support inplace buffer reuse #1486

jgong5 · 2022-10-05T06:37:35Z

Enable inplace_buffers = True feature. This addresses the issue #823 and the second optimization mentioned in #1401. It is turned off by default now due to #1670. Currently, only one CI case gmixer_24_224 training on CUDA would fail due to this issue.

This PR made following fixes:

Support more than two other_names in make_inplace (common.py) to support a chain of inplace reuses.
Do not delete buffer if it is the graph output in make_buffer_reuse (wrapper.py)
Add a new field V.graph.inplaced_to_remove. For inplace buffers subject to remove, we don't actually remove them but put them in this dedicated set. This simplifies the life cycle management of inplace buffers. This set is used to 1) avoid unnecessary store in DeferredLine; 2) avoid alias var definitions in kernel (graph.py/scheduler.py).
Do not allocate buffer for ExternKernelAlloc and MultiOutput (wrapper.py) since these are pre-allocated.
Make sure reuse key matches on inplace buffer reuse in allocate (scheduler.py)
Do not inplace reuse buffers with layout (ir.MultiOutputLayout, ir.MutationLayout, ir.AliasedLayout). This is a conservative choice and can revisit later if we can still inplace reuse buffers with some of these layouts.
Add kernel output buffers to available_buffer_names in codegen (scheduler.py) to make sure remaining_uses is correct in allocate (scheduler.py)

jansel

The approach seems right, but the test failures look real.

This reverts commit 33f9c61.

…rchdynamo into jgong5/inplace_buffers

desertfire · 2022-10-12T13:20:21Z

Ah, I managed to repro it locally. Seems only happen with the clean run without code cache. If I run it the second time, the test passes with cached code.

This is weird. Can you give our minifier a try?https://github.com/pytorch/torchdynamo/blob/main/documentation/TROUBLESHOOTING.md,

torchdynamo/torchdynamo/config.py

Line 121 in 8c9f11c

repro_after = os.environ.get("TORCHDYNAMO_REPRO_AFTER", None)

. It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier.

jgong5 · 2022-10-12T13:31:48Z

It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier.

Thanks a lot for the tip, will have a try! BTW, seems trunk is broken? I got something like argument number not matching with all CUDA related test...

soumith · 2022-10-12T13:56:46Z

@jgong5 I believe that is because the main now requires a newer triton: #1585
Please upgrade your triton to the outlined commit and you should be good

jgong5 · 2022-10-12T14:14:46Z

It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier

@desertfire Can this troubleshoot accuracy issues or only handle compilation/crash errors?

soumith · 2022-10-12T14:37:25Z

@jgong5 it can minify accuracy issues. This feature was introduced in this PR: #1242

jansel

Requesting changes to take this off my review queue, re-request review when tests pass.

jgong5 · 2022-10-14T09:29:55Z

Thanks @desertfire @soumith for the hints on the minifier tool. It helped to generate a repro, even though not a small one since the issue only happened on the first clean-cache run. I managed to narrow it down manually to a smaller repro #1670. It seems an issue inside Triton compiler or the way how TorchInductor integrates Triton? @jansel

Disable inplace_buffers now.

jgong5 · 2022-10-14T23:49:56Z

@jansel Are you ok to land this PR with inplace_buffers set to False first? If it is turned on with the same PR change, only gmixer_24_224 training fails the accuracy test with the reason I explained in the description. I don't think the failure is caused by the change though but I will continue to check the root cause. Having it landed first allows others to experiment inplace_buffers if needed. If you are ok, the PR is in a good shape for review.

jansel · 2022-10-15T02:58:59Z

Yes, sounds good to me.

support inplace buffer reuse

addc2aa

jgong5 marked this pull request as draft October 5, 2022 06:38

facebook-github-bot added the cla signed label Oct 5, 2022

Jiong Gong added 7 commits October 5, 2022 11:16

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

ecfbdee

wip

ca5da5e

revert trace config

f045524

code cleanup, lint fix

35d1c8d

further code cleanup

8efd854

more cleanup

72bdc01

lint fix

0e7a8ec

jgong5 marked this pull request as ready for review October 6, 2022 15:39

jgong5 requested review from jansel and EikanWang October 6, 2022 15:39

jansel requested changes Oct 6, 2022

View reviewed changes

Jiong Gong added 6 commits October 7, 2022 02:09

try fix CI

1124fe2

avoid inplacing template output

062cc3c

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

1997a01

do not inplace reuse kernel local buffers

33f9c61

Revert "do not inplace reuse kernel local buffers"

63e98dc

This reverts commit 33f9c61.

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

da814d1

jgong5 marked this pull request as draft October 9, 2022 03:58

Jiong Gong added 9 commits October 9, 2022 07:22

do not allocate output for MultiOutput too

33d10ce

try fix

fb130ff

further fix

6b65f03

fix lint

820a14d

correct reuseline planning for removed buffers

3d33dd1

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

3077b18

code cleanup a bit

989ca7e

Merge branch 'jgong5/inplace_buffers' of https://github.com/jgong5/to…

7a3967e

…rchdynamo into jgong5/inplace_buffers

further fix

d6b7fbe

Jiong Gong added 3 commits October 12, 2022 17:38

correct need store check

7269bcd

revert config change

d8498b0

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

238c061

jgong5 marked this pull request as draft October 13, 2022 00:58

Jiong Gong added 4 commits October 13, 2022 02:20

revert store check

b5549e1

fix

5de3115

fix

793b49d

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

7156454

jansel requested changes Oct 14, 2022

View reviewed changes

jgong5 mentioned this pull request Oct 14, 2022

[Triton] Incorrect Result from First Run with Clean Cache #1670

Closed

Jiong Gong added 7 commits October 14, 2022 07:41

code cleanup

de32940

further code cleanup

243fbc3

fix typo

ffd8a79

add comment

25c39b3

Merge remote-tracking branch 'origin/main' into jgong5/inplace_buffers

d46e59a

lint

29536b8

Update config.py

818ae20

Disable inplace_buffers now.

jgong5 marked this pull request as ready for review October 14, 2022 23:18

jgong5 requested a review from jansel October 14, 2022 23:50

jansel approved these changes Oct 15, 2022

View reviewed changes

jansel merged commit cf5a629 into pytorch:main Oct 15, 2022

jgong5 mentioned this pull request Feb 1, 2023

[Inductor] Task Tracker for CPU Backend Optimization pytorch/pytorch#93557

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support inplace buffer reuse #1486

support inplace buffer reuse #1486

jgong5 commented Oct 5, 2022 •

edited

Loading

jansel left a comment

desertfire commented Oct 12, 2022

jgong5 commented Oct 12, 2022

soumith commented Oct 12, 2022

jgong5 commented Oct 12, 2022

soumith commented Oct 12, 2022

jansel left a comment

jgong5 commented Oct 14, 2022

jgong5 commented Oct 14, 2022

jansel commented Oct 15, 2022

support inplace buffer reuse #1486

support inplace buffer reuse #1486

Conversation

jgong5 commented Oct 5, 2022 • edited Loading

jansel left a comment

Choose a reason for hiding this comment

desertfire commented Oct 12, 2022

jgong5 commented Oct 12, 2022

soumith commented Oct 12, 2022

jgong5 commented Oct 12, 2022

soumith commented Oct 12, 2022

jansel left a comment

Choose a reason for hiding this comment

jgong5 commented Oct 14, 2022

jgong5 commented Oct 14, 2022

jansel commented Oct 15, 2022

jgong5 commented Oct 5, 2022 •

edited

Loading