-
Notifications
You must be signed in to change notification settings - Fork 24.1k
TorchInductor missing ops tracker #93757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unfortunately, there's no good way of decomposing this today. The gradient (wrt the weights iirc?) cannot be expressed with a forwards convolution.
This one is WIP: #78491 |
Well, to be more precise, if we have a suitably generalized convolution that has output shape and transposition that would be sufficient for all convolutions forwards/backwards. |
It can. Where you have to pad the output image appropriately, and the convolve it with a transposed kernel weight matrix. See the last diagram in http://soumith.ch/ex/pages/2014/08/07/why-rotate-weights-convolution-gradient/ |
Requested in https://github.com/pytorch/torchdynamo/issues/327 Pull Request resolved: #78919 Approved by: https://github.com/mruberry
btw, this is currently a prim in Primtorch. So there won't be a further decomposition for it.
Also, this is a CompositeImplicitAutograd op - where are you seeing this today? |
I am seeing it in AOT Autograd graphs for training. I see it in multiple models for example BERT_pytorch. |
Interesting, will have to check where it comes from. Perhaps it's in a decomposition and we're not properly recursively decomposing? But I don't see it being used in any decomposition right now besides |
A convolution that has strides, dilation and zero padding can express all needs in a closed form. |
Summary: Requested in https://github.com/pytorch/torchdynamo/issues/327 Pull Request resolved: #78919 Approved by: https://github.com/mruberry Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ea3c4d0c75c99855d6f6aae2e67860a627052104 Reviewed By: osalpekar Differential Revision: D36959175 Pulled By: Chillee fbshipit-source-id: aa1cd2563ea0c185d0efa294c47fcea40fa8a21a
I'll have a stab at |
@lezcano I think @eellison just did that one here https://github.com/pytorch/torchdynamo/pull/962/files It requires |
AGGGGG. Would it be possible to update the OP with the ops that have already been implemented? Also, I was able to implement this one without raw loops (but still peeking twice into the data, one per dimension). Do you think this could be of interest by itself? I am writing this in |
Lowerings are more powerful, there are some things they can represent which decomps currently can't support. Decomps are more portable, they work outside of TorchInductor and could be used to generate things like double-backwards. If something can be done (with the same performance) both ways, a decomp is preferred, but we may need to use lowerings to support more complex stuff or in order to match the performance of eager. "peeking twice into the data" sounds slow. IMO we should be performance testing decomps and ensuring that they match or beat the performance of eager mode. |
"peeking twice into the data" = getting an integer from a Tensor into Python-land, i.e., 2 syncs. Also, the way I implemented it, basically implements the general When it comes to performance, without a fuser, eager is always going ot be fasterter than pretty much any decomp, as most kernels are implemented in one kernel, while in decompositions we are literally decomposing them into a number of kernels. |
We can measure performance with a fuser. Perhaps using pytorch/torchdynamo#785 |
I updated the OP for a few items. Looks like we still don't have a decomp/lowering for grid_sampler_2d_backward: You can check that file to confirm something is still missing. Anything implemented with a fallback is in need of decomp/lowering. |
@lezcano By "measuring performance", I think @jansel means "measuring performance with inductor". For example, we benchmarked inductor performance here with the lowering: pytorch/torchdynamo#934 |
I will have a go at |
FWIW, pytorch/torch/_decomp/decompositions.py Line 340 in b8e1c54
|
Awesome! I think we just need to add it to this list: |
I'll take |
cc @SherlockNoMad, for things like grid_sampler_2dbackward and upsample_bilinear2d_backward we thought that a better solution would be to create infra to generate backward from the forward decomposition trace, but stop-gap direct decomposition solution is also fine I guess (we'd still need implicit grads for higher order gradients) |
I added a decomp for aten.im2col.
|
aten._fused_moving_avg_obs_fq_helper is showing up in mobilenet_v2_quantized_qat and resnet50_quantized_qat models. These are quantization model, and the op is originated from FusedMovingAvgObsFakeQuantize module. Does inductor, or PT2, want to cover quantization model? I thought no...
|
@SherlockNoMad I think Any idea why this is the case? maybe @ngimel? re: quantization, I don't see any reason why we shouldn't support quantized models :) I guess it's a separate question whether we should decompose that op, or just ... not have it. |
I'll take |
I'll also take |
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. [ghstack-poisoned]
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. ghstack-source-id: 735c3c3 Pull Request resolved: #88379
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. Pull Request resolved: #88379 Approved by: https://github.com/jansel
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. ghstack-source-id: 65b98bf Pull Request resolved: pytorch#88379
Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. Pull Request resolved: pytorch#88379 Approved by: https://github.com/jansel
Sounds good to me, thanks! |
Hi @jansel, I think we can also decompose Once I wanted to ask if you have any comments on this before digging deeper, thanks! |
In any case, I get the feeling that many of these sampling ops should be primitives in PrimTorch. One rule of thumb we chose when writing what's a primitive for PrimTorch and what's not is whether the STL has an exclusive operation for it. If it does, it makes sense that that operation may have something numerically particular that it does not make it amenable to being decomposed. Note that I bring up PrimTorch here even though Inductor does not necessarily abide by it, as I don't think these decompositions are specific to Inductor, but could be used by nvfuser or other consumers really. cc @mruberry |
Numeric is one aspect. The other is about the tradeoff between narrowed opset to support vs. performance - narrowed opset reduces the complexity of the backend support but it might be harder for backend to map the decomposed pattern into a more efficient implementation. Good to know that STL was used as one of the rules for defining the opset. Is there any documentation on the guidelines of defining the Prim opset? |
|
Given that backends typically provide |
Presumably this can close #109784 Also related to #93757 (though `take` is not listed there). There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow? Pull Request resolved: #114813 Approved by: https://github.com/peterbell10, https://github.com/lezcano
The following ops are using
ir.FallbackKernel
viamake_fallback()
in lowering.py and appear in benchmarks. We should rewrite them to use decomps or lowerings.Might not be possible (in a performant way), but currently use fallbacks:
Done:
aten.upsample_bilinear2d.vec
#80964cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @soumith @ngimel
The text was updated successfully, but these errors were encountered: