Skip to content

TorchInductor missing ops tracker #93757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
33 of 45 tasks
jansel opened this issue Jun 6, 2022 · 50 comments
Closed
33 of 45 tasks

TorchInductor missing ops tracker #93757

jansel opened this issue Jun 6, 2022 · 50 comments
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jansel
Copy link
Contributor

jansel commented Jun 6, 2022

The following ops are using ir.FallbackKernel via make_fallback() in lowering.py and appear in benchmarks. We should rewrite them to use decomps or lowerings.

  • Add decomp/lowering: aten.as_strided_scatter #93650
  • aten.grid_sampler_2d_backward (higher priority)
  • aten.upsample_bilinear2d_backward (higher priority)
  • aten._adaptive_avg_pool2d_backward
  • aten.upsample_bicubic2d_backward
  • aten._fused_moving_avg_obs_fq_helper
  • aten.upsample_nearest3d (needed for FAIR model)
  • aten.avg_pool3d (needed for FAIR model)
  • aten.bucketize (needed for internal model) - not targeting for codegen, do a fallback
  • aten.prod (needed for research model) - not targeting for codegen, do a fallback

Might not be possible (in a performant way), but currently use fallbacks:

  • aten.convolution_backward (might need to hold of on this if perf doesn't match)
  • aten._cudnn_rnn (might need to hold of on this if perf doesn't match) - not targeting for codegen, do a fallback
  • aten._cudnn_rnn_backward (might need to hold of on this if perf doesn't match) - not targeting for codegen, do a fallback
  • aten._embedding_bag (may have a template internally) (Attempted with [No CI] Decomp for _embedding_bag #84235, but it's very hard to make it performant and inductor-friendly)
  • [inductor] Lower aten.cumsum #93631 not targeting for codegen, do a fallback
  • torchvision.roi_align (need to sort of decomps for domain libs)

Done:

cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @soumith @ngimel

@Chillee
Copy link
Collaborator

Chillee commented Jun 6, 2022

aten.convolution_backward

Unfortunately, there's no good way of decomposing this today. The gradient (wrt the weights iirc?) cannot be expressed with a forwards convolution.

aten.nll_loss_forward

This one is WIP: #78491

@ezyang
Copy link
Contributor

ezyang commented Jun 6, 2022

Unfortunately, there's no good way of decomposing this today. The gradient (wrt the weights iirc?) cannot be expressed with a forwards convolution.

Well, to be more precise, if we have a suitably generalized convolution that has output shape and transposition that would be sufficient for all convolutions forwards/backwards.

@soumith
Copy link
Member

soumith commented Jun 6, 2022

The gradient (wrt the weights iirc?) cannot be expressed with a forwards convolution.

It can. Where you have to pad the output image appropriately, and the convolve it with a transposed kernel weight matrix.

See the last diagram in http://soumith.ch/ex/pages/2014/08/07/why-rotate-weights-convolution-gradient/

@Chillee
Copy link
Collaborator

Chillee commented Jun 7, 2022

aten.log1p

btw, this is currently a prim in Primtorch. So there won't be a further decomposition for it.

aten.expand_as

Also, this is a CompositeImplicitAutograd op - where are you seeing this today?

@jansel
Copy link
Contributor Author

jansel commented Jun 7, 2022

aten.expand_as

Also, this is a CompositeImplicitAutograd op - where are you seeing this today?

I am seeing it in AOT Autograd graphs for training. I see it in multiple models for example BERT_pytorch.

@Chillee
Copy link
Collaborator

Chillee commented Jun 7, 2022

Interesting, will have to check where it comes from. Perhaps it's in a decomposition and we're not properly recursively decomposing? But I don't see it being used in any decomposition right now besides embedding_dense_backward.

@soumith
Copy link
Member

soumith commented Jun 7, 2022

Well, to be more precise, if we have a suitably generalized convolution that has output shape and transposition that would be sufficient for all convolutions forwards/backwards.

A convolution that has strides, dilation and zero padding can express all needs in a closed form.
Convolution forward, backward-input, backward-weight, etc. can be expressed in terms of this convolution.

@lezcano
Copy link
Collaborator

lezcano commented Aug 25, 2022

I'll have a stab at aten._adaptive_avg_pool2d and its backward.

@jansel
Copy link
Contributor Author

jansel commented Aug 25, 2022

I'll have a stab at aten._adaptive_avg_pool2d and its backward.

@lezcano I think @eellison just did that one here https://github.com/pytorch/torchdynamo/pull/962/files

It requires ops.masked so it had to be done as a lowering rather than a decomp.

@lezcano
Copy link
Collaborator

lezcano commented Aug 25, 2022

AGGGGG. Would it be possible to update the OP with the ops that have already been implemented?

Also, I was able to implement this one without raw loops (but still peeking twice into the data, one per dimension). Do you think this could be of interest by itself? I am writing this in _decomp, I am not sure what's the relation between _decomp and these lowerings. Which one is preferable? The lowerings use some constructions I don't understand at first sight.

@jansel
Copy link
Contributor Author

jansel commented Aug 25, 2022

Lowerings are more powerful, there are some things they can represent which decomps currently can't support.

Decomps are more portable, they work outside of TorchInductor and could be used to generate things like double-backwards.

If something can be done (with the same performance) both ways, a decomp is preferred, but we may need to use lowerings to support more complex stuff or in order to match the performance of eager.

"peeking twice into the data" sounds slow. IMO we should be performance testing decomps and ensuring that they match or beat the performance of eager mode.

@lezcano
Copy link
Collaborator

lezcano commented Aug 25, 2022

"peeking twice into the data" = getting an integer from a Tensor into Python-land, i.e., 2 syncs.

Also, the way I implemented it, basically implements the general nd version for free, but I don't know whether this is of interest. I'll post today the draft and I'll tag you guys and let you decide, that may be best.

When it comes to performance, without a fuser, eager is always going ot be fasterter than pretty much any decomp, as most kernels are implemented in one kernel, while in decompositions we are literally decomposing them into a number of kernels.

@jansel
Copy link
Contributor Author

jansel commented Aug 25, 2022

We can measure performance with a fuser. Perhaps using pytorch/torchdynamo#785

@jansel
Copy link
Contributor Author

jansel commented Aug 25, 2022

I updated the OP for a few items. Looks like we still don't have a decomp/lowering for grid_sampler_2d_backward:
https://github.com/pytorch/torchdynamo/blob/cb9add9e63ea6a45716aeca89d3669fc82ab94d5/torchinductor/lowering.py#L894

You can check that file to confirm something is still missing. Anything implemented with a fallback is in need of decomp/lowering.

@Chillee
Copy link
Collaborator

Chillee commented Aug 25, 2022

@lezcano By "measuring performance", I think @jansel means "measuring performance with inductor".

For example, we benchmarked inductor performance here with the lowering: pytorch/torchdynamo#934

@fdrocha
Copy link
Collaborator

fdrocha commented Aug 30, 2022

I will have a go at grid_sampler_2d_backward

@lezcano
Copy link
Collaborator

lezcano commented Aug 30, 2022

FWIW, aten.mse_loss_backward is present in PyTorch (

def mse_loss_backward(
). Should we mark that one as resolved?

@jansel
Copy link
Contributor Author

jansel commented Aug 30, 2022

FWIW, aten.mse_loss_backward is present in PyTorch (

def mse_loss_backward(
). Should we mark that one as resolved?

Awesome! I think we just need to add it to this list:
https://github.com/pytorch/torchdynamo/blob/a44c28b3ccc2a3643666cf6167990730bf585aa6/torchinductor/decomposition.py#L18

@lezcano
Copy link
Collaborator

lezcano commented Aug 30, 2022

I'll take upsample_bilinear2d_backward

@ngimel
Copy link
Collaborator

ngimel commented Aug 30, 2022

cc @SherlockNoMad, for things like grid_sampler_2dbackward and upsample_bilinear2d_backward we thought that a better solution would be to create infra to generate backward from the forward decomposition trace, but stop-gap direct decomposition solution is also fine I guess (we'd still need implicit grads for higher order gradients)

@SherlockNoMad
Copy link
Contributor

I added a decomp for aten.im2col.
Also I noticed that nn.functional.unfold would eventually call aten.im2col, so I am wondering where are we seeing aten.unfold?

# Module stack: {'self_network_0': 'Sequential', 'self_network_0_2': 'Outlooker', 'self_network_0__2__attn': 'OutlookAttention', 'self_network_0__2__attn_unfold': 'Unfold'}, File: /fsx/users/bahuang/conda/envs/pt_dev/lib/python3.9/site-packages/timm/models/volo.py:111, code: v = self.unfold(v).reshape(
im2col_default_2 = torch.ops.aten.im2col.default(permute_default_15, [3, 3], [1, 1], [1, 1], [2, 2]);  permute_default_15 = None

@SherlockNoMad
Copy link
Contributor

aten._fused_moving_avg_obs_fq_helper is showing up in mobilenet_v2_quantized_qat and resnet50_quantized_qat models.

These are quantization model, and the op is originated from FusedMovingAvgObsFakeQuantize module. Does inductor, or PT2, want to cover quantization model? I thought no...

# Module stack: {'self_activation_post_process_0': 'FusedMovingAvgObsFakeQuantize'}, File: <eval_with_key>.2:5, code: activation_post_process_0 = self.activation_post_process_0(x);  x = None
_fused_moving_avg_obs_fq_helper_default = torch.ops.aten._fused_moving_avg_obs_fq_helper.default(primals_1075, primals_159, primals_160, primals_163, primals_164, primals_161, primals_162, 0.01, 0, 127, -1);  primals_1075 = primals_159 = primals_160 = primals_163 = primals_164 = primals_161 = primals_162 = None
getitem = _fused_moving_avg_obs_fq_helper_default[0];  _fused_moving_avg_obs_fq_helper_default = None

@Chillee
Copy link
Collaborator

Chillee commented Aug 30, 2022

@SherlockNoMad I think torch.Tensor.unfold calls aten::unfold (https://pytorch.org/docs/stable/generated/torch.Tensor.unfold.html), while the functional version calls im2col: https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html

Any idea why this is the case? maybe @ngimel?

re: quantization, I don't see any reason why we shouldn't support quantized models :) I guess it's a separate question whether we should decompose that op, or just ... not have it.

@fdrocha
Copy link
Collaborator

fdrocha commented Sep 27, 2022

I'll take bucketize

@jansel jansel unpinned this issue Oct 11, 2022
@fdrocha
Copy link
Collaborator

fdrocha commented Oct 11, 2022

I'll also take softplus and its backward

peterbell10 referenced this issue Nov 3, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

[ghstack-poisoned]
peterbell10 referenced this issue Nov 3, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

ghstack-source-id: 735c3c3
Pull Request resolved: #88379
peterbell10 referenced this issue Nov 3, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
pytorchmergebot referenced this issue Nov 7, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

Pull Request resolved: #88379
Approved by: https://github.com/jansel
peterbell10 referenced this issue in peterbell10/pytorch Nov 7, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

ghstack-source-id: 65b98bf
Pull Request resolved: pytorch#88379
kulinseth referenced this issue in kulinseth/pytorch Dec 10, 2022
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

Pull Request resolved: pytorch#88379
Approved by: https://github.com/jansel
@min-jean-cho
Copy link
Collaborator

min-jean-cho commented Dec 14, 2022

Hi @jansel, I will add aten._uniform decompose -- please refer to #90815. Please help let me know if any comments on this, thanks!

@jansel
Copy link
Contributor Author

jansel commented Dec 15, 2022

Hi @jansel, I will add aten._uniform decompose -- please refer to pytorch/pytorch#90815. Please help let me know if any comments on this, thanks!

Sounds good to me, thanks!

@min-jean-cho
Copy link
Collaborator

Hi @jansel, I think we can also decompose aten._normal with torch.rand, torch.log, etc similar to how numpy implements it https://github.com/numpy/numpy/blob/main/numpy/random/src/legacy/legacy-distributions.c#L18-L40 based on this method https://en.wikipedia.org/wiki/Marsaglia_polar_method.

Once aten._normal is decomposed, other distributions can be pretty simply decomposed as well based on the decomposed aten._normal.

I wanted to ask if you have any comments on this before digging deeper, thanks!

@lezcano
Copy link
Collaborator

lezcano commented Dec 16, 2022

Note that these iterative methods will be particularly inefficient on GPUs. I've just realised that the do-while is just there to handle an extremely rare edge-case.

In any case, I get the feeling that many of these sampling ops should be primitives in PrimTorch. One rule of thumb we chose when writing what's a primitive for PrimTorch and what's not is whether the STL has an exclusive operation for it. If it does, it makes sense that that operation may have something numerically particular that it does not make it amenable to being decomposed.

Note that I bring up PrimTorch here even though Inductor does not necessarily abide by it, as I don't think these decompositions are specific to Inductor, but could be used by nvfuser or other consumers really.

cc @mruberry

@jgong5
Copy link
Collaborator

jgong5 commented Dec 19, 2022

If it does, it makes sense that that operation may have something numerically particular that it does not make it amenable to being decomposed.

Numeric is one aspect. The other is about the tradeoff between narrowed opset to support vs. performance - narrowed opset reduces the complexity of the backend support but it might be harder for backend to map the decomposed pattern into a more efficient implementation. Good to know that STL was used as one of the rules for defining the opset. Is there any documentation on the guidelines of defining the Prim opset?

@ngimel
Copy link
Collaborator

ngimel commented Dec 19, 2022

randn already has a lowering that can be reused for _normal. randn is already a primitive op in inductor, with codegen defined for cpu and triton.

@min-jean-cho
Copy link
Collaborator

Hi @jansel, I think we can also decompose aten._normal

Hi all, I know there's an ongoing discussion of Prim Op --vs-- decomposed Aten Op regarding aten.uniform_ , but just wanted to add here that I've created a RFC for decomposing aten.normal_ as of now, #91085

@ngimel
Copy link
Collaborator

ngimel commented Dec 19, 2022

Given that backends typically provide randn implementation, I don't think normal decomposition using polar method is needed, normal should just call randn

@malfet malfet transferred this issue from pytorch/torchdynamo Feb 1, 2023
@gchanan gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 16, 2023
@jansel jansel closed this as completed Nov 27, 2023
pytorchmergebot pushed a commit that referenced this issue Dec 22, 2023
Presumably this can close #109784

Also related to #93757 (though `take` is not listed there).

There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow?

Pull Request resolved: #114813
Approved by: https://github.com/peterbell10, https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests