Skip to content

Fixes new tf32 failures in test_nn.py #52871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Fixes new tf32 failures in test_nn.py #52871

wants to merge 3 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

Also modify the tf32_on_and_off decorator to make it support function without device argument.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 25, 2021

💊 CI failures summary and remediations

As of commit 77af952 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@@ -3692,7 +3693,7 @@ def fractional_max_pool3d_test(test_case):
check_gradgrad=False,
desc='gelu_activation',
with_tf32=True,
tf32_precision=0.01,
tf32_precision=0.05,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it worrying? Did the new cublas version come out?

Copy link
Collaborator Author

@zasdfgbnm zasdfgbnm Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cublas in CUDA 11.2.1 which we have never looked through the tests greenish before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold issues come so often that we don't fix this kind of bug immediately after we see them. We usually wait until there is enough amount of failure to submit a fix. So the last time when we see this pass, it could be many versions ago. (I don't remember exactly which version)

@anjali411 anjali411 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 25, 2021
@codecov
Copy link

codecov bot commented Feb 26, 2021

Codecov Report

Merging #52871 (77af952) into master (1b792a7) will increase coverage by 0.00%.
The diff coverage is 86.66%.

@@           Coverage Diff           @@
##           master   #52871   +/-   ##
=======================================
  Coverage   77.47%   77.47%           
=======================================
  Files        1892     1892           
  Lines      185623   185623           
=======================================
+ Hits       143803   143804    +1     
+ Misses      41820    41819    -1     

def wrapped(*args, **kwargs):
for k, v in zip(arg_names, args):
kwargs[k] = v
cond = tf32_is_not_fp32()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining the computation of cond and its effect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the comment explaining the decorator so readers know when it has an effect

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added all comments in front of this function.

@zasdfgbnm
Copy link
Collaborator Author

ping @ngimel @mruberry

@mruberry
Copy link
Collaborator

@zasdfgbnm Thank you for the ping. This looks good to me now. @ngimel, would you like to take a look?

@mruberry mruberry self-requested a review March 24, 2021 05:59
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zasdfgbnm!

@facebook-github-bot
Copy link
Contributor

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in 9f336bd.

@zasdfgbnm zasdfgbnm deleted the tf32-cudnn branch March 25, 2021 06:14
facebook-github-bot pushed a commit that referenced this pull request Jun 24, 2021
Summary:
Allow those tests to pass on A100 GPUs which support tf32

Basically follow-up to #52871 which also increased some precisions to 0.05

For reference these are the failures I see (only ones in testnn with 1.9.0):
```
FAIL: test_Conv3d_pad_same_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 161 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.032408137116391345 (-33.45570601919647 vs. -33.42329788208008), which occurred at index (2, 0, 0, 1, 0).

======================================================================
FAIL: test_Conv3d_pad_same_dilated_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 111 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.024654212557543076 (35.104286017977465 vs. 35.07963180541992), which occurred at index (3, 0, 0, 0, 2).

======================================================================
FAIL: test_Conv3d_pad_valid_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 41 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.010903167642320355 (8.074376869119371 vs. 8.06347370147705), which occurred at index (0, 0, 1, 0, 0).

```

Pull Request resolved: #60451

Reviewed By: albanD

Differential Revision: D29353255

Pulled By: ngimel

fbshipit-source-id: 155a02242be5a11dcbd9dd40ab63f15c6757ae1b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants