Fixes new tf32 failures in test_nn.py #52871

zasdfgbnm · 2021-02-25T20:54:06Z

Also modify the tf32_on_and_off decorator to make it support function without device argument.

facebook-github-bot · 2021-02-25T20:54:25Z

💊 CI failures summary and remediations

As of commit 77af952 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

ngimel · 2021-02-25T21:21:45Z

torch/testing/_internal/common_nn.py

@@ -3692,7 +3693,7 @@ def fractional_max_pool3d_test(test_case):
        check_gradgrad=False,
        desc='gelu_activation',
        with_tf32=True,
-        tf32_precision=0.01,
+        tf32_precision=0.05,


Isn't it worrying? Did the new cublas version come out?

This is the cublas in CUDA 11.2.1 which we have never looked through the tests greenish before.

Threshold issues come so often that we don't fix this kind of bug immediately after we see them. We usually wait until there is enough amount of failure to submit a fix. So the last time when we see this pass, it could be many versions ago. (I don't remember exactly which version)

save

codecov · 2021-02-26T04:24:36Z

Codecov Report

Merging #52871 (77af952) into master (1b792a7) will increase coverage by 0.00%.
The diff coverage is 86.66%.

@@           Coverage Diff           @@
##           master   #52871   +/-   ##
=======================================
  Coverage   77.47%   77.47%           
=======================================
  Files        1892     1892           
  Lines      185623   185623           
=======================================
+ Hits       143803   143804    +1     
+ Misses      41820    41819    -1

mruberry · 2021-02-26T17:51:17Z

torch/testing/_internal/common_cuda.py

+        def wrapped(*args, **kwargs):
+            for k, v in zip(arg_names, args):
+                kwargs[k] = v
+            cond = tf32_is_not_fp32()


Add a comment explaining the computation of cond and its effect

Also update the comment explaining the decorator so readers know when it has an effect

I just added all comments in front of this function.

zasdfgbnm · 2021-03-23T20:53:32Z

ping @ngimel @mruberry

mruberry · 2021-03-24T05:59:54Z

@zasdfgbnm Thank you for the ping. This looks good to me now. @ngimel, would you like to take a look?

mruberry

Thanks @zasdfgbnm!

facebook-github-bot · 2021-03-24T06:00:29Z

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-25T04:55:32Z

@mruberry merged this pull request in 9f336bd.

Summary: Allow those tests to pass on A100 GPUs which support tf32 Basically follow-up to #52871 which also increased some precisions to 0.05 For reference these are the failures I see (only ones in testnn with 1.9.0): ``` FAIL: test_Conv3d_pad_same_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 161 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso ns). The greatest difference was 0.032408137116391345 (-33.45570601919647 vs. -33.42329788208008), which occurred at index (2, 0, 0, 1, 0). ====================================================================== FAIL: test_Conv3d_pad_same_dilated_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 111 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso ns). The greatest difference was 0.024654212557543076 (35.104286017977465 vs. 35.07963180541992), which occurred at index (3, 0, 0, 0, 2). ====================================================================== FAIL: test_Conv3d_pad_valid_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 41 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.010903167642320355 (8.074376869119371 vs. 8.06347370147705), which occurred at index (0, 0, 1, 0, 0). ``` Pull Request resolved: #60451 Reviewed By: albanD Differential Revision: D29353255 Pulled By: ngimel fbshipit-source-id: 155a02242be5a11dcbd9dd40ab63f15c6757ae1b

facebook-github-bot added the cla signed label Feb 25, 2021

zasdfgbnm requested review from mruberry and ngimel February 25, 2021 20:55

pytorchbot added the open source label Feb 25, 2021

ngimel reviewed Feb 25, 2021

View reviewed changes

anjali411 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 25, 2021

Fix failing tf32 tests

cbbf9f7

save

zasdfgbnm force-pushed the tf32-cudnn branch from b6df6a6 to cbbf9f7 Compare February 26, 2021 00:49

mruberry reviewed Feb 26, 2021

View reviewed changes

zasdfgbnm added 2 commits February 26, 2021 10:38

Update common_cuda.py

8e493a5

Merge branch 'master' of github.com:pytorch/pytorch into tf32-cudnn

77af952

mruberry self-requested a review March 24, 2021 05:59

mruberry approved these changes Mar 24, 2021

View reviewed changes

facebook-github-bot closed this in 9f336bd Mar 25, 2021

facebook-github-bot added the Merged label Mar 25, 2021

zasdfgbnm deleted the tf32-cudnn branch March 25, 2021 06:14

xwang233 mentioned this pull request Apr 15, 2021

Do not use TF32 matmul in linalg and DDP tests #56114

Closed

Flamefire mentioned this pull request Jun 22, 2021

Increase some tolerances for tf32 for Conv3d tests #60451

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes new tf32 failures in test_nn.py #52871

Fixes new tf32 failures in test_nn.py #52871

Uh oh!

zasdfgbnm commented Feb 25, 2021

Uh oh!

facebook-github-bot commented Feb 25, 2021 •

edited

Loading

Uh oh!

ngimel Feb 25, 2021

Uh oh!

zasdfgbnm Feb 25, 2021 •

edited

Loading

Uh oh!

zasdfgbnm Feb 25, 2021

Uh oh!

codecov bot commented Feb 26, 2021 •

edited

Loading

Uh oh!

mruberry Feb 26, 2021

Uh oh!

mruberry Feb 26, 2021

Uh oh!

zasdfgbnm Feb 26, 2021

Uh oh!

zasdfgbnm commented Mar 23, 2021

Uh oh!

mruberry commented Mar 24, 2021

Uh oh!

mruberry left a comment

Uh oh!

facebook-github-bot commented Mar 24, 2021

Uh oh!

facebook-github-bot commented Mar 25, 2021

Uh oh!

Uh oh!

Fixes new tf32 failures in test_nn.py #52871

Fixes new tf32 failures in test_nn.py #52871

Uh oh!

Conversation

zasdfgbnm commented Feb 25, 2021

Uh oh!

facebook-github-bot commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

ngimel Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mruberry Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

mruberry Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Mar 23, 2021

Uh oh!

mruberry commented Mar 24, 2021

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 24, 2021

Uh oh!

facebook-github-bot commented Mar 25, 2021

Uh oh!

Uh oh!

facebook-github-bot commented Feb 25, 2021 •

edited

Loading

zasdfgbnm Feb 25, 2021 •

edited

Loading

codecov bot commented Feb 26, 2021 •

edited

Loading