-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Fixes new tf32 failures in test_nn.py #52871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 77af952 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@@ -3692,7 +3693,7 @@ def fractional_max_pool3d_test(test_case): | |||
check_gradgrad=False, | |||
desc='gelu_activation', | |||
with_tf32=True, | |||
tf32_precision=0.01, | |||
tf32_precision=0.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it worrying? Did the new cublas version come out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the cublas in CUDA 11.2.1 which we have never looked through the tests greenish before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threshold issues come so often that we don't fix this kind of bug immediately after we see them. We usually wait until there is enough amount of failure to submit a fix. So the last time when we see this pass, it could be many versions ago. (I don't remember exactly which version)
b6df6a6
to
cbbf9f7
Compare
Codecov Report
@@ Coverage Diff @@
## master #52871 +/- ##
=======================================
Coverage 77.47% 77.47%
=======================================
Files 1892 1892
Lines 185623 185623
=======================================
+ Hits 143803 143804 +1
+ Misses 41820 41819 -1 |
def wrapped(*args, **kwargs): | ||
for k, v in zip(arg_names, args): | ||
kwargs[k] = v | ||
cond = tf32_is_not_fp32() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining the computation of cond and its effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the comment explaining the decorator so readers know when it has an effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added all comments in front of this function.
@zasdfgbnm Thank you for the ping. This looks good to me now. @ngimel, would you like to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zasdfgbnm!
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Allow those tests to pass on A100 GPUs which support tf32 Basically follow-up to #52871 which also increased some precisions to 0.05 For reference these are the failures I see (only ones in testnn with 1.9.0): ``` FAIL: test_Conv3d_pad_same_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 161 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso ns). The greatest difference was 0.032408137116391345 (-33.45570601919647 vs. -33.42329788208008), which occurred at index (2, 0, 0, 1, 0). ====================================================================== FAIL: test_Conv3d_pad_same_dilated_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 111 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso ns). The greatest difference was 0.024654212557543076 (35.104286017977465 vs. 35.07963180541992), which occurred at index (3, 0, 0, 0, 2). ====================================================================== FAIL: test_Conv3d_pad_valid_cuda_tf32 (__main__.TestNN) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper method(*args, **kwargs) File "test_nn.py", line 11296, in with_tf32_on test.test_cuda(self, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType return self.assertEqual(*args, exact_dtype=False, **kwargs) File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e8846fa367de36e7fe58b9463678adf5f)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 41 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.010903167642320355 (8.074376869119371 vs. 8.06347370147705), which occurred at index (0, 0, 1, 0, 0). ``` Pull Request resolved: #60451 Reviewed By: albanD Differential Revision: D29353255 Pulled By: ngimel fbshipit-source-id: 155a02242be5a11dcbd9dd40ab63f15c6757ae1b
Also modify the
tf32_on_and_off
decorator to make it support function withoutdevice
argument.