[NCCL] Join work clean up thread before aborting communicators #55444

rohan-varma · 2021-04-07T05:20:17Z

Stack from ghstack:

[NCCL] Join work clean up thread before aborting communicators #55444 [NCCL] Join work clean up thread before aborting communicators

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3

With this change, we no longer see these false positive logs.

Differential Revision: D27613035

After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) [ghstack-poisoned]

facebook-github-bot · 2021-04-07T05:20:21Z

💊 CI failures summary and remediations

As of commit e3e80ea (more details on the Dr. CI page):

5/5 failures possibly* introduced in this PR
- 2/5 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_bionic_rocm3_9_py3_6_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Apr 09 06:54:24 Error generating file

Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/SegmentReduce.hip:90:23: error: use of undeclared identifier 'cub'
Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/SegmentReduce.hip:105:23: error: use of undeclared identifier 'cub'
Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 18 errors generated when compiling for gfx900.
Apr 09 06:54:24 CMake Error at torch_hip_generated_SegmentReduce.hip.o.cmake:192 (message):
Apr 09 06:54:24   Error generating file
Apr 09 06:54:24   /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_SegmentReduce.hip.o
Apr 09 06:54:24 
Apr 09 06:54:24 
Apr 09 06:54:24 caffe2/CMakeFiles/torch_hip.dir/build.make:1195: recipe for target 'caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_SegmentReduce.hip.o' failed
Apr 09 06:54:24 make[2]: *** [caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_SegmentReduce.hip.o] Error 1
Apr 09 06:54:24 make[2]: *** Waiting for unfinished jobs....
Apr 09 06:54:25 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/hip/UnaryLogKernels.hip:4:
Apr 09 06:54:25 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/hip/Loops.cuh:18:
Apr 09 06:54:25 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/MemoryAccess.cuh:38:26: warning: template template parameter using 'typename' is a C++17 extension [-Wc++17-extensions]
Apr 09 06:54:25 template<template<int i> typename func, int end, int current=0>

pytorch_xla_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 09 08:06:55 ERROR [0.006s]: TestViewOpsXLA (unittest.loader._FailedTest)

Apr 09 08:06:53 + XLA_EXPERIMENTAL=nonzero:masked_select
Apr 09 08:06:53 + run_test python3 /var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py -v TestViewOpsXLA
Apr 09 08:06:53 + python3 /var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py -v TestViewOpsXLA
Apr 09 08:06:55 Test results will be stored in test-reports/python-unittest/.var.lib.jenkins.workspace.xla.test.......test.test_view_ops
Apr 09 08:06:55 
Apr 09 08:06:55 Running tests...
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55   TestViewOpsXLA (unittest.loader._FailedTest) ... ERROR (0.006s)
Apr 09 08:06:55 
Apr 09 08:06:55 ======================================================================
Apr 09 08:06:55 ERROR [0.006s]: TestViewOpsXLA (unittest.loader._FailedTest)
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55 AttributeError: module '__main__' has no attribute 'TestViewOpsXLA'
Apr 09 08:06:55 
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55 Ran 1 test in 0.007s
Apr 09 08:06:55 
Apr 09 08:06:55 FAILED (errors=1)
Apr 09 08:06:55 
Apr 09 08:06:55 Generating XML reports...
Apr 09 08:06:55 Generated XML report: test-reports/python-unittest/.var.lib.jenkins.workspace.xla.test.......test.test_view_ops/TEST-unittest.loader._FailedTest-20210409080655.xml

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_ios_12_0_0_x86_64_build}	^Build	🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) ghstack-source-id: 125919880 Pull Request resolved: #55444

pritamdamania87 · 2021-04-09T02:04:39Z

torch/lib/c10d/ProcessGroupNCCL.cpp

-          // workNCCL objects from work vector.
-          if (!terminateProcessGroup_.load()) {
-            work.handleNCCLGuard();
+      if (terminateProcessGroup_.load()) {


I'm not sure if these conditionals help since the variable could be set to true right after you do this check (same goes for line 682). Is a better fix joining the workCleanupThread before aborting the communicators in ProcessGroupNCCL destructor?

Ah ok, so the root cause of those false positive error logs are due to us aborting the communicators during destruction. Yes, this seems like a better fix

…tors" Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in `workMetaList` can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` With this change, we no longer see these false positive logs. Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) [ghstack-poisoned]

Pull Request resolved: #55444 Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 With this change, we no longer see these false positive logs. ghstack-source-id: 126145284 Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)

osalpekar

Looks good to me!

facebook-github-bot · 2021-04-13T22:27:00Z

This pull request has been merged in c218ac3.

…ch#55444) Summary: Pull Request resolved: pytorch#55444 Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 With this change, we no longer see these false positive logs. ghstack-source-id: 126145284 Test Plan: CI Reviewed By: osalpekar Differential Revision: D27613035 fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96

rohan-varma requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, wayi1 and zhaojuanmao as code owners April 7, 2021 05:20

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Apr 7, 2021

rohan-varma requested a review from osalpekar April 7, 2021 05:21

pritamdamania87 reviewed Apr 9, 2021

View reviewed changes

rohan-varma changed the title ~~[rfc] Add terminated check in workCleanupLoop~~ [NCCL] Join work clean up thread before aborting communicators Apr 9, 2021

rohan-varma requested a review from pritamdamania87 April 9, 2021 06:40

osalpekar approved these changes Apr 13, 2021

View reviewed changes

facebook-github-bot closed this in c218ac3 Apr 13, 2021

facebook-github-bot added the Merged label Apr 13, 2021

rohan-varma mentioned this pull request Apr 14, 2021

[DDP] Param to name mapping in Reducer #55075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] Join work clean up thread before aborting communicators #55444

[NCCL] Join work clean up thread before aborting communicators #55444

Uh oh!

rohan-varma commented Apr 7, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 7, 2021 •

edited

Loading

Uh oh!

pritamdamania87 Apr 9, 2021

Uh oh!

rohan-varma Apr 9, 2021

Uh oh!

osalpekar left a comment

Uh oh!

facebook-github-bot commented Apr 13, 2021

Uh oh!

Uh oh!

[NCCL] Join work clean up thread before aborting communicators #55444

[NCCL] Join work clean up thread before aborting communicators #55444

Uh oh!

Conversation

rohan-varma commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_bionic_rocm3_9_py3_6_build (1/2)

pytorch_xla_linux_bionic_py3_6_clang9_test (2/2)

1 failure not recognized by patterns:

ci.pytorch.org: 1 failed

Uh oh!

pritamdamania87 Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

rohan-varma Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

osalpekar left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 13, 2021

Uh oh!

Uh oh!

rohan-varma commented Apr 7, 2021 •

edited

Loading

facebook-github-bot commented Apr 7, 2021 •

edited

Loading