Skip to content

[NCCL] Join work clean up thread before aborting communicators #55444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Apr 7, 2021

Stack from ghstack:

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3

With this change, we no longer see these false positive logs.

Differential Revision: D27613035

After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

```
I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
```

Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 7, 2021

💊 CI failures summary and remediations

As of commit e3e80ea (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 2/5 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_rocm3_9_py3_6_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Apr 09 06:54:24 Error generating file
Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/SegmentReduce.hip:90:23: error: use of undeclared identifier 'cub'
Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/SegmentReduce.hip:105:23: error: use of undeclared identifier 'cub'
Apr 09 06:54:24         AT_CUDA_CHECK(cub::DeviceSegmentedReduce::Reduce(
Apr 09 06:54:24                       ^
Apr 09 06:54:24 18 errors generated when compiling for gfx900.
Apr 09 06:54:24 CMake Error at torch_hip_generated_SegmentReduce.hip.o.cmake:192 (message):
Apr 09 06:54:24   Error generating file
Apr 09 06:54:24   /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_SegmentReduce.hip.o
Apr 09 06:54:24 
Apr 09 06:54:24 
Apr 09 06:54:24 caffe2/CMakeFiles/torch_hip.dir/build.make:1195: recipe for target 'caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_SegmentReduce.hip.o' failed
Apr 09 06:54:24 make[2]: *** [caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_SegmentReduce.hip.o] Error 1
Apr 09 06:54:24 make[2]: *** Waiting for unfinished jobs....
Apr 09 06:54:25 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/hip/UnaryLogKernels.hip:4:
Apr 09 06:54:25 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/hip/Loops.cuh:18:
Apr 09 06:54:25 /var/lib/jenkins/workspace/aten/src/ATen/native/hip/MemoryAccess.cuh:38:26: warning: template template parameter using 'typename' is a C++17 extension [-Wc++17-extensions]
Apr 09 06:54:25 template<template<int i> typename func, int end, int current=0>

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 09 08:06:55 ERROR [0.006s]: TestViewOpsXLA (unittest.loader._FailedTest)
Apr 09 08:06:53 + XLA_EXPERIMENTAL=nonzero:masked_select
Apr 09 08:06:53 + run_test python3 /var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py -v TestViewOpsXLA
Apr 09 08:06:53 + python3 /var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py -v TestViewOpsXLA
Apr 09 08:06:55 Test results will be stored in test-reports/python-unittest/.var.lib.jenkins.workspace.xla.test.......test.test_view_ops
Apr 09 08:06:55 
Apr 09 08:06:55 Running tests...
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55   TestViewOpsXLA (unittest.loader._FailedTest) ... ERROR (0.006s)
Apr 09 08:06:55 
Apr 09 08:06:55 ======================================================================
Apr 09 08:06:55 ERROR [0.006s]: TestViewOpsXLA (unittest.loader._FailedTest)
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55 AttributeError: module '__main__' has no attribute 'TestViewOpsXLA'
Apr 09 08:06:55 
Apr 09 08:06:55 ----------------------------------------------------------------------
Apr 09 08:06:55 Ran 1 test in 0.007s
Apr 09 08:06:55 
Apr 09 08:06:55 FAILED (errors=1)
Apr 09 08:06:55 
Apr 09 08:06:55 Generating XML reports...
Apr 09 08:06:55 Generated XML report: test-reports/python-unittest/.var.lib.jenkins.workspace.xla.test.......test.test_view_ops/TEST-unittest.loader._FailedTest-20210409080655.xml

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_ios_12_0_0_x86_64_build Build 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Apr 7, 2021
rohan-varma added a commit that referenced this pull request Apr 7, 2021
After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

```
I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
```

Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)

ghstack-source-id: 125919880
Pull Request resolved: #55444
@rohan-varma rohan-varma requested a review from osalpekar April 7, 2021 05:21
// workNCCL objects from work vector.
if (!terminateProcessGroup_.load()) {
work.handleNCCLGuard();
if (terminateProcessGroup_.load()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these conditionals help since the variable could be set to true right after you do this check (same goes for line 682). Is a better fix joining the workCleanupThread before aborting the communicators in ProcessGroupNCCL destructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so the root cause of those false positive error logs are due to us aborting the communicators during destruction. Yes, this seems like a better fix

@rohan-varma rohan-varma changed the title [rfc] Add terminated check in workCleanupLoop [NCCL] Join work clean up thread before aborting communicators Apr 9, 2021
…tors"


Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in `workMetaList` can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

```
I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
```

With this change, we no longer see these false positive logs.

Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 9, 2021
Pull Request resolved: #55444

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284

Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)
Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c218ac3.

krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
…ch#55444)

Summary:
Pull Request resolved: pytorch#55444

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D27613035

fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants