-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[NCCL] Join work clean up thread before aborting communicators #55444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit e3e80ea (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Build | 🔁 rerun |
ci.pytorch.org: 1 failed
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
After ProcessGroupNCCl is destroyed, it is possible that `workCleanupLoop` is still running and and processing possible timed out work. Although, we don't do anything if there is timed out work since `handleNcclGuard` is gated by `terminateProcessGroup_` check. So instead of iterating through workMetaList, just clear it if we are shutting down. This also reduces log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) ghstack-source-id: 125919880 Pull Request resolved: #55444
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
// workNCCL objects from work vector. | ||
if (!terminateProcessGroup_.load()) { | ||
work.handleNCCLGuard(); | ||
if (terminateProcessGroup_.load()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if these conditionals help since the variable could be set to true right after you do this check (same goes for line 682). Is a better fix joining the workCleanupThread before aborting the communicators in ProcessGroupNCCL destructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, so the root cause of those false positive error logs are due to us aborting the communicators during destruction. Yes, this seems like a better fix
…tors" Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in `workMetaList` can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on `WorkNCCL`, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: ``` I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 ``` With this change, we no longer see these false positive logs. Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/) [ghstack-poisoned]
Pull Request resolved: #55444 Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 With this change, we no longer see these false positive logs. ghstack-source-id: 126145284 Differential Revision: [D27613035](https://our.internmc.facebook.com/intern/diff/D27613035/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
This pull request has been merged in c218ac3. |
…ch#55444) Summary: Pull Request resolved: pytorch#55444 Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 With this change, we no longer see these false positive logs. ghstack-source-id: 126145284 Test Plan: CI Reviewed By: osalpekar Differential Revision: D27613035 fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
Stack from ghstack:
Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in
workMetaList
can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.The main motivation is also to reduce log spam since we added some logging when an exception is set on
WorkNCCL
, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:With this change, we no longer see these false positive logs.
Differential Revision: D27613035