-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Update ddp_tutorial.rst #1618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ddp_tutorial.rst #1618
Conversation
Relax GPU count to support devices with 2 GPUs
✔️ Deploy Preview for pytorch-tutorials-preview ready! 🔨 Explore the source changes: 856a6c2 🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6101c7b122cc0f0007d631b8 😎 Browse the preview: https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app |
Hey Shen, Pritam, and Brian. This is small change to make ddp tutorials accept dev server with >= 2 GPU instead of hard coding it to 8. PTAL |
Preview looks good. https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app/intermediate/ddp_tutorial.html I can merge anytime. Thanks Bo! |
Thanks Brian a lot. Lets wait for Shen or Pritam to take a look before merging. Thanks a lot. |
@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method. | |||
setup(rank, world_size) | |||
|
|||
# setup mp_model and devices for this process | |||
dev0 = rank * 2 | |||
dev1 = rank * 2 + 1 | |||
dev0 = (rank * 2) % world_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we cannot do this. This will make rank0 and rank 1 share the two GPUs, if there are just two GPUs. DDP and collective comm requires each process to exclusively work on GPUs, otherwise, the comm might hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I got is like below:
I tried to run the code on 2 GPUs machine.
For rank0 -- dev0=0 dev1=1
For rank1 --- dev0=2 dev2=3
With that I got exception about wrong device ordinals since only cuda:0 and cuda:1 are valid. But for rank2, it ends up with cuda:2 and cuda:3
Any idea how to avoid such exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tutorial will require 8 GPUs to run, as highlighted in the note block in the summary section: "The code in this tutorial runs on an 8-GPU server, but it can be easily generalized to other environments."
Maybe we can error out when the number of GPUs is less than 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd thought, I think only the model parallel part needs 8 GPUs. demo_model_parallel
we can if just that one when the number of GPUs is less than 8
looks like it only needs >= 4 GPUs. Maybe we can do the following?
- skip
demo_model_parallel
when there are less than 4 GPUs - pass
ngpus / 2
asworld_size
Thanks Shen for your super quick review, please take a look at the inline comment on the error I got when running the example. |
Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Oops, there is not ready to merge. Sent out https://github.com/pytorch/tutorials/pull/1621/files to roll it back. |
Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
…torial fix" Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Relax GPU count to support devices with 2 GPUs Co-authored-by: Brian Johnson <brianjo@fb.com>
Relax GPU count to support devices with 2 GPUs