Skip to content

Update ddp_tutorial.rst #1618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 28, 2021
Merged

Update ddp_tutorial.rst #1618

merged 2 commits into from
Jul 28, 2021

Conversation

bowangbj
Copy link
Contributor

Relax GPU count to support devices with 2 GPUs

Relax GPU count to support devices with 2 GPUs
@netlify
Copy link

netlify bot commented Jul 27, 2021

✔️ Deploy Preview for pytorch-tutorials-preview ready!

🔨 Explore the source changes: 856a6c2

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6101c7b122cc0f0007d631b8

😎 Browse the preview: https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app

@bowangbj
Copy link
Contributor Author

Hey Shen, Pritam, and Brian.

This is small change to make ddp tutorials accept dev server with >= 2 GPU instead of hard coding it to 8. PTAL

@brianjo
Copy link
Contributor

brianjo commented Jul 27, 2021

@bowangbj
Copy link
Contributor Author

Preview looks good. https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app/intermediate/ddp_tutorial.html I can merge anytime. Thanks Bo!

Thanks Brian a lot. Lets wait for Shen or Pritam to take a look before merging. Thanks a lot.

@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
setup(rank, world_size)

# setup mp_model and devices for this process
dev0 = rank * 2
dev1 = rank * 2 + 1
dev0 = (rank * 2) % world_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we cannot do this. This will make rank0 and rank 1 share the two GPUs, if there are just two GPUs. DDP and collective comm requires each process to exclusively work on GPUs, otherwise, the comm might hang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I got is like below:

I tried to run the code on 2 GPUs machine.
For rank0 -- dev0=0 dev1=1
For rank1 --- dev0=2 dev2=3

With that I got exception about wrong device ordinals since only cuda:0 and cuda:1 are valid. But for rank2, it ends up with cuda:2 and cuda:3

Any idea how to avoid such exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tutorial will require 8 GPUs to run, as highlighted in the note block in the summary section: "The code in this tutorial runs on an 8-GPU server, but it can be easily generalized to other environments."

Maybe we can error out when the number of GPUs is less than 8?

Copy link
Contributor

@mrshenli mrshenli Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd thought, I think only the model parallel part needs 8 GPUs. demo_model_parallel we can if just that one when the number of GPUs is less than 8

looks like it only needs >= 4 GPUs. Maybe we can do the following?

  1. skip demo_model_parallel when there are less than 4 GPUs
  2. pass ngpus / 2 as world_size

@brianjo brianjo merged commit a1ad9ed into pytorch:master Jul 28, 2021
@bowangbj
Copy link
Contributor Author

Thanks Shen for your super quick review, please take a look at the inline comment on the error I got when running the example.

bowangbj added a commit that referenced this pull request Jul 28, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Sending out the CL to roll it back.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 28, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Sending out the CL to roll it back.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: e90ca2d
Pull Request resolved: #1621
@bowangbj
Copy link
Contributor Author

Oops, there is not ready to merge. Sent out https://github.com/pytorch/tutorials/pull/1621/files to roll it back.

bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2c37d77
Pull Request resolved: #1621
bowangbj added a commit that referenced this pull request Jul 30, 2021
…torial fix"

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021
Relax GPU count to support devices with 2 GPUs

Co-authored-by: Brian Johnson <brianjo@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants