Update ddp_tutorial.rst #1618

bowangbj · 2021-07-27T19:29:44Z

Relax GPU count to support devices with 2 GPUs

netlify · 2021-07-27T19:34:41Z

✔️ Deploy Preview for pytorch-tutorials-preview ready!

🔨 Explore the source changes: 856a6c2

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6101c7b122cc0f0007d631b8

😎 Browse the preview: https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app

bowangbj · 2021-07-27T20:53:57Z

Hey Shen, Pritam, and Brian.

This is small change to make ddp tutorials accept dev server with >= 2 GPU instead of hard coding it to 8. PTAL

brianjo · 2021-07-27T21:07:02Z

Preview looks good. https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app/intermediate/ddp_tutorial.html I can merge anytime. Thanks Bo!

bowangbj · 2021-07-27T21:20:27Z

Preview looks good. https://deploy-preview-1618--pytorch-tutorials-preview.netlify.app/intermediate/ddp_tutorial.html I can merge anytime. Thanks Bo!

Thanks Brian a lot. Lets wait for Shen or Pritam to take a look before merging. Thanks a lot.

mrshenli · 2021-07-27T22:58:03Z

intermediate_source/ddp_tutorial.rst

@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
        setup(rank, world_size)

        # setup mp_model and devices for this process
-        dev0 = rank * 2
-        dev1 = rank * 2 + 1
+        dev0 = (rank * 2) % world_size


Unfortunately, we cannot do this. This will make rank0 and rank 1 share the two GPUs, if there are just two GPUs. DDP and collective comm requires each process to exclusively work on GPUs, otherwise, the comm might hang.

The issue I got is like below:

I tried to run the code on 2 GPUs machine.
For rank0 -- dev0=0 dev1=1
For rank1 --- dev0=2 dev2=3

With that I got exception about wrong device ordinals since only cuda:0 and cuda:1 are valid. But for rank2, it ends up with cuda:2 and cuda:3

Any idea how to avoid such exception?

This tutorial will require 8 GPUs to run, as highlighted in the note block in the summary section: "The code in this tutorial runs on an 8-GPU server, but it can be easily generalized to other environments."

Maybe we can error out when the number of GPUs is less than 8?

~~2nd thought, I think only the model parallel part needs 8 GPUs. demo_model_parallel we can if just that one when the number of GPUs is less than 8~~

looks like it only needs >= 4 GPUs. Maybe we can do the following?

skip demo_model_parallel when there are less than 4 GPUs

pass ngpus / 2 as world_size

bowangbj · 2021-07-28T21:20:14Z

Thanks Shen for your super quick review, please take a look at the inline comment on the error I got when running the example.

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e90ca2d Pull Request resolved: #1621

bowangbj · 2021-07-28T22:55:44Z

Oops, there is not ready to merge. Sent out https://github.com/pytorch/tutorials/pull/1621/files to roll it back.

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1621

…torial fix" Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Relax GPU count to support devices with 2 GPUs Co-authored-by: Brian Johnson <[email protected]>

Update ddp_tutorial.rst

8dc1c7d

Relax GPU count to support devices with 2 GPUs

facebook-github-bot added the cla signed label Jul 27, 2021

bowangbj requested review from mrshenli, pritamdamania87 and brianjo July 27, 2021 19:42

mrshenli requested changes Jul 27, 2021

View reviewed changes

brianjo approved these changes Jul 28, 2021

View reviewed changes

Merge branch 'master' into master

856a6c2

brianjo merged commit a1ad9ed into pytorch:master Jul 28, 2021

bowangbj mentioned this pull request Jul 28, 2021

Rollback ddp_tutorial fix #1621

Closed

bowangbj added a commit that referenced this pull request Jul 28, 2021

Rollback ddp_tutorial fix

3883a9c

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021

Update ddp_tutorial.rst (pytorch#1618)

3ac1bf8

Relax GPU count to support devices with 2 GPUs Co-authored-by: Brian Johnson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update ddp_tutorial.rst #1618

Update ddp_tutorial.rst #1618

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

netlify bot commented Jul 27, 2021 •

edited

Loading

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

brianjo commented Jul 27, 2021

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

mrshenli Jul 27, 2021

Uh oh!

bowangbj Jul 28, 2021

Uh oh!

mrshenli Jul 29, 2021

Uh oh!

mrshenli Jul 29, 2021 •

edited

Loading

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

Uh oh!

Update ddp_tutorial.rst #1618

Update ddp_tutorial.rst #1618

Uh oh!

Conversation

bowangbj commented Jul 27, 2021

Uh oh!

netlify bot commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

brianjo commented Jul 27, 2021

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

mrshenli Jul 27, 2021

Choose a reason for hiding this comment

Uh oh!

bowangbj Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

Uh oh!

netlify bot commented Jul 27, 2021 •

edited

Loading

mrshenli Jul 29, 2021 •

edited

Loading