Rollback ddp_tutorial fix #1621

bowangbj · 2021-07-28T22:54:30Z

Stack from ghstack:

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e90ca2d Pull Request resolved: #1621

bowangbj · 2021-07-28T22:56:49Z

intermediate_source/ddp_tutorial.rst

@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
        setup(rank, world_size)

        # setup mp_model and devices for this process
-        dev0 = (rank * 2) % world_size
-        dev1 = (rank * 2 + 1) % world_size
+        dev0 = rank * 2


@mrshenli

Copied the conversation from #1618:

Shen:
Unfortunately, we cannot do this. This will make rank0 and rank 1 share the two GPUs, if there are just two GPUs. DDP and collective comm requires each process to exclusively work on GPUs, otherwise, the comm might hang.

Bo:
The issue I got is like below:

I tried to run the code on 2 GPUs machine.
For rank0 -- dev0=0 dev1=1
For rank1 --- dev0=2 dev2=3

With that I got exception about wrong device ordinals since only cuda:0 and cuda:1 are valid. But for rank2, it ends up with cuda:2 and cuda:3

Any idea how to avoid such exception?

Looks like only demo_model_parallel needs >= 4 GPUs. Maybe we can do the following?

skip demo_model_parallel when there are less than 4 GPUs

pass ngpus / 2 as world_size

SG, changed as suggested

bowangbj · 2021-07-28T22:58:18Z

Continued discussion of #1618 here.
Shen, Brian, PTAL

mrshenli

Thanks for fixing!

mrshenli · 2021-07-29T13:19:41Z

intermediate_source/ddp_tutorial.rst

@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
        setup(rank, world_size)

        # setup mp_model and devices for this process
-        dev0 = (rank * 2) % world_size
-        dev1 = (rank * 2 + 1) % world_size
+        dev0 = rank * 2


Looks like only demo_model_parallel needs >= 4 GPUs. Maybe we can do the following?

skip demo_model_parallel when there are less than 4 GPUs

pass ngpus / 2 as world_size

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1621

bowangbj · 2021-07-30T18:06:23Z

Thanks Shen for the quick review, fixed as commented, PTAL

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: See pytorch#1621 Test Plan: Reviewers: Subscribers: Tasks: Tags:

bowangbj · 2021-08-06T18:19:15Z

Closing this PR which is subsumed by #1641

Rollback ddp_tutorial fix

3883a9c

Summary: #1618 was merged unintentionally before all the cmts were resolved. Sending out the CL to roll it back. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

facebook-github-bot added the cla signed label Jul 28, 2021

bowangbj requested review from mrshenli and brianjo July 28, 2021 22:55

bowangbj commented Jul 28, 2021

View reviewed changes

mrshenli reviewed Jul 29, 2021

View reviewed changes

resolve shen cmt on "Rollback ddp_tutorial fix"

4ddc9d1

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

redo rank -> trainer_rank doc fix on "Rollback ddp_tutorial fix"

259e216

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

bowangbj mentioned this pull request Jul 30, 2021

Redo rank --> trainer_rank doc fix #1623

Closed

fix param on "Rollback ddp_tutorial fix"

e0f6781

Summary: #1618 was merged unintentionally before all the cmts were resolved. Also did a minior fix to skip demo parallel when ngpus is < 4 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

bowangbj added a commit to bowangbj/tutorials that referenced this pull request Aug 6, 2021

Rollback ddp tutorial fix

edf2ba8

Summary: See pytorch#1621 Test Plan: Reviewers: Subscribers: Tasks: Tags:

bowangbj mentioned this pull request Aug 6, 2021

Rollback ddp tutorial fix #1641

Closed

bowangbj closed this Aug 6, 2021

facebook-github-bot deleted the gh/bowangbj/2/head branch September 6, 2021 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rollback ddp_tutorial fix #1621

Rollback ddp_tutorial fix #1621

Uh oh!

bowangbj commented Jul 28, 2021 •

edited

Loading

Uh oh!

bowangbj Jul 28, 2021 •

edited

Loading

Uh oh!

mrshenli Jul 29, 2021

Uh oh!

bowangbj Jul 30, 2021

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Jul 29, 2021

Uh oh!

bowangbj commented Jul 30, 2021

Uh oh!

bowangbj commented Aug 6, 2021

Uh oh!

Uh oh!

Rollback ddp_tutorial fix #1621

Rollback ddp_tutorial fix #1621

Uh oh!

Conversation

bowangbj commented Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bowangbj Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

bowangbj Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

bowangbj commented Jul 30, 2021

Uh oh!

bowangbj commented Aug 6, 2021

Uh oh!

Uh oh!

bowangbj commented Jul 28, 2021 •

edited

Loading

bowangbj Jul 28, 2021 •

edited

Loading