Skip to content

Rollback ddp_tutorial fix #1621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

bowangbj
Copy link
Contributor

@bowangbj bowangbj commented Jul 28, 2021

Stack from ghstack:

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Sending out the CL to roll it back.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 28, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Sending out the CL to roll it back.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: e90ca2d
Pull Request resolved: #1621
@bowangbj bowangbj requested review from mrshenli and brianjo July 28, 2021 22:55
@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
setup(rank, world_size)

# setup mp_model and devices for this process
dev0 = (rank * 2) % world_size
dev1 = (rank * 2 + 1) % world_size
dev0 = rank * 2
Copy link
Contributor Author

@bowangbj bowangbj Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli

Copied the conversation from #1618:

Shen:
Unfortunately, we cannot do this. This will make rank0 and rank 1 share the two GPUs, if there are just two GPUs. DDP and collective comm requires each process to exclusively work on GPUs, otherwise, the comm might hang.

Bo:
The issue I got is like below:

I tried to run the code on 2 GPUs machine.
For rank0 -- dev0=0 dev1=1
For rank1 --- dev0=2 dev2=3

With that I got exception about wrong device ordinals since only cuda:0 and cuda:1 are valid. But for rank2, it ends up with cuda:2 and cuda:3

Any idea how to avoid such exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only demo_model_parallel needs >= 4 GPUs. Maybe we can do the following?

  1. skip demo_model_parallel when there are less than 4 GPUs
  2. pass ngpus / 2 as world_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, changed as suggested

@bowangbj
Copy link
Contributor Author

Continued discussion of #1618 here.
Shen, Brian, PTAL

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@@ -265,8 +265,8 @@ either the application or the model ``forward()`` method.
setup(rank, world_size)

# setup mp_model and devices for this process
dev0 = (rank * 2) % world_size
dev1 = (rank * 2 + 1) % world_size
dev0 = rank * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only demo_model_parallel needs >= 4 GPUs. Maybe we can do the following?

  1. skip demo_model_parallel when there are less than 4 GPUs
  2. pass ngpus / 2 as world_size

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2c37d77
Pull Request resolved: #1621
@bowangbj
Copy link
Contributor Author

Thanks Shen for the quick review, fixed as commented, PTAL

Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:

#1618 was merged unintentionally before all the cmts were resolved.
Also did a minior fix to skip demo parallel when ngpus is < 4

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit to bowangbj/tutorials that referenced this pull request Aug 6, 2021
Summary:

See pytorch#1621

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@bowangbj
Copy link
Contributor Author

bowangbj commented Aug 6, 2021

Closing this PR which is subsumed by #1641

@bowangbj bowangbj closed this Aug 6, 2021
@facebook-github-bot facebook-github-bot deleted the gh/bowangbj/2/head branch September 6, 2021 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants