Fix the param: rank --> trainer_rank #1619

bowangbj · 2021-07-27T21:07:50Z

Stack from ghstack:

-> Fix the param: rank --> trainer_rank #1619

Summary:

w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.

Example code fix: pytorch/examples#924

Test Plan:

(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 4916d77 Pull Request resolved: #1619

bowangbj · 2021-07-27T21:16:08Z

Hey Pritam, Yi, Shen and Brian

I got an exception when I run examples/distributed/rpc/ddp_rpc/main.py (see error below).
I guess it's caused by the fact the param 'rank' is incorrectly passed, seems it should be trainer_rank, since for ps rank, ps_rank is always 2. The CL for example code will follow up. Please let me know whether this makes sense. Thanks

The exception I got:

pytorch) [[email protected] /data/users/bowangbj/examples/distributed/rpc/ddp_rpc] python main.py
On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
On WorkerInfo(id=1, name=trainer1):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[W tensorpipe_agent.cpp:707] RPC agent for trainer1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for trainer0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Traceback (most recent call last):
File "main.py", line 180, in
mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 146, in run_worker
fut.wait()
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 218, in _handle_exception
raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 78c9b9e Pull Request resolved: #924

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

bowangbj · 2021-07-27T21:58:31Z

Hey Pritam, Yi, Shen and Brian

I got an exception when I run examples/distributed/rpc/ddp_rpc/main.py (see error below).
I guess it's caused by the fact the param 'rank' is incorrectly passed, seems it should be trainer_rank, since for ps rank, ps_rank is always 2. The CL for example code will follow up. Please let me know whether this makes sense. Thanks

Example code fix: pytorch/examples#924

The exception I got:

pytorch) [[email protected] /data/users/bowangbj/examples/distributed/rpc/ddp_rpc] python main.py
On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
On WorkerInfo(id=1, name=trainer1):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[W tensorpipe_agent.cpp:707] RPC agent for trainer1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for trainer0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Traceback (most recent call last):
File "main.py", line 180, in
mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 146, in run_worker
fut.wait()
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 218, in _handle_exception
raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

wayi1

Thanks for the fix!

bowangbj · 2021-07-28T21:06:07Z

Thanks for the fix!

Thanks Yi for the super quick fix, merging the fix

Example code fix for pytorch/tutorials#1619

Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1623

Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1623

Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3504362 Pull Request resolved: pytorch/tutorials#1619

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Example code fix for pytorch/tutorials#1619

facebook-github-bot added the cla signed label Jul 27, 2021

bowangbj requested review from wayi1, pritamdamania87, mrshenli and bdhirsh and removed request for wayi1 July 27, 2021 21:12

bowangbj added a commit to pytorch/examples that referenced this pull request Jul 27, 2021

Example code fix for pytorch/tutorials#1619

7ada1bb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 78c9b9e Pull Request resolved: #924

bowangbj added a commit to pytorch/examples that referenced this pull request Jul 27, 2021

Example code fix for pytorch/tutorials#1619

5c47bb1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

bowangbj mentioned this pull request Jul 27, 2021

Example code fix for https://github.com/pytorch/tutorials/pull/1619 pytorch/examples#924

Merged

wayi1 approved these changes Jul 27, 2021

View reviewed changes

bowangbj merged commit 85bdc83 into gh/bowangbj/1/base Jul 28, 2021

brianjo added a commit to pytorch/examples that referenced this pull request Jul 28, 2021

Merge pull request #924 from pytorch/gh/bowangbj/1/head

cedca77

Example code fix for pytorch/tutorials#1619

This was referenced Jul 28, 2021

Rollback ddp_tutorial fix #1621

Closed

Redo rank --> trainer_rank doc fix #1623

Closed

bowangbj added a commit that referenced this pull request Jul 30, 2021

Redo rank --> trainer_rank doc fix

7f67e14

Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

bowangbj mentioned this pull request Aug 6, 2021

Real: redo rank --> trainer_rank doc fix #1640

Merged

YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025

Example code fix for pytorch/tutorials#1619

f0391c3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025

Merge pull request pytorch#924 from pytorch/gh/bowangbj/1/head

d24760c

Example code fix for pytorch/tutorials#1619

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the param: rank --> trainer_rank #1619

Fix the param: rank --> trainer_rank #1619

Uh oh!

bowangbj commented Jul 27, 2021 •

edited

Loading

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

wayi1 left a comment

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

Uh oh!

Fix the param: rank --> trainer_rank #1619

Fix the param: rank --> trainer_rank #1619

Uh oh!

Conversation

bowangbj commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

bowangbj commented Jul 27, 2021

Uh oh!

wayi1 left a comment

Choose a reason for hiding this comment

Uh oh!

bowangbj commented Jul 28, 2021

Uh oh!

Uh oh!

bowangbj commented Jul 27, 2021 •

edited

Loading