Skip to content

Fix the param: rank --> trainer_rank #1619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2021
Merged

Conversation

bowangbj
Copy link
Contributor

@bowangbj bowangbj commented Jul 27, 2021

Stack from ghstack:

Summary:

w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.

Example code fix: pytorch/examples#924

Test Plan:

(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.

Test Plan:

(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 27, 2021
Summary:

w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.

Test Plan:

(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 4916d77
Pull Request resolved: #1619
@bowangbj bowangbj requested review from wayi1, pritamdamania87, mrshenli and bdhirsh and removed request for wayi1 July 27, 2021 21:12
@bowangbj
Copy link
Contributor Author

Hey Pritam, Yi, Shen and Brian

I got an exception when I run examples/distributed/rpc/ddp_rpc/main.py (see error below).
I guess it's caused by the fact the param 'rank' is incorrectly passed, seems it should be trainer_rank, since for ps rank, ps_rank is always 2. The CL for example code will follow up. Please let me know whether this makes sense. Thanks

The exception I got:

pytorch) [[email protected] /data/users/bowangbj/examples/distributed/rpc/ddp_rpc] python main.py
On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
On WorkerInfo(id=1, name=trainer1):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[W tensorpipe_agent.cpp:707] RPC agent for trainer1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for trainer0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Traceback (most recent call last):
File "main.py", line 180, in
mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 146, in run_worker
fut.wait()
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 218, in _handle_exception
raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

bowangbj added a commit to pytorch/examples that referenced this pull request Jul 27, 2021
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 78c9b9e
Pull Request resolved: #924
bowangbj added a commit to pytorch/examples that referenced this pull request Jul 27, 2021
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@bowangbj
Copy link
Contributor Author

Hey Pritam, Yi, Shen and Brian

I got an exception when I run examples/distributed/rpc/ddp_rpc/main.py (see error below).
I guess it's caused by the fact the param 'rank' is incorrectly passed, seems it should be trainer_rank, since for ps rank, ps_rank is always 2. The CL for example code will follow up. Please let me know whether this makes sense. Thanks

Example code fix: pytorch/examples#924

The exception I got:

pytorch) [[email protected] /data/users/bowangbj/examples/distributed/rpc/ddp_rpc] python main.py
On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
On WorkerInfo(id=1, name=trainer1):
RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[W tensorpipe_agent.cpp:707] RPC agent for trainer1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for trainer0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:707] RPC agent for ps encountered error when reading incoming request from trainer1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Traceback (most recent call last):
File "main.py", line 180, in
mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True)
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 146, in run_worker
fut.wait()
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 218, in _handle_exception
raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=trainer0):
RuntimeError('CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/data/users/bowangbj/pytorch/torch/distributed/rpc/internal.py", line 204, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 48, in _run_trainer
model = HybridModel(remote_emb_module, rank)
File "/data/users/bowangbj/examples/distributed/rpc/ddp_rpc/main.py", line 30, in init
self.fc = DDP(torch.nn.Linear(16, 8).cuda(device), device_ids=[device])
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in cuda
return self._apply(lambda t: t.cuda(device))
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 558, in _apply
param_applied = fn(param)
File "/data/users/bowangbj/pytorch/torch/nn/modules/module.py", line 645, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Copy link
Contributor

@wayi1 wayi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@bowangbj bowangbj merged commit 85bdc83 into gh/bowangbj/1/base Jul 28, 2021
@bowangbj
Copy link
Contributor Author

Thanks for the fix!

Thanks Yi for the super quick fix, merging the fix

brianjo added a commit to pytorch/examples that referenced this pull request Jul 28, 2021
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

Context: #1619
1619 was merged into a wrong branch (non-master)
Example code fix: pytorch/examples#924

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

Context: #1619
1619 was merged into a wrong branch (non-master)
Example code fix: pytorch/examples#924

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2c37d77
Pull Request resolved: #1623
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

Context: #1619
1619 was merged into a wrong branch (non-master)
Example code fix: pytorch/examples#924

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

Context: #1619
1619 was merged into a wrong branch (non-master)
Example code fix: pytorch/examples#924

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
bowangbj added a commit that referenced this pull request Jul 30, 2021
Summary:

Context: #1619
1619 was merged into a wrong branch (non-master)
Example code fix: pytorch/examples#924

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2c37d77
Pull Request resolved: #1623
jainbilnach87 pushed a commit to jainbilnach87/pytoc that referenced this pull request Aug 11, 2021
Summary:

w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.

Test Plan:

(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 3504362
Pull Request resolved: pytorch/tutorials#1619
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants