-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Fix the param: rank --> trainer_rank #1619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 4916d77 Pull Request resolved: #1619
Hey Pritam, Yi, Shen and Brian I got an exception when I run examples/distributed/rpc/ddp_rpc/main.py (see error below). The exception I got: pytorch) [[email protected] /data/users/bowangbj/examples/distributed/rpc/ddp_rpc] python main.py [W tensorpipe_agent.cpp:707] RPC agent for trainer1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) -- Process 2 terminated with the following error: |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Example code fix: pytorch/examples#924
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
Thanks Yi for the super quick fix, merging the fix |
Example code fix for pytorch/tutorials#1619
Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1623
Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Context: #1619 1619 was merged into a wrong branch (non-master) Example code fix: pytorch/examples#924 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2c37d77 Pull Request resolved: #1623
Summary: w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal] w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices. Test Plan: (pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3504362 Pull Request resolved: pytorch/tutorials#1619
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Example code fix for pytorch/tutorials#1619
Stack from ghstack:
Summary:
w/o the fix, _run_trainer is invoked with rank hardcoded as 2. This introduced [CUDA error: invalid device ordinal]
w/ this fix, _run_trainer will be invoked with rank as 0 and 1 -- those are valid GPU devices.
Example code fix: pytorch/examples#924
Test Plan:
(pytorch) examples/distributed/rpc/ddp_rpc % python main.py -- run training successfully.
Reviewers:
Subscribers:
Tasks:
Tags: