.. automodule:: torch.distributed.elastic.rendezvous
Below is a state diagram describing how rendezvous works.
.. autoclass:: RendezvousParameters
:members:
.. autoclass:: RendezvousHandlerRegistry
:members:
.. automodule:: torch.distributed.elastic.rendezvous.registry
.. currentmodule:: torch.distributed.elastic.rendezvous
.. autoclass:: RendezvousHandler
:members:
.. autoclass:: RendezvousError
.. autoclass:: RendezvousClosedError
.. autoclass:: RendezvousTimeoutError
.. autoclass:: RendezvousConnectionError
.. autoclass:: RendezvousStateError
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
.. autofunction:: create_handler
.. autoclass:: DynamicRendezvousHandler()
:members: from_backend
.. autoclass:: RendezvousBackend
:members:
.. autoclass:: RendezvousTimeout
:members:
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: C10dRendezvousBackend
:members:
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: EtcdRendezvousBackend
:members:
Warning
The DynamicRendezvousHandler
class supersedes the EtcdRendezvousHandler
class, and is recommended for most users. EtcdRendezvousHandler
is in
maintenance mode and will be deprecated in the future.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
.. autoclass:: EtcdRendezvousHandler
The EtcdStore
is the C10d Store
instance type returned by
next_rendezvous()
when etcd is used as the rendezvous backend.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
.. autoclass:: EtcdStore
:members:
The EtcdServer
is a convenience class that makes it easy for you to
start and stop an etcd server on a subprocess. This is useful for testing
or single-node (multi-worker) deployments where manually setting up an
etcd server on the side is cumbersome.
Warning
For production and multi-node deployments please consider properly deploying a highly available etcd server as this is the single point of failure for your distributed jobs.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
.. autoclass:: EtcdServer