0% found this document useful (0 votes)

127 views4 pages

Getting Started With Distributed Data Parallel - PyTorch Tutorials 2.4.0+cu124 Documentation

Uploaded by

Jeebers Crrebers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views4 pages

Getting Started With Distributed Data Parallel - PyTorch Tutorials 2.4.0+cu124 Documentation

Uploaded by

Jeebers Crrebers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.4.

0+cu124
documentation

Author: Shen Li

Edited by: Joe Zhu

Note

View and edit this tutorial in github.

Prerequisites:

• PyTorch Distributed Overview

• DistributedDataParallel API documents

• DistributedDataParallel notes

DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should
spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize
gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the
corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes. Please refer to
DDP design note for more details.

The recommended way to use DDP is to spawn one process for each model replica, where a model replica can span multiple devices. DDP processes can be
placed on the same machine or across machines, but GPU devices cannot be shared across processes. This tutorial starts from a basic DDP use case and then
demonstrates more advanced use cases including checkpointing models and combining DDP with model parallel.

Note

The code in this tutorial runs on an 8-GPU server, but it can be easily generalized to other environments.

Comparison between DataParallel and DistributedDataParallel

Before we dive in, let’s clarify why, despite the added complexity, you would consider using DistributedDataParallel over DataParallel:

• First, DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works
for both single- and multi- machine training. DataParallel is usually slower than DistributedDataParallel even on a single machine due to GIL
contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs.

• Recall from the prior tutorial that if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs.
DistributedDataParallel works with model parallel; DataParallel does not at this time. When DDP is combined with model parallel, each DDP
process would use model parallel, and all processes collectively would use data parallel.

• If your model needs to span multiple machines or if your use case does not fit into data parallelism paradigm, please see the RPC API for more generic
distributed training support.

Basic Use Case

To create a DDP module, you must first set up process groups properly. More details can be found in Writing Distributed Applications with PyTorch.
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP

# On Windows platform, the torch.distributed package only

# supports Gloo backend, FileStore and TcpStore.
# For FileStore, set init_method parameter in init_process_group
# to a local file. Example as follow:
# init_method="file:///f:/libtmp/some_file"
# dist.init_process_group(
# "gloo",
# rank=rank,
# init_method=init_method,
# world_size=world_size)
# For TcpStore, same way as on Linux.

def setup(rank, world_size):

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

# initialize the process group

dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()
Now, let’s create a toy module, wrap it with DDP, and feed it some dummy input data. Please note, as DDP broadcasts model states from rank 0 process to all
other processes in the DDP constructor, you do not need to worry about different DDP processes starting from different initial model parameter values.
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)

def forward(self, x):

return self.net2(self.relu(self.net1(x)))

def demo_basic(rank, world_size):

print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)

# create model and move it to GPU with id rank

model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])

loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()

cleanup()

def run_demo(demo_fn, world_size):

mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)

As you can see, DDP wraps lower-level distributed communication details and provides a clean API as if it were a local model. Gradient synchronization
communications take place during the backward pass and overlap with the backward computation. When the backward() returns, param.grad already contains
the synchronized gradient tensor. For basic use cases, DDP only requires a few more LoCs to set up the process group. When applying DDP to more
advanced use cases, some caveats require caution.

Skewed Processing Speeds

In DDP, the constructor, the forward pass, and the backward pass are distributed synchronization points. Different processes are expected to launch the same
number of synchronizations and reach these synchronization points in the same order and enter each synchronization point at roughly the same time.
Otherwise, fast processes might arrive early and timeout while waiting for stragglers. Hence, users are responsible for balancing workload distributions
across processes. Sometimes, skewed processing speeds are inevitable due to, e.g., network delays, resource contentions, or unpredictable workload spikes.
To avoid timeouts in these situations, make sure that you pass a sufficiently large timeout value when calling init_process_group.

Save and Load Checkpoints

It’s common to use torch.save and torch.load to checkpoint modules during training and recover from checkpoints. See SAVING AND LOADING
MODELS for more details. When using DDP, one optimization is to save the model in only one process and then load it to all processes, reducing write
overhead. This is correct because all processes start from the same parameters and gradients are synchronized in backward passes, and hence optimizers
should keep setting parameters to the same values. If you use this optimization, make sure no process starts loading before the saving is finished.
Additionally, when loading the module, you need to provide an appropriate map_location argument to prevent a process from stepping into others’ devices. If
map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all
processes on the same machine using the same set of devices. For more advanced failure recovery and elasticity support, please refer to TorchElastic.
def demo_checkpoint(rank, world_size):
print(f"Running DDP checkpoint example on rank {rank}.")
setup(rank, world_size)

model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])

CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"

if rank == 0:
# All processes should see same parameters as they all start from same
# random parameters and gradients are synchronized in backward passes.
# Therefore, saving it in one process is sufficient.
torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

# Use a barrier() to make sure that process 1 loads the model after process
# 0 saves it.
dist.barrier()
# configure map_location properly
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
ddp_model.load_state_dict(
torch.load(CHECKPOINT_PATH, map_location=map_location))

loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()

# Not necessary to use a dist.barrier() to guard the file deletion below

# as the AllReduce ops in the backward pass of DDP already served as
# a synchronization.

if rank == 0:
os.remove(CHECKPOINT_PATH)

cleanup()

Combining DDP with Model Parallelism

DDP also works with multi-GPU models. DDP wrapping multi-GPU models is especially helpful when training large models with a huge amount of data.
class ToyMpModel(nn.Module):
def __init__(self, dev0, dev1):
super(ToyMpModel, self).__init__()
self.dev0 = dev0
self.dev1 = dev1
self.net1 = torch.nn.Linear(10, 10).to(dev0)
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5).to(dev1)

def forward(self, x):

x = x.to(self.dev0)
x = self.relu(self.net1(x))
x = x.to(self.dev1)
return self.net2(x)

When passing a multi-GPU model to DDP, device_ids and output_device must NOT be set. Input and output data will be placed in proper devices by either
the application or the model forward() method.
def demo_model_parallel(rank, world_size):
print(f"Running DDP with model parallel example on rank {rank}.")
setup(rank, world_size)

# setup mp_model and devices for this process

dev0 = rank * 2
dev1 = rank * 2 + 1
mp_model = ToyMpModel(dev0, dev1)
ddp_mp_model = DDP(mp_model)

loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)

optimizer.zero_grad()
# outputs will be on dev1
outputs = ddp_mp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(dev1)
loss_fn(outputs, labels).backward()
optimizer.step()

cleanup()

if __name__ == "__main__":
n_gpus = torch.cuda.device_count()
assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
world_size = n_gpus
run_demo(demo_basic, world_size)
run_demo(demo_checkpoint, world_size)
world_size = n_gpus//2
run_demo(demo_model_parallel, world_size)

Initialize DDP with torch.distributed.run/torchrun

We can leverage PyTorch Elastic to simplify the DDP code and initialize the job more easily. Let’s still use the Toymodel example and create a file named
elastic_ddp.py.

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)

def forward(self, x):

return self.net2(self.relu(self.net1(x)))

def demo_basic():
dist.init_process_group("nccl")
rank = dist.get_rank()
print(f"Start running basic DDP example on rank {rank}.")

# create model and move it to GPU with id rank

device_id = rank % torch.cuda.device_count()
model = ToyModel().to(device_id)
ddp_model = DDP(model, device_ids=[device_id])

loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
dist.destroy_process_group()

if __name__ == "__main__":
demo_basic()

One can then run a torch elastic/torchrun command on all nodes to initialize the DDP job created above:
torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py

We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we are running it on 16 GPUs. Note that $MASTER_ADDR must be the
same across all nodes.

Here torchrun will launch 8 process and invoke elastic_ddp.py on each process on the node it is launched on, but user also needs to apply cluster
management tools like slurm to actually run this command on 2 nodes.

For example, on a SLURM enabled cluster, we can write a script to run the command above and set MASTER_ADDR as:
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)

Then we can just run this script using the SLURM command: srun --nodes=2 ./torchrun_script.sh. Of course, this is just an example; you can choose
your own cluster scheduling tools to initiate the torchrun job.

For more information about Elastic run, one can check this quick start document to learn more.

Revit PPT For Students
100% (1)
Revit PPT For Students
28 pages
Solutions To Common Errors and Warnings in Cadence Virtuoso IC617
No ratings yet
Solutions To Common Errors and Warnings in Cadence Virtuoso IC617
13 pages
Distributed ML Report
No ratings yet
Distributed ML Report
3 pages
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
No ratings yet
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
14 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
Data Parallelism
No ratings yet
Data Parallelism
5 pages
Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
No ratings yet
Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
32 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
No ratings yet
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
14 pages
Pytorch Slides
No ratings yet
Pytorch Slides
31 pages
DL Pytorch
No ratings yet
DL Pytorch
8 pages
PyTorch Crash Guide Tejas Desai Detailed
No ratings yet
PyTorch Crash Guide Tejas Desai Detailed
7 pages
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
No ratings yet
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
20 pages
Pytorch
No ratings yet
Pytorch
38 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Pytorch Tutorial 1 Rev 1
No ratings yet
Pytorch Tutorial 1 Rev 1
48 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
8 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
DL Unit II
No ratings yet
DL Unit II
29 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
PyTorch eBook
No ratings yet
PyTorch eBook
44 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
No ratings yet
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
11 pages
PyTorch Cheat Sheet
No ratings yet
PyTorch Cheat Sheet
2 pages
L2 Tensors Multidimensional Arrays
No ratings yet
L2 Tensors Multidimensional Arrays
48 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
No ratings yet
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
13 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
Tensorflow Usage: Babii Andrii
No ratings yet
Tensorflow Usage: Babii Andrii
33 pages
Py Torch
No ratings yet
Py Torch
786 pages
Lec 3
No ratings yet
Lec 3
30 pages
PyTorch Cheat Sheet & Quick Reference
No ratings yet
PyTorch Cheat Sheet & Quick Reference
6 pages
Day 45 PyTorch Presentation
No ratings yet
Day 45 PyTorch Presentation
67 pages
PyTorch PDF
No ratings yet
PyTorch PDF
72 pages
Week2_PytorchIntro.ipynb - Colaboratory
No ratings yet
Week2_PytorchIntro.ipynb - Colaboratory
12 pages
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
No ratings yet
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
11 pages
S06 DNN Tensorflow PyTorch Wip
No ratings yet
S06 DNN Tensorflow PyTorch Wip
24 pages
Chapter1 Intro
No ratings yet
Chapter1 Intro
35 pages
ML Unit-5
No ratings yet
ML Unit-5
14 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
MDP Tutorial
No ratings yet
MDP Tutorial
104 pages
2c PyTorch4
No ratings yet
2c PyTorch4
4 pages
Lecture8 Computational Graph Pytorch TF
No ratings yet
Lecture8 Computational Graph Pytorch TF
64 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
Pytorch 101: Deep Learning PHD Course 2017/2018
No ratings yet
Pytorch 101: Deep Learning PHD Course 2017/2018
19 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
Apurv Notes - Foundations of Pytorch
No ratings yet
Apurv Notes - Foundations of Pytorch
15 pages
Train
No ratings yet
Train
13 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
Tensor
No ratings yet
Tensor
19 pages
Deep Learning Library PDF
No ratings yet
Deep Learning Library PDF
12 pages
PyTorch Guide With Code
No ratings yet
PyTorch Guide With Code
4 pages
ICML'22 Big Model Tutorial (Public v2)
No ratings yet
ICML'22 Big Model Tutorial (Public v2)
160 pages
AML Lecture1.3
No ratings yet
AML Lecture1.3
72 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
1 TensorFlow
No ratings yet
1 TensorFlow
66 pages
Pytorch Tutorial: - Ntu Machine Learning Course
No ratings yet
Pytorch Tutorial: - Ntu Machine Learning Course
64 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
M - Perak
No ratings yet
M - Perak
13 pages
L - Uspwh
No ratings yet
L - Uspwh
4 pages
H - Terengganu
No ratings yet
H - Terengganu
13 pages
G - Perlis
No ratings yet
G - Perlis
16 pages
Compose and Other Libraries - Jetpack Compose - Android Developers
No ratings yet
Compose and Other Libraries - Jetpack Compose - Android Developers
11 pages
Draft 34 TNB SR17 - 210318 (Cross-Checked) PDF
No ratings yet
Draft 34 TNB SR17 - 210318 (Cross-Checked) PDF
72 pages
Universal Beams To BS4 Part 1 - 2005
No ratings yet
Universal Beams To BS4 Part 1 - 2005
2 pages
JKR-Arahan Teknik Road Design
100% (4)
JKR-Arahan Teknik Road Design
28 pages
Event
No ratings yet
Event
1,256 pages
Two-Stage Filter Response Normalization Network For Real Image Denoising
No ratings yet
Two-Stage Filter Response Normalization Network For Real Image Denoising
5 pages
DSP Unit 5
No ratings yet
DSP Unit 5
34 pages
Logitech MX Vertical Datasheet
No ratings yet
Logitech MX Vertical Datasheet
2 pages
Koolprog User Guide
No ratings yet
Koolprog User Guide
18 pages
Proceedings of Spie: Design and Characterization of A 3d-Printer-Based Diode Laser Engraver
No ratings yet
Proceedings of Spie: Design and Characterization of A 3d-Printer-Based Diode Laser Engraver
7 pages
KCS053 Computer Graphics 23
No ratings yet
KCS053 Computer Graphics 23
2 pages
VMware Test-King 2V0-21 20 v2020-10-12 by - Liyong 35q
No ratings yet
VMware Test-King 2V0-21 20 v2020-10-12 by - Liyong 35q
20 pages
iPC1 Network Guide
No ratings yet
iPC1 Network Guide
214 pages
DC Brushless Fan & Blower U
No ratings yet
DC Brushless Fan & Blower U
83 pages
DVR 910HD Digital Video Camera: User Manual
No ratings yet
DVR 910HD Digital Video Camera: User Manual
74 pages
Beginning Photo Retouching and Restoration Using Gimp Learn To Retouch and Restore Your Photos Like A Pro 2nd Edition 2nd Phillip Whitt Download
No ratings yet
Beginning Photo Retouching and Restoration Using Gimp Learn To Retouch and Restore Your Photos Like A Pro 2nd Edition 2nd Phillip Whitt Download
86 pages
APC MGE Galaxy 3500 Operation - Manual (EN)
No ratings yet
APC MGE Galaxy 3500 Operation - Manual (EN)
48 pages
1tool QuickReference EN
No ratings yet
1tool QuickReference EN
18 pages
Chapter 5 Large and Fast Exploiting Memory Hierarchy
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
96 pages
CC 102: Computer Programming 1: LESSON II: High and Low Level Programming Languages
No ratings yet
CC 102: Computer Programming 1: LESSON II: High and Low Level Programming Languages
13 pages
Lastexception 63840978597
No ratings yet
Lastexception 63840978597
5 pages
CS-chap6-Storage and Other IO Topics
No ratings yet
CS-chap6-Storage and Other IO Topics
60 pages
PMT Hps Dell r450 Honeywell Server Planning Installation and Service Guide Hwdoc x604 en 520c
No ratings yet
PMT Hps Dell r450 Honeywell Server Planning Installation and Service Guide Hwdoc x604 en 520c
67 pages
TPT 19u2 Release Notes
No ratings yet
TPT 19u2 Release Notes
109 pages
PD2002 ZStack V3.8.0 Technical Whitepaper
No ratings yet
PD2002 ZStack V3.8.0 Technical Whitepaper
59 pages
Curtis 1313
No ratings yet
Curtis 1313
44 pages
How To Build Android Apps With Kotlin A Practical Guide To Developing Testing and Publishing Your First Android Apps 2nd Edition Alex Forrester Download
100% (2)
How To Build Android Apps With Kotlin A Practical Guide To Developing Testing and Publishing Your First Android Apps 2nd Edition Alex Forrester Download
59 pages
TestReach App System Requirements and Configuration Guide
No ratings yet
TestReach App System Requirements and Configuration Guide
5 pages
Using The Error Indicators lq300-350
No ratings yet
Using The Error Indicators lq300-350
1 page
Installation Eviews 9 Student
0% (1)
Installation Eviews 9 Student
8 pages
MPC HC Guide
No ratings yet
MPC HC Guide
4 pages
The Smart Approach To Transactional Printing
No ratings yet
The Smart Approach To Transactional Printing
2 pages

Getting Started With Distributed Data Parallel - PyTorch Tutorials 2.4.0+cu124 Documentation

Uploaded by

Getting Started With Distributed Data Parallel - PyTorch Tutorials 2.4.0+cu124 Documentation

Uploaded by

Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.4.

Edited by: Joe Zhu

View and edit this tutorial in github.

• PyTorch Distributed Overview

• DistributedDataParallel API documents

Comparison between DataParallel and DistributedDataParallel

Basic Use Case

from torch.nn.parallel import DistributedDataParallel as DDP

# On Windows platform, the torch.distributed package only

def setup(rank, world_size):

# initialize the process group

def forward(self, x):

def demo_basic(rank, world_size):

# create model and move it to GPU with id rank

def run_demo(demo_fn, world_size):

Skewed Processing Speeds

Save and Load Checkpoints

CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"

# Not necessary to use a dist.barrier() to guard the file deletion below

Combining DDP with Model Parallelism

def forward(self, x):

# setup mp_model and devices for this process

Initialize DDP with torch.distributed.run/torchrun

from torch.nn.parallel import DistributedDataParallel as DDP

def forward(self, x):

# create model and move it to GPU with id rank

You might also like