0% found this document useful (0 votes)
3 views

data_parallelism

Uploaded by

gaoxiang0411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

data_parallelism

Uploaded by

gaoxiang0411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Title: Distributed Deep Learning: Data Parallelism vs.

Model Parallelism, Saving, and


Serving

Table of Contents:

1. Introduction
2. Data Parallelism
3. Model Parallelism
4. Code Example of Model Parallelism in PyTorch
5. Saving and Serving a Model-Trained with Model Parallelism
○ Saving the Model
○ Serving for Online Inference
○ Inference on Multiple vs. Single Devices
6. Conclusion

1. Introduction

In distributed deep learning, there are two primary strategies for scaling training across multiple
devices (e.g., GPUs): Data Parallelism and Model Parallelism. Understanding these
strategies is crucial for efficiently training large models or large datasets.

2. Data Parallelism

Definition: Each device (GPU) holds a full copy of the model. The dataset is split into batches
that are distributed across devices. Each GPU processes a separate batch, computes
gradients, and the gradients are then aggregated to update the model weights.

Pros:

● Straightforward to implement.
● Scales well with large datasets.

Cons:

● Requires that the full model fits on a single device.


● Communication overhead when synchronizing gradients.

Data parallelism is best when the model comfortably fits into a single GPU’s memory, and you
have a large amount of data.

3. Model Parallelism
Definition: The model is split across multiple devices. Each device holds only a part of the
model. During the forward pass, intermediate outputs are passed between devices.

Pros:

● Enables training of very large models that cannot fit into a single GPU’s memory.

Cons:

● More complex to implement than data parallelism.


● Requires inter-device communication of intermediate activations, which can increase
overhead.

Model parallelism is ideal when model size is the bottleneck rather than dataset size.

4. Code Example of Model Parallelism in PyTorch

Note: This is a simplified example assuming two GPUs, GPU 0 and GPU 1. The model’s first
half runs on GPU 0 and the second half on GPU 1.

python
Copy code
import torch
import torch.nn as nn
import torch.optim as optim

# Device setup
device0 = torch.device("cuda:0" if torch.cuda.is_available() else
"cpu")
device1 = torch.device("cuda:1" if (torch.cuda.is_available() and
torch.cuda.device_count() > 1) else "cpu")

class ModelParallelNN(nn.Module):
def __init__(self):
super(ModelParallelNN, self).__init__()
# Part of model on GPU 0
self.fc1 = nn.Linear(1024, 512).to(device0)
self.relu = nn.ReLU()

# Part of model on GPU 1


self.fc2 = nn.Linear(512, 256).to(device1)
self.fc3 = nn.Linear(256, 10).to(device1)
def forward(self, x):
x = x.to(device0)
x = self.fc1(x)
x = self.relu(x)

# Move activations to GPU 1


x = x.to(device1)
x = self.fc2(x)
x = self.relu(x)
x = self.fc3(x)
return x

# Instantiate and train


model = ModelParallelNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Dummy data
data = torch.randn(64, 1024) # 64 examples, 1024 features each
labels = torch.randint(0, 10, (64,)).to(device1)

for epoch in range(10):


optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

5. Saving and Serving a Model-Trained with Model Parallelism

Saving the Model

Saving works similarly to standard PyTorch models. The state_dict includes all parameters
from all devices.

python
Copy code
torch.save(model.state_dict(), 'model_parallel.pth')

Loading:

● To the original devices:

python
Copy code
model = ModelParallelNN()
model.load_state_dict(torch.load('model_parallel.pth'))
# Ensure parts of model are on correct devices if re-instantiated
model.fc1.to(device0)
model.fc2.to(device1)
model.fc3.to(device1)

● To a single device (e.g., CPU or GPU 0):

python
Copy code
device = torch.device('cuda:0' if torch.cuda.is_available() else
'cpu')
model = ModelParallelNN()
model.load_state_dict(torch.load('model_parallel.pth',
map_location=device))
model.to(device)

Serving for Online Inference

Inference on a Single Device:


If the model fits into one GPU or CPU, it’s simpler to run inference on a single device. This
avoids the complexity of multi-device communication.

python
Copy code
def infer(input_data):
input_data = input_data.to(device)
with torch.no_grad():
output = model(input_data)
return output
Inference on Multiple Devices:
If the model is too large to fit on one device, you can perform inference similarly to the training
forward pass, with parts of the model on different GPUs.

python
Copy code
def infer_parallel(input_data):
input_data = input_data.to(device0)
with torch.no_grad():
output = model(input_data)
return output

In Practice:

● If possible, consolidate the model onto one device for inference to reduce complexity
and overhead.
● Use frameworks like TorchServe or NVIDIA Triton to handle multi-GPU deployment and
scaling.
● Convert models to ONNX and use efficient inference engines if needed.

6. Conclusion

● Data Parallelism is straightforward when the model fits on a single device and involves
replicating the model across multiple devices to process different parts of the dataset.
● Model Parallelism is used when the model is too large for a single device, splitting it
across multiple devices.
● When serving models for online inference, consider consolidating onto a single device if
feasible. If the model is too large, maintain model parallelism for inference.
● Saving and loading model-parallel-trained models involves saving the state_dict and
carefully loading it onto the appropriate devices.

You might also like