0% found this document useful (0 votes)
120 views41 pages

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

This document discusses the CPU bottleneck in deep learning training and proposes a solution called DALI to optimize data pipelines. DALI aims to centralize efforts to build efficient data pipelines, integrate into all frameworks, and provide both flexibility and high performance. It describes DALI's architecture with GPU-optimized primitives and a pipeline that processes data from disk through decoding, augmentation and onto the GPU in a batched fashion. DALI seeks to set data free across frameworks and file formats for faster, more flexible deep learning training.

Uploaded by

Ryan Bloxham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views41 pages

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

This document discusses the CPU bottleneck in deep learning training and proposes a solution called DALI to optimize data pipelines. DALI aims to centralize efforts to build efficient data pipelines, integrate into all frameworks, and provide both flexibility and high performance. It describes DALI's architecture with GPU-optimized primitives and a pipeline that processes data from disk through decoding, augmentation and onto the GPU in a batched fashion. DALI seeks to set data free across frameworks and file formats for faster, more flexible deep learning training.

Uploaded by

Ryan Bloxham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

S8906: FAST DATA PIPELINES

FOR DEEP LEARNING TRAINING


Przemek Tredak, Simon Layton
THE PROBLEM

2
CPU BOTTLENECK OF DL TRAINING
CPU : GPU ratio

• Multi-GPU, dense systems are more common (DGX-1V, DGX-2)

• Using more cores / sockets is very expensive

• CPU to GPU ratio becomes lower:


• DGX-1V: 40 cores / 8, 5 cores / GPU

• DGX-2: 48 cores / 16, 3 cores / GPU

3
CPU BOTTLENECK OF DL TRAINING
Complexity of I/O pipeline

Alexnet

256x256 image 224x224 crop


Training
and mirror

ResNet 50

480p image Random resize Color augment 224x224 crop Training


and mirror
4
CPU BOTTLENECK OF DL TRAINING

Increased complexity of
Higher GPU to CPU ratio
CPU-based I/O pipeline

GPU
Throughput

CPU

Time 5
LOTS OF FRAMEWORKS
Lots of effort

MXNet Caffe2 TensorFlow

Manual graph
ImageRecordIter ImageIO ImageInputOp Python Dataset
construction

Python Python

Frameworks have their own I/O pipelines (often more than 1!)
Lots of duplicated effort to optimize them all
Training process is not portable even if the model is (e.g. via ONNX)

6
LOTS OF FRAMEWORKS
Lots of effort

Optimized I/O pipelines are not flexible and often unsuitable for research
train = mx.io.ImageRecordIter(
path_imgrec = args.data_train,
path_imgidx = args.data_train_idx,
label_width = 1,
mean_r = rgb_mean[0],
mean_g = rgb_mean[1],
mean_b = rgb_mean[2],
data_name = 'data', image, _ = mx.image.random_size_crop(image,
label_name = 'softmax_label', (data_shape, data_shape), 0.08, (3/4., 4/3.))
data_shape = image_shape,
batch_size = 128, image = mx.nd.image.random_flip_left_right(image)
rand_crop
max_random_scale
= True,
= 1, vs image = mx.nd.image.to_tensor(image)
pad = 0, image = mx.nd.image.normalize(image, mean=(0.485,
fill_value = 127,
min_random_scale = 0.533, 0.456, 0.406), std=(0.229, 0.224, 0.225))
max_aspect_ratio
random_h
= args.max_random_aspect_ratio,
= args.max_random_h,
return mx.nd.cast(image, dtype), label
random_s = args.max_random_s,
random_l = args.max_random_l,
max_rotate_angle = args.max_random_rotate_angle,
max_shear_ratio = args.max_random_shear_ratio,
rand_mirror = args.random_mirror,
preprocess_threads = args.data_nthreads,
shuffle = True,
num_parts = 0,
part_index = 1)

Inflexible fast flexible slow 7


SOLUTION: ONE LIBRARY

MXNet Caffe2 PyTorch TF etc.

DALI

• Centralize the effort


• Integrate into all frameworks
• Provide both flexibility and performance

8
DALI: OVERVIEW

9
DALI

• Flexible, high-performance image data pipeline

Plugin
• Python / C++ frontends with C++ / CUDA backend

Framework
• Minimal (or no) changes to the frameworks required

DALI
• Full pipeline - from disk to GPU, ready to train
• OSS (soon)

10
GRAPH WITHIN A GRAPH
I/O in Frameworks today
GPU

Data pipeline is just a (simple) graph CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training

11
GPU OPTIMIZED PRIMITIVES
DALI
GPU

High performance, GPU optimized implementations CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training

12
GPU ACCELERATED JPEG DECODE
DALI with nvJPEG
GPU

Hybrid approach to JPEG decoding – can move fully to GPU in the future CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training
Hu

13
SET YOUR DATA FREE
DALI

List of JPEGs
LMDB (Caffe, RecordIO TFRecord
(PyTorch,
Caffe2) (MXNet) (TensorFlow)
others)

Use any file format in any framework

14
BEHIND THE SCENES:
PIPELINE

15
PIPELINE
Overview

Framework

One pipeline per GPU


The same logic for multithreaded and multiprocess frameworks

16
PIPELINE
Overview

Framework

CPU

Mixed
Single direction
3 stages GPU
CPU -> Mixed -> GPU
17
PIPELINE
Overview

6 8
1 9 Framework
3
5 7
2 4
CPU

Mixed

Simple scheduling of operations GPU

18
PIPELINE
CPU

1 1 1 1 1

1 2 2 2 2 2

3 3 3 3 3 3
5
2 4 4 4 4 4 4

5 5 5 5 5

5 5 5 5 5
Operations processed per-sample in a thread pool

19
PIPELINE
GPU

8
8
9

9
Batched processing of data

20
PIPELINE
Mixed

Mixed

A bridge between CPU and GPU


Per-sample input, batched output
Used also for batching CPU data (for CPU outputs of the pipeline)
21
EXECUTOR
Pipelining the pipeline

time

CPU 1 Mixed 1 GPU 1 CPU 2 Mixed 2 GPU 2 CPU 3 Mixed 4

CPU, Mixed and GPU stages need to be executed serially

But each batch of data is independent…


22
EXECUTOR
Pipelining the pipeline

time

CPU 1 CPU 2 CPU 3



Mixed 1 Mixed 2 Mixed 3

GPU 1 GPU 2 GPU 3

Each stage is asynchronous

Stages of given batch synchronized via events


23
OPERATORS
Gallery

24
USING DALI

25
EXAMPLE: RESNET-50 PIPELINE
Pipeline class
import dali
import dali.ops as ops

class HybridRN50Pipe(dali.Pipeline):
def __init__(self, batch_size, num_threads, device_id, num_devices):
super(HybridRN50Pipe, self).__init__(batch_size,
num_threads, device_id)
# define used operators

def define_graph(self):
# define graph of operations

26
EXAMPLE: RESNET-50 PIPELINE
Defining operators

def __init__(self, batch_size, num_threads, device_id, num_devices):


super(HybridRN50Pipe, self).__init__(batch_size, num_threads,
device_id)
self.loader = ops.Caffe2Reader(path=lmdb_path, shard_id=dev_id,
num_shards=num_devices)
self.decode = ops.HybridDecode(output_type=dali.types.RGB)
self.resize = ops.Resize(device="gpu", resize_a=256,
resize_b=480, random_resize=True,
image_type=types.RGB)
self.crop = ops.CropMirrorNormalize(device="gpu",
random_crop=True, crop=(224, 224),
mirror_prob=0.5, mean=[128.,128.,128.],
std=[1.,1.,1.], output_layout=dali.types.NCHW)

27
EXAMPLE: RESNET-50 PIPELINE
Defining graph

def define_graph(self):
jpeg, labels = self.loader(name="Reader")
images = self.decode(jpeg)
resized_images = self.resize(images)
cropped_images = self.crop(resized_images)
return [cropped_images, labels]

jpeg Decode Resize Crop Data

Loader

labels MakeContiguous Label

28
EXAMPLE: RESNET-50 PIPELINE
Usage: MXNet

import mxnet as mx
from dali.plugin.mxnet import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)
pipe.build()
train = DALIIterator(pipe, pipe.epoch_size("Reader"))

model.fit(train,
# other parameters
)

29
EXAMPLE: RESNET-50 PIPELINE
Usage: TensorFlow
import tensorflow as tf
from dali.plugin.tf import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()
train = DALIIterator()

with tf.session() as sess:


images, labels = train(serialized_pipe)
# rest of the model using images and labels
sess.run(...)

30
EXAMPLE: RESNET-50 PIPELINE
Usage: Caffe 2

from caffe2.python import brew

pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()

data, label = brew.dali_input(model, ["data", "label"],


serialized_pipe=serialized_pipe)

# Add the rest of your network as normal


conv1 = brew.conv(model, data, “conv1”, …)

31
PERFORMANCE

32
PERFORMANCE
I/O Pipeline
Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW
25000
23000

20000
Images / Second

15000 14350

10000
8000

5150 5450
5000

33
PERFORMANCE
End-to-end training
End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU
18000
17000

16000 15500

14000

12000
images / second

10000

8000
8000

6000

4000

2000

0
Native DALI Synthetic
34
NEXT STEPS

35
NEXT: MORE WORKLOADS
Segmentation

def define_graph(self):
images, masks = self.loader(name="Reader")
images = self.decode(images)
masks = self.decode(masks)

# Apply identical transformations


resized_images, resized_masks = self.resize([images, masks], …)
cropped_images, cropped_masks = self.crop([resized_images,
resized_masks], …)
return [cropped_images, cropped_masks]
36
NEXT: MORE FORMATS

What would be useful to you?

PNG Video frames

37
NEXT++: MORE OFFLOADING

Fully GPU-based decode

HW-based via. NVDEC

Transcode to video

38
SOON: EARLY ACCESS

Looking for:
Contact: Milind Kukanur
[email protected]
General feedback

New workloads

New transformations

39
ACKNOWLEDGEMENTS

Trevor Gale
Andrei Ivanov
Serge Panev
Cliff Woolley
DL Frameworks team @ NVIDIA

40

You might also like