S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton
S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton
2
CPU BOTTLENECK OF DL TRAINING
CPU : GPU ratio
3
CPU BOTTLENECK OF DL TRAINING
Complexity of I/O pipeline
Alexnet
ResNet 50
Increased complexity of
Higher GPU to CPU ratio
CPU-based I/O pipeline
GPU
Throughput
CPU
Time 5
LOTS OF FRAMEWORKS
Lots of effort
Manual graph
ImageRecordIter ImageIO ImageInputOp Python Dataset
construction
Python Python
Frameworks have their own I/O pipelines (often more than 1!)
Lots of duplicated effort to optimize them all
Training process is not portable even if the model is (e.g. via ONNX)
6
LOTS OF FRAMEWORKS
Lots of effort
Optimized I/O pipelines are not flexible and often unsuitable for research
train = mx.io.ImageRecordIter(
path_imgrec = args.data_train,
path_imgidx = args.data_train_idx,
label_width = 1,
mean_r = rgb_mean[0],
mean_g = rgb_mean[1],
mean_b = rgb_mean[2],
data_name = 'data', image, _ = mx.image.random_size_crop(image,
label_name = 'softmax_label', (data_shape, data_shape), 0.08, (3/4., 4/3.))
data_shape = image_shape,
batch_size = 128, image = mx.nd.image.random_flip_left_right(image)
rand_crop
max_random_scale
= True,
= 1, vs image = mx.nd.image.to_tensor(image)
pad = 0, image = mx.nd.image.normalize(image, mean=(0.485,
fill_value = 127,
min_random_scale = 0.533, 0.456, 0.406), std=(0.229, 0.224, 0.225))
max_aspect_ratio
random_h
= args.max_random_aspect_ratio,
= args.max_random_h,
return mx.nd.cast(image, dtype), label
random_s = args.max_random_s,
random_l = args.max_random_l,
max_rotate_angle = args.max_random_rotate_angle,
max_shear_ratio = args.max_random_shear_ratio,
rand_mirror = args.random_mirror,
preprocess_threads = args.data_nthreads,
shuffle = True,
num_parts = 0,
part_index = 1)
DALI
8
DALI: OVERVIEW
9
DALI
Plugin
• Python / C++ frontends with C++ / CUDA backend
Framework
• Minimal (or no) changes to the frameworks required
DALI
• Full pipeline - from disk to GPU, ready to train
• OSS (soon)
10
GRAPH WITHIN A GRAPH
I/O in Frameworks today
GPU
Images
Decode Resize Augment
JPEG
Labels
Loader Training
11
GPU OPTIMIZED PRIMITIVES
DALI
GPU
Images
Decode Resize Augment
JPEG
Labels
Loader Training
12
GPU ACCELERATED JPEG DECODE
DALI with nvJPEG
GPU
Hybrid approach to JPEG decoding – can move fully to GPU in the future CPU
Images
Decode Resize Augment
JPEG
Labels
Loader Training
Hu
13
SET YOUR DATA FREE
DALI
List of JPEGs
LMDB (Caffe, RecordIO TFRecord
(PyTorch,
Caffe2) (MXNet) (TensorFlow)
others)
14
BEHIND THE SCENES:
PIPELINE
15
PIPELINE
Overview
Framework
16
PIPELINE
Overview
Framework
CPU
Mixed
Single direction
3 stages GPU
CPU -> Mixed -> GPU
17
PIPELINE
Overview
6 8
1 9 Framework
3
5 7
2 4
CPU
Mixed
18
PIPELINE
CPU
1 1 1 1 1
1 2 2 2 2 2
3 3 3 3 3 3
5
2 4 4 4 4 4 4
5 5 5 5 5
5 5 5 5 5
Operations processed per-sample in a thread pool
19
PIPELINE
GPU
8
8
9
9
Batched processing of data
20
PIPELINE
Mixed
Mixed
time
time
24
USING DALI
25
EXAMPLE: RESNET-50 PIPELINE
Pipeline class
import dali
import dali.ops as ops
class HybridRN50Pipe(dali.Pipeline):
def __init__(self, batch_size, num_threads, device_id, num_devices):
super(HybridRN50Pipe, self).__init__(batch_size,
num_threads, device_id)
# define used operators
def define_graph(self):
# define graph of operations
26
EXAMPLE: RESNET-50 PIPELINE
Defining operators
27
EXAMPLE: RESNET-50 PIPELINE
Defining graph
def define_graph(self):
jpeg, labels = self.loader(name="Reader")
images = self.decode(jpeg)
resized_images = self.resize(images)
cropped_images = self.crop(resized_images)
return [cropped_images, labels]
Loader
28
EXAMPLE: RESNET-50 PIPELINE
Usage: MXNet
import mxnet as mx
from dali.plugin.mxnet import DALIIterator
pipe = HybridRN50Pipe(128, 2, 0, 1)
pipe.build()
train = DALIIterator(pipe, pipe.epoch_size("Reader"))
model.fit(train,
# other parameters
)
29
EXAMPLE: RESNET-50 PIPELINE
Usage: TensorFlow
import tensorflow as tf
from dali.plugin.tf import DALIIterator
pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()
train = DALIIterator()
30
EXAMPLE: RESNET-50 PIPELINE
Usage: Caffe 2
pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()
31
PERFORMANCE
32
PERFORMANCE
I/O Pipeline
Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW
25000
23000
20000
Images / Second
15000 14350
10000
8000
5150 5450
5000
33
PERFORMANCE
End-to-end training
End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU
18000
17000
16000 15500
14000
12000
images / second
10000
8000
8000
6000
4000
2000
0
Native DALI Synthetic
34
NEXT STEPS
35
NEXT: MORE WORKLOADS
Segmentation
def define_graph(self):
images, masks = self.loader(name="Reader")
images = self.decode(images)
masks = self.decode(masks)
37
NEXT++: MORE OFFLOADING
Transcode to video
38
SOON: EARLY ACCESS
Looking for:
Contact: Milind Kukanur
[email protected]
General feedback
New workloads
New transformations
39
ACKNOWLEDGEMENTS
Trevor Gale
Andrei Ivanov
Serge Panev
Cliff Woolley
DL Frameworks team @ NVIDIA
40