0% found this document useful (0 votes)

122 views41 pages

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

This document discusses the CPU bottleneck in deep learning training and proposes a solution called DALI to optimize data pipelines. DALI aims to centralize efforts to build efficient data pipelines, integrate into all frameworks, and provide both flexibility and high performance. It describes DALI's architecture with GPU-optimized primitives and a pipeline that processes data from disk through decoding, augmentation and onto the GPU in a batched fashion. DALI seeks to set data free across frameworks and file formats for faster, more flexible deep learning training.

Uploaded by

Ryan Bloxham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views41 pages

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

Uploaded by

Ryan Bloxham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

S8906: FAST DATA PIPELINES

FOR DEEP LEARNING TRAINING

Przemek Tredak, Simon Layton
THE PROBLEM

2
CPU BOTTLENECK OF DL TRAINING
CPU : GPU ratio

• Multi-GPU, dense systems are more common (DGX-1V, DGX-2)

• Using more cores / sockets is very expensive

• CPU to GPU ratio becomes lower:

• DGX-1V: 40 cores / 8, 5 cores / GPU

• DGX-2: 48 cores / 16, 3 cores / GPU

3
CPU BOTTLENECK OF DL TRAINING
Complexity of I/O pipeline

Alexnet

256x256 image 224x224 crop

Training
and mirror

ResNet 50

480p image Random resize Color augment 224x224 crop Training

and mirror
4
CPU BOTTLENECK OF DL TRAINING

Increased complexity of
Higher GPU to CPU ratio
CPU-based I/O pipeline

GPU
Throughput

CPU

Time 5
LOTS OF FRAMEWORKS
Lots of effort

MXNet Caffe2 TensorFlow

Manual graph
ImageRecordIter ImageIO ImageInputOp Python Dataset
construction

Python Python

Frameworks have their own I/O pipelines (often more than 1!)
Lots of duplicated effort to optimize them all
Training process is not portable even if the model is (e.g. via ONNX)

6
LOTS OF FRAMEWORKS
Lots of effort

Optimized I/O pipelines are not flexible and often unsuitable for research
train = mx.io.ImageRecordIter(
path_imgrec = args.data_train,
path_imgidx = args.data_train_idx,
label_width = 1,
mean_r = rgb_mean[0],
mean_g = rgb_mean[1],
mean_b = rgb_mean[2],
data_name = 'data', image, _ = mx.image.random_size_crop(image,
label_name = 'softmax_label', (data_shape, data_shape), 0.08, (3/4., 4/3.))
data_shape = image_shape,
batch_size = 128, image = mx.nd.image.random_flip_left_right(image)
rand_crop
max_random_scale
= True,
= 1, vs image = mx.nd.image.to_tensor(image)
pad = 0, image = mx.nd.image.normalize(image, mean=(0.485,
fill_value = 127,
min_random_scale = 0.533, 0.456, 0.406), std=(0.229, 0.224, 0.225))
max_aspect_ratio
random_h
= args.max_random_aspect_ratio,
= args.max_random_h,
return mx.nd.cast(image, dtype), label
random_s = args.max_random_s,
random_l = args.max_random_l,
max_rotate_angle = args.max_random_rotate_angle,
max_shear_ratio = args.max_random_shear_ratio,
rand_mirror = args.random_mirror,
preprocess_threads = args.data_nthreads,
shuffle = True,
num_parts = 0,
part_index = 1)

Inflexible fast flexible slow 7

SOLUTION: ONE LIBRARY

MXNet Caffe2 PyTorch TF etc.

DALI

• Centralize the effort

• Integrate into all frameworks
• Provide both flexibility and performance

8
DALI: OVERVIEW

9
DALI

• Flexible, high-performance image data pipeline

Plugin
• Python / C++ frontends with C++ / CUDA backend

Framework
• Minimal (or no) changes to the frameworks required

DALI
• Full pipeline - from disk to GPU, ready to train
• OSS (soon)

10
GRAPH WITHIN A GRAPH
I/O in Frameworks today
GPU

Data pipeline is just a (simple) graph CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training

11
GPU OPTIMIZED PRIMITIVES
DALI
GPU

High performance, GPU optimized implementations CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training

12
GPU ACCELERATED JPEG DECODE
DALI with nvJPEG
GPU

Hybrid approach to JPEG decoding – can move fully to GPU in the future CPU

Images
Decode Resize Augment
JPEG
Labels
Loader Training
Hu

13
SET YOUR DATA FREE
DALI

List of JPEGs
LMDB (Caffe, RecordIO TFRecord
(PyTorch,
Caffe2) (MXNet) (TensorFlow)
others)

Use any file format in any framework

14
BEHIND THE SCENES:
PIPELINE

15
PIPELINE
Overview

Framework

One pipeline per GPU

The same logic for multithreaded and multiprocess frameworks

16
PIPELINE
Overview

Framework

CPU

Mixed
Single direction
3 stages GPU
CPU -> Mixed -> GPU
17
PIPELINE
Overview

6 8
1 9 Framework
3
5 7
2 4
CPU

Mixed

Simple scheduling of operations GPU

18
PIPELINE
CPU

1 1 1 1 1

1 2 2 2 2 2

3 3 3 3 3 3
5
2 4 4 4 4 4 4

5 5 5 5 5

5 5 5 5 5
Operations processed per-sample in a thread pool

19
PIPELINE
GPU

8
8
9

9
Batched processing of data

20
PIPELINE
Mixed

Mixed

A bridge between CPU and GPU

Per-sample input, batched output
Used also for batching CPU data (for CPU outputs of the pipeline)
21
EXECUTOR
Pipelining the pipeline

time

CPU 1 Mixed 1 GPU 1 CPU 2 Mixed 2 GPU 2 CPU 3 Mixed 4

CPU, Mixed and GPU stages need to be executed serially

But each batch of data is independent…

22
EXECUTOR
Pipelining the pipeline

time

CPU 1 CPU 2 CPU 3

…
Mixed 1 Mixed 2 Mixed 3

GPU 1 GPU 2 GPU 3

Each stage is asynchronous

Stages of given batch synchronized via events

23
OPERATORS
Gallery

24
USING DALI

25
EXAMPLE: RESNET-50 PIPELINE
Pipeline class
import dali
import dali.ops as ops

class HybridRN50Pipe(dali.Pipeline):
def __init__(self, batch_size, num_threads, device_id, num_devices):
super(HybridRN50Pipe, self).__init__(batch_size,
num_threads, device_id)
# define used operators

def define_graph(self):
# define graph of operations

26
EXAMPLE: RESNET-50 PIPELINE
Defining operators

def init(self, batch_size, num_threads, device_id, num_devices):

super(HybridRN50Pipe, self).__init__(batch_size, num_threads,
device_id)
self.loader = ops.Caffe2Reader(path=lmdb_path, shard_id=dev_id,
num_shards=num_devices)
self.decode = ops.HybridDecode(output_type=dali.types.RGB)
self.resize = ops.Resize(device="gpu", resize_a=256,
resize_b=480, random_resize=True,
image_type=types.RGB)
self.crop = ops.CropMirrorNormalize(device="gpu",
random_crop=True, crop=(224, 224),
mirror_prob=0.5, mean=[128.,128.,128.],
std=[1.,1.,1.], output_layout=dali.types.NCHW)

27
EXAMPLE: RESNET-50 PIPELINE
Defining graph

def define_graph(self):
jpeg, labels = self.loader(name="Reader")
images = self.decode(jpeg)
resized_images = self.resize(images)
cropped_images = self.crop(resized_images)
return [cropped_images, labels]

jpeg Decode Resize Crop Data

Loader

labels MakeContiguous Label

28
EXAMPLE: RESNET-50 PIPELINE
Usage: MXNet

import mxnet as mx
from dali.plugin.mxnet import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)
pipe.build()
train = DALIIterator(pipe, pipe.epoch_size("Reader"))

model.fit(train,
# other parameters
)

29
EXAMPLE: RESNET-50 PIPELINE
Usage: TensorFlow
import tensorflow as tf
from dali.plugin.tf import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()
train = DALIIterator()

with tf.session() as sess:

images, labels = train(serialized_pipe)
# rest of the model using images and labels
sess.run(...)

30
EXAMPLE: RESNET-50 PIPELINE
Usage: Caffe 2

from caffe2.python import brew

pipe = HybridRN50Pipe(128, 2, 0, 1)
serialized_pipe = pipe.serialize()

data, label = brew.dali_input(model, ["data", "label"],

serialized_pipe=serialized_pipe)

# Add the rest of your network as normal

conv1 = brew.conv(model, data, “conv1”, …)

31
PERFORMANCE

32
PERFORMANCE
I/O Pipeline
Throughput, DGX-2, RN50 pipeline, Batch 128, NCHW
25000
23000

20000
Images / Second

15000 14350

10000
8000

5150 5450
5000

33
PERFORMANCE
End-to-end training
End-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU
18000
17000

16000 15500

14000

12000
images / second

10000

8000
8000

6000

4000

2000

0
Native DALI Synthetic
34
NEXT STEPS

35
NEXT: MORE WORKLOADS
Segmentation

def define_graph(self):
images, masks = self.loader(name="Reader")
images = self.decode(images)
masks = self.decode(masks)

# Apply identical transformations

resized_images, resized_masks = self.resize([images, masks], …)
cropped_images, cropped_masks = self.crop([resized_images,
resized_masks], …)
return [cropped_images, cropped_masks]
36
NEXT: MORE FORMATS

What would be useful to you?

PNG Video frames

37
NEXT++: MORE OFFLOADING

Fully GPU-based decode

HW-based via. NVDEC

Transcode to video

38
SOON: EARLY ACCESS

Looking for:
Contact: Milind Kukanur
[email protected]
General feedback

New workloads

New transformations

39
ACKNOWLEDGEMENTS

Trevor Gale
Andrei Ivanov
Serge Panev
Cliff Woolley
DL Frameworks team @ NVIDIA

Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Bhargav in Group
No ratings yet
Bhargav in Group
3 pages
Junk 1
No ratings yet
Junk 1
12 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
Modern Convolutional Neural Networks
No ratings yet
Modern Convolutional Neural Networks
68 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
3rd Research Paper
No ratings yet
3rd Research Paper
10 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
Aditya Joshi 23252595 Assign 5
No ratings yet
Aditya Joshi 23252595 Assign 5
7 pages
Dla
No ratings yet
Dla
23 pages
Notesv 1
No ratings yet
Notesv 1
6 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Create AI Model Guide
No ratings yet
Create AI Model Guide
14 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
30 pages
DSE 3141 Deep Learning Lab Manual 2024 Week4
No ratings yet
DSE 3141 Deep Learning Lab Manual 2024 Week4
14 pages
Mxnet Documentation: Release 0.0.8
No ratings yet
Mxnet Documentation: Release 0.0.8
93 pages
Introduction To Deep Neural Networks - DataCamp
No ratings yet
Introduction To Deep Neural Networks - DataCamp
10 pages
Ker As Tutorial
No ratings yet
Ker As Tutorial
33 pages
Fastai. A Layered API For Deep Learning
No ratings yet
Fastai. A Layered API For Deep Learning
26 pages
Let Us Code: Using Deep Learning Through A Library
No ratings yet
Let Us Code: Using Deep Learning Through A Library
17 pages
(Important) (Resnet 50 Paper) Howard Et Al 2020 Fastai A Layered API For Deep Learning
No ratings yet
(Important) (Resnet 50 Paper) Howard Et Al 2020 Fastai A Layered API For Deep Learning
27 pages
Preet Hi
No ratings yet
Preet Hi
75 pages
Author Biographies Preface Acknowledgments Table of Figures
No ratings yet
Author Biographies Preface Acknowledgments Table of Figures
6 pages
A Comparative Study of Deep Learning
No ratings yet
A Comparative Study of Deep Learning
6 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
Complex Neural Networks Made Easy by Chainer
No ratings yet
Complex Neural Networks Made Easy by Chainer
13 pages
Deep Learning Library PDF
No ratings yet
Deep Learning Library PDF
12 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
No ratings yet
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
18 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
Deep Learning1
No ratings yet
Deep Learning1
23 pages
8 ModelArts One-Stop AI Development Platform
No ratings yet
8 ModelArts One-Stop AI Development Platform
52 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Various Frameworks and Libraries of Machine Learning and Deep Learning - A Survey
No ratings yet
Various Frameworks and Libraries of Machine Learning and Deep Learning - A Survey
24 pages
LAB SHEET 1 Basics
No ratings yet
LAB SHEET 1 Basics
5 pages
24 TensorFlow Clipper
No ratings yet
24 TensorFlow Clipper
35 pages
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
13-Gradient Descent With Momentum-08!08!2024
No ratings yet
13-Gradient Descent With Momentum-08!08!2024
26 pages
L7 - Functional API
No ratings yet
L7 - Functional API
14 pages
Speeding Up Document Image Classi Cation
No ratings yet
Speeding Up Document Image Classi Cation
59 pages
03 - Algorithms For Embedded and Edge AI
No ratings yet
03 - Algorithms For Embedded and Edge AI
44 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
5 pages
AlexNet Algorithm Presentation ML AI Deep Learning
No ratings yet
AlexNet Algorithm Presentation ML AI Deep Learning
10 pages
14 DL Frameworks
No ratings yet
14 DL Frameworks
30 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Computer Vision Engineer Interview Preparation Guide
No ratings yet
Computer Vision Engineer Interview Preparation Guide
20 pages
Unit3 DLT Material Important Notes
No ratings yet
Unit3 DLT Material Important Notes
33 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
29 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
DL - Unit IV
No ratings yet
DL - Unit IV
36 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Microproject Report Group 2
No ratings yet
Microproject Report Group 2
15 pages
Ov2500 Nms e 46r1 Install Reva
No ratings yet
Ov2500 Nms e 46r1 Install Reva
268 pages
Verbal Nominal Sentence
No ratings yet
Verbal Nominal Sentence
1 page
Let'S Have A Party!: Level
No ratings yet
Let'S Have A Party!: Level
7 pages
Drexel Lesson Plan Template Interactive Read Aloud Teacher: Brad Jones Grade: 2
No ratings yet
Drexel Lesson Plan Template Interactive Read Aloud Teacher: Brad Jones Grade: 2
3 pages
Sap Enterprise Structure
No ratings yet
Sap Enterprise Structure
27 pages
ASHA Statement On CAS
No ratings yet
ASHA Statement On CAS
3 pages
Passive Exercises With Answers
No ratings yet
Passive Exercises With Answers
6 pages
Oops Java Cheatsheet 2023
No ratings yet
Oops Java Cheatsheet 2023
4 pages
Origin C Programming Guide
No ratings yet
Origin C Programming Guide
287 pages
MP 4 JSD
No ratings yet
MP 4 JSD
8 pages
Identify The Choice That Best Completes The Statement or Answers The Question. Bcde Opqr DE PQ OR OP QR MNO PQR, MN PR M P NO QR N Q ABC
No ratings yet
Identify The Choice That Best Completes The Statement or Answers The Question. Bcde Opqr DE PQ OR OP QR MNO PQR, MN PR M P NO QR N Q ABC
32 pages
GEC 109 - Final Exam
100% (1)
GEC 109 - Final Exam
3 pages
Grammar
No ratings yet
Grammar
2 pages
(15685276 - Numen) The Identity of The Destroyer in The Mahābhārata
No ratings yet
(15685276 - Numen) The Identity of The Destroyer in The Mahābhārata
18 pages
English Wizard Sample PDF
No ratings yet
English Wizard Sample PDF
26 pages
Change QUOTES From Direct To Reported Speech
No ratings yet
Change QUOTES From Direct To Reported Speech
3 pages
But 18
No ratings yet
But 18
12 pages
By William Somerset: Mr. Know-All
No ratings yet
By William Somerset: Mr. Know-All
20 pages
A Review On Automatic Speech Recognition Architect
No ratings yet
A Review On Automatic Speech Recognition Architect
13 pages
k2 V11ea1
No ratings yet
k2 V11ea1
30 pages
ASSESSMENT
No ratings yet
ASSESSMENT
8 pages
Least Common Multiple
No ratings yet
Least Common Multiple
6 pages
Computer Fundamentals (ALL in ONE)
No ratings yet
Computer Fundamentals (ALL in ONE)
818 pages
Style As Subject Washington Square
No ratings yet
Style As Subject Washington Square
21 pages
Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
No ratings yet
Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
36 pages
W.macdonald Thechurchhisbody
No ratings yet
W.macdonald Thechurchhisbody
2 pages
Autotheory As Rebellion - On Research, Embodiment, and Imagination in Creative No
No ratings yet
Autotheory As Rebellion - On Research, Embodiment, and Imagination in Creative No
20 pages
Movers - Speaking Guideline
No ratings yet
Movers - Speaking Guideline
2 pages
Moody - Sowing and Reaping
No ratings yet
Moody - Sowing and Reaping
136 pages
Arthur Conan Doyle
No ratings yet
Arthur Conan Doyle
15 pages

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

Uploaded by

S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton

Uploaded by

S8906: FAST DATA PIPELINES

FOR DEEP LEARNING TRAINING

• Multi-GPU, dense systems are more common (DGX-1V, DGX-2)

• Using more cores / sockets is very expensive

• CPU to GPU ratio becomes lower:

• DGX-2: 48 cores / 16, 3 cores / GPU

256x256 image 224x224 crop

480p image Random resize Color augment 224x224 crop Training

MXNet Caffe2 TensorFlow

Inflexible fast flexible slow 7

MXNet Caffe2 PyTorch TF etc.

• Centralize the effort

• Flexible, high-performance image data pipeline

Data pipeline is just a (simple) graph CPU

High performance, GPU optimized implementations CPU

Use any file format in any framework

One pipeline per GPU

Simple scheduling of operations GPU

A bridge between CPU and GPU

CPU 1 Mixed 1 GPU 1 CPU 2 Mixed 2 GPU 2 CPU 3 Mixed 4

CPU, Mixed and GPU stages need to be executed serially

But each batch of data is independent…

CPU 1 CPU 2 CPU 3

GPU 1 GPU 2 GPU 3

Each stage is asynchronous

Stages of given batch synchronized via events

def __init__(self, batch_size, num_threads, device_id, num_devices):

jpeg Decode Resize Crop Data

labels MakeContiguous Label

with tf.session() as sess:

from caffe2.python import brew

data, label = brew.dali_input(model, ["data", "label"],

# Add the rest of your network as normal

# Apply identical transformations

What would be useful to you?

PNG Video frames

Fully GPU-based decode

HW-based via. NVDEC

You might also like

def init(self, batch_size, num_threads, device_id, num_devices):