L6 Hardware and Software For DL en
L6 Hardware and Software For DL en
Lesson 6:
Software and hardware for
deep learning
2
Outline
1. Hardware for deep learning (CPU vs GPU)
2. Deep learning frameworks
3. Accelerator and compression tools
3
Hardware for deep learning
(CPU vs GPU)
4
A not so common computer
5
CPU vs GPU
6
Example: Matrix multiplication
7
GigaFLOPs per 1$
8
CPU vs GPU in practice
9
CPU vs GPU in practice (2)
10
CPU vs GPU vs TPU
11
GigaFLOPs per 1$
12
NVIDIA DGX-2
13
NVidia edge computing
14
Google Coral
15
ARM edge computing
16
ARM NPU
17
Programming GPUs
• CUDA (NVIDIA only)
• Write C-like code that runs directly on the GPU
• Optimized APIs: cuBLAS, cuFFT, cuDNN, etc
• OpenCL
• Similar to CUDA, but runs on anything
• Usually slower on NVIDIA hardware
• HIP https://github.com/ROCm-Developer-Tools/HIP
• New project that automatically converts CUDA code to
something that can run on AMD GPUs
CPU / GPU Communication
20
There are many ...
21
Computational graphs
Computational graphs (2)
Computational graphs (3)
• Pros:
• Clean API,
easy to write
numeric code
• Cons:
• Have to
compute our
own gradients
• Can’t run on
GPU
24
Computational graphs (4)
26
The point of deep learning
frameworks
• Quick to develop and test new ideas
• Automatically compute gradients
• Run it all efficiently on GPU (wrap cuDNN, cuBLAS,
OpenCL, etc)
PyTorch: Tensors
• Like a numpy array, but
can run on GPU
• Data model and API are
too much the same!
• Here is an example of
training a two-layer
neural network using
PyTorch Tensors
28
PyTorch: Autograd
• Create Tensors with
requires_grad=True to
enable autograd
• Torch.no_grad means
don't include this in the
calculation graph
• We need to set the
gradients to zero before
starting to do back
propragation because
PyTorch accumulates the
gradients on subsequent
backward passes. This is
convenient while training
RNNs.
29
PyTorch: nn
• Higher-level
wrapper to work
with neural
networks
• Make
programming
easier
30
PyTorch: nn (2)
• It is possible to
define new
modules in
PyTorch
• Modules can
contain weights
or other modules
• PyTorch
automatically
handles Autograd
for new modules
31
PyTorch: optim
• Optimization
algorithms are
available in
PyTorch, such
as Adam
32
PyTorch: Pretrained models
• PyTorch has several pre-trained models available.
• These models can be used directly.
33
PyTorch: Visdom
• Tool to help visualize the calculation process
• Currently does not support the visualization of
computational graph structures
34
PyTorch: tensorboardx
• A python wrapper around Tensorflow’s web-based visualization
tool.
• pip install tensorboardx
• https://github.com/lanpa/tensorboardX
35
PyTorch: Dynamic computational graph
• Create tensor
36
PyTorch: Dynamic computational graph (2)
• Build graph and perform computation
37
PyTorch: Dynamic computational graph (3)
• Build graph and perform computation
38
PyTorch: Dynamic computational graph (4)
• Find the path on the graph from the objective function
to w1 and w2 for backprop, then do the calculation
39
PyTorch: Dynamic computational graph (5)
• On the next iteration, delete all the graphs and
backpropagation in the previous step, rebuild all from
scratch
• Seems inefficient, especially when building the same
graph multiple times...
40
PyTorch: Static computation graphs
Static graph
Step 1: Build
computational graph
describing our computation
(including finding paths for
backprop)
Step 2: : Reuse the same
graph on every iteration
41
Tensorflow Pre2.0
• Step 1:
Build a
calculation
graph
• Step 2: Run
this
calculation
graph
several
times
42
Tensorflow 2.0
• TensorFlow's Eager Execution mode is an imperative
programming environment that allows immediate
operations to be executed without the need to build a
computation graph
• operations return specific values instead of building a graph of
the calculation and run it later.
• This makes it easier to get started with TensorFlow
models and easier to debug.
43
Tensorflow 2.0 vs Pre2.0
44
Tensorflow 2.0 vs Pre2.0
45
Tensorflow 2.0: Neural Network
• Turn numpy array into TF tensor
46
Tensorflow 2.0: Neural Network
• Use tf.GradientTape() to build dynamic computational
graphs
47
Tensorflow 2.0: Neural Network
• All operations in the forward step are tracked for later
gradient calculations.
48
Tensorflow 2.0: Neural Network
• tape.gradient() uses the previously tracked calculation
graph to calculate the gradient.
49
Tensorflow 2.0: Neural Network
• Neural network training: loop over computational
graphs, use gradients to update weights
50
Tensorflow 2.0: Neural Network
• The available optimization algorithm (optimizer) can be
used to calculate the gradient and update the weights
51
Tensorflow 2.0: Neural Network
• Can use predefined objective function
52
Keras: High-Level wrapper
• Keras is a layer on top of TensorFlow, makes common
things easy to do (Used to be third-party, now merged
into TensorFlow)
53
Keras: High-Level wrapper
54
Tensorflow 2.0: @tf.function
• tf.function decorator
(implicitly) compiles
python functions to
static graph for
better performance
• Here we compare
the forward-pass
time of the same
model under
dynamic graph
mode and static
graph mode
55
TensorFlow: Pretrained Models
• tf.keras:
https://www.tensorflow.org/api_docs/python/tf/keras/ap
plications
• TF-Slim:
https://github.com/tensorflow/models/tree/master/resea
rch/slim
56
TensorFlow: Tensorboard
• Add log in the code to observe the
objective function, parameters...
• Run server tensorboard and see
the result
57
Static vs Dynamic
59
Static PyTorch
• Caffe2:
https://caffe2.ai/
• ONNX:
https://github.com/onnx/onnx
60
Accelerator and compression tools
61
Tensorflow Lite
• Tensorflow Lite is a set of tools to optimize Tensorflow
models, make models more compact and infer faster
on mobile platforms.
62
NVIDIA TensorRT
• tf
63
Other tools
• Pocket flow: https://github.com/Tencent/PocketFlow
• Tencent NCNN: https://github.com/Tencent/ncnn
64
References
1. The lecture is based on Stanford's cs231n
http://cs231n.stanford.edu
2. Tensorflow vs Keras vs PyTorch:
https://databricks.com/session/a-tale-of-three-deep-
learning-frameworks-tensorflow-keras-pytorch
3. NVIDIA TensorRT:
Fast Neural Network Inference with TensorRT on
Autonomous
4. ARM chip:
Design And Reuse 2018 Keynote
65
Thank you
for your
attention!!!
66