Deep-Learning-Optimization
Deep-Learning-Optimization
WORKFLOW.
Deepshikha Kumari Data Scientist II- Deep learning
1. AI Use cases for Industry
2. End To End Deep Learning Workflow
Training Pipeline
a. NGC
AGENDA b. Transfer Learning
c. Automatic Mixed Precision
d. Code walkthrough
Inference Pipeline
a. TensorRT (Float 16)
b. TensorRT (INT8)
c. Custom plugin support
d. Deepstream
2
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY
3
DEEP LEARNING IN PRODUCTION
Speech Recognition
Recommender Systems
Autonomous Driving
Robotics
Real-time Language
Translation
Many More…
4
5
NGC
6
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks and
applications
Share, collaborate,
and test applications across
different environments
5
Virtual Machine vs. Container
Not so similar
App 1 App 1 App 2
8
NVIDIA container runtime
https://github.com/NVIDIA/nvidia-docker
Container
A container is a runtime instance of a docker image.
A Docker container consists of
● A Docker image
● Execution environment
● A standard set of instructions
https://docs.docker.com/engine/reference/glossary/
10
11
12
13
PRUNING
Prune Retrain
tlt-prune tlt-train
14
Selecting Unnecessary Neurons
pruned_model = TLT.prune(model, t)
15
SCENE ADAPTATION
Camera location vantage point Person with blue shirt
Data Adapt
Same network adapting to new data
Train with new data from another vantage point, camera location, or added attribute
16
17
TLT
TENSORFLOW
Automatic Mixed Precision feature is available both in native TensorFlow and inside
the TensorFlow container on
As an alternative, the environment variable can be set inside the TensorFlow Python script:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
19
PYTORCH
Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable,
add these two lines of code into your existing training script:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
20
MXNET
Automatic Mixed Precision feature is available both in native MXNet (1.5 or later) and inside the MXNet
container (19.04 or later) on NVIDIA NGC container registry. To enable the feature, add the following
lines of code to your existing training script:
amp.init()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
autograd.backward(scaled_loss)
21
AUTOMATIC MIXED PRECISION IN TENSORFL
Upto 3X Speedup
TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs
●
2X
More AMP enabled model scripts coming soon:
Mask-R CNN, GNMT, NCF, etc.
AMP
Enabled
FP32 M ixed
Precision
Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32
https://github.com/apache/incubator-mxnet/pull/14173 9
25
NVIDIA TENSORRT
Programmable Inference Accelerator
TESLA P4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
26
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
400
Images/sec
Latency (ms)
4,000
Images/sec
25
280 ms
300
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
27
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Plan 3
Runtime inference
C++ or Python API
29
developer.nvidia.com/tensorrt
TENSORRT OPTIMIZATIONS
Kernel Auto-Tuning
Dynamic Tensor
Memory
30
LAYER & TENSOR FUSION
31
32
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY
New Data
Steps:
• Start with a frozen TensorFlow model Trained Neural
Network
• Create a model parser
• Optimize model and create a runtime
TensorRT
engine Optimizer Optimized
• Perform inference using the optimized Runtime Engine
runtime engine
Inference Results
34
developer.nvidia.com/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
developer.nvidia.com/tensorrt
TensorRT Inference
with TensorFlow
TensorFlow
An end-to-end open source machine learning platform
41m Downloads
NVIDIA TensorRT
Platform for High-Performance Deep Learning Inference
● Deploy responsive and memory efficient apps with INT8 & FP16
● Additional Resources
Benefits to using TF-TRT
● Optimize TF inference while still using the TF ecosystem
● Simple API: up to 8x performance gain with little effort
● Fallback to native TensorFlow where TensorRT does not support
Over 10 optimized models with published examples
Models TF FP32 TF-TRT INT8 Speedup
(imgs/s) (imgs/s) ● Performance optimizations soon:
ResNet-50 399 3053 7.7x More NLP and Object Detection
Models
Inception V4 158 1128 7.1x
TensorFlow FP32 vs TensorFlow-TensorRT INT8 on T4, largest possible batch size, no I/O.
NGC Tensorflow 19.07 with scripts: https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py 42
FP16 accuracy
Models TF FP32 TF-TRT FP16
Top-1 metric (%) for classification models. mAP for SSD detection models.
43
Complete data: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#verified-models
INT8 accuracy
Models TF FP32 TF-TRT INT8
developer.nvidia.com/tensorrt 45
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference
Rec-
ASR NLP
ommender
Some systems are overused while Solutions can only support Developers need to reinvent the
others are underutilized models from one framework plumbing for every application
46
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference
NVIDIA performance of GPUs
TensorRT
Inference
Server
T4
NVIDIA
T4 Quickly deploy and manage multiple
models per GPU per node
Tesla
TensorRT
Inference
Server
V100
Easily scale to heterogeneous GPUs
Tesla
V100 and multi GPU nodes
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
49
Additional resources
- GTC Technical presentation: https://developer.nvidia.com/gtc/2019/video/S9431/video
- TF-TRT user guide: https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html
- NVIDIA DLI course on TF-TRT: https://www.nvidia.com/en-us/deep-learning-ai/education/
- Monthly release notes: https://docs.nvidia.com/deeplearning/dgx/tf-trt-release-notes/index.html
- Google Blog on TF-TRT inference: https://cloud.google.com/blog/products/ai-machine-
learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus
- Nvidia Developer Blog: https://devblogs.nvidia.com/tensorrt-integration-speeds-tensorflow-
inference/
50
51
1. Intelligent Video Analytics
2. Deepstream SDK
a. What is Deepstream SDK?
a. Why Deepstream SDK?
AGENDA b. What’s new with DS4.0?
c. Deepstream Building Blocks
3. Getting started with Deepstream SDK
a. Where to start?
b. Directory hierarchy
c. Configurable file and pipeline details
d. Running application
4. Building with Deepstream SDK
a. Real world use cases with demo 2
b. Resources
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY
5
3
WHAT IS DEEPSTREAM?
DEEPSTREAM SDK
Ref er en ce
Har dwar e
Docker Co n t ainers Applications & Analytic IOT
Ac c e le r a te d Plugins
Or ch est ratio n Recipes Runtime
CUDA-X
JETSON | TESLA
5
4
WHY DEEPSTREAM?
The most comprehensive end-to-end development platform for IVA.
DNN infer ence/Tensor RT plugins DeepStr eam in container s, Multi-GPU or chestr ation End to end r efer ence applications
Communications plugins Tr acking & analytics acr oss lar ge scale/ multi-camer a App building/configur ation tools
V ideo/image captur e and pr ocessing plugins Str eaming and Batch Analytics End-end or chestr ation r ecipes & adaptation guides
3rd par ty libr ar y plugins … … Event fabr ic Plugin templates, custom IP integr ation
DeepStream SDK
Linux, CUDA
Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure
56
REAL TIME INSIGHTS, HIGHEST STREAM
DENSITY
57
Smart Parking
PERCEPTION GRAPH
COMM PLUGIN PREPROCESSING PLUGINS DETECTION, CLASSIFICATION & TRACKING PLUGINS COMMUNICATIONS PLUGINS
Camer a
ROI calibr ation
calibr ation
Detectionand
Detection and Global Tr ansmit Analytics
RTSP Decoder Dewar p libr ar y classification Tr acker
classification positioning Metadata ser ver
59
Perception Analytics Visualization
NVIDIA Telemetrydata
IoT edge
DeepSt ream
runtime
Container
Detect and flag packages on a DeepStream container can connect to Azure 61 IoT central
conveyor belt through Azure IoT edge runtime
THANK YOU!
~QUESTIONS?
62