0% found this document useful (0 votes)
29 views

Deep-Learning-Optimization

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Deep-Learning-Optimization

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

ACCELERATING END TO END DEEP LEARNING

WORKFLOW.
Deepshikha Kumari Data Scientist II- Deep learning
1. AI Use cases for Industry
2. End To End Deep Learning Workflow
Training Pipeline
a. NGC
AGENDA b. Transfer Learning
c. Automatic Mixed Precision
d. Code walkthrough
Inference Pipeline
a. TensorRT (Float 16)
b. TensorRT (INT8)
c. Custom plugin support
d. Deepstream
2
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

3
DEEP LEARNING IN PRODUCTION
Speech Recognition

Recommender Systems

Autonomous Driving

Real-time Object Recognition

Robotics

Real-time Language
Translation

Many More…

4
5
NGC

6
WHY CONTAINERS?

Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks and
applications
Share, collaborate,
and test applications across
different environments

5
Virtual Machine vs. Container
Not so similar
App 1 App 1 App 2

Bins / Libs Bins / Libs Bins / Libs


App 1 App 1 App 2
Guest OS Guest OS Guest OS
Bins / Libs Bins / Libs Bins / Libs

Hypervisor Docker Engine

Host Operating System Host Operating System

Server Infrastructure Server Infrastructure

Virtual Machines Containers

8
NVIDIA container runtime
https://github.com/NVIDIA/nvidia-docker

• Colloqually called “nvidia-docker”


• Docker containers are hardware-
agnostic and platform-agnostic
• NVIDIA GPUs are specialized
hardware that require the NVIDIA
driver
• Docker does not natively support
NVIDIA GPUs with containers
• NVIDIA Container Runtime makes the
images agnostic of the NVIDIA driver
9
Docker Terms
Definitions
Image
Docker images are the basis of containers. An Image is an ordered collection of root filesystem changes
and the corresponding execution parameters for use within a container runtime. An image typically
contains a union of layered filesystems stacked on top of each other. An image does not have state and it
never changes.

Container
A container is a runtime instance of a docker image.
A Docker container consists of
● A Docker image
● Execution environment
● A standard set of instructions

https://docs.docker.com/engine/reference/glossary/
10
11
12
13
PRUNING

1 Reduce model size and increase throughput

2 Incrementally retrain model after pruning to recover


accuracy

Prune Retrain

tlt-prune tlt-train

14
Selecting Unnecessary Neurons

• 1. DATA Driven operation


• 2. Non- Data Driven Operation.
• 3. Handling Element-Wise Operations of Multiple Inputs

pruned_model = TLT.prune(model, t)

15
SCENE ADAPTATION
Camera location vantage point Person with blue shirt

Same network adapting to different


angles and vantage points

Data Adapt
Same network adapting to new data

Train with new data from another vantage point, camera location, or added attribute
16
17
TLT
TENSORFLOW

Automatic Mixed Precision feature is available both in native TensorFlow and inside
the TensorFlow container on

NVIDIA NGC container registry:


export TF_ENABLE_AUTO_MIXED_PRECISION=1

As an alternative, the environment variable can be set inside the TensorFlow Python script:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

19
PYTORCH

Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable,
add these two lines of code into your existing training script:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

with amp.scale_loss(loss, optimizer) as scaled_loss:


scaled_loss.backward()

20
MXNET

Automatic Mixed Precision feature is available both in native MXNet (1.5 or later) and inside the MXNet
container (19.04 or later) on NVIDIA NGC container registry. To enable the feature, add the following
lines of code to your existing training script:
amp.init()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
autograd.backward(scaled_loss)

21
AUTOMATIC MIXED PRECISION IN TENSORFL
Upto 3X Speedup

TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

All models can be found at:


https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow, except for ssd-rn50-fpn-640, which is here: https://github.com/tensorflow/models/tree/master/research/object_detection All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN
PYTORCH
https://developer.nvidia.com/automatic-mixed-precision

● Plot shows ResNet-50 result with/without automatic mixed


precision(AMP)


2X
More AMP enabled model scripts coming soon:
Mask-R CNN, GNMT, NCF, etc.

AMP
Enabled

FP32 M ixed
Precision

Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32

https://github.com/apache/incubator-mxnet/pull/14173 9
25
NVIDIA TENSORRT
Programmable Inference Accelerator

FRAMEWORKS GPU PLATFORMS

TESLA P4

TensorRT
JETSON TX2
Optimizer Runtime

DRIVE PX 2

NVIDIA DLA

TESLA V100

26
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350

Latency (ms)
400

Images/sec
Latency (ms)
4,000
Images/sec

25
280 ms
300

3,000 20 300 250


14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms

1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.

27
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy


Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 28


MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists

Example: Importing a TensorFlow model


Other Frameworks

Python/C++ API Python/C++ API

Model Importer Network


Definition API

Runtime inference
C++ or Python API

29
developer.nvidia.com/tensorrt
TENSORRT OPTIMIZATIONS

Layer & Tensor Fusion

➢ Optimizations are completely automatic


➢ Performed with a single function call
Weights & Activation
Precision Calibration

Kernel Auto-Tuning

Dynamic Tensor
Memory
30
LAYER & TENSOR FUSION

Un-Optimized Network TensorRT Optimized Network


• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias
bias Network Layersbias Layers bias 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 1x1
309conv. 113
V3
input
ResNet-152 670 159 input
concat

31
32
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and


100s for specialized kernels
Optimized for every GPU platform
improves memory re-use

• Manages memory allocation for


each tensor only for the duration of
Multiple parameters: its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 33
...
EXAMPLE: DEPLOYING TENSORFLOW
MODELS WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT python
API

New Data
Steps:
• Start with a frozen TensorFlow model Trained Neural
Network
• Create a model parser
• Optimize model and create a runtime
TensorRT
engine Optimizer Optimized
• Perform inference using the optimized Runtime Engine
runtime engine

Inference Results

34
developer.nvidia.com/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create


a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine


Step 7: Perform inference

developer.nvidia.com/tensorrt
TensorRT Inference
with TensorFlow
TensorFlow
An end-to-end open source machine learning platform

● Powerful platform for research and experimentation


● Versatile, easy model building
● Robust ML production anywhere
● Most popular ML project on Github

41m Downloads
NVIDIA TensorRT
Platform for High-Performance Deep Learning Inference

● Optimize and Deploy neural networks in production environments

● Maximize throughput for latency-critical apps with optimizer and runtime

● Deploy responsive and memory efficient apps with INT8 & FP16

300k Downloads in 2018


TF-TRT = TF + TRT
TensorRT Inference with
TensorFlow

● Benefits to using TF-TRT

AGENDA ● How to use

● Customer experience: Clarifai

● How TF-TRT works

● Additional Resources
Benefits to using TF-TRT
● Optimize TF inference while still using the TF ecosystem
● Simple API: up to 8x performance gain with little effort
● Fallback to native TensorFlow where TensorRT does not support
Over 10 optimized models with published examples
Models TF FP32 TF-TRT INT8 Speedup
(imgs/s) (imgs/s) ● Performance optimizations soon:
ResNet-50 399 3053 7.7x More NLP and Object Detection
Models
Inception V4 158 1128 7.1x

Mobilenet V1 1203 4975 4.1x


● For non-optimized layers, fallback
support is provided by
NASNet large 43 162 3.8x TensorFlow
VGG16 245 1568 6.4x

SSD Mobilenet V2 102 411 4.0x

SSD Inception V2 82 327 4.0x

TensorFlow FP32 vs TensorFlow-TensorRT INT8 on T4, largest possible batch size, no I/O.
NGC Tensorflow 19.07 with scripts: https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py 42
FP16 accuracy
Models TF FP32 TF-TRT FP16

Mobilenet V2 74.08 74.07

NASNet Mobile 73.97 73.87

ResNet 50 V1.5 76.51 76.48

ResNet 50 V2 76.43 76.40

VGG 16 70.89 70.91

Inception V3 77.99 77.97

SSD Mobilenet v1 23.06 23.07

FP16 accuracy is within 0.1% of FP32 accuracy

Top-1 metric (%) for classification models. mAP for SSD detection models.
43
Complete data: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#verified-models
INT8 accuracy
Models TF FP32 TF-TRT INT8

Mobilenet V2 74.08 73.90

NASNet Mobile 73.97 73.55

ResNet 50 V1.5 76.51 76.23

ResNet 50 V2 76.43 76.30

VGG 16 70.89 70.78

Inception V3 77.99 77.85


INT8 accuracy is within 0.2% of FP32 accuracy except for NASNet Mobile within 0.5%.

Top-1 metric (%) for classification models.


44
Complete data: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#verified-models
TensorRT ONNX PARSER
Parser to import ONNX-models into TensorRT

Optimize and deploy models from ONNX-


supported frameworks in production
Apply TensorRT optimizations to any ONNX framework
(Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet,
PyTorch)

C++ and Python APIs to import ONNX models

New samples demonstrating step-by-step process to get


started

developer.nvidia.com/tensorrt 45
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference

Single Model Only Single Framework Only Custom Development


!

Rec-
ASR NLP
ommender

Some systems are overused while Solutions can only support Developers need to reinvent the
others are underutilized models from one framework plumbing for every application

46
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference
NVIDIA performance of GPUs

TensorRT
Inference
Server
T4

NVIDIA
T4 Quickly deploy and manage multiple
models per GPU per node
Tesla

TensorRT
Inference
Server
V100
Easily scale to heterogeneous GPUs
Tesla
V100 and multi GPU nodes

Integrates with orchestration


TensorRT
Inference
Tesla P4
systems and auto scalers via latency
Server Tesla P4 and health metrics

Now open source for thorough


customization and integration
47
FEATURES
Concurrent Model Execution Dynamic Batching
Multiple models (or multiple instances of same Inference requests can be batched up by the
model) may execute on GPU simultaneously inference server to 1) the model-allowed
maximum or 2) the user-defined latency SLA
CPU Model Inference Execution
Framework native models can execute inference Multiple Model Format Support
requests on the CPU PyTorch JIT (.pt)
TensorFlow GraphDef/SavedModel
TensorFlow and TensorRT GraphDef
Metrics ONNX graph (ONNX Runtime)
Utilization, count, memory, and latency TensorRT Plans
Caffe2 NetDef (ONNX import path)
Custom Backend
Custom backend allows the user more flexibility
by providing their own implementation of an
CMake build
execution engine through the use of a shared Build the inference server from source making it
library more portable to multiple OSes and removing
the build dependency on Docker
Model Ensemble
Pipeline of one or more models and the Streaming API
connection of input and output tensors between Built-in support for audio streaming input e.g.
those models (can be used with custom for speech recognition
backend) 48
INFERENCE SERVER ARCHITECTURE
Available with Monthly Updates

Python/C++ Client Library


Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
● ONNX graph
● PyTorch JIT (.pb)

Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC

Python/C++ client libraries

49
Additional resources
- GTC Technical presentation: https://developer.nvidia.com/gtc/2019/video/S9431/video
- TF-TRT user guide: https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html
- NVIDIA DLI course on TF-TRT: https://www.nvidia.com/en-us/deep-learning-ai/education/
- Monthly release notes: https://docs.nvidia.com/deeplearning/dgx/tf-trt-release-notes/index.html
- Google Blog on TF-TRT inference: https://cloud.google.com/blog/products/ai-machine-
learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus
- Nvidia Developer Blog: https://devblogs.nvidia.com/tensorrt-integration-speeds-tensorflow-
inference/

50
51
1. Intelligent Video Analytics
2. Deepstream SDK
a. What is Deepstream SDK?
a. Why Deepstream SDK?
AGENDA b. What’s new with DS4.0?
c. Deepstream Building Blocks
3. Getting started with Deepstream SDK
a. Where to start?
b. Directory hierarchy
c. Configurable file and pipeline details
d. Running application
4. Building with Deepstream SDK
a. Real world use cases with demo 2

b. Resources
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

5
3
WHAT IS DEEPSTREAM?

Applications and Services

DEEPSTREAM SDK

Ref er en ce
Har dwar e
Docker Co n t ainers Applications & Analytic IOT
Ac c e le r a te d Plugins
Or ch est ratio n Recipes Runtime

CUDA-X

K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT

NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD

JETSON | TESLA

5
4
WHY DEEPSTREAM?
The most comprehensive end-to-end development platform for IVA.

Broader Use Cases and


Faster Time to Progress Faster Time to Market
Industries
Provides ready to use building blocks and Provides ready to use building blocks and
Build your own application for smart cities, IP simplify building your innovative IP simplify building your innovative
retail analytics, industrial inspection, product.. product.
logistics, and more

Performance Driven Cloud Integration Faster Time to Progress


Low latency and exceptional performance Pushbutton IoT solution integration to build Iterate and integrate by quick plug and play
optimized for NVIDIA GPUs for real-time applications and services with Cloud of popular plug-ins that are pre-packaged or
edge analytics Service Providers. build your own.

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 55


DEEPSTREAM SDK
Plugins (build w ith open source, 3 rd party, NV) Analytics - multi-camera, multi-sensor framew ork Development Tools

DNN infer ence/Tensor RT plugins DeepStr eam in container s, Multi-GPU or chestr ation End to end r efer ence applications

Communications plugins Tr acking & analytics acr oss lar ge scale/ multi-camer a App building/configur ation tools

V ideo/image captur e and pr ocessing plugins Str eaming and Batch Analytics End-end or chestr ation r ecipes & adaptation guides

3rd par ty libr ar y plugins … … Event fabr ic Plugin templates, custom IP integr ation

DeepStream SDK

Multimedia APIs/ Imaging & Metadata & Multi-camer a


Tensor RT NV container s Message bus clients
V ideo Codec SDK Dewar ping libr ar y messaging tr acking lib

Linux, CUDA

Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure

56
REAL TIME INSIGHTS, HIGHEST STREAM
DENSITY

NGC ANY CLOUD

NVIDIA M etropolis Analytics Visualization


Application Framework

NVIDIA Edge Stack

NVIDIA EGX Server Cloud M onitoring

68 streams of 1080p per T4

Pixels Information Dashboard

57
Smart Parking
PERCEPTION GRAPH
COMM PLUGIN PREPROCESSING PLUGINS DETECTION, CLASSIFICATION & TRACKING PLUGINS COMMUNICATIONS PLUGINS

Camer a
ROI calibr ation
calibr ation

Detectionand
Detection and Global Tr ansmit Analytics
RTSP Decoder Dewar p libr ar y classification Tr acker
classification positioning Metadata ser ver

360d feeds Dewarping ROI: Lines ROI: Polygon

59
Perception Analytics Visualization

VIDEO : INTELLIGENT TRAFFIC SYSTEM


60
WAREHOUSE LOGISTICS: INVENTORY SORTING

USE CASE SOLUTION


IoT edge device
Business Logic
Services
Azure IoT Central

NVIDIA Telemetrydata
IoT edge
DeepSt ream
runtime
Container

Detect and flag packages on a DeepStream container can connect to Azure 61 IoT central
conveyor belt through Azure IoT edge runtime
THANK YOU!

~QUESTIONS?
62

You might also like