0% found this document useful (0 votes)

31 views

Deep-Learning-Optimization

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Deep-Learning-Optimization

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

ACCELERATING END TO END DEEP LEARNING

WORKFLOW.
Deepshikha Kumari Data Scientist II- Deep learning
1. AI Use cases for Industry
2. End To End Deep Learning Workflow
Training Pipeline
a. NGC
AGENDA b. Transfer Learning
c. Automatic Mixed Precision
d. Code walkthrough
Inference Pipeline
a. TensorRT (Float 16)
b. TensorRT (INT8)
c. Custom plugin support
d. Deepstream
2
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

3
DEEP LEARNING IN PRODUCTION
Speech Recognition

Recommender Systems

Autonomous Driving

Real-time Object Recognition

Robotics

Real-time Language
Translation

Many More…

4
5
NGC

6
WHY CONTAINERS?

Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks and
applications
Share, collaborate,
and test applications across
different environments

5
Virtual Machine vs. Container
Not so similar
App 1 App 1 App 2

Bins / Libs Bins / Libs Bins / Libs

App 1 App 1 App 2
Guest OS Guest OS Guest OS
Bins / Libs Bins / Libs Bins / Libs

Hypervisor Docker Engine

Host Operating System Host Operating System

Server Infrastructure Server Infrastructure

Virtual Machines Containers

8
NVIDIA container runtime
https://github.com/NVIDIA/nvidia-docker

• Colloqually called “nvidia-docker”

• Docker containers are hardware-
agnostic and platform-agnostic
• NVIDIA GPUs are specialized
hardware that require the NVIDIA
driver
• Docker does not natively support
NVIDIA GPUs with containers
• NVIDIA Container Runtime makes the
images agnostic of the NVIDIA driver
9
Docker Terms
Definitions
Image
Docker images are the basis of containers. An Image is an ordered collection of root filesystem changes
and the corresponding execution parameters for use within a container runtime. An image typically
contains a union of layered filesystems stacked on top of each other. An image does not have state and it
never changes.

Container
A container is a runtime instance of a docker image.
A Docker container consists of
● A Docker image
● Execution environment
● A standard set of instructions

https://docs.docker.com/engine/reference/glossary/
10
11
12
13
PRUNING

1 Reduce model size and increase throughput

2 Incrementally retrain model after pruning to recover

accuracy

Prune Retrain

tlt-prune tlt-train

14
Selecting Unnecessary Neurons

• 1. DATA Driven operation

• 2. Non- Data Driven Operation.
• 3. Handling Element-Wise Operations of Multiple Inputs

pruned_model = TLT.prune(model, t)

15
SCENE ADAPTATION
Camera location vantage point Person with blue shirt

Same network adapting to different

angles and vantage points

Data Adapt
Same network adapting to new data

Train with new data from another vantage point, camera location, or added attribute
16
17
TLT
TENSORFLOW

Automatic Mixed Precision feature is available both in native TensorFlow and inside
the TensorFlow container on

NVIDIA NGC container registry:

export TF_ENABLE_AUTO_MIXED_PRECISION=1

As an alternative, the environment variable can be set inside the TensorFlow Python script:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

19
PYTORCH

Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable,
add these two lines of code into your existing training script:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

with amp.scale_loss(loss, optimizer) as scaled_loss:

scaled_loss.backward()

20
MXNET

Automatic Mixed Precision feature is available both in native MXNet (1.5 or later) and inside the MXNet
container (19.04 or later) on NVIDIA NGC container registry. To enable the feature, add the following
lines of code to your existing training script:
amp.init()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
autograd.backward(scaled_loss)

21
AUTOMATIC MIXED PRECISION IN TENSORFL
Upto 3X Speedup

TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

All models can be found at:

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow, except for ssd-rn50-fpn-640, which is here: https://github.com/tensorflow/models/tree/master/research/object_detection All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN
PYTORCH
https://developer.nvidia.com/automatic-mixed-precision

● Plot shows ResNet-50 result with/without automatic mixed

precision(AMP)

●
2X
More AMP enabled model scripts coming soon:
Mask-R CNN, GNMT, NCF, etc.

AMP
Enabled

FP32 M ixed
Precision

Source: https://github.com/NVIDIA/apex/tree/master/examples/imagenet 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32

https://github.com/apache/incubator-mxnet/pull/14173 9
25
NVIDIA TENSORRT
Programmable Inference Accelerator

FRAMEWORKS GPU PLATFORMS

TESLA P4

TensorRT
JETSON TX2
Optimizer Runtime

DRIVE PX 2

NVIDIA DLA

TESLA V100

26
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350

Latency (ms)
400

Images/sec
Latency (ms)
4,000
Images/sec

25
280 ms
300

3,000 20 300 250

14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms

1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.

27
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 28

MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists

Example: Importing a TensorFlow model

Other Frameworks

Python/C++ API Python/C++ API

Model Importer Network

Definition API

Runtime inference
C++ or Python API

29
developer.nvidia.com/tensorrt
TENSORRT OPTIMIZATIONS

Layer & Tensor Fusion

➢ Optimizations are completely automatic

➢ Performed with a single function call
Weights & Activation
Precision Calibration

Kernel Auto-Tuning

Dynamic Tensor
Memory
30
LAYER & TENSOR FUSION

31
32
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and

100s for specialized kernels
Optimized for every GPU platform
improves memory re-use

• Manages memory allocation for

each tensor only for the duration of
Multiple parameters: its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 33
...
EXAMPLE: DEPLOYING TENSORFLOW
MODELS WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT python
API

New Data
Steps:
• Start with a frozen TensorFlow model Trained Neural
Network
• Create a model parser
• Optimize model and create a runtime
TensorRT
engine Optimizer Optimized
• Perform inference using the optimized Runtime Engine
runtime engine

Inference Results

34
developer.nvidia.com/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create

a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine

Step 7: Perform inference

developer.nvidia.com/tensorrt
TensorRT Inference
with TensorFlow
TensorFlow
An end-to-end open source machine learning platform

● Powerful platform for research and experimentation

● Versatile, easy model building
● Robust ML production anywhere
● Most popular ML project on Github

41m Downloads
NVIDIA TensorRT
Platform for High-Performance Deep Learning Inference

● Optimize and Deploy neural networks in production environments

● Maximize throughput for latency-critical apps with optimizer and runtime

● Deploy responsive and memory efficient apps with INT8 & FP16

300k Downloads in 2018

TF-TRT = TF + TRT
TensorRT Inference with
TensorFlow

● Benefits to using TF-TRT

AGENDA ● How to use

● Customer experience: Clarifai

● How TF-TRT works

● Additional Resources
Benefits to using TF-TRT
● Optimize TF inference while still using the TF ecosystem
● Simple API: up to 8x performance gain with little effort
● Fallback to native TensorFlow where TensorRT does not support
Over 10 optimized models with published examples
Models TF FP32 TF-TRT INT8 Speedup
(imgs/s) (imgs/s) ● Performance optimizations soon:
ResNet-50 399 3053 7.7x More NLP and Object Detection
Models
Inception V4 158 1128 7.1x

Mobilenet V1 1203 4975 4.1x

● For non-optimized layers, fallback
support is provided by
NASNet large 43 162 3.8x TensorFlow
VGG16 245 1568 6.4x

SSD Mobilenet V2 102 411 4.0x

SSD Inception V2 82 327 4.0x

TensorFlow FP32 vs TensorFlow-TensorRT INT8 on T4, largest possible batch size, no I/O.
NGC Tensorflow 19.07 with scripts: https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py 42
FP16 accuracy
Models TF FP32 TF-TRT FP16

Mobilenet V2 74.08 74.07

NASNet Mobile 73.97 73.87

ResNet 50 V1.5 76.51 76.48

ResNet 50 V2 76.43 76.40

VGG 16 70.89 70.91

Inception V3 77.99 77.97

SSD Mobilenet v1 23.06 23.07

FP16 accuracy is within 0.1% of FP32 accuracy

Top-1 metric (%) for classification models. mAP for SSD detection models.
43
Complete data: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#verified-models
INT8 accuracy
Models TF FP32 TF-TRT INT8

Mobilenet V2 74.08 73.90

NASNet Mobile 73.97 73.55

ResNet 50 V1.5 76.51 76.23

ResNet 50 V2 76.43 76.30

VGG 16 70.89 70.78

Inception V3 77.99 77.85

INT8 accuracy is within 0.2% of FP32 accuracy except for NASNet Mobile within 0.5%.

Top-1 metric (%) for classification models.

44
Complete data: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#verified-models
TensorRT ONNX PARSER
Parser to import ONNX-models into TensorRT

Optimize and deploy models from ONNX-

supported frameworks in production
Apply TensorRT optimizations to any ONNX framework
(Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet,
PyTorch)

C++ and Python APIs to import ONNX models

New samples demonstrating step-by-step process to get

started

developer.nvidia.com/tensorrt 45
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference

Single Model Only Single Framework Only Custom Development

Rec-
ASR NLP
ommender

Some systems are overused while Solutions can only support Developers need to reinvent the
others are underutilized models from one framework plumbing for every application

46
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference
NVIDIA performance of GPUs

TensorRT
Inference
Server
T4

NVIDIA
T4 Quickly deploy and manage multiple
models per GPU per node
Tesla

TensorRT
Inference
Server
V100
Easily scale to heterogeneous GPUs
Tesla
V100 and multi GPU nodes

Integrates with orchestration

TensorRT
Inference
Tesla P4
systems and auto scalers via latency
Server Tesla P4 and health metrics

Now open source for thorough

customization and integration
47
FEATURES
Concurrent Model Execution Dynamic Batching
Multiple models (or multiple instances of same Inference requests can be batched up by the
model) may execute on GPU simultaneously inference server to 1) the model-allowed
maximum or 2) the user-defined latency SLA
CPU Model Inference Execution
Framework native models can execute inference Multiple Model Format Support
requests on the CPU PyTorch JIT (.pt)
TensorFlow GraphDef/SavedModel
TensorFlow and TensorRT GraphDef
Metrics ONNX graph (ONNX Runtime)
Utilization, count, memory, and latency TensorRT Plans
Caffe2 NetDef (ONNX import path)
Custom Backend
Custom backend allows the user more flexibility
by providing their own implementation of an
CMake build
execution engine through the use of a shared Build the inference server from source making it
library more portable to multiple OSes and removing
the build dependency on Docker
Model Ensemble
Pipeline of one or more models and the Streaming API
connection of input and output tensors between Built-in support for audio streaming input e.g.
those models (can be used with custom for speech recognition
backend) 48
INFERENCE SERVER ARCHITECTURE
Available with Monthly Updates

Python/C++ Client Library

Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
● ONNX graph
● PyTorch JIT (.pb)

Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC

Python/C++ client libraries

49
Additional resources
- GTC Technical presentation: https://developer.nvidia.com/gtc/2019/video/S9431/video
- TF-TRT user guide: https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html
- NVIDIA DLI course on TF-TRT: https://www.nvidia.com/en-us/deep-learning-ai/education/
- Monthly release notes: https://docs.nvidia.com/deeplearning/dgx/tf-trt-release-notes/index.html
- Google Blog on TF-TRT inference: https://cloud.google.com/blog/products/ai-machine-
learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus
- Nvidia Developer Blog: https://devblogs.nvidia.com/tensorrt-integration-speeds-tensorflow-
inference/

50
51
1. Intelligent Video Analytics
2. Deepstream SDK
a. What is Deepstream SDK?
a. Why Deepstream SDK?
AGENDA b. What’s new with DS4.0?
c. Deepstream Building Blocks
3. Getting started with Deepstream SDK
a. Where to start?
b. Directory hierarchy
c. Configurable file and pipeline details
d. Running application
4. Building with Deepstream SDK
a. Real world use cases with demo 2

b. Resources
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

5
3
WHAT IS DEEPSTREAM?

Applications and Services

DEEPSTREAM SDK

Ref er en ce
Har dwar e
Docker Co n t ainers Applications & Analytic IOT
Ac c e le r a te d Plugins
Or ch est ratio n Recipes Runtime

CUDA-X

K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT

NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD

JETSON | TESLA

5
4
WHY DEEPSTREAM?
The most comprehensive end-to-end development platform for IVA.

Broader Use Cases and

Faster Time to Progress Faster Time to Market
Industries
Provides ready to use building blocks and Provides ready to use building blocks and
Build your own application for smart cities, IP simplify building your innovative IP simplify building your innovative
retail analytics, industrial inspection, product.. product.
logistics, and more

Performance Driven Cloud Integration Faster Time to Progress

Low latency and exceptional performance Pushbutton IoT solution integration to build Iterate and integrate by quick plug and play
optimized for NVIDIA GPUs for real-time applications and services with Cloud of popular plug-ins that are pre-packaged or
edge analytics Service Providers. build your own.

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 55

DEEPSTREAM SDK
Plugins (build w ith open source, 3 rd party, NV) Analytics - multi-camera, multi-sensor framew ork Development Tools

DNN infer ence/Tensor RT plugins DeepStr eam in container s, Multi-GPU or chestr ation End to end r efer ence applications

Communications plugins Tr acking & analytics acr oss lar ge scale/ multi-camer a App building/configur ation tools

V ideo/image captur e and pr ocessing plugins Str eaming and Batch Analytics End-end or chestr ation r ecipes & adaptation guides

3rd par ty libr ar y plugins … … Event fabr ic Plugin templates, custom IP integr ation

DeepStream SDK

Multimedia APIs/ Imaging & Metadata & Multi-camer a

Tensor RT NV container s Message bus clients
V ideo Codec SDK Dewar ping libr ar y messaging tr acking lib

Linux, CUDA

Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure

56
REAL TIME INSIGHTS, HIGHEST STREAM
DENSITY

NGC ANY CLOUD

NVIDIA M etropolis Analytics Visualization

Application Framework

NVIDIA Edge Stack

NVIDIA EGX Server Cloud M onitoring

68 streams of 1080p per T4

Pixels Information Dashboard

57
Smart Parking
PERCEPTION GRAPH
COMM PLUGIN PREPROCESSING PLUGINS DETECTION, CLASSIFICATION & TRACKING PLUGINS COMMUNICATIONS PLUGINS

Camer a
ROI calibr ation
calibr ation

Detectionand
Detection and Global Tr ansmit Analytics
RTSP Decoder Dewar p libr ar y classification Tr acker
classification positioning Metadata ser ver

360d feeds Dewarping ROI: Lines ROI: Polygon

59
Perception Analytics Visualization

VIDEO : INTELLIGENT TRAFFIC SYSTEM

60
WAREHOUSE LOGISTICS: INVENTORY SORTING

USE CASE SOLUTION

IoT edge device
Business Logic
Services
Azure IoT Central

NVIDIA Telemetrydata
IoT edge
DeepSt ream
runtime
Container

Detect and flag packages on a DeepStream container can connect to Azure 61 IoT central
conveyor belt through Azure IoT edge runtime
THANK YOU!

~QUESTIONS?
62

(Applied Mathematical Sciences, 118) Edwige Godlewski, Pierre-Arnaud Raviart - Numerical Approximation of Hyperbolic Systems of Conservation Laws-Springer (2021)
No ratings yet
(Applied Mathematical Sciences, 118) Edwige Godlewski, Pierre-Arnaud Raviart - Numerical Approximation of Hyperbolic Systems of Conservation Laws-Springer (2021)
846 pages
IMIKA EMPIRE - Corporate Profile (Draft 1)
100% (2)
IMIKA EMPIRE - Corporate Profile (Draft 1)
12 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Afrocentricity
No ratings yet
Afrocentricity
22 pages
Introduction To TensorFlow For Artificial Intelligence
No ratings yet
Introduction To TensorFlow For Artificial Intelligence
41 pages
Unit1-Building Models With Tensorflow
No ratings yet
Unit1-Building Models With Tensorflow
17 pages
Tensorflow Tutorial PDF
100% (6)
Tensorflow Tutorial PDF
90 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
TENSORRT
No ratings yet
TENSORRT
26 pages
t4 Inference Print Update Inference Tech Overview Final
No ratings yet
t4 Inference Print Update Inference Tech Overview Final
25 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet on AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet on AWS
29 pages
Ansari H. Mastering TensorFlow. Unleashing the Power of Deep Learning...2024
No ratings yet
Ansari H. Mastering TensorFlow. Unleashing the Power of Deep Learning...2024
134 pages
Osn Tensorflow2 210327175734
No ratings yet
Osn Tensorflow2 210327175734
23 pages
TensorRT Sample Support Guide
No ratings yet
TensorRT Sample Support Guide
52 pages
Article_python_TensorFlow: A system for large-scale machine learning
No ratings yet
Article_python_TensorFlow: A system for large-scale machine learning
18 pages
UNIT-II
No ratings yet
UNIT-II
83 pages
Lesson 05 TensorFlow
No ratings yet
Lesson 05 TensorFlow
113 pages
TensorRT Developer Guide
100% (1)
TensorRT Developer Guide
131 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
AML Service
No ratings yet
AML Service
38 pages
Tensorflow: A System For Large-Scale Machine Learning
No ratings yet
Tensorflow: A System For Large-Scale Machine Learning
21 pages
S61915
No ratings yet
S61915
28 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
TensorFlow User Guide
No ratings yet
TensorFlow User Guide
24 pages
10 From Zero To ML
No ratings yet
10 From Zero To ML
53 pages
Tensorlayer Documentation: Release 1.11.1
No ratings yet
Tensorlayer Documentation: Release 1.11.1
258 pages
C2 W4ok
No ratings yet
C2 W4ok
94 pages
Tensor Flow
No ratings yet
Tensor Flow
12 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
5 pages
Tensor Flow
No ratings yet
Tensor Flow
2 pages
106106213
No ratings yet
106106213
637 pages
MVS_Expt8 Object Detection and Reconstruction Using CNN
No ratings yet
MVS_Expt8 Object Detection and Reconstruction Using CNN
5 pages
Mxnet Documentation: Release 0.0.8
No ratings yet
Mxnet Documentation: Release 0.0.8
93 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
TensorFlow
No ratings yet
TensorFlow
14 pages
Instant download (Ebook) Deep Learning Projects Using TensorFlow 2: Neural Network Development with Python and Keras by Vinita Silaparasetty ISBN 9781484258019, 1484258010 pdf all chapter
100% (8)
Instant download (Ebook) Deep Learning Projects Using TensorFlow 2: Neural Network Development with Python and Keras by Vinita Silaparasetty ISBN 9781484258019, 1484258010 pdf all chapter
52 pages
16. Deep Learning
No ratings yet
16. Deep Learning
28 pages
Readme PDF
No ratings yet
Readme PDF
5 pages
LLM Training Update
No ratings yet
LLM Training Update
31 pages
Deep Learning in Matlab
No ratings yet
Deep Learning in Matlab
36 pages
SRS Documentation - TensorFlow
No ratings yet
SRS Documentation - TensorFlow
16 pages
Deep learning1
No ratings yet
Deep learning1
23 pages
Tensor Flow Guide
No ratings yet
Tensor Flow Guide
25 pages
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
No ratings yet
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
4 pages
Tensor Flow
No ratings yet
Tensor Flow
19 pages
Tech Seminar - [email protected] NIKITHA
No ratings yet
Tech Seminar - [email protected] NIKITHA
32 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
Bigdata Neural Networks
No ratings yet
Bigdata Neural Networks
144 pages
RA2211026010557- SEAI Scenario 2
No ratings yet
RA2211026010557- SEAI Scenario 2
3 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
deep-learning-r18-jntuh-lab-manual
No ratings yet
deep-learning-r18-jntuh-lab-manual
20 pages
Artificial Intelligence ME: Manufacturing 6324
No ratings yet
Artificial Intelligence ME: Manufacturing 6324
23 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
64 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
15 ML
No ratings yet
15 ML
60 pages
Machine Learning and Deep Learning using Tensor Flow Course
No ratings yet
Machine Learning and Deep Learning using Tensor Flow Course
8 pages
Docker Guide For AI Research
No ratings yet
Docker Guide For AI Research
8 pages
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
newton_1
No ratings yet
newton_1
114 pages
s41598-024-74600-4
No ratings yet
s41598-024-74600-4
27 pages
biorthogonal_system
No ratings yet
biorthogonal_system
54 pages
Kinematic_wave_theory
No ratings yet
Kinematic_wave_theory
2 pages
klausen1999
No ratings yet
klausen1999
20 pages
Nonlinear Differential Equations
No ratings yet
Nonlinear Differential Equations
25 pages
1D Conservation Laws
No ratings yet
1D Conservation Laws
38 pages
Lie NOTES
No ratings yet
Lie NOTES
114 pages
Weak Solution For The Hyperbolic Equations and Its Numerical Computation
No ratings yet
Weak Solution For The Hyperbolic Equations and Its Numerical Computation
35 pages
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
No ratings yet
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
156 pages
ORAL COMMUNICATION - Q1 - W3 - Mod3 - Models of Communication 2
100% (1)
ORAL COMMUNICATION - Q1 - W3 - Mod3 - Models of Communication 2
14 pages
The Cherry Orchard and Marxism
No ratings yet
The Cherry Orchard and Marxism
2 pages
Roles of Audiologists and Speech-Language Pathologists Working With Persons With Attention Deficit Hyperactivity Disorder
No ratings yet
Roles of Audiologists and Speech-Language Pathologists Working With Persons With Attention Deficit Hyperactivity Disorder
41 pages
Baltic Amber Handbook PDF
No ratings yet
Baltic Amber Handbook PDF
154 pages
Microteaching TM
No ratings yet
Microteaching TM
24 pages
RT-2100C+User's Manual V2.6e Lector de Elisa
No ratings yet
RT-2100C+User's Manual V2.6e Lector de Elisa
48 pages
Content Standard:: I. Learning Competency The Learner
No ratings yet
Content Standard:: I. Learning Competency The Learner
3 pages
Chapter 2-Ethics
No ratings yet
Chapter 2-Ethics
17 pages
Syllabus HYA Grade 12 2022-23
No ratings yet
Syllabus HYA Grade 12 2022-23
3 pages
Kraft Pulp and Paper Co
No ratings yet
Kraft Pulp and Paper Co
2 pages
HCI Assignment
No ratings yet
HCI Assignment
9 pages
Psychological Perspective of The Self
100% (1)
Psychological Perspective of The Self
16 pages
CS Q-Test 2
No ratings yet
CS Q-Test 2
13 pages
Gaurav Bajpai SyntelResume11610881 Gaurav Bajpai SyntelResume Jan2...
No ratings yet
Gaurav Bajpai SyntelResume11610881 Gaurav Bajpai SyntelResume Jan2...
7 pages
(Ebook) Real-Time C++: Efficient Object-Oriented and Template Microcontroller Programming by Christopher Kormanyos ISBN 9783662629956, 366262995X 2024 scribd download
100% (2)
(Ebook) Real-Time C++: Efficient Object-Oriented and Template Microcontroller Programming by Christopher Kormanyos ISBN 9783662629956, 366262995X 2024 scribd download
71 pages
Test Initial A 11 A - A 12 A Engleza R Ii
No ratings yet
Test Initial A 11 A - A 12 A Engleza R Ii
2 pages
Polymers 14 01626 v3
No ratings yet
Polymers 14 01626 v3
44 pages
Reflected Ceiling Plan For Brain Laboratory in NYC
No ratings yet
Reflected Ceiling Plan For Brain Laboratory in NYC
1 page
MVC Framework Introduction PDF
No ratings yet
MVC Framework Introduction PDF
2 pages
Mitsubishi Tractor: Maintenance Manual
No ratings yet
Mitsubishi Tractor: Maintenance Manual
192 pages
Foundation Plan Floor Slab Reinforcement Layout Plan
No ratings yet
Foundation Plan Floor Slab Reinforcement Layout Plan
3 pages
Flac3D Grid Generation With Ansys+Civilfem
No ratings yet
Flac3D Grid Generation With Ansys+Civilfem
57 pages
The Historical Research: Theory, Methodology and Historiography
No ratings yet
The Historical Research: Theory, Methodology and Historiography
2 pages
Iso 161-1 PDF
No ratings yet
Iso 161-1 PDF
8 pages
PAM Book
No ratings yet
PAM Book
56 pages
Antologie - The Science of Fractal Images
No ratings yet
Antologie - The Science of Fractal Images
197 pages
Introducing Sociology: Suparna Majumdar Kar
No ratings yet
Introducing Sociology: Suparna Majumdar Kar
48 pages

Deep-Learning-Optimization

Uploaded by

Deep-Learning-Optimization

Uploaded by

ACCELERATING END TO END DEEP LEARNING

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

Real-time Object Recognition

Bins / Libs Bins / Libs Bins / Libs

Hypervisor Docker Engine

Host Operating System Host Operating System

Server Infrastructure Server Infrastructure

Virtual Machines Containers

• Colloqually called “nvidia-docker”

1 Reduce model size and increase throughput

2 Incrementally retrain model after pruning to recover

• 1. DATA Driven operation

Same network adapting to different

NVIDIA NGC container registry:

with amp.scale_loss(loss, optimizer) as scaled_loss:

All models can be found at:

● Plot shows ResNet-50 result with/without automatic mixed

FRAMEWORKS GPU PLATFORMS

3,000 20 300 250

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Optimized Plans TensorRT Runtime Engine Automotive Embedded 28

Example: Importing a TensorFlow model

Python/C++ API Python/C++ API

Model Importer Network

Layer & Tensor Fusion

➢ Optimizations are completely automatic

Un-Optimized Network TensorRT Optimized Network

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and

• Manages memory allocation for

Step 3: Register inputs and outputs

Step 4: Optimize model and create

Step 5: Serialize optimized engine

Step 6: De-serialize engine

● Powerful platform for research and experimentation

● Optimize and Deploy neural networks in production environments

● Maximize throughput for latency-critical apps with optimizer and runtime

300k Downloads in 2018

● Benefits to using TF-TRT

AGENDA ● How to use

● Customer experience: Clarifai

● How TF-TRT works

Mobilenet V1 1203 4975 4.1x

SSD Mobilenet V2 102 411 4.0x

SSD Inception V2 82 327 4.0x

Mobilenet V2 74.08 74.07

NASNet Mobile 73.97 73.87

ResNet 50 V1.5 76.51 76.48

ResNet 50 V2 76.43 76.40

VGG 16 70.89 70.91

Inception V3 77.99 77.97

SSD Mobilenet v1 23.06 23.07

FP16 accuracy is within 0.1% of FP32 accuracy

Mobilenet V2 74.08 73.90

NASNet Mobile 73.97 73.55

ResNet 50 V1.5 76.51 76.23

ResNet 50 V2 76.43 76.30

VGG 16 70.89 70.78

Inception V3 77.99 77.85

Top-1 metric (%) for classification models.

Optimize and deploy models from ONNX-

C++ and Python APIs to import ONNX models

New samples demonstrating step-by-step process to get

Single Model Only Single Framework Only Custom Development

Integrates with orchestration

Now open source for thorough

Python/C++ Client Library

Python/C++ client libraries

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

Applications and Services

K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT

NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD

Broader Use Cases and