0% found this document useful (0 votes)

25 views28 pages

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

The document discusses the implementation of Triton and TensorRT for model serving in Snap's ScreenShop service, highlighting the benefits of a unified platform for hosting multiple model formats. It details the deployment of TensorFlow and PyTorch models, optimization of serving parameters, and the efficiency gains achieved through TensorRT, resulting in significant throughput improvements. The conclusion emphasizes the successful deployment of both simple and complex models, along with the use of profiling tools for performance analysis.

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views28 pages

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

GTC 2024

Universal Model Serving via

Triton and TensorRT
Ke Ma, GenAI@Snap, Inc.

1
Take Home

Our journey to use Triton and TensorRT in our ScreenShop service

● Triton provides a single platform hosting multiple models of different formats

○ Greatly simplify serving, maintenance, and monitoring

● Serving optimization using the Triton built-in analyzers and TensorRT

○ ~3x throughput at the same latency level compared to native serving

2
Outline

• Introduction

• TF SavedModel Deployment

• Pytorch Model Deployment - An Ensemble

• Serving Parameter Optimization

• Efﬁcient Inference with TensorRT

3
Introduction

● In the production pipeline, the models may be trained/generated from multiple frameworks.

● Maintaining multiple serving platforms for multiple models is tedious.

○ TFserving, torchserve, etc
● The Triton Inference Server provides a uniformed serving solution to different model formats.
○ Conﬁg driven

4
Introduction

5
Introduction

Fashion Fashion
Similar search
detection embedding

TensorFlow YOLOV5 SavedModel Pytorch ViT model

6
TF SavedModel Deployment

7
TF SavedModel Deployment

Our YOLOV5 SavedModel is self-contained

It contains both preprocessing steps and postprocessing steps

TF SavedModel ﬁle

YOLOV5 backbone
Normalization
Decoding
Resizing

ﬁltering
NMS
Raw jpeg image bytes

N * 4 bbox as the input to the

embedding model
8
TF SavedModel Deployment

● Minimal Triton Inference Server conﬁguration

backend: tensorflow
max_batch_size: 16
dynamic_batching: {}

● Other conﬁgs are automatically generated

● Deploy on GKE
○ Use Triton docker base image
○ Provide the SavedModel folder and the minimal serving conﬁguration as the assets
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models

models/
- 1/
- model.savedmodel
9
- config.pbtxt
Pytorch Model Deployment
An Ensemble

10
An Ensemble
transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize((400, 400)),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406],

Preprocessing ),
std=[0.229, 0.224, 0.225],

])
img = torchvision.io.decode_jpeg(
torch.frombuffer(raw_bytes_data, dtype=torch.uint8),
Python + Pytorch mode=torchvision.io.ImageReadMode.RGB,
device="cpu",
)
preprocessed_img = transforms(img)

torch.nn.Module
torch.onnx.export(
model,
dummy_input,
ONNX "onnx_model.onnx",
export_params=True,
input_names="input",
output_names="output",
dynamic_axes={
"input": {0: "dynamic_axis_0"},
"output": {0: "dynamic_axis_0"},
},
)

11
An Ensemble
ensemble_input
● Triton ensemble backend
preproc_input

Preprocess: Python
○ Preprocessing model
■ Python backend preproc_output

onnx_input

○ Inference model
Inference: onnxruntime
■ onnxruntime backend

onnx_output

ensemble_output
12
platform: ensemble
max_batch_size: 16
input:

An Ensemble - name: ensemble_input

data_type: TYPE_UINT8
dims: [-1]
output:

Triton serving conﬁg - name: ensemble_output

data_type: TYPE_FP32
backend: python backend: onnxruntime dims: [-1, -1]
max_batch_size: 16 max_batch_size: 16 ensemble_scheduling:
input: input: step:
- name: input - name: input - model_name: preprocess
data_type: TYPE_UINT8 data_type: TYPE_FP32 model_version: -1
dims: [-1] dims: [-1, 3, 224, 224] input_map:
output: output: input: ensemble_input
- name: output - name: output output_map:
data_type: TYPE_FP32 data_type: TYPE_FP32 output: poutput
dims: [-1, 3, 224, 224] dims: [-1, -1] - model_name: onnx_model
instance_group: instance_group: model_version: -1
- count: 1 - count: 1 input_map:
kind: KIND_CPU kind: KIND_AUTO input: poutput
dynamic_batching: {} dynamic_batching: {} ... output_map:
Model_warmup:... output: ensemble_output...

13
Serving Parameter Optimization

14
Serving Parameter Optimization

● Performance analyzer
○ Analyze the model performance given a ﬁxed set of parameters.
■ Throughput vs latency
■ Concurrency mode: N concurrent requests
■ (Request rate mode: simulate a certain QPS)

● Model analyzer
○ Analyze the model by measuring its performance across multiple sets of parameters using
Performance analyzer

15
Serving Parameter Optimization
# the model repository in the docker container.
model_repository: /models
# the list of models you want to profile
profile_models:
Parameters controlling the # the preprocess model
Performance Analyzer client preprocess:
parameters: ● 6 max_batch_size conﬁgs * 4
# the batch size from the client instance_group counts = 24 serving
batch_sizes: [1,2,4,8,16,32] parameters
model_config_parameters:
# the maximal batch size for the server ● 24 Performance Analyses
max_batch_size: [1,2,4,8,16,32]
# turn on dynamic batching
Model analyzer dynamic_batching: {}
# always turn off warmup as it occupies GPU mem
model_warmup: []
# the instance group parameters to sweep over
instance_group:
- kind: KIND_CPU
# we evaluate 1 or 2 or 4 or 8 instances
count: [1,2,4,8]
perf_analyzer_flags:
shape:
- IMAGE:42872
# the configs to search for the onnx inference step
onnx_model: ...
16
Serving Parameter Optimization

docker run --gpus=1 --shm-size=8GB --net host --rm \

nvcr.io/nvidia/tritonserver:<xx.yy>-py3 \
model-analyzer profile \
-f /profile_results/config.yaml \
--export-path /profile_results

Model analyzer

17
Serving Parameter Optimization

Model analyzer

18
Serving Monitor

Example:

Batching

QPS

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html

19
Efﬁcient Inference with TensorRT

20
Efﬁcient Inference with TensorRT

Post training optimization

YOLOV5 TF SavedModel

from tensorflow.python.compiler.tensorrt import trt_convert as trt

converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_dir,
input_saved_model_signature_key='serving_default',
precision_mode=trt.TrtPrecisionMode.FP16,
...
)

Use Tensorﬂow NGC docker image from Nvidia to avoid version mismatch!

21
Efﬁcient Inference with TensorRT

22
Efﬁcient Inference with TensorRT

● The validation shows the results from the TRT fp16 optimized model is almost identical to that of the
native fp32 model.

● At the same latency level (QoS), the TRT fp16 model has a 2.9x throughput (a potential 66% cut in
the serving cost).

● Similar improvement on the Pytorch ViT model

○ 2.6x throughput compared to the native fp32 model.

23
Something We Could Do Better

24
Bottleneck and Hardware

● 8 CPUs 1 T4 GPU
● YOLOV5
○ CPU (pre/post processing): 100%
○ GPU (YOLOV5 backbone): 20-30%
○ Solutions:
■ More CPUs
■ Move some CPU workload to GPU
t he
● DALI/CV-CUDA
t o reach
0%
● ViT l o a d at 4
he
○ CPU (preprocessing): 40% e k eep t ncy (QoS)
W late
○ GPU (onnx inference): 40% s i re d
de
● T4 vs L4
○ At the same latency level, L4 offers ~50-70% more throughput than T4.

25
Conclusion

Model deployment

A simple TF YOLOV5 SavedModel

A complex ensemble model with Pytorch preprocessing and ONNX inference

Serving parameters

Proﬁler and model analyzer

Model optimization

TensorRT

26
Acknowledgement
Snap Nvidia

Ke Ma ([email protected]) Sean Kohler

Andres Talero Farzan Memarian

Timothy Hyde Haohang Huang

Josh Moore Michael Boone

Huseyin Coskun Sean Pieper

Leo Lu Sandeep Hiremath

Annie Huang

Cindy Wu

Chen Wang

Alexander McCauley

Derek Hao Hu

Stephen Chen

27
Thank you!
Q&A

Py Torch
No ratings yet
Py Torch
786 pages
DLV Lab Manual Print
No ratings yet
DLV Lab Manual Print
29 pages
Deep Learning Project For Computer Vision With Python 2022
No ratings yet
Deep Learning Project For Computer Vision With Python 2022
297 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Deep Learning Optimization
No ratings yet
Deep Learning Optimization
62 pages
TENSORRT
No ratings yet
TENSORRT
26 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Expectancy Theory Overview
100% (3)
Expectancy Theory Overview
27 pages
Pytorch Tutorial
0% (1)
Pytorch Tutorial
65 pages
CNN (convolution+Neura+Network) PYNQ Online Course
No ratings yet
CNN (convolution+Neura+Network) PYNQ Online Course
12 pages
CPT5 - Short Circuit Analysis - July 25, 2005
100% (3)
CPT5 - Short Circuit Analysis - July 25, 2005
235 pages
Daily Activity Booklet
No ratings yet
Daily Activity Booklet
143 pages
Credit Awareness
100% (2)
Credit Awareness
62 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
29 pages
Deep Learning With Pytorch: Ai Courses by Opencv
No ratings yet
Deep Learning With Pytorch: Ai Courses by Opencv
9 pages
TensorRT Sample Support Guide
No ratings yet
TensorRT Sample Support Guide
52 pages
Deep Learning With PyTorch
No ratings yet
Deep Learning With PyTorch
19 pages
Preet Hi
No ratings yet
Preet Hi
75 pages
OH-SFF Naval Manual
No ratings yet
OH-SFF Naval Manual
180 pages
NN From Scratch
No ratings yet
NN From Scratch
5 pages
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
No ratings yet
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
108 pages
Enterprise Architecture PDF
No ratings yet
Enterprise Architecture PDF
175 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
Geometric Design For Highways and Railways Including Cross Sections Horizontal and Vertical Alignments Super Elevation and Earthworks - Compress
No ratings yet
Geometric Design For Highways and Railways Including Cross Sections Horizontal and Vertical Alignments Super Elevation and Earthworks - Compress
23 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Almera n16 Europa Em-K9k
No ratings yet
Almera n16 Europa Em-K9k
98 pages
DL Pipeline and Tutorial
No ratings yet
DL Pipeline and Tutorial
36 pages
MLP Pytorch Sigmoid Mse
No ratings yet
MLP Pytorch Sigmoid Mse
20 pages
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
No ratings yet
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
11 pages
Lab 9
No ratings yet
Lab 9
29 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Export
No ratings yet
Export
13 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
MLG Tensor
No ratings yet
MLG Tensor
34 pages
DL7 2
No ratings yet
DL7 2
11 pages
Project Documentation
No ratings yet
Project Documentation
24 pages
03 - Model - Dep (4) - JupyterLab
No ratings yet
03 - Model - Dep (4) - JupyterLab
11 pages
Unit1-Building Models With Tensorflow
No ratings yet
Unit1-Building Models With Tensorflow
17 pages
Faster R-CNN
No ratings yet
Faster R-CNN
20 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
The Raine Report Issue 02
No ratings yet
The Raine Report Issue 02
51 pages
Assignment3 AL
No ratings yet
Assignment3 AL
23 pages
Arnav MLOPSLab05
No ratings yet
Arnav MLOPSLab05
5 pages
Dlweek 7
No ratings yet
Dlweek 7
9 pages
Onnx Machine Learning in Production - Blog
No ratings yet
Onnx Machine Learning in Production - Blog
4 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
5 pages
Report
No ratings yet
Report
11 pages
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
36 pages
Val
No ratings yet
Val
9 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
Optimizing Inference Server For Maximum Tokens - Sec
No ratings yet
Optimizing Inference Server For Maximum Tokens - Sec
4 pages
Detect
No ratings yet
Detect
6 pages
Building Deep Learning Models Using The PyTorch Library
No ratings yet
Building Deep Learning Models Using The PyTorch Library
4 pages
Configuring A JOB in T24
No ratings yet
Configuring A JOB in T24
2 pages
Memopower 1-3KVA User Manual 4 BT
No ratings yet
Memopower 1-3KVA User Manual 4 BT
26 pages
RA2211026010557 - SEAI Scenario 2
No ratings yet
RA2211026010557 - SEAI Scenario 2
3 pages
Intro To Pytorch
No ratings yet
Intro To Pytorch
12 pages
07 Milestone Project 1 Food Vision
No ratings yet
07 Milestone Project 1 Food Vision
20 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
2
No ratings yet
2
1 page
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Logs
No ratings yet
Logs
2 pages
Keras
No ratings yet
Keras
4 pages
IFRS 15 Summary PDF
No ratings yet
IFRS 15 Summary PDF
8 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
BreastCancer EXP
No ratings yet
BreastCancer EXP
8 pages
Event Action Script Call Equivalents
No ratings yet
Event Action Script Call Equivalents
17 pages
Tensorflow Installtion Ubuntu16.4
No ratings yet
Tensorflow Installtion Ubuntu16.4
5 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Losses in Piping System
No ratings yet
Losses in Piping System
18 pages
Technology Newsletter
No ratings yet
Technology Newsletter
5 pages
English Set: Sl. No
No ratings yet
English Set: Sl. No
12 pages
Q3 Brochure
No ratings yet
Q3 Brochure
24 pages
Cell Material Interaction Lab Manual S241
No ratings yet
Cell Material Interaction Lab Manual S241
22 pages
Maurenthia Jeinely Mandey, Sri Murni, Arrazi Hasan Jan - Januari 2023
No ratings yet
Maurenthia Jeinely Mandey, Sri Murni, Arrazi Hasan Jan - Januari 2023
8 pages
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
No ratings yet
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
35 pages
Midterm SC
No ratings yet
Midterm SC
5 pages
PR5259610 BME Non CAA HVAC PM SOW
No ratings yet
PR5259610 BME Non CAA HVAC PM SOW
68 pages
DRRM Minutes DSV, ZMDV
No ratings yet
DRRM Minutes DSV, ZMDV
17 pages
Wa0005.
No ratings yet
Wa0005.
17 pages
A New Decade For Soci Al Changes
No ratings yet
A New Decade For Soci Al Changes
16 pages
Ad Spender Manual
No ratings yet
Ad Spender Manual
17 pages
U.S. Seismic Design Maps CHRYSLER
No ratings yet
U.S. Seismic Design Maps CHRYSLER
2 pages
12-Plinth Beams Layout PDF
No ratings yet
12-Plinth Beams Layout PDF
1 page
Legal Framework For Truck Logistics in India
No ratings yet
Legal Framework For Truck Logistics in India
2 pages
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
No ratings yet
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
1 page
Notes-Exc - 1
No ratings yet
Notes-Exc - 1
2 pages
Head Assy
No ratings yet
Head Assy
1 page
WP - No.10205 of 2017
No ratings yet
WP - No.10205 of 2017
2 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Numpy Simply In Depth
From Everand
Numpy Simply In Depth
Ajit Singh
5/5 (1)

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

Uploaded by

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

Uploaded by

GTC 2024

Universal Model Serving via

Our journey to use Triton and TensorRT in our ScreenShop service

● Triton provides a single platform hosting multiple models of different formats

● Serving optimization using the Triton built-in analyzers and TensorRT

• Pytorch Model Deployment - An Ensemble

• Serving Parameter Optimization

• Efﬁcient Inference with TensorRT

● Maintaining multiple serving platforms for multiple models is tedious.

TensorFlow YOLOV5 SavedModel Pytorch ViT model

Our YOLOV5 SavedModel is self-contained

It contains both preprocessing steps and postprocessing steps

N * 4 bbox as the input to the

● Minimal Triton Inference Server conﬁguration

● Other conﬁgs are automatically generated

An Ensemble - name: ensemble_input

Triton serving conﬁg - name: ensemble_output

docker run --gpus=1 --shm-size=8GB --net host --rm \

Post training optimization

from tensorflow.python.compiler.tensorrt import trt_convert as trt

● Similar improvement on the Pytorch ViT model

A simple TF YOLOV5 SavedModel

A complex ensemble model with Pytorch preprocessing and ONNX inference

Proﬁler and model analyzer

Ke Ma ([email protected]) Sean Kohler

Andres Talero Farzan Memarian

Timothy Hyde Haohang Huang

Josh Moore Michael Boone

Huseyin Coskun Sean Pieper

Leo Lu Sandeep Hiremath

You might also like