0% found this document useful (0 votes)
25 views28 pages

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

The document discusses the implementation of Triton and TensorRT for model serving in Snap's ScreenShop service, highlighting the benefits of a unified platform for hosting multiple model formats. It details the deployment of TensorFlow and PyTorch models, optimization of serving parameters, and the efficiency gains achieved through TensorRT, resulting in significant throughput improvements. The conclusion emphasizes the successful deployment of both simple and complex models, along with the use of profiling tools for performance analysis.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views28 pages

Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc

The document discusses the implementation of Triton and TensorRT for model serving in Snap's ScreenShop service, highlighting the benefits of a unified platform for hosting multiple model formats. It details the deployment of TensorFlow and PyTorch models, optimization of serving parameters, and the efficiency gains achieved through TensorRT, resulting in significant throughput improvements. The conclusion emphasizes the successful deployment of both simple and complex models, along with the use of profiling tools for performance analysis.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

GTC 2024

Universal Model Serving via


Triton and TensorRT
Ke Ma, GenAI@Snap, Inc.

1
Take Home

Our journey to use Triton and TensorRT in our ScreenShop service

● Triton provides a single platform hosting multiple models of different formats


○ Greatly simplify serving, maintenance, and monitoring

● Serving optimization using the Triton built-in analyzers and TensorRT


○ ~3x throughput at the same latency level compared to native serving

2
Outline

• Introduction

• TF SavedModel Deployment

• Pytorch Model Deployment - An Ensemble

• Serving Parameter Optimization

• Efficient Inference with TensorRT

3
Introduction

● In the production pipeline, the models may be trained/generated from multiple frameworks.

● Maintaining multiple serving platforms for multiple models is tedious.


○ TFserving, torchserve, etc
● The Triton Inference Server provides a uniformed serving solution to different model formats.
○ Config driven

4
Introduction

5
Introduction

Fashion Fashion
Similar search
detection embedding

TensorFlow YOLOV5 SavedModel Pytorch ViT model

6
TF SavedModel Deployment

7
TF SavedModel Deployment

Our YOLOV5 SavedModel is self-contained

It contains both preprocessing steps and postprocessing steps

TF SavedModel file

YOLOV5 backbone
Normalization
Decoding
Resizing

filtering
NMS
Raw jpeg image bytes

N * 4 bbox as the input to the


embedding model
8
TF SavedModel Deployment

● Minimal Triton Inference Server configuration

backend: tensorflow
max_batch_size: 16
dynamic_batching: {}

● Other configs are automatically generated


● Deploy on GKE
○ Use Triton docker base image
○ Provide the SavedModel folder and the minimal serving configuration as the assets
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models

models/
- 1/
- model.savedmodel
9
- config.pbtxt
Pytorch Model Deployment
An Ensemble

10
An Ensemble
transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize((400, 400)),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406],

Preprocessing ),
std=[0.229, 0.224, 0.225],

])
img = torchvision.io.decode_jpeg(
torch.frombuffer(raw_bytes_data, dtype=torch.uint8),
Python + Pytorch mode=torchvision.io.ImageReadMode.RGB,
device="cpu",
)
preprocessed_img = transforms(img)

torch.nn.Module
torch.onnx.export(
model,
dummy_input,
ONNX "onnx_model.onnx",
export_params=True,
input_names="input",
output_names="output",
dynamic_axes={
"input": {0: "dynamic_axis_0"},
"output": {0: "dynamic_axis_0"},
},
)

11
An Ensemble
ensemble_input
● Triton ensemble backend
preproc_input

Preprocess: Python
○ Preprocessing model
■ Python backend preproc_output

onnx_input

○ Inference model
Inference: onnxruntime
■ onnxruntime backend

onnx_output

ensemble_output
12
platform: ensemble
max_batch_size: 16
input:

An Ensemble - name: ensemble_input


data_type: TYPE_UINT8
dims: [-1]
output:

Triton serving config - name: ensemble_output


data_type: TYPE_FP32
backend: python backend: onnxruntime dims: [-1, -1]
max_batch_size: 16 max_batch_size: 16 ensemble_scheduling:
input: input: step:
- name: input - name: input - model_name: preprocess
data_type: TYPE_UINT8 data_type: TYPE_FP32 model_version: -1
dims: [-1] dims: [-1, 3, 224, 224] input_map:
output: output: input: ensemble_input
- name: output - name: output output_map:
data_type: TYPE_FP32 data_type: TYPE_FP32 output: poutput
dims: [-1, 3, 224, 224] dims: [-1, -1] - model_name: onnx_model
instance_group: instance_group: model_version: -1
- count: 1 - count: 1 input_map:
kind: KIND_CPU kind: KIND_AUTO input: poutput
dynamic_batching: {} dynamic_batching: {} ... output_map:
Model_warmup:... output: ensemble_output...

13
Serving Parameter Optimization

14
Serving Parameter Optimization

● Performance analyzer
○ Analyze the model performance given a fixed set of parameters.
■ Throughput vs latency
■ Concurrency mode: N concurrent requests
■ (Request rate mode: simulate a certain QPS)

● Model analyzer
○ Analyze the model by measuring its performance across multiple sets of parameters using
Performance analyzer

15
Serving Parameter Optimization
# the model repository in the docker container.
model_repository: /models
# the list of models you want to profile
profile_models:
Parameters controlling the # the preprocess model
Performance Analyzer client preprocess:
parameters: ● 6 max_batch_size configs * 4
# the batch size from the client instance_group counts = 24 serving
batch_sizes: [1,2,4,8,16,32] parameters
model_config_parameters:
# the maximal batch size for the server ● 24 Performance Analyses
max_batch_size: [1,2,4,8,16,32]
# turn on dynamic batching
Model analyzer dynamic_batching: {}
# always turn off warmup as it occupies GPU mem
model_warmup: []
# the instance group parameters to sweep over
instance_group:
- kind: KIND_CPU
# we evaluate 1 or 2 or 4 or 8 instances
count: [1,2,4,8]
perf_analyzer_flags:
shape:
- IMAGE:42872
# the configs to search for the onnx inference step
onnx_model: ...
16
Serving Parameter Optimization

docker run --gpus=1 --shm-size=8GB --net host --rm \


nvcr.io/nvidia/tritonserver:<xx.yy>-py3 \
model-analyzer profile \
-f /profile_results/config.yaml \
--export-path /profile_results

Model analyzer

17
Serving Parameter Optimization

Model analyzer

18
Serving Monitor

Example:

Batching

QPS

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html

19
Efficient Inference with TensorRT

20
Efficient Inference with TensorRT

Post training optimization

YOLOV5 TF SavedModel

from tensorflow.python.compiler.tensorrt import trt_convert as trt

converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_dir,
input_saved_model_signature_key='serving_default',
precision_mode=trt.TrtPrecisionMode.FP16,
...
)

Use Tensorflow NGC docker image from Nvidia to avoid version mismatch!

21
Efficient Inference with TensorRT

22
Efficient Inference with TensorRT

● The validation shows the results from the TRT fp16 optimized model is almost identical to that of the
native fp32 model.

● At the same latency level (QoS), the TRT fp16 model has a 2.9x throughput (a potential 66% cut in
the serving cost).

● Similar improvement on the Pytorch ViT model


○ 2.6x throughput compared to the native fp32 model.

23
Something We Could Do Better

24
Bottleneck and Hardware

● 8 CPUs 1 T4 GPU
● YOLOV5
○ CPU (pre/post processing): 100%
○ GPU (YOLOV5 backbone): 20-30%
○ Solutions:
■ More CPUs
■ Move some CPU workload to GPU
t he
● DALI/CV-CUDA
t o reach
0%
● ViT l o a d at 4
he
○ CPU (preprocessing): 40% e k eep t ncy (QoS)
W late
○ GPU (onnx inference): 40% s i re d
de
● T4 vs L4
○ At the same latency level, L4 offers ~50-70% more throughput than T4.

25
Conclusion

Model deployment

A simple TF YOLOV5 SavedModel

A complex ensemble model with Pytorch preprocessing and ONNX inference

Serving parameters

Profiler and model analyzer

Model optimization

TensorRT

26
Acknowledgement
Snap Nvidia

Ke Ma ([email protected]) Sean Kohler

Andres Talero Farzan Memarian

Timothy Hyde Haohang Huang

Josh Moore Michael Boone

Huseyin Coskun Sean Pieper

Leo Lu Sandeep Hiremath

Annie Huang

Cindy Wu

Chen Wang

Alexander McCauley

Derek Hao Hu

Stephen Chen

27
Thank you!
Q&A

28

You might also like