Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
1
Take Home
2
Outline
• Introduction
• TF SavedModel Deployment
3
Introduction
● In the production pipeline, the models may be trained/generated from multiple frameworks.
4
Introduction
5
Introduction
Fashion Fashion
Similar search
detection embedding
6
TF SavedModel Deployment
7
TF SavedModel Deployment
TF SavedModel file
YOLOV5 backbone
Normalization
Decoding
Resizing
filtering
NMS
Raw jpeg image bytes
backend: tensorflow
max_batch_size: 16
dynamic_batching: {}
models/
- 1/
- model.savedmodel
9
- config.pbtxt
Pytorch Model Deployment
An Ensemble
10
An Ensemble
transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize((400, 400)),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406],
Preprocessing ),
std=[0.229, 0.224, 0.225],
])
img = torchvision.io.decode_jpeg(
torch.frombuffer(raw_bytes_data, dtype=torch.uint8),
Python + Pytorch mode=torchvision.io.ImageReadMode.RGB,
device="cpu",
)
preprocessed_img = transforms(img)
torch.nn.Module
torch.onnx.export(
model,
dummy_input,
ONNX "onnx_model.onnx",
export_params=True,
input_names="input",
output_names="output",
dynamic_axes={
"input": {0: "dynamic_axis_0"},
"output": {0: "dynamic_axis_0"},
},
)
11
An Ensemble
ensemble_input
● Triton ensemble backend
preproc_input
Preprocess: Python
○ Preprocessing model
■ Python backend preproc_output
onnx_input
○ Inference model
Inference: onnxruntime
■ onnxruntime backend
onnx_output
ensemble_output
12
platform: ensemble
max_batch_size: 16
input:
13
Serving Parameter Optimization
14
Serving Parameter Optimization
● Performance analyzer
○ Analyze the model performance given a fixed set of parameters.
■ Throughput vs latency
■ Concurrency mode: N concurrent requests
■ (Request rate mode: simulate a certain QPS)
● Model analyzer
○ Analyze the model by measuring its performance across multiple sets of parameters using
Performance analyzer
15
Serving Parameter Optimization
# the model repository in the docker container.
model_repository: /models
# the list of models you want to profile
profile_models:
Parameters controlling the # the preprocess model
Performance Analyzer client preprocess:
parameters: ● 6 max_batch_size configs * 4
# the batch size from the client instance_group counts = 24 serving
batch_sizes: [1,2,4,8,16,32] parameters
model_config_parameters:
# the maximal batch size for the server ● 24 Performance Analyses
max_batch_size: [1,2,4,8,16,32]
# turn on dynamic batching
Model analyzer dynamic_batching: {}
# always turn off warmup as it occupies GPU mem
model_warmup: []
# the instance group parameters to sweep over
instance_group:
- kind: KIND_CPU
# we evaluate 1 or 2 or 4 or 8 instances
count: [1,2,4,8]
perf_analyzer_flags:
shape:
- IMAGE:42872
# the configs to search for the onnx inference step
onnx_model: ...
16
Serving Parameter Optimization
Model analyzer
17
Serving Parameter Optimization
Model analyzer
18
Serving Monitor
Example:
Batching
QPS
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html
19
Efficient Inference with TensorRT
20
Efficient Inference with TensorRT
YOLOV5 TF SavedModel
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_dir,
input_saved_model_signature_key='serving_default',
precision_mode=trt.TrtPrecisionMode.FP16,
...
)
Use Tensorflow NGC docker image from Nvidia to avoid version mismatch!
21
Efficient Inference with TensorRT
22
Efficient Inference with TensorRT
● The validation shows the results from the TRT fp16 optimized model is almost identical to that of the
native fp32 model.
● At the same latency level (QoS), the TRT fp16 model has a 2.9x throughput (a potential 66% cut in
the serving cost).
23
Something We Could Do Better
24
Bottleneck and Hardware
● 8 CPUs 1 T4 GPU
● YOLOV5
○ CPU (pre/post processing): 100%
○ GPU (YOLOV5 backbone): 20-30%
○ Solutions:
■ More CPUs
■ Move some CPU workload to GPU
t he
● DALI/CV-CUDA
t o reach
0%
● ViT l o a d at 4
he
○ CPU (preprocessing): 40% e k eep t ncy (QoS)
W late
○ GPU (onnx inference): 40% s i re d
de
● T4 vs L4
○ At the same latency level, L4 offers ~50-70% more throughput than T4.
25
Conclusion
Model deployment
Serving parameters
Model optimization
TensorRT
26
Acknowledgement
Snap Nvidia
Annie Huang
Cindy Wu
Chen Wang
Alexander McCauley
Derek Hao Hu
Stephen Chen
27
Thank you!
Q&A
28