0% found this document useful (0 votes)
179 views48 pages

Slides Deep Learning On AWS With NVIDIA From Training To Deployment

The document discusses deep learning on AWS with NVIDIA. It covers NVIDIA's relationship with AWS, NVIDIA AI capabilities on AWS including machine learning training at scale and model deployment/inference. It also outlines NVIDIA's GPU offerings like the A100 and upcoming H100 and how they power AWS instances for tasks such as large language model training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views48 pages

Slides Deep Learning On AWS With NVIDIA From Training To Deployment

The document discusses deep learning on AWS with NVIDIA. It covers NVIDIA's relationship with AWS, NVIDIA AI capabilities on AWS including machine learning training at scale and model deployment/inference. It also outlines NVIDIA's GPU offerings like the A100 and upcoming H100 and how they power AWS instances for tasks such as large language model training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

22 February 2023

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep learning on AWS with NVIDIA:
From training to deployment

Michael Lang
Solutions Architect Manager – APAC South
NVIDIA

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• NVIDIA and AWS relationship
• NVIDIA AI on AWS
• ML model training (at scale)
• ML model deployment and inference
• Conclusion
• Next steps

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA AI
End-to-end open platform for production AI

Application workflows

CLARA RIVA TOKKIO MERLIN MODULUS MAXINE METROPOLIS CUOPT NEMO ISAAC DRIVE MORPHEUS

Medical Speech AI Customer Recommenders Physics Video Video Logistics Conversational Robotics Autonomous Cybersecurity
imaging service ML analytics AI vehicles

NVIDIA AI Enterprise NVIDIA


LaunchPad
AI and data science development and deployment tools

Cloud-native management and orchestration

Hands-on
Infrastructure optimization labs

Accelerated infrastructure

Cloud Data center Edge Embedded

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA and AWS relationship

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GPU power from the cloud to the edge

Machine Virtual High-performance Internet of


learning workstations compute things

ML training Work Solve large Extend AI/ML


and cost- from computational to edge devices
effective anywhere problems that act locally
inference

Powerful | Cost-Effective | Flexible

https://aws.amazon.com/nvidia/

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GPU power from the cloud to the edge
The highest-performance instance
Improve your operations with
for ML training and HPC
computer vision at the edge
applications powered by NVIDIA
powered by NVIDIA Jetson
A100 GPUs

High-performance instances for Spot defects with automated


graphics-intensive applications quality inspection powered
and ML inference powered by by NVIDIA Jetson
NVIDIA A10G GPUs

The best price performance NVIDIA GPU-optimized


in Amazon EC2 for graphics software available for free on
workloads powered by the NVIDIA NGC portal.
NVIDIA T4G GPUs

Deploy fast and scalable AI


with NVIDIA Triton Inference
Server in Amazon SageMaker

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA AI on AWS

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA A100
Supercharging High performing ai supercomputing gpu

80 GB HBM2e 2 TB/s +
For largest datasets High-memory bandwidth
and models to feed extremely fast GPU

3rd-gen Tensor core Multi-instance GPU

Powering Amazon EC2 P4d/P4de instances


3rd-gen NVLink

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA H100 – Coming soon to AWS
The new engine of the world’s AI infrastructure

Advanced Transformer 2nd-gen MIG


chip engine

Powering the next generation of GPU systems on AWS Confidential 4th-gen DPX instructions
computing NVLink

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA H100 supercharges large language models
Hopper architecture addresses LLM needs at scale

Supercharged LLM training High-performance prompt learning 30x real-time inference throughput

Lower is better
4K A100
4 5X
Training time (weeks)

3
300
1 month concurrent
to users
2 1 week
Days
to 10
concurrent
1 hours 1X
4K H100 users

0
70 175 530 1000 A100 H100
A100 H100

Time-to-train by LLM size 530B P-tuning time-to-train 530B inference on 10 DGX systems
(billion parameters)

LLM Training | 4,096 GPUs | H100 NDR IB | A100 HDR IB | 300 billion tokens
P-tuning | DGX H100 | DGX A100 | 530B Q&A tuning using SQuAD dataset
Inference | Chatbot | 10 DGX H100 NDR IB | 10 DGX A100 HDR IB | <1 second latency | 1 inference/second/user
H100 data center projected workload performance, subject to change

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NGC
Portal to AI services, software, support
NGC catalog

Cloud services Performance optimized Fully transparent Accelerates development


End-to-end AI development Tested across Quickly find and Focus on building, not setup
GPU-accelerated platforms deploy the right SW
1.9 Faster training on the same stack*

Training Speedup
1.4

0.9
May '21 Nov '21 May '22

AI services for NLP, biology, speech Monthly SW container updates Detailed security scan reports One-click deploy from NGC

Multiple cloud
providers

AI workflow management & support SOTA models Model resumes Develop once; deploy
anywhere with NVIDIA VMI

ngc.nvidia.com

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 instances powered by NVIDIA GPUs
Accessible via AWS, AWS Marketplace, and AWS services

AWS GPU On-demand


NVIDIA GPU GA Use case recommendations Regions GPUs
instance memory price/hour

Graphic workloads such as Android game streaming,


T4g G5g 11/2021 5 16 GB 1, 2 $0.42
ML inference, graphics rendering, and AV simulation

Best performance for graphics, HPC,


A10G G5 11/2021 3 24 GB 1, 4, 8 $1.00
and cost-effective ML inference

Best performance,
A100 P4d, P4de 11/2020 8 40, 80 GB 8 $32.77
ML training, HPC across industries

V100 P3, P3dn 10/2017 ML training, HPC across industries 14+ 16, 32 GB 1, 4, 8 $3.06–$31.21

The universal GPU,


ML inference, training, remote visualization workstations,
T4 G4 9/2019 20+ 16 GB 1, 4, 8 $0.52–$7.82
rendering, video transcoding
Includes Quadro Virtual Workstation

EC2 G5g is now available in US East (N. Virginia), US West (Oregon), and Asia Pacific (Tokyo, Seoul, and Singapore) Regions; On-Demand, Reserved, and Spot pricing available
EC2 G5 is now available in US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions; On-Demand, Reserved, Spot, or as part of Savings Plans
EC2 P4d is now available in US East (N. Virginia and Ohio), US West (Oregon), Europe (Ireland and Frankfurt), and Asia Pacific (Tokyo and Seoul) Regions; On-Demand, Reserved, Spot, Dedicated Hosts, or Savings
Plans availability

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training computer vision and
conversational AI

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Proliferation of use cases
Industrial
Healthcare manufacturing
Patient monitoring Automated optical inspection
Smart hospitals Worker safety
Robot-assisted surgery Process automation

Retail Smart infrastructure


Detecting people movement Pedestrian safety
Analyzing action Traffic management
Warehouse logistics Waste management

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating an AI application is hard and complex

INGEST DATA PREP DATA SELECT MODEL TRAIN VALIDATE OPTIMIZE & DEPLOY INTEGRATE MONITOR
EXPORT IN APP

DATA PREPERATION TRAINING DEPLOYMENT

DATA PREPARATION TRAINING DEPLOYMENT

Labeling, annotating, and augmenting Model training, pruning, and optimizing Deploying and monitoring

Get started today with the TAO Toolkit: https://developer.nvidia.com/tao-toolkit-get-started

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA TAO Toolkit
Train, adapt, optimize
Create custom, production-ready AI models in hours rather than months

How can I run this?


• Containerized on
Amazon EC2
• Containerized with
Amazon EC2
• Bring-your-own-
container on Amazon
SageMaker

All available from the


NGC catalog

TRAIN EASILY CUSTOMIZE FASTER OPTIMIZE FOR DEPLOYMENT SUPPORTED BY EXPERTS


Fine-tune NVIDIA Built on TensorFlow and Optimize for inference Supported by NVIDIA experts
pretrained models with PyTorch that abstract away and integrate with to help resolve issues from
a fraction of the data the AI framework complexity Riva or DeepStream development to deployment

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The NVIDIA TAO stack

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High-performance pretrained vision AI models
100
96 98 97
80 92
85 84
Accuracy (%)

60
56 Models Accuracy
40
Facial 6.1-pixel
20 landmark error
Gaze
6.5 RMSE
0 estimation

Inference
Performance
(FPS)
2D Pose
PeopleNet PeopleSemSegNet TrafficCamNet FaceDetectIR LPD LPR Facial Landmark Gaze Estimation
Estimation

Nano 11 1.4 18 104 66 94 5 125 98

Xavier NX 296 17 340 2000 1158 564 48 747 923

AGX Xavier 462 28 656 3915 1880 1045 84 1451 1627

A30 4163 330 4991 26635 12207 15960 1515 10078 15172
A100 6001 519 9520 50541 21931 26600 2686 23117 26534

15+ pretrained models – download for free from NGC

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Pretrained conversational AI models

BERT
Jasper QuartzNet BERT NER FastPitch HiFi-GAN
Punctuation

BERT Text BERT Intent


CitriNet N-Gram
Classification & Slot

Domain-Specific BERT & Megatron


NER QA

Support for models that are used Adapt with your dataset using Deploy with turnkey inference
in the conversational AI pipeline NVIDIA TAO Toolkit applications in NVIDIA Riva

https://developer.nvidia.com/blog/building-and-deploying-conversational-ai-models-using-nvidia-tao-toolkit/

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
Getting Started with the TAO Toolkit

TAO Toolkit product page TAO Toolkit getting started page TAO Toolkit whitepaper
All information related to product Detailed information on how to get Includes examples on data
features and developer blogs started with the TAO Toolkit augmentation, adding new classes

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Developer resources
Computer vision
• TAO Toolkit computer vision models and container collection:
Download from NGC

• To deploy TAO Toolkit models using DeepStream, go to download


resources

2D Pose Estimation Supercharge your AI workflow • Collection of Jupyter Notebooks and training specs for vision AI models
Model with NVIDIA TAO with TAO Toolkit whitepaper
Toolkit Part 1 | Part 2
Conversational AI
• TAO Toolkit conversational AI models and container collection:
Download from NGC

• To deploy with Riva, go to download resources

• Get started with Jupyter Notebooks:


Speech Recognition | Question Answering | Text Classification
Train and deploy action Building conversational AI models Named Entity Recognition | Punctuation & Capitalization | Intent Detection
recognition model using the NVIDIA TAO Toolkit & Slot Tagging

TAO TOOLKIT GETTING STARTED PAGE

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training (at scale)
large language models

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
LLMs unlock new opportunities
LLMs transcend language and pattern matching

Translating Brand Dynamic code Molecular


Summarization
Wikipedia creation commenting representations

Real-time
Marketing Gaming Function Drug
GPT-3 NLLB-200 metaverse DALL-E-2 CODEX MegaMolBART
copy characters generation discovery
translation

TEXT TRANSLATION IMAGE CODING LIFE SCIENCE


GENERATION GENERATION

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
When large language models make sense

Zero-shot (or few-shot learning)


Traditional NLP approach Large language models
Painful and impractical to get a large
Requires corpus of labeled data
Yes No
labeled data
Models can learn new tasks
Parameters 100s of millions Billions to trillions If you want models with “common sense”
and can generalize well to new tasks
Desired model General (model can do
capability
Specific (one model per task)
many tasks)
A single model can serve all use cases
At scale, you avoid costs and complexity of
Training Retrain frequently with Never retrain or retrain
many models, saving cost in data curation,
frequency task-specific training data minimally
training, and managing deployment

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training and deploying LLMs is not for the faint of heart
LLMs are challenging to build & Deploy

UNMET NEEDS

Large-scale data processing


• Training and deploying models take months to
years
Multilingual data processing & training
• Requires deep technical expertise
Finding optimal hyperparameters • Extensive compute resources in the scale of
1,000s GPUs for training a 530B model over
Convergence of models several months
Scaling on clouds • Tools to scale to 1,000s of GPUs are limited

Deploying for inference • All leading to high financial investments, in the


order of tens of millions of dollars for 175B+
Deployment at scale models

Evaluating models in industry standard benchmarks

Differing infrastructure setups

Lack of knowledge

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NeMo Megatron
End-to-end framework for training and deploying large-scale language models with trillions of parameters

Model availability

Models NVIDIA verified training recipes


GPT-3: 126M, 5B, 20B, 40B, 175B
T5: 220M, 3B, 11B, 23B, 41B
mT5: 170M, 390M, 3B, 11B, 23B

NVIDIA publicly available model


checkpoints
T5: 3B
GPT-3: 5B, 20B

Training and inference support for


popular community pretrained models
(coming in Q4 2022)

• Rapidly create and tune state-of-the-art custom language models


• Linear scaling to 1,000s of GPUs for up to a trillion parameter language models Now in open beta
• 30% speed-up in training using new sequence parallelism and selective Find out more:
activation recomputation techniques NVIDIA NeMo Megatron
• Distributed inference using Triton Inference Server
• Prompt learning capabilities with P-tuning and prompt tuning

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solving pain points across the stack
NeMo Megatron simplifies the path to an LLM
Unmet needs How we are helping

Large-scale data processing Data curation and preprocessing tools

Multilingual data processing and training Relative positional embedding (RPE) – multilingual support

Finding optimal hyperparameters Hyperparameter tool

Convergence of models Verified recipes for large GPT and T5-style models

Scaling on clouds Scripts/configs to run on AWS

Deploying for inference Model navigator + export to FT functionalities

Deployment at scale Quantization to accelerate inferencing

Evaluating models in industry-standard benchmarks Productization evaluation harness

Differing infrastructure setups Full-stack support with FP8 and Hopper support

Lack of knowledge Documentation

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NeMo Megatron
Value Proposition
End-to-end Performance at scale Easy to use Fastest time to solution
Bring your own data, SOTA training techniques Containerized Tools and SOTA performance
train and deploy LLM framework
• NeMo Megatron is an end-to-end
application framework for training
and deploying LLMs with billions
and trillions of parameters

• Turnkey containerized framework


with recipes for training and
deploying GPT-3 (up to 1T
parameters), T5, and mT5 (up to
50B parameters) style models

Customization Availability Battle-hardened Training container Inference container


Source-open approach Train on your choice Enterprise-grade framework with
of infrastructure verified recipes to work OOTB

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
GETTING STARTED
Register here for open beta
NVIDIA NeMo Megatron
NVIDIA brings large language AI Models to enterprises worldwide | NVIDIA newsroom

DEV BLOGS
Adapting P-Tuning to solve non-english downstream tasks
NVIDIA AI platform delivers big gains for large language models
Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, the world’s largest and most powerful generative language
model | NVIDIA developer blog

CUSTOMER STORIES
The King’s Swedish: AI rewrites the book in Scandinavia

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deployment and inference

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AI inference workflow
Two-part process implemented by multiple personas

Data scientist, ML
ML engineer engineer MLOps, DevOps

App developer
Query
Model Inference
optimization serving Result AI application

Trained Optimize for multiple Model Scaled multi-framework


models constraints for high- repo inference serving for
performance inference high-performance &
utilization on GPU/CPU

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference is complex
REAL TIME | COMPETING CONSTRAINTS | RAPID UPDATES

Query
Inference
TensorRT
Inference Triton
Serving
Result

Large trained models Low-latency inference, every framework

FRAMEWORKS CONSTRAINTS HARDWARE

Accuracy Memory Response


time Data center Jetson

Model
Throughput
architectures DRIVE

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A world-leading inference
WORLD LEADING performance​
INFERENCE PERFORMANCE
TensorRT accelerates every workload​

BEST-IN-CLASS RESPONSE TIME AND THROUGHPUT vs. CPUs

36x 583x 21x


Computer vision Speech recognition NLP
< 7 ms < 100 ms < 50 ms

10x 178x 12x


Reinforcement Text-to-speech Recommenders
learning < 100 ms < 1 sec

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA TensorRT
SDK for High-Performance Deep Learning Inference​

Optimize and deploy neural networks in production


Maximize throughput for latency-critical applications with Trained TensorRT TensorRT
compiler and runtime; optimize every network, including CNNs, DNN Optimizer Runtime
RNNs, and transformers
1. Reduced mixed precision: FP32, TF32, FP16, and INT8
2. Layer and tensor fusion: Optimizes use of GPU memory
bandwidth
3. Kernel auto-tuning: Select best algorithm on target GPU
4. Dynamic tensor memory: Deploy memory-efficient
applications Embedded Automotive Data center
5. Multi-stream execution: Scalable design to process multiple
streams
6. Time fusion: Optimizes RNN over time steps
Jetson Drive Data center
GPUs

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Download TensorRT today
Tensorflow with Tensorrt

TensorRT Torch-TensorRT TensorFlow-TensorRT

TensorRT 8.4 GA is available for free to the members of the NVIDIA Developer Program: developer.nvidia.com/tensorrt

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NVIDIA Triton Inference Server
Open-source software for fast, scalable, simplified inference serving

DevOps & Performance &


Any framework Any query type Any platform
MLOps utilization

Supports multiple Optimized for real X86 CPU | Arm Integration with Model Analyzer
framework time, batch, CPU | NVIDIA Kubernetes, for optimal
backends natively; streaming, GPUs | MIG KServe, configuration
e.g., TensorFlow, ensemble Prometheus &
PyTorch, TensorRT, inferencing Linux | Windows | Grafana Optimized for
XGBoost, ONNX, virtualization high GPU/CPU
Python & more Available across all utilization, high
Public cloud, data major cloud AI throughput & low
center, and platforms latency
edge/embedded
(Jetson)

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tritons architecture
Delivering high performance across frameworks

Model analyzer Model orchestration


Multiple client
applications
Query
Many
Python/C++ Dynamic batching active
Standard (real time, batch, stream)
client library Result models
HTTP/gRPC
Per model
scheduler queues
Or
Query Flexible model loading
In-process API (all, selective)
Python/C++ (directly

client library Result integrate into Model
Multiple GPU & CPU
client app via repository
backends
C or Java API)
Query
Custom
Python/C++
client library Metrics Kubernetes,
Result
Utilization, throughput, latency metrics Prometheus

GPU CPU

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concurrent model execution
INCREASE THROUGHPUT AND UTILIZATION

Dynamic batching scheduler


GROUP REQUESTS TO FORM LARGER BATCHES, INCREASE GPU UTILIZATION

Optimal model configuration


USING THE MODEL ANALYZER CAPABILITY

Large language model inference


USING TRITON’S FASTERTRANSFORMER BACKEND

Model pipelines with business logic scripting


CONTROL FLOW AND LOOPS IN MODEL ENSEMBLES

Decoupled models
ALLOWS 0, 1, OR 1+ RESPONSES PER REQUEST

Triton Inference Server | NVIDIA Developer


© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time spell check for product search
Amazon Search

• One of the most visited ecommerce websites Optimize


model with
Choose best
config with
Deploy with
Triton
Triton Model Inference
TensorRT
• Deep learning (DL) AI model for automatic spell Analyzer Server

correction to search effortlessly


• Triton + TensorRT meets sub-50 ms latency target
and delivers 5x throughput for DL model on GPUs
on AWS
• Triton Model Analyzer reduced time to find
optimal configuration from weeks to hours

https://aws.amazon.com/blogs/machine-learning/how-amazon-search-achieves-low-latency-
high-throughput-t5-inference-with-nvidia-triton-on-aws/

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn more and download
For more information
https://developer.nvidia.com/nvidia-triton-inference-server

Get the ready-to-deploy container with monthly updates from the NGC catalog
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver

Open-source GitHub repository


https://github.com/NVIDIA/triton-inference-server

Latest release information


https://github.com/triton-inference-server/server/releases

Quick start guide


https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Triton Inference Server on Amazon SageMaker

A Triton Inference Server container developed with NVIDIA – includes NVIDIA Triton
Inference Server along with useful environment variables to tune performance (e.g,.
set thread count) on SageMaker

Use with SageMaker Python SDK to deploy your models on scalable, cost-effective
SageMaker endpoints without worrying about Docker

Code examples to find readily usable code samples using Triton Inference Server
with popular machine learning frameworks on Amazon SageMaker

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon SageMaker & Triton technical resources
Triton on Amazon SageMaker
Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker
Amazon announces new NVIDIA Triton Inference Server on Amazon SageMaker
Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker
Use Triton Inference Server with Amazon SageMaker
How Amazon Search achieves low-latency, high-throughput T5 inference with NVIDIA Triton on AWS
Getting the most out of NVIDIA T4 on AWS G4 Instances
Deploying the Nvidia Triton Inference Server on Amazon ECS

AWS AI/ML Heroes collaboration


NVIDIA Triton spam detection engine of C-suite labs
Blurry faces: Training, optimizing and deploying a segmentation model on Amazon SageMaker with NVIDIA TensorRT
and NVIDIA Triton

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sign up for NVIDIA and AWS free ML Course
In this course, you will gain hands-on experience on building, training, and deploying scalable
machine learning models with Amazon SageMaker and Amazon EC2 instances powered by
NVIDIA GPUs

Hands-on Machine Learning with AWS/NVIDIA | Coursera


https://www.coursera.org/learn/machine-learning-aws-
nvidia

Free e-book: Dive into deep learning


https://d2l.ai

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap and next steps

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap and key takeaways
What did we learn today?

NVIDIA GPUs power the most compute-intensive workloads from computer


vision to speech to language and many more
NVIDIA TAO is a toolkit for training CV and speech models efficiently
NVIDIA NeMo Megatron is a open-source toolkit for large language model
training and deployment
NVIDIA TensorRT is an SDK for optimizing deep learning models
NVIDIA Triton is an inference server for deploying your models

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Join the NVIDIA Inception program for startups
Accelerate your startup’s growth and build your solutions faster with engineering guidance, free
technical training, preferred pricing on NVIDIA products, opportunities for customer introductions
and co-marketing, and exposure to the VC community

APPLY TO INCEPTION TODAY


https://www.nvidia.com/en-us/startups

GET THE LATEST NEWS, UPDATES, AND MORE


https://www.nvidia.com/en-us/preferences/email-signup/

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Michael Lang
[email protected]

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like