Module 3
Module 3
Henry Axelrod
Principal WW Data & AI PSA
© 2024, Amazon Web Services, Inc. or its affiliates. All rights © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
reserved.
AWS Inferentia
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Inferentia2
Host PCIe
Inferentia2 DM
Inf2 powered by Inferentia2
DM
DM DMA
Collective
DMA
DMA
AAA
DMA
Communication
HBM
NeuronCore-v2 NeuronCore-v2
BF16/FP16 INT8
On-chip Tensor On-chip Tensor
SRAM Engine SRAM Engine 2.3 PFLOPS 4.6 petaOPS
memor memor
y y
GPSIMD GPSIMD
Engine Engine
NEURONCORE V2 Supports
NeuronLink-v2 NeuronLink-v2 P y To r c h &
NEURONLINK V2 Te n s o r F l o w
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 Inf2 instances
IN PREVIEW: THE MOST COST-EFFICIENT DL INFERENCE INSTANCE
Inferentia2 Accelerator
Instance size vCPUs NeuronLink Instance memory Instance networking
chips memory
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deploy Llama2 with up to 4.7x lower cost with AWS
Inferentia2 Average Higher Throughput (Tokens/Sec)
395.
Amazon Inf2
1 3x
Up to
7B
Comparable 131.
3.7x
Instance* 3
331.
Amazon Inf2
9 3.7x
13B
Comparable
89.5
Instance*
139.
Amazon Inf2
7
70B
Achieve higher inference Comparable
OOM
Instance*
throughput on Llama 2
models
Average Lower Per Token Latency
4.7x
7B
Comparable
Instance*
33.7
Comparable
66.2
Instance*
Comparable
Inf2 for Llama 2 models Instance*
OOM
Higher
Lower Lower
Comparable inference-optimized Amazon Model Throughp
EC2 instance Latency Cost
ut
BERT-base 2.6x 6.7x 3.4x
Latency
8.1
RoBERTa-
base
3x 8.1x 4x
up
3x up
to
higher throughput
to
x
lower latency
Throughput
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
up
to 4x
lower cost
Speculative Decoding
MAINTAIN LARGER MODEL ACCURACY AND BENEFIT FROM SMALLER MODEL SPEED
Latency (ms)
Traditional Sampling
Llama 3 70B
TTFT
Speculative
Decoding 4x Lower Latency
Llama 3 8B/70B
Traditional Sampling
Llama 3 70B
PTL
Speculative
Decoding 4x Lower Latency
Llama 3 8B/70B
4x
lower Time to First Token and Per
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Token Latency
AWS Inferentia2 delivers lower cost for
SD models
Up to
80
Higher Throughput per dollar Amazon Inf2
SDXL 1.1
887.2
%
to deploy Diffusion Models
with Amazon Inf2 Comparable
Amazon EC2 491.5
Instances
Throughput / $
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. SDXL 1.1, 1024x1024, 30 steps, AWS 3 Yr RI Instance Pricing
AWS Inferentia2 Performance: Vision
models
UP TO 5X LOWER INFERENCE COST FOR DIFFUSION MODELS
Cost per
Image Time
Model 1000
Size (sec)
images
Stable Diffusion 1.5 512 2.4 $0.51
Stable Diffusion 1.5 768 7.9 $1.6
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 Inf1/2 and Trn1 are available in over 23 Regions
AVAILABLE WORLDWIDE
Thank you!