0% found this document useful (0 votes)
18 views

Module 3

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Module 3

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Intro to Inferentia

Accelerate Inference on AWS

Henry Axelrod
Principal WW Data & AI PSA

© 2024, Amazon Web Services, Inc. or its affiliates. All rights © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
reserved.
AWS Inferentia

HIGH PERFORMANCE AND LOWEST COST INFERENCE IN THE


CLOUD

Up to 25% higher Up to 70% lower cost


Support for popular
throughput vs per inference than
ML frameworks
comparable GPU- comparable GPU-
including PyTorch and
based Amazon EC2 based Amazon EC2
TensorFlow
instances instances

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Inferentia2

Host PCIe
Inferentia2 DM
Inf2 powered by Inferentia2
DM
DM DMA
Collective
DMA
DMA
AAA
DMA
Communication
HBM

NeuronCore-v2 NeuronCore-v2
BF16/FP16 INT8
On-chip Tensor On-chip Tensor
SRAM Engine SRAM Engine 2.3 PFLOPS 4.6 petaOPS
memor memor
y y

Vector Scalar Vector Scalar A G G R E G AT E NETWORK


Engin Engin Engin Engin A C C E L E R AT O R M E M O R Y CONNECTIVITY
e e e e
384 GB 100 Gbps
HBM

GPSIMD GPSIMD
Engine Engine

NEURONCORE V2 Supports
NeuronLink-v2 NeuronLink-v2 P y To r c h &
NEURONLINK V2 Te n s o r F l o w

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 Inf2 instances
IN PREVIEW: THE MOST COST-EFFICIENT DL INFERENCE INSTANCE

Up to 2.5x higher throughput (vs. Inf1)

15x more memory bandwidth

Deploy 175B parameter models in a single server

Inferentia2 Accelerator
Instance size vCPUs NeuronLink Instance memory Instance networking
chips memory

Inf2.xlarge 4 1 32 GB N/A 16 GB Up to 15 Gbps

Inf2.8xlarge 32 1 32 GB N/A 128 GB Up to 25 Gbps

Inf2.24xlarge 96 6 192 GB N/A 384 GB 50 Gbps

Inf2.48xlarge 192 12 384 GB 192 GB/s 768 GB 100 Gbps

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deploy Llama2 with up to 4.7x lower cost with AWS
Inferentia2 Average Higher Throughput (Tokens/Sec)
395.
Amazon Inf2
1 3x
Up to

7B
Comparable 131.

3.7x
Instance* 3
331.
Amazon Inf2
9 3.7x

13B
Comparable
89.5
Instance*
139.
Amazon Inf2
7

70B
Achieve higher inference Comparable
OOM
Instance*
throughput on Llama 2
models
Average Lower Per Token Latency

Up to Amazon Inf2 10.2 69%

4.7x
7B
Comparable
Instance*
33.7

Amazon Inf2 15.2 77%


13B

Comparable
66.2
Instance*

Lower cost to deploy with Amazon Inf2 28.6


70B

Comparable
Inf2 for Llama 2 models Instance*
OOM

AWS Inferentia2 vs Comparable Inference Optimized Amazon EC2 Instances

*Comparable Inference Optimized Amazon EC2


© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instances
Comparing average performance of Llama2 7B, 13B and 70B across Batch 1-8, 512-2k Context Lengths, AWS 3Yr RI
Instance Pricing
AWS Inferentia2 Performance: BERT/RoBERTa
models
RoBERTa-Base Benchmarks Inf2 vs. Comparable inference-
Inf2 [BF16]; Comparable EC2 instance [Mixed Precision],
seqlen 128 optimized Amazon EC2 instance
Inf2 [BF16]; Comparable EC2 instance [Mixed Precision], seqlen 128

Higher
Lower Lower
Comparable inference-optimized Amazon Model Throughp
EC2 instance Latency Cost
ut
BERT-base 2.6x 6.7x 3.4x
Latency

BERT-large 2.2x 5.0x 2.9x

8.1
RoBERTa-
base
3x 8.1x 4x

up
3x up
to
higher throughput
to
x
lower latency

Throughput

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
up
to 4x
lower cost
Speculative Decoding
MAINTAIN LARGER MODEL ACCURACY AND BENEFIT FROM SMALLER MODEL SPEED

Latency (ms)
Traditional Sampling
Llama 3 70B

TTFT
Speculative
Decoding 4x Lower Latency
Llama 3 8B/70B

Traditional Sampling
Llama 3 70B

PTL
Speculative
Decoding 4x Lower Latency
Llama 3 8B/70B

With Speculative Decoding on


Inferentia deliver up to

4x
lower Time to First Token and Per
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Token Latency
AWS Inferentia2 delivers lower cost for
SD models
Up to

80
Higher Throughput per dollar Amazon Inf2
SDXL 1.1

887.2

%
to deploy Diffusion Models
with Amazon Inf2 Comparable
Amazon EC2 491.5
Instances
Throughput / $

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. SDXL 1.1, 1024x1024, 30 steps, AWS 3 Yr RI Instance Pricing
AWS Inferentia2 Performance: Vision
models
UP TO 5X LOWER INFERENCE COST FOR DIFFUSION MODELS

Cost per
Image Time
Model 1000
Size (sec)
images
Stable Diffusion 1.5 512 2.4 $0.51
Stable Diffusion 1.5 768 7.9 $1.6

Stable Diffusion 2.1 512 1.9 $0.41


Stable Diffusion 2.1 768 6.1 $1.40

Stable Diffusion XL 1024 14.9 $3.41


Stable Diffusion XL +
1024 13.0 $2.96
Refiner
Notes:
• SD 2.1, SD 1.5, and SD XL models from Hugging
Face
• Batch =1, Iterations = 50
• Neuron results: FP32/autocast and BF16
• Using on-demand instance pricing

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 Inf1/2 and Trn1 are available in over 23 Regions
AVAILABLE WORLDWIDE

Americas EMEA APAC


Inf1 N. Virginia Ohio Inf1 Dublin Inf1 Tokyo Mumbai*
N. Virginia Ohio Oregon Stockholm Frankfurt Beijing Singapore Melbourne*
Ohio Oregon LHR Stockholm Bahrain Mumbai
Oregon Sao Paulo Milan London Sydney Sydney
Paris Paris Tokyo
Capetown Hong Kong
Dublin Seoul
Trn1/n Frankfurt Singapore
Inf2 N. Virginia Mumbai
Trn1/n Trn1/n
Inf2 Stockholm* Inf2 Tokyo*
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
*Coming in 2024
AWS Neuron Getting Started
https://bityl.co/IcJO

Thank you!

Sample Codes Training and Inference


Performance

© 2024 Amazon Web Services, Inc. or its affiliates. All rights


Amazon © Web2024, Amazon
Services, Web Services,
reserved.
AWS, the Powered Inc. logo,
by AWS or itsand
affiliates.
all AWS All rights reserved.
service names used in this slide deck are trademarks of Amazon.com,
Inc. or its affiliates.

You might also like