0% found this document useful (0 votes)
21 views21 pages

High-Performance & Cost-Effective Model Deployment With Amazon SageMaker

The document discusses Amazon SageMaker and its capabilities for deploying machine learning models for inference at scale. It provides an overview of SageMaker's different inference options including real-time, asynchronous, batch, and serverless inference. It also discusses features such as automatic deployment recommendations, cost optimization strategies, and integration with MLOps workflows. Additionally, the document provides a simple guide to help choose the best inference option based on factors like payload size, processing time needs, and traffic patterns.

Uploaded by

DuongHang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views21 pages

High-Performance & Cost-Effective Model Deployment With Amazon SageMaker

The document discusses Amazon SageMaker and its capabilities for deploying machine learning models for inference at scale. It provides an overview of SageMaker's different inference options including real-time, asynchronous, batch, and serverless inference. It also discusses features such as automatic deployment recommendations, cost optimization strategies, and integration with MLOps workflows. Additionally, the document provides a simple guide to help choose the best inference option based on factors like payload size, processing time needs, and traffic patterns.

Uploaded by

DuongHang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

T O R O N T O | J U N E 2 2 – 2 3 , 2 0 2 2

AIM302

High-performance & cost-effective


model deployment with
Amazon SageMaker
Mani Khanuja
Sr. AI/ML Specialist Solutions Architect – Amazon SageMaker
AWS

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Topic 1: Choosing the best inference option
• Introduction to Amazon SageMaker model deployment
• Overview of different inference options
• Simple guide to choose an inference option

Topic 2: Cost optimization options


• SageMaker Savings Plan
• Improving utilization
• Picking the right instance
• Auto scaling
• Optimize models

Topic 3: Demo

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Wide selection of infrastructures
70+ instance types with varying levels of compute
and memory to meet the needs of every use case

Automatic deployment recommendations


Optimal instance type/count and container
parameters and fully managed load testing

Breadth of deployment options

Deploy ML models Real-time, asynchronous, batch, and serverless endpoints

Fully managed deployment Fully managed deployment strategies


for inference at scale Canary and linear traffic shifting modes with built-in
safeguards such as auto-rollbacks

Cost-effective deployment
Multi-model/multi-container endpoints, serverless
inference, and elastic scaling

Built-in integration for MLOps


ML workflows, monitor models, CI/CD, lineage
tracking, and model registry

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SageMaker model deployment options
Online Batch

An inference for each request Inference on a set of data

SageMaker offers SageMaker offers batch inference


• Real-time inference
• Serverless inference
• Asynchronous inference

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time inference
Properties Example use cases
Synchronous

Instance-based (supports CPU/GPU)


Ad serving
Low latency

Payload size <6 MB, request timeout – 60 seconds

Personalized
Key features recommendations
Optimize cost and utilization by deploying multiple
models/containers on an instance

Flight changes with A/B testing


Fraud detection
Safely deploy changes with blue/green deployments

Capture model inputs and outputs for later use

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless inference
Properties Example use cases
Synchronous

No need to pick and choose instances

Cost effective for intermittent/unpredictable traffic Analyze data


from documents
Good for workloads that tolerate higher p99 latency

Payload size <4 MB, request timeout – 60 seconds

Form processing
Key features
Pay only for duration of each inference request

No cost at idle

Automatic and fast scaling Chatbots

Similar deploy/invoke model to real-time inference

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Asynchronous inference
Properties Asynchronous Example use cases
Instance-based (supports CPU/GPU)

Good for large payloads (up to 1 GB) of unstructured Image synthesis


data (images, videos, text, etc.)

Suitable when processing time is the order of


minutes (up to 15 minutes)

Known entity
Key features extraction
Built-in queue for requests

Configure auto scaling for queue drain rate

Scale down to zero to optimize for costs Anomaly detection


with time-series data
Safely deploy changes with blue/green deployments

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Batch inference
Properties Example use cases
High throughput inference in batches

Instance-based (supports CPU/GPU)


Propensity modeling
Good for processing gigabytes of data for all
data types

Payload size in GBs and processing time in days

Predictive
Key features maintenance
Built-in features to split, filter, and join
structured data

Automatic distributed processing of structured


tabular data for high performance Churn prediction
Pay only for the duration of the job

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choosing model deployment options
Start
Does your workload Would it be helpful Does your workload Does your workload
need to return an to queue requests have intermittent have sustained
inference for each Yes due to longer No traffic patterns or No traffic and need
request to processing times or periods of lower and
your model? larger payloads? no traffic? consistent latency?

No, I can wait until all


Yes Yes Yes
requests are processed

Batch Async Serverless Real-time

Payload size: GBs Payload size: 1 GB Payload size: 4 MB Payload size: 6 MB


Runtime: days Runtime: 15 minutes Runtime: 60 seconds Runtime: 60 seconds

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Programmatically calling SageMaker

• AWS Command Line Interface (AWS CLI)


• SageMaker REST APIs
• AWS CloudFormation
• AWS Cloud Development Kit (AWS CDK)
• AWS SDKs
• SageMaker Python SDK

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS SDKs and SageMaker Python SDK
AWS SDKs SageMaker Python SDK
Abstraction Low-level API High-level API

Language support Java, C++, Go, JavaScript, .NET, Node.js, PHP, Python
Ruby, Python
AWS services supported Most AWS services Amazon SageMaker

Persona DevOps, ML engineers Data scientists


Size Lightweight (~67 MB) ~250 MB*
High-level features • More verbose but more transparent • Features like hiding Docker images,
• Pre-installed in AWS Lambda copying scripts from local to Amazon
S3, creating the model and endpoint
configurations without you noticing
• Native support for sync/async API call
• Simpler request/response schema
• Less code
Code complexity Medium Low
* The size may be lower with SageMaker SDK v2

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SageMaker model deployment
cost optimizations

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cost optimizations

SageMaker Savings Plans

Optimize

Real-time Batch Asynchronous Serverless


Instance-based Instance-based Instance-based Serverless

Auto scaling Pick the right instance Auto scaling (can Choose the right
be zero) memory size
Pick the right instance
Pick the right instance
Use multiple models/containers

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Buy a SageMaker Savings Plan
• Reduce your costs by up to 64% with a Savings Plan
• 1- or 3-year term commitment to a consistent amount of usage ($/hour)
• Apply automatically to eligible SageMaker ML instance usages for
• SageMaker Studio Notebook
• SageMaker on-demand notebook instances
• SageMaker processing
• SageMaker Data Wrangler
• SageMaker training
• SageMaker real-time inference
• SageMaker batch transform

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Improve utilization of real-time inference

Multi-model endpoints Multi-container endpoints Serial inference pipeline


• Deploy thousands of models • Up to 15 different containers • Chain 2–15 containers
• Works best when models are • Containers can be directly • Reuse the data transformers
of similar size and latency invoked developed for training models
• Models must be able to run in • Works best when containers • Low latency: All containers run
the same container exhibit similar usage and on the same underlying
performance characteristics Amazon EC2 instance
• Dynamic model loading
• Always in memory • Pipeline is immutable

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference recommender
• Run extensive load tests
• Get instance type recommendations
(based on throughput, latency, and cost) Inference recommender job
• Integrate with model registry Job types
• Review performance metrics from
SageMaker Studio
• Customize your load tests
• Fine-tune your model, model server, Advanced
Default
and containers
Custom load testing and
Preliminary
• Get detailed metrics from recommendations
granular control to
performance tuning
Amazon CloudWatch

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto scaling
Client application

• Distributes your instances across Inference Inference


request result
Availability Zones
• Dynamically adjusts the number Secure endpoint
of instances
• No traffic interruption while instances are
being added to or removed {ProductionVariants}

• Scale-in and scale-out options suitable for


different traffic patterns
• Support for predefined and custom
Availability Availability Availability
metrics for auto scaling policy Zone 1 Zone 2 Zone 3
• Support for cooldown period for scaling in
Automatic scaling
and scaling out

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimize models
Better-performing models mean you
can run more on an instance over a
shorter duration
Automatically optimize models with
SageMaker Neo
this case

https://aws.amazon.com/blogs/machine-learning/increasing-performance-and-reducing-the-cost-of-
mxnet-inference-using-amazon-sagemaker-neo-and-amazon-elastic-inference/

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn in-demand AWS Cloud skills

AWS Skill Builder AWS Certifications


Access 500+ free digital courses Earn an industry-recognized
and Learning Plans credential

Explore resources with a variety Receive Foundational,


of skill levels and 16+ languages Associate, Professional,
to meet your learning needs and Specialty certifications

Deepen your skills with digital Join the AWS Certified community
learning on demand and get exclusive benefits

Access new
Train now exam guides

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Mani Khanuja
@mani_Khanuja

@manikhanuja

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like