Sagemaker DG
Sagemaker DG
Developer Guide
Amazon SageMaker Developer Guide
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon SageMaker Developer Guide
Table of Contents
What Is Amazon SageMaker? ............................................................................................................... 1
Amazon SageMaker Pricing ......................................................................................................... 1
Are You a First-time User of Amazon SageMaker? .......................................................................... 1
How It Works .................................................................................................................... 2
SageMaker Features ................................................................................................................... 2
New features ..................................................................................................................... 2
Machine learning environments ........................................................................................... 3
Major features ................................................................................................................... 3
Machine Learning with Amazon SageMaker ................................................................................... 5
Explore, Analyze, and Process Data .............................................................................................. 7
Fairness and Model Explainability ................................................................................................. 8
Best Practices for Evaluating Fairness and Explainability in the ML Lifecycle ............................... 8
Sample Notebooks ............................................................................................................. 9
Guide to the SageMaker Clarify Documentation ................................................................... 10
Model Training ......................................................................................................................... 10
Model Deployment ................................................................................................................... 13
Validating Models ..................................................................................................................... 13
Model Monitoring ..................................................................................................................... 14
ML Frameworks and Toolkits ..................................................................................................... 15
Apache MXNet ................................................................................................................. 15
Apache Spark ................................................................................................................... 16
Chainer ........................................................................................................................... 24
Hugging Face ................................................................................................................... 25
PyTorch ........................................................................................................................... 27
R .................................................................................................................................... 28
Scikit-learn ...................................................................................................................... 30
SparkML Serving .............................................................................................................. 31
TensorFlow ...................................................................................................................... 32
Triton Inference Server ...................................................................................................... 32
Supported Regions and Quotas .................................................................................................. 33
Quotas ............................................................................................................................ 34
Get Started ..................................................................................................................................... 35
Set Up Amazon SageMaker Prerequisites ..................................................................................... 35
Create an AWS Account ..................................................................................................... 35
Create an Administrative User and Group ............................................................................ 36
AWS CLI Prerequisites ....................................................................................................... 37
Onboard to Domain .................................................................................................................. 37
Onboard Using Quick setup ............................................................................................... 38
Onboard Using IAM Identity Center .................................................................................... 39
Onboard Using IAM .......................................................................................................... 43
Choose an Amazon VPC .................................................................................................... 46
SageMaker JumpStart ............................................................................................................... 47
Open and use JumpStart .................................................................................................. 47
Solution Templates ........................................................................................................... 50
Foundation Models ........................................................................................................... 58
Task-Specific Models ......................................................................................................... 66
Shared Models and Notebooks ........................................................................................... 79
SageMaker JumpStart Industry: Financial ............................................................................ 83
Get Started with Notebook Instances .......................................................................................... 87
Machine Learning with the SageMaker Python SDK .............................................................. 87
Tutorial Overview ............................................................................................................. 87
Step 1: Create an Amazon SageMaker Notebook Instance ...................................................... 88
Step 2: Create a Jupyter Notebook ..................................................................................... 89
Step 3: Download, Explore, and Transform Data ................................................................... 90
iii
Amazon SageMaker Developer Guide
iv
Amazon SageMaker Developer Guide
v
Amazon SageMaker Developer Guide
vi
Amazon SageMaker Developer Guide
vii
Amazon SageMaker Developer Guide
viii
Amazon SageMaker Developer Guide
ix
Amazon SageMaker Developer Guide
x
Amazon SageMaker Developer Guide
xi
Amazon SageMaker Developer Guide
xii
Amazon SageMaker Developer Guide
xiii
Amazon SageMaker Developer Guide
xiv
Amazon SageMaker Developer Guide
xv
Amazon SageMaker Developer Guide
xvi
Amazon SageMaker Developer Guide
Amazon SageMaker Pricing
This guide includes information and tutorials on SageMaker features. For additional information, see
Amazon SageMaker developer resources.
Topics
1. Read How Amazon SageMaker Works (p. 2) – This section provides an overview of SageMaker,
explains key concepts, and describes the core components involved in building AI solutions with
SageMaker. We recommend that you read this topic in the order presented.
2. Set Up Amazon SageMaker Prerequisites (p. 35) – This section explains how to set up your AWS
account.
3. Amazon SageMaker Autopilot simplifies the machine learning experience by automating machine
learning tasks. If you are new to SageMaker, it provides the easiest learning path. It also serves as an
excellent ML learning tool that provides visibility into the code with notebooks generated for each
of the automated ML tasks. For an introduction to its capabilities, see Automate model development
with Amazon SageMaker Autopilot (p. 467). To get started building, training, and deploying machine
learning models, Autopilot provides:
• Samples: Explore modeling with Amazon SageMaker Autopilot (p. 468)
• Videos: Use Autopilot to automate and explore the machine learning process (p. 469)
• Tutorials: Get started with Amazon SageMaker Autopilot (p. 470)
4. Get Started with Amazon SageMaker (p. 35) – This section walks you through training your first
model using SageMaker Studio, or the SageMaker console and the SageMaker API. You use training
algorithms provided by SageMaker.
1
Amazon SageMaker Developer Guide
How It Works
Topics
• New features for re:Invent 2022 (p. 2)
• Machine learning environments (p. 3)
• Major features (p. 3)
Document information about your ML models in a single place for streamlined governance and
reporting throughout the ML lifecycle.
SageMaker Model Dashboard (p. 3261)
A pre-built, visual overview of all the models in your account. Model Dashboard integrates
information from SageMaker Model Monitor, transform jobs, endpoints, lineage tracking, and
2
Amazon SageMaker Developer Guide
Machine learning environments
CloudWatch so you can access high-level model information and track model performance in one
unified view.
SageMaker Role Manager (p. 3108)
Administrators can define least-privilege permissions for common ML activities using custom and
preconfigured persona-based IAM roles.
AutoML step (p. 2733)
A shared space consists of a shared JupyterServer application and a shared directory. All user profiles
in a Domain have access to all shared spaces in the Domain.
Data Wrangler data preparation widget (p. 1138)
Interact with your data, get visualizations, explore actionable insights, and fix data quality issues.
Inference shadow tests (p. 2467)
Evaluate any changes to your model-serving infrastructure by comparing it's performance against
the currently deployed infrastructure.
Notebook-based Workflows (p. 2908)
A Git extension to enter the URL of a Git repository, clone it into your environment, push changes,
and view commit history.
An integrated machine learning environment where you can build, train, deploy, and analyze your
models all in the same application.
SageMaker Studio Lab (p. 230)
A free service that gives customers access to AWS compute resources in an environment based on
open-source JupyterLab.
SageMaker Canvas (p. 258)
An auto ML service that gives people with no coding experience the ability to build models and make
predictions with them.
RStudio on Amazon SageMaker (p. 432)
Major features
SageMaker includes the following major features in alphabetical order excluding any SageMaker prefix.
3
Amazon SageMaker Developer Guide
Major features
Build the workflows required for human review of ML predictions. Amazon A2I brings human review
to all developers, removing the undifferentiated heavy lifting associated with building human review
systems or managing large numbers of human reviewers.
SageMaker Autopilot (p. 467)
Users without machine learning knowledge can quickly build classification and regression models.
Batch Transform (p. 2421)
Preprocess datasets, run inference when you don't need a persistent endpoint, and associate input
records with inferences to assist the interpretation of results.
SageMaker Clarify (p. 8)
Improve your machine learning models by detecting potential bias and help explain the predictions
that models make.
SageMaker Data Wrangler (p. 981)
Import, analyze, prepare, and featurize data in SageMaker Studio. You can integrate Data Wrangler
into your machine learning workflows to simplify and streamline data pre-processing and feature
engineering using little to no coding. You can also add your own Python scripts and transformations
to customize your data prep workflow.
SageMaker Debugger (p. 1649)
Inspect training parameters and data throughout the training process. Automatically detect and
alert users to commonly occurring errors such as parameter values getting too large or small.
SageMaker Edge Manager (p. 2510)
Optimize custom models for edge devices, create and manage fleets and run models with an
efficient runtime.
SageMaker Elastic Inference (p. 2628)
Speed up the throughput and decrease the latency of getting real-time inferences.
SageMaker Experiments (p. 1587)
Experiment management and tracking. You can use the tracked data to reconstruct an experiment,
incrementally build on experiments conducted by peers, and trace model lineage for compliance and
audit verifications.
SageMaker Feature Store (p. 1210)
A centralized store for features and associated metadata so features can be easily discovered and
reused. You can create two types of stores, an Online or Offline store. The Online Store can be used
for low latency, real-time inference use cases and the Offline Store can be used for training and
batch inference.
SageMaker Ground Truth (p. 526)
High-quality training datasets by using workers along with machine learning to create labeled
datasets.
SageMaker Ground Truth Plus (p. 844)
A turnkey data labeling feature to create high-quality training datasets without having to build
labeling applications and manage the labeling workforce on your own.
SageMaker Inference Recommender (p. 2159)
Get recommendations on inference instance types and configurations (e.g. instance count, container
parameters and model optimizations) to use your ML models and workloads.
4
Amazon SageMaker Developer Guide
Machine Learning with Amazon SageMaker
Learn about SageMaker features and capabilities through curated 1-click solutions, example
notebooks, and pretrained models that you can deploy. You can also fine-tune the models and
deploy them.
SageMaker ML Lineage Tracking (p. 2828)
Create and manage machine learning pipelines integrated directly with SageMaker jobs.
SageMaker Model Monitor (p. 2299)
Monitor and analyze models in production (endpoints) to detect data drift and deviations in model
quality.
SageMaker Model Registry (p. 2484)
Versioning, artifact and lineage tracking, approval workflow, and cross account support for
deployment of your machine learning models.
SageMaker Neo (p. 2562)
Train machine learning models once, then run anywhere in the cloud and at the edge.
Preprocessing (p. 1196)
Analyze and preprocess data, tackle feature engineering, and evaluate models.
SageMaker Projects (p. 2801)
Maximize the long-term reward that an agent receives as a result of its actions.
SageMaker Serverless Endpoints (p. 2371)
A serverless endpoint option for hosting your ML model. Automatically scales in capacity to serve
your endpoint traffic. Removes the need to select instance types or manage scaling policies on an
endpoint.
SageMaker Studio Notebooks (p. 144)
The next generation of SageMaker notebooks that include AWS IAM Identity Center (successor to
AWS Single Sign-On) (IAM Identity Center) integration, fast start-up times, and single-click sharing.
SageMaker Studio Notebooks and Amazon EMR (p. 1164)
Easily discover, connect to, create, terminate and manage Amazon EMR clusters in single account
and cross account configurations directly from SageMaker Studio.
SageMaker Training Compiler (p. 1948)
Train deep learning models faster on scalable GPU instances managed by SageMaker.
In machine learning, you "teach" a computer to make predictions, or inferences. First, you use an
algorithm and example data to train a model. Then you integrate your model into your application to
5
Amazon SageMaker Developer Guide
Machine Learning with Amazon SageMaker
generate inferences in real time and at scale. In a production environment, a model typically learns from
millions of example data items and produces inferences in hundreds to less than 20 milliseconds.
The following diagram illustrates the typical workflow for creating a machine learning model:
1. Generate example data—To train a model, you need example data. The type of data that you need
depends on the business problem that you want the model to solve (the inferences that you want
the model to generate). For example, suppose that you want to create a model to predict a number
given an input image of a handwritten digit. To train such a model, you need example images of
handwritten numbers.
Data scientists often spend a lot of time exploring and preprocessing, or "wrangling," example data
before using it for model training. To preprocess data, you typically do the following:
a. Fetch the data— You might have in-house example data repositories, or you might use datasets
that are publicly available. Typically, you pull the dataset or datasets into a single repository.
b. Clean the data—To improve model training, inspect the data and clean it as needed. For example, if
your data has a country name attribute with values United States and US, you might want to
edit the data to be consistent.
c. Prepare or transform the data—To improve performance, you might perform additional data
transformations. For example, you might choose to combine attributes. If your model predicts the
conditions that require de-icing an aircraft, instead of using temperature and humidity attributes
separately, you might combine those attributes into a new attribute to get a better model.
In SageMaker, you preprocess example data in a Jupyter notebook on your notebook instance. You
use your notebook to fetch your dataset, explore it, and prepare it for model training. For more
information, see Explore, Analyze, and Process Data (p. 7). For more information about preparing
data in AWS Marketplace, see data preparation.
2. Train a model—Model training includes both training and evaluating the model, as follows:
• Training the model— To train a model, you need an algorithm or a pre-trained base model. The
algorithm you choose depends on a number of factors. For a quick, out-of-the-box solution, you
might be able to use one of the algorithms that SageMaker provides. For a list of algorithms
provided by SageMaker and related considerations, see Use Amazon SageMaker Built-in Algorithms
or Pre-trained Models (p. 1281). For a UI-based training solution that provides algorithms and
models, see SageMaker JumpStart (p. 47).
6
Amazon SageMaker Developer Guide
Explore, Analyze, and Process Data
You also need compute resources for training. Depending on the size of your training dataset and
how quickly you need the results, you can use resources ranging from a single general-purpose
instance to a distributed cluster of GPU instances. For more information, see Train a Model with
Amazon SageMaker (p. 10).
• Evaluating the model—After you've trained your model, you evaluate it to determine whether
the accuracy of the inferences is acceptable. In SageMaker, you use either the AWS SDK for Python
(Boto) or the high-level Python library that SageMaker provides to send requests to the model for
inferences.
You use a Jupyter notebook in your SageMaker notebook instance to train and evaluate your model.
3. Deploy the model— You traditionally re-engineer a model before you integrate it with your
application and deploy it. With SageMaker hosting services, you can deploy your model
independently, decoupling it from your application code. For more information, see Deploy Models for
Inference (p. 2155).
Machine learning is a continuous cycle. After deploying a model, you monitor the inferences, collect
"ground truth," and evaluate the model to identify drift. You then increase the accuracy of your
inferences by updating your training data to include the newly collected ground truth. You do this by
retraining the model with the new dataset. As more and more example data becomes available, you
continue retraining your model to increase accuracy.
Amazon SageMaker Processing enables running jobs to preprocess and postprocess data, perform
feature engineering, and evaluate models on SageMaker easily and at scale. When combined with the
other critical machine learning tasks provided by SageMaker, such as training and hosting, Processing
provides you with the benefits of a fully managed machine learning environment, including all the
security and compliance support built into SageMaker. With Processing, you have the flexibility to use
the built-in data processing containers or to bring your own containers and submit custom jobs to run on
managed infrastructure. After you submit a job, SageMaker launches the compute instances, processes
and analyzes the input data, and releases the resources upon completion. For more information, see
Process Data (p. 1196).
• For information about how to run your own data processing scripts, see Data Processing with scikit-
learn (p. 1198).
• For information about how to build your own processing container to run scripts, see Build Your Own
Processing Container (Advanced Scenario) (p. 1205).
• For information about how to perform exploratory data analysis (EDA) with a visual no-code interface,
see Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981).
7
Amazon SageMaker Developer Guide
Fairness and Model Explainability
Machine learning models and data-driven systems are being increasingly used to help make decisions
across domains such as financial services, healthcare, education, and human resources. Machine learning
applications provide benefits such as improved accuracy, increased productivity, and cost savings to help
meet regulatory requirements, improve business decisions, and provide better insights into data science
procedures.
For a blog that shows how to architect and build a complete machine learning use case involving
fraudulent automobile claims that integrates SageMaker Clarify into a SageMaker pipeline, see the
Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker
demo. This blog discusses how to assess pre and post training bias, how to mitigate the bias, and how
the data features impact the prediction. There are links to the relevent code for each task in the ML
lifecycle, including the creation of an automated workflow that integrates the fairness and explainablity
functionality of SageMaker Clarify into a SageMaker Pipeline.
8
Amazon SageMaker Developer Guide
Sample Notebooks
teams, as well as end users and communities) is a prerequisite for the successful adoption of fairness-
aware ML approaches in practice.
Fairness and Explainability by Design in the ML Lifecycle – You should consider fairness and
explainability during each stage of the ML lifecycle: problem formation, dataset construction, algorithm
selection, model training process, testing process, deployment, and monitoring/feedback. It is important
to have the right tools to do this analysis. To encourage engaging with these considerations, here are a
few example questions we recommend you ask during each of these stages.
Sample Notebooks
Amazon SageMaker Clarify provides the following sample notebooks:
• Explainability and bias detection with Amazon SageMaker Clarify – Use SageMaker Clarify to create a
processing job for the detecting bias and explaining model predictions with feature attributions.
• Monitoring bias drift and feature attribution drift Amazon SageMaker Clarify – Use Amazon SageMaker
Model Monitor to monitor bias drift and feature attribution drift over time.
• Fairness and Explainability with SageMaker Clarify (Bring Your Own Container) – This sample notebook
introduces key terms and concepts needed to understand SageMaker Clarify, and it walks you through
an end-to-end data science workflow demonstrating how to build your own model and container
that can work seamlessly with your Clarify jobs, use the model and SageMaker Clarify to measure
bias, explain the importance of the various input features on the model's decision and then access the
reports through SageMaker Studio if you have an instance set up.
• Fairness and Explainability with SageMaker Clarify - Spark Distributed Processing – This sample
notebook walks you through key terms and concepts needed to understand SageMaker Clarify,
measures the pre-training bias of a dataset and post-training bias of a model, explains the importance
of the various input features on the model's decision, and accesses the reports through SageMaker
Studio if you have an instance set up.
• Mitigate Bias, Train another unbiased Model and Put in the Model Registry – This notebook describes
how to detect bias using SageMaker Clarify, mitigate it with Synthetic Minority Over-sampling
Technique (SMOTE), train another model, then put it in the Model Registry along with all the lineage
of the artifacts created along the way: data, code and model metadata. This notebook forms part of a
series that shows how to integrate SageMaker Clarify into a SageMaker Pipeline that is described in the
Architect and build the full machine learning lifecycle with AWS blog.
These notebooks have been verified to run in Amazon SageMaker Studio only. If you need instructions on
how to open a notebook in Amazon SageMaker Studio, see Create or Open an Amazon SageMaker Studio
Notebook (p. 148). If you're prompted to choose a kernel, choose Python 3 (Data Science).
9
Amazon SageMaker Developer Guide
Guide to the SageMaker Clarify Documentation
• For further information on detecting bias in preprocessing data before it's used to train a model, see
Detect Pre-training Data Bias (p. 968).
• For further information on detecting posttraining data and model bias, see Detect Post-training Data
and Model Bias with Amazon SageMaker Clarify (p. 2072).
• For further information on the model-agnostic feature attribution approach to explain model
predictions after training, see Amazon SageMaker Clarify Model Explainability (p. 2093).
• For further information on monitoring for bias in production model inferences due to the drift
of data away from the baseline used to train the model, see Monitor Bias Drift for Models in
Production (p. 2325).
• For further information on monitoring for the drift of features' contributions away from the baseline
that was established during model training, see Monitor Feature Attribution Drift for Models in
Production (p. 2334).
10
Amazon SageMaker Developer Guide
Model Training
11
Amazon SageMaker Developer Guide
Model Training
The area labeled SageMaker highlights the two components of SageMaker: model training and model
deployment.
To train a model in SageMaker, you create a training job. The training job includes the following
information:
• The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training
data.
• The compute resources that you want SageMaker to use for model training. Compute resources are
machine learning (ML) compute instances that are managed by SageMaker.
• The URL of the S3 bucket where you want to store the output of the job.
• The Amazon Elastic Container Registry path where the training code is stored. For more information,
see Docker Registry Paths and Example Code.
Note
Your input dataset must be in the same AWS Region as your training job.
After you create the training job, SageMaker launches the ML compute instances and uses the training
code and the training dataset to train the model. It saves the resulting model artifacts and other output
in the S3 bucket you specified for that purpose.
You can create a training job with the SageMaker console or the API. For information about creating a
training job with the API, see the CreateTrainingJob API.
When you create a training job with the API, SageMaker replicates the entire dataset on ML compute
instances by default. To make SageMaker replicate a subset of the data on each ML compute instance,
12
Amazon SageMaker Developer Guide
Model Deployment
you must set the S3DataDistributionType field to ShardedByS3Key. You can set this field using the
low-level SDK. For more information, see S3DataDistributionType in S3DataSource.
Important
To prevent your algorithm container from contending for memory, we reserve memory for our
SageMaker critical system processes on your ML compute instances and therefore you cannot
expect to see all the memory for your instance type.
• For persistent, real-time endpoints that make one prediction at a time, use SageMaker real-time
hosting services. See Real-time inference (p. 2195).
• Workloads that have idle periods between traffic spurts and can tolerate cold starts, use Serverless
Inference. See Serverless Inference (p. 2371).
• Requests with large payload sizes up to 1GB, long processing times, and near real-time latency
requirements, use Amazon SageMaker Asynchronous Inference. See Asynchronous inference (p. 2398).
• To get predictions for an entire dataset, use SageMaker batch transform. See Use Batch
Transform (p. 2421).
SageMaker also provides features to manage resources and optimize inference performance when
deploying machine learning models:
• To manage models on edge devices so that you can optimize, secure, monitor, and maintain machine
learning models on fleets of edge devices such as smart cameras, robots, personal computers, and
mobile devices, see Deploy models at the edge with SageMaker Edge Manager (p. 2510).
• To optimize Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, and ONNX models for
inference on Android, Linux, and Windows machines based on processors from Ambarella, ARM,
Intel, Nvidia, NXP, Qualcomm, Texas Instruments, and Xilinx, see Optimize model performance using
Neo (p. 2562).
For more information about all deployment options, see Deploy Models for Inference (p. 2155).
You can evaluate your model using historical data (offline) or live data:
• Offline testing—Use historical, not live, data to send requests to the model for inferences.
Deploy your trained model to an alpha endpoint, and use historical data to send inference requests to
it. To send the requests, use a Jupyter notebook in your Amazon SageMaker notebook instance and
either the AWS SDK for Python (Boto) or the high-level Python library provided by SageMaker.
• Online testing with live data—SageMaker supports A/B testing for models in production by using
production variants. Production variants are models that use the same inference code and are
13
Amazon SageMaker Developer Guide
Model Monitoring
deployed on the same SageMaker endpoint. You configure the production variants so that a small
portion of the live traffic goes to the model that you want to validate. For example, you might choose
to send 10% of the traffic to a model variant for evaluation. After you are satisfied with the model's
performance, you can route 100% traffic to the updated model. For an example of testing models in
production, see Production variants (p. 2270).
For more information, see articles and books about how to evaluate models, for example, Evaluating
Machine Learning Models.
• Validating using a holdout set—Machine learning practitioners often set aside a part of the data as a
"holdout set." They don’t use this data for model training.
With this approach, you evaluate how well your model provides inferences on the holdout set. You
then assess how effectively the model generalizes what it learned in the initial training, as opposed to
using model memory. This approach to validation gives you an idea of how often the model is able to
infer the correct answer.
In some ways, this approach is similar to teaching elementary school students. First, you provide them
with a set of examples to learn, and then test their ability to generalize from their learning. With
homework and tests, you pose problems that were not included in the initial learning and determine
whether they are able to generalize effectively. Students with perfect memories could memorize the
problems, instead of learning the rules.
• k-fold validation—In this validation approach, you split the example dataset into k parts. You treat
each of these parts as a holdout set for k training runs, and use the other k-1 parts as the training set
for that run. You produce k models using a similar process, and aggregate the models to generate your
final model. The value k is typically in the range of 5-10.
For more information about SageMaker model monitoring products, see Monitor models for data and
model quality, bias, and explainability (p. 2299).
To start your machine learning journey with SageMaker, sign up for an AWS account at Set Up
SageMaker.
14
Amazon SageMaker Developer Guide
ML Frameworks and Toolkits
For information about using specific frameworks or how to use R in SageMaker, see the following topics.
For a sample Jupyter notebook, see the MXNet example notebooks in the Amazon SageMaker
Examples GitHub repository.
15
Amazon SageMaker Developer Guide
Apache Spark
I want to see the API documentation for Amazon SageMaker Python SDK MXNet classes.
For general information about writing MXNet script mode training scripts and using MXNet script mode
estimators and models with SageMaker, see Using MXNet with the SageMaker Python SDK.
SageMaker provides an Apache Spark library, in both Python and Scala, that you can use to easily train
models in SageMaker using org.apache.spark.sql.DataFrame data frames in your Spark clusters.
After model training, you can also host the model using SageMaker hosting services.
With SageMaker Studio, you can easily connect to an Amazon EMR cluster. For more information, see
Prepare data at Scale with Studio Notebooks.
• You can download the source code for both PySpark and Scala libraries from the SageMaker Spark
GitHub repository.
• For the Python Spark library, you have the following additional options:
• Use pip install:
• In a notebook instance, create a new notebook that uses either the Sparkmagic (PySpark) or the
Sparkmagic (PySpark3) kernel and connect to a remote Amazon EMR cluster.
Note
The EMR cluster must be configured with an IAM role that has the
AmazonSageMakerFullAccess policy attached. For information about configuring roles
for an EMR cluster, see Configure IAM Roles for Amazon EMR Permissions to AWS Services
in the Amazon EMR Management Guide.
16
Amazon SageMaker Developer Guide
Apache Spark
• You can get the Scala library from Maven. Add the Spark library to your project by adding the
following dependency to your pom.xml file:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>sagemaker-spark_2.11</artifactId>
<version>spark_2.2.0-1.0</version>
</dependency>
1. Continue data preprocessing using the Apache Spark library that you are familiar with. Your dataset
remains a DataFrame in your Spark cluster. Load your data into a DataFrame and preprocess it so
that you have a features column with org.apache.spark.ml.linalg.Vector of Doubles,
and an optional label column with values of Double type.
2. Use the estimator in the SageMaker Spark library to train your model. For example, if you
choose the k-means algorithm provided by SageMaker for model training, you call the
KMeansSageMakerEstimator.fit method.
a. Converts the input DataFrame to the protobuf format by selecting the features and label
columns from the input DataFrame and uploading the protobuf data to an Amazon S3 bucket.
The protobuf format is efficient for model training in SageMaker.
b. Starts model training in SageMaker by sending a SageMaker CreateTrainingJob request.
After model training has completed, SageMaker saves the model artifacts to an S3 bucket.
SageMaker assumes the IAM role that you specified for model training to perform tasks on your
behalf. For example, it uses the role to read training data from an S3 bucket and to write model
artifacts to a bucket.
c. Creates and returns a SageMakerModel object. The constructor does the following tasks, which
are related to deploying your model to SageMaker.
Provide an input DataFrame with features as input. The transform method transforms it
to a DataFrame containing inferences. Internally, the transform method sends a request to
the InvokeEndpoint SageMaker API to get inferences. The transform method appends the
inferences to the input DataFrame.
17
Amazon SageMaker Developer Guide
Apache Spark
Amazon SageMaker provides an Apache Spark library (in both Python and Scala) that you can use to
integrate your Apache Spark applications with SageMaker. For example, you might use Apache Spark
for data preprocessing and SageMaker for model training and hosting. For more information, see Use
Apache Spark with Amazon SageMaker (p. 16). This section provides example code that uses the
Apache Spark Scala library provided by SageMaker to train a model in SageMaker using DataFrames
in your Spark cluster. The example also hosts the resulting model artifacts using SageMaker hosting
services. Specifically, this example does the following:
Because the example uses the k-means algorithm provided by SageMaker to train a model, you
use the KMeansSageMakerEstimator. You train the model using images of handwritten single-
digit numbers (from the MNIST dataset). You provide the images as an input DataFrame. For your
convenience, SageMaker provides this dataset in an S3 bucket.
To get inferences from a model hosted in SageMaker, you call the SageMakerModel.transform
method. You pass a DataFrame as input. The method transforms the input DataFrame to another
DataFrame containing inferences obtained from the model.
For a given input image of a handwritten single-digit number, the inference identifies a cluster that the
image belongs to. For more information, see K-Means Algorithm (p. 1485).
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator
18
Amazon SageMaker Developer Guide
Apache Spark
// train
val model = estimator.fit(trainingData)
The show method displays the first 20 rows in the data frame:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 5.0|(784,[152,153,154...|
| 0.0|(784,[127,128,129...|
| 4.0|(784,[160,161,162...|
| 1.0|(784,[158,159,160...|
| 9.0|(784,[208,209,210...|
| 2.0|(784,[155,156,157...|
| 1.0|(784,[124,125,126...|
| 3.0|(784,[151,152,153...|
| 1.0|(784,[152,153,154...|
| 4.0|(784,[134,135,161...|
| 3.0|(784,[123,124,125...|
| 5.0|(784,[216,217,218...|
| 3.0|(784,[143,144,145...|
| 6.0|(784,[72,73,74,99...|
| 1.0|(784,[151,152,153...|
| 7.0|(784,[211,212,213...|
| 2.0|(784,[151,152,153...|
19
Amazon SageMaker Developer Guide
Apache Spark
| 8.0|(784,[159,160,161...|
| 6.0|(784,[100,101,102...|
| 9.0|(784,[209,210,211...|
+-----+--------------------+
only showing top 20 rows
In each row:
• The label column identifies the image's label. For example, if the image of the handwritten number
is the digit 5, the label value is 5.
• The features column stores a vector (org.apache.spark.ml.linalg.Vector) of Double
values. These are the 784 features of the handwritten number. (Each handwritten number is a 28 x
28-pixel image, making 784 features.)
The fit method of this estimator uses the k-means algorithm provided by SageMaker to train models
using an input DataFrame. In response, it returns a SageMakerModel object that you can use to get
inferences.
Note
The KMeansSageMakerEstimator extends the SageMaker SageMakerEstimator, which
extends the Apache Spark Estimator.
The constructor parameters provide information that is used for training a model and deploying it on
SageMaker:
• trainingInstanceType and trainingInstanceCount—Identify the type and number of ML
compute instances to use for model training.
• endpointInstanceType—Identifies the ML compute instance type to use when hosting the model
in SageMaker. By default, one ML compute instance is assumed.
• sagemakerRole—SageMaker assumes this IAM role to perform tasks on your behalf. For example,
for model training, it reads data from S3 and writes training results (model artifacts) to S3.
Note
This example implicitly creates a SageMaker client. To create this client, you must provide
your credentials. The API uses these credentials to authenticate requests to SageMaker. For
example, it uses the credentials to authenticate requests to create a training job and API
calls for deploying the model using SageMaker hosting services.
• After the KMeansSageMakerEstimator object has been created, you set the following parameters,
are used in model training:
20
Amazon SageMaker Developer Guide
Apache Spark
• The number of clusters that the k-means algorithm should create during model training. You
specify 10 clusters, one for each digit, 0 through 9.
• Identifies that each input image has 784 features (each handwritten number is a 28 x 28-pixel
image, making 784 features).
// train
val model = estimator.fit(trainingData)
You pass the input DataFrame as a parameter. The model does all the work of training the model
and deploying it to SageMaker. For more information see, Integrate Your Apache Spark Application
with SageMaker (p. 17). In response, you get a SageMakerModel object, which you can use to get
inferences from your model deployed in SageMaker.
You provide only the input DataFrame. You don't need to specify the registry path to the k-means
algorithm used for model training because the KMeansSageMakerEstimator knows it.
• Calls the SageMakerModel.transform method to get inferences from the model deployed in
SageMaker.
The transform method takes a DataFrame as input, transforms it, and returns another DataFrame
containing inferences obtained from the model.
For simplicity, we use the same DataFrame as input to the transform method that we used for
model training in this example. The transform method does the following:
• Serializes the features column in the input DataFrame to protobuf and sends it to the SageMaker
endpoint for inference.
• Deserializes the protobuf response into the two additional columns (distance_to_cluster and
closest_cluster) in the transformed DataFrame.
The show method gets inferences to the first 20 rows in the input DataFrame:
+-----+--------------------+-------------------+---------------+
|label| features|distance_to_cluster|closest_cluster|
+-----+--------------------+-------------------+---------------+
| 5.0|(784,[152,153,154...| 1767.897705078125| 4.0|
| 0.0|(784,[127,128,129...| 1392.157470703125| 5.0|
| 4.0|(784,[160,161,162...| 1671.5711669921875| 9.0|
| 1.0|(784,[158,159,160...| 1182.6082763671875| 6.0|
| 9.0|(784,[208,209,210...| 1390.4002685546875| 0.0|
| 2.0|(784,[155,156,157...| 1713.988037109375| 1.0|
| 1.0|(784,[124,125,126...| 1246.3016357421875| 2.0|
| 3.0|(784,[151,152,153...| 1753.229248046875| 4.0|
| 1.0|(784,[152,153,154...| 978.8394165039062| 2.0|
| 4.0|(784,[134,135,161...| 1623.176513671875| 3.0|
| 3.0|(784,[123,124,125...| 1533.863525390625| 4.0|
| 5.0|(784,[216,217,218...| 1469.357177734375| 6.0|
| 3.0|(784,[143,144,145...| 1736.765869140625| 4.0|
| 6.0|(784,[72,73,74,99...| 1473.69384765625| 8.0|
21
Amazon SageMaker Developer Guide
Apache Spark
Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker
with Apache Spark
In Example 1: Use Amazon SageMaker for Training and Inference with Apache Spark (p. 18), you
use the kMeansSageMakerEstimator because the example uses the k-means algorithm provided by
Amazon SageMaker for model training. You might choose to use your own custom algorithm for model
training instead. Assuming that you have already created a Docker image, you can create your own
SageMakerEstimator and specify the Amazon Elastic Container Registry path for your custom image.
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.SageMakerEstimator
import
com.amazonaws.services.sagemaker.sparksdk.transformation.serializers.ProtobufRequestRowSerializer
import
com.amazonaws.services.sagemaker.sparksdk.transformation.deserializers.KMeansProtobufResponseRowDeseri
• trainingImage —Identifies the Docker registry path to the training image containing your custom
code.
• modelImage —Identifies the Docker registry path to the image containing inference code.
22
Amazon SageMaker Developer Guide
Apache Spark
• requestRowSerializer —Implements
com.amazonaws.services.sagemaker.sparksdk.transformation.RequestRowSerializer.
This parameter serializes rows in the input DataFrame to send them to the model hosted in
SageMaker for inference.
• responseRowDeserializer —Implements
com.amazonaws.services.sagemaker.sparksdk.transformation.ResponseRowDeserializer.
This parameter deserializes responses from the model, hosted in SageMaker, back into a DataFrame.
• trainingSparkDataFormat —Specifies the data format that Spark uses when uploading training
data from a DataFrame to S3. For example, "sagemaker" for protobuf format, "csv" for comma-
separated values, and "libsvm" for LibSVM format.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.PCA
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator
23
Amazon SageMaker Developer Guide
Chainer
// train
val pipelineModel = pipeline.fit(trainingData)
Because we want to make inferences using the "projectedFeatures" column, we pass the column name
into the ProtobufRequestRowSerializer.
+-----+--------------------+--------------------+-------------------+---------------+
|label| features| projectedFeatures|distance_to_cluster|closest_cluster|
+-----+--------------------+--------------------+-------------------+---------------+
| 5.0|(784,[152,153,154...|[880.731433034386...| 1500.470703125| 0.0|
| 0.0|(784,[127,128,129...|[1768.51722024166...| 1142.18359375| 4.0|
| 4.0|(784,[160,161,162...|[704.949236329314...| 1386.246826171875| 9.0|
| 1.0|(784,[158,159,160...|[-42.328192193771...| 1277.0736083984375| 5.0|
| 9.0|(784,[208,209,210...|[374.043902028333...| 1211.00927734375| 3.0|
| 2.0|(784,[155,156,157...|[941.267714528850...| 1496.157958984375| 8.0|
| 1.0|(784,[124,125,126...|[30.2848596410594...| 1327.6766357421875| 5.0|
| 3.0|(784,[151,152,153...|[1270.14374062052...| 1570.7674560546875| 0.0|
| 1.0|(784,[152,153,154...|[-112.10792566485...| 1037.568359375| 5.0|
| 4.0|(784,[134,135,161...|[452.068280676606...| 1165.1236572265625| 3.0|
| 3.0|(784,[123,124,125...|[610.596447285397...| 1325.953369140625| 7.0|
| 5.0|(784,[216,217,218...|[142.959601818422...| 1353.4930419921875| 5.0|
| 3.0|(784,[143,144,145...|[1036.71862533658...| 1460.4315185546875| 7.0|
| 6.0|(784,[72,73,74,99...|[996.740157435754...| 1159.8631591796875| 2.0|
| 1.0|(784,[151,152,153...|[-107.26076167417...| 960.963623046875| 5.0|
| 7.0|(784,[211,212,213...|[619.771820430940...| 1245.13623046875| 6.0|
| 2.0|(784,[151,152,153...|[850.152101817161...| 1304.437744140625| 8.0|
| 8.0|(784,[159,160,161...|[370.041887230547...| 1192.4781494140625| 0.0|
| 6.0|(784,[100,101,102...|[546.674328209335...| 1277.0908203125| 2.0|
| 9.0|(784,[209,210,211...|[-29.259112927426...| 1245.8182373046875| 6.0|
+-----+--------------------+--------------------+-------------------+---------------+
Note
To run the notebooks on a notebook instance, see Example Notebooks (p. 220). To run the
notebooks on Studio, see Create or Open an Amazon SageMaker Studio Notebook (p. 148).
24
Amazon SageMaker Developer Guide
Hugging Face
For a sample Jupyter notebook, see the Chainer example notebooks in the Amazon SageMaker
Examples GitHub repository.
For more information, see the SageMaker Chainer Container GitHub repository.
For information about supported Chainer versions, and for general information about writing Chainer
training scripts and using Chainer estimators and models with SageMaker, see Using Chainer with the
SageMaker Python SDK.
To use the Hugging Face Deep Learning Containers with the SageMaker Python SDK for training, see the
Hugging Face SageMaker Estimator. With the Hugging Face Estimator, you can use the Hugging Face
models as you would any other SageMaker Estimator. However, using the SageMaker Python SDK is
optional. You can also orchestrate your use of the Hugging Face Deep Learning Containers with the AWS
CLI and AWS SDK for Python (Boto3).
For more information on Hugging Face and the models available in it, see the Hugging Face
documentation.
Training
To run training, you can use any of the thousands of models available in Hugging Face and fine-tune
them for your specific use case with additional training. With SageMaker, you can use standard training
or take advantage of SageMaker Distributed Data and Model Parallel training. As with other SageMaker
training jobs using custom code, you can capture your own metrics by passing a metrics definition to the
SageMaker Python SDK as shown in Defining Training Metrics (SageMaker Python SDK) . The captured
metrics are then accessible via CloudWatch and as a Pandas DataFrame via the TrainingJobAnalytics
25
Amazon SageMaker Developer Guide
Hugging Face
method. Once your model is trained and fine-tuned, you can use it like any other model to run inference
jobs.
With the SageMaker Python SDK, you can run training jobs using the Hugging Face Estimator in the
following environments:
• SageMaker Studio: Amazon SageMaker Studio is the first fully integrated development environment
(IDE) for machine learning (ML). SageMaker Studio provides a single, web-based visual interface where
you can perform all ML development steps required to prepare, build, train and tune, deploy and
manage models. For information on using Jupyter Notebooks in Studio, see Use Amazon SageMaker
Studio Notebooks.
• SageMaker Notebook Instances: An Amazon SageMaker notebook instance is a machine learning
(ML) compute instance running the Jupyter Notebook App. This app lets you run Jupyter Notebooks
in your notebook instance to prepare and process data, write code to train models, deploy models
to SageMaker hosting, and test or validate your models without SageMaker Studio features like
Debugger, Model Monitoring, and a web-based IDE.
• Locally: If you have connectivity to AWS and have appropriate SageMaker permissions, you can use
the SageMaker Python SDK locally to launch remote training and inference jobs for Hugging Face in
SageMaker on AWS. This works on your local machine, as well as other AWS services with a connected
SageMaker Python SDK and appropriate permissions.
Inference
For inference, you can use your trained Hugging Face model or one of the pretrained Hugging Face
models to deploy an inference job with SageMaker. With this collaboration, you only need one line of
code to deploy both your trained models and pre-trained models with SageMaker. You can also run
inference jobs without having to write any custom inference code. With custom inference code, you can
customize the inference logic by providing your own Python script.
How to deploy an inference job using the Hugging Face Deep Learning
Containers
You have two options for running inference with SageMaker. You can run inference using a model that
you trained, or deploy a pre-trained Hugging Face model.
• Run inference with your trained model: You have two options for running inference with your own
trained model. You can run inference with a model that you trained using an existing Hugging Face
model with the SageMaker Hugging Face Deep Learning Containers, or you can bring your own existing
Hugging Face model and deploy it using SageMaker. When you run inference with a model that you
trained with the SageMaker Hugging Face Estimator, you can deploy the model immediately after
training completes or you can upload the trained model to an Amazon S3 bucket and ingest it when
running inference later. If you bring your own existing Hugging Face model, you must upload the
trained model to an Amazon S3 bucket and ingest that bucket when running inference as shown in
Deploy your Hugging Face Transformers for inference example.
• Run inference with a pre-trained HuggingFace model: You can use one of the thousands of pre-
trained Hugging Face models to run your inference jobs with no additional training needed. To run
inference, you select the pre-trained model from the list of Hugging Face models, as outlined in
Deploy pre-trained Hugging Face Transformers for inference example.
26
Amazon SageMaker Developer Guide
PyTorch
I want to train and deploy a text classification model using Hugging Face in SageMaker with PyTorch.
For a sample Jupyter Notebook, see the PyTorch Getting Started Demo.
I want to train and deploy a text classification model using Hugging Face in SageMaker with TensorFlow.
For a sample Jupyter Notebook, see the TensorFlow Getting Started example.
I want to run distributed training with data parallelism using Hugging Face and SageMaker Distributed.
For a sample Jupyter Notebook, see the Training with Custom Metrics example.
I want to train a distributed question-answering TensorFlow model using Hugging Face in SageMaker.
For a sample Jupyter Notebook, see the Distributed TensorFlow Training example.
I want to train a distributed summarization model using Hugging Face in SageMaker.
For a sample Jupyter Notebook, see the Distributed Summarization Training example.
I want to train an image classification model using Hugging Face in SageMaker.
For a sample Jupyter Notebook, see the Vision Transformer Training example.
I want to deploy my trained Hugging Face model in SageMaker.
For a sample Jupyter Notebook, see the Deploy your Hugging Face Transformers for inference
example.
I want to deploy a pre-trained Hugging Face model in SageMaker.
For a sample Jupyter Notebook, see the Deploy pre-trained Hugging Face Transformers for inference
example.
For a sample Jupyter notebook, see the PyTorch example notebook in the Amazon SageMaker
Examples GitHub repository.
27
Amazon SageMaker Developer Guide
R
I have a PyTorch model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.
For general information about writing PyTorch training scripts and using PyTorch estimators and models
with SageMaker, see Using PyTorch with the SageMaker Python SDK.
The examples are organized in three levels, Beginner, Intermediate, and Advanced. They start
from Getting Started with R on SageMaker, continue to end-to-end machine learning with R on
SageMaker, and then finish with more advanced topics such as SageMaker Processing with R script, and
Bring-Your-Own (BYO) R algorithm to SageMaker.
For information on how to bring your own custom R image to Studio, see Bring your own SageMaker
image (p. 169). For a similar blog article, see Bringing your own R environment to Amazon SageMaker
Studio.
R Kernel in SageMaker
SageMaker notebook instances support R using a pre-installed R kernel. Also, the R kernel has the
reticulate library, an R to Python interface, so you can use the features of SageMaker Python SDK from
within an R script.
• reticulatelibrary: provides an R interface to the Amazon SageMaker Python SDK. The reticulate
package translates between R and Python objects.
28
Amazon SageMaker Developer Guide
R
• Wait until the status of the notebook is In Service, and then click Open Jupyter.
• Create a new notebook with R kernel from the list of available environments.
• When the new notebook is created, you should see an R logo in the upper right corner of the notebook
environment, and also R as the kernel under that logo. This indicates that SageMaker has successfully
launched the R kernel for this notebook.
• Alternatively, when you are in a Jupyter notebook, you can use Kernel menu, and then select R from
Change Kernel option.
29
Amazon SageMaker Developer Guide
Scikit-learn
Example Notebooks
Prerequisites
Getting Started with R on SageMaker: This sample notebook describes how you can develop R scripts
using Amazon SageMaker‘s R kernel. In this notebook you set up your SageMaker environment and
permissions, download the abalone dataset from the UCI Machine Learning Repository, do some basic
processing and visualization on the data, then save the data as .csv format to S3.
Beginner Level
SageMaker Batch Transform using R Kernel: This sample Notebook describes how to conduct a batch
transform job using SageMaker’s Transformer API and the XGBoost algorithm. The notebook also uses
the Abalone dataset.
Intermediate Level
Hyperparameter Optimization for XGBoost in R: This sample notebook extends the previous
beginner notebooks that use the abalone dataset and XGBoost. It describes how to do model tuning
with hyperparameter optimization. You will also learn how to use batch transform for batching
predictions, as well as how to create a model endpoint to make real-time predictions.
Amazon SageMaker Processing with R: SageMaker Processing lets you preprocess, post-process and
run model evaluation workloads. This example shows you how to create an R script to orchestrate a
Processing job.
Advanced Level
Train and Deploy Your Own R Algorithm in SageMaker: Do you already have an R algorithm, and you
want to bring it into SageMaker to tune, train, or deploy it? This example walks you through how to
customize SageMaker containers with custom R packages, all the way to using a hosted endpoint for
inference on your R-origin model.
Requirements
Python 3.8
NumPy 1.17.3
SciPy 1.3.2
joblib 1.1.1
threadpoolctl 2.0.0
30
Amazon SageMaker Developer Guide
SparkML Serving
1.2-1 3.8
1.0-1 3.7
0.23-1 3.6
For general information about writing Scikit-learn training scripts and using Scikit-learn estimators and
models with SageMaker, see Using Scikit-learn with the SageMaker Python SDK.
I want to use Scikit-learn for data processing, feature engineering, or model evaluation in SageMaker.
For information about using the SparkML Serving container to deploy models to SageMaker, see
SageMaker Spark ML Container GitHub repository. For information about the Amazon SageMaker
Python SDK SparkML Serving model and predictors, see the SparkML Serving Model and Predictor API
documentation.
31
Amazon SageMaker Developer Guide
TensorFlow
For a sample Jupyter notebook, see TensorFlow script mode training and serving.
For general information about writing TensorFlow script mode training scripts and using TensorFlow
script mode estimators and models with SageMaker, see Using TensorFlow with the SageMaker Python
SDK.
• You have existing legacy mode scripts that you do not want to convert to script mode.
• You want to use a TensorFlow version earlier than 1.11.
For information about writing legacy mode TensorFlow scripts to use with the SageMaker Python SDK,
see TensorFlow SageMaker Estimators and Models.
32
Amazon SageMaker Developer Guide
Supported Regions and Quotas
containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful
environment variables that let you optimize performance on SageMaker. For a list of all available Deep
Learning Containers images, see Available Deep Learning Containers Images. Deep Learning Containers
images are maintained and regularly updated with security patches.
You can use the Triton Inference Server Container with SageMaker Python SDK as you would any other
container in your SageMaker models. However, using the SageMaker Python SDK is optional. You can use
Triton Inference Server Containers with the AWS CLI and AWS SDK for Python (Boto3).
For more information on NVIDIA Triton Inference Server see the Triton documentation.
Inference
Note
The Triton Python backend uses shared memory (SHMEM) to connect your code to Triton.
SageMaker Inference provides up to half of the instance memory as SHMEM so you can use an
instance with more memory for larger SHMEM size.
For inference, you can use your trained ML models with Triton Inference Server to deploy an inference
job with SageMaker.
• Support for multiple frameworks: Triton can be used to deploy models from all major ML
frameworks. Triton supports TensorFlow GraphDef and SavedModel, ONNX, PyTorch TorchScript,
TensorRT, and custom Python/C++ model formats.
• Model pipelines: Triton model ensemble represents a pipeline of one model with pre/post processing
logic and the connection of input and output tensors between them. A single inference request to an
ensemble triggers the execution of the entire pipeline.
• Concurrent model execution: Multiple instances of the same model can run simultaneously on the
same GPU or on multiple GPUs.
• Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and
batching algorithms that combine individual inference requests together to improve inference
throughput. These scheduling and batching decisions are transparent to the client requesting
inference.
• Diverse CPU and GPU support: The models can be executed on CPUs or GPUs for maximum flexibility
and to support heterogeneous computing requirements.
For a sample Jupyter Notebook, see the Deploy your PyTorch Resnet50 model with Triton Inference
Server example.
I want to deploy my trained Hugging Face model in SageMaker.
For a sample Jupyter Notebook, see the Deploy your PyTorch BERT model with Triton Inference
Server example.
33
Amazon SageMaker Developer Guide
Quotas
For a list of the SageMaker service endpoints for each Region, see Amazon SageMaker endpoints and
quotas in the AWS General Reference.
Quotas
For a list of SageMaker quotas, see Amazon SageMaker endpoints and quotas in the AWS General
Reference.
The Service Quotas console provides information about your service quotas. You can use the Service
Quotas console to view your default service quotas or to request quota increases. To request a quota
increase for adjustable quotas, see Requesting a quota increase.
You can set up a quota request template for your AWS Organization that automatically requests quota
increases during account creation. For more information, see Using Service Quotas request templates.
34
Amazon SageMaker Developer Guide
Set Up Amazon SageMaker Prerequisites
Amazon SageMaker Studio Lab does not require an AWS account or IAM integration.
After you complete these tasks, continue to one of the following topics, depending on your use case.
• Onboard to Amazon SageMaker Domain (p. 37): Follow these steps to create a Domain, which gives
you access to Amazon SageMaker Studio and RStudio on Amazon SageMaker. For more information
about Domains, see Amazon SageMaker Domain (p. 105).
• SageMaker JumpStart (p. 47): Follow these steps to start working with SageMaker JumpStart
and learn about SageMaker features and capabilities through curated one-click solutions, example
notebooks, and pretrained models that you can deploy. To use SageMaker JumpStart, which is a
feature of Amazon SageMaker Studio, you must first onboard to an Amazon SageMaker Domain.
• Get Started with Amazon SageMaker Notebook Instances (p. 87): Follow these steps to train and
deploy Machine Learning (ML) models using SageMaker notebook instances. SageMaker notebook
instances help create the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud
(Amazon EC2) and providing preconfigured kernels. For more information, see Amazon SageMaker
Notebook Instances (p. 204).
• Amazon SageMaker Studio Lab (p. 230): Follow these steps to start working with Amazon SageMaker
Studio Lab. Studio Lab is a free service that gives you access to AWS compute resources, in an
environment based on open-source JupyterLab, without requiring an AWS account.
Topics
• Set Up Amazon SageMaker Prerequisites (p. 35)
• Onboard to Amazon SageMaker Domain (p. 37)
• SageMaker JumpStart (p. 47)
• Get Started with Amazon SageMaker Notebook Instances (p. 87)
If you're new to SageMaker, we recommend that you read How Amazon SageMaker Works (p. 2).
Topics
• Create an AWS Account (p. 35)
• Create an Administrative User and Group (p. 36)
• AWS CLI Prerequisites (p. 37)
When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all
AWS services, including SageMaker. You are charged only for the services that you use.
35
Amazon SageMaker Developer Guide
Create an Administrative User and Group
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.
Write down your AWS account ID because you'll need it for the next task.
We strongly recommend that you not use the root user for everyday tasks, even the administrative
ones. Instead, adhere to the Security best practices in IAM, and create an administrative user. Then
securely lock away the root user credentials and use them to perform only a few account and service
management tasks.
1. Create an administrative user in your AWS account. For instructions, see Create an administrative
user in the IAM User Guide.
Note
We assume that you use administrator user credentials for the exercises and procedures
in this guide. If you choose to create and use another user, grant that user minimum
permissions. For more information, see Authenticating with Identities (p. 3049).
2. Ensure that your administrator user has the AmazonSageMakerFullAccess policy, as well as a policy
with the following content needed to create a SageMaker domain. For more information about
creating IAM policies, see Creating IAM policies.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:*"
],
"Resource": [
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:app/*",
"arn:aws:sagemaker:*:*:flow-definition/*"
]
},
{
"Effect": "Allow",
"Action": [
"iam:GetRole",
36
Amazon SageMaker Developer Guide
AWS CLI Prerequisites
"servicecatalog:*"
],
"Resource": [
"*"
]
}
]
}
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
When onboarding, you can choose to use either AWS IAM Identity Center (successor to AWS Single Sign-
On) (IAM Identity Center) or AWS Identity and Access Management (IAM) for authentication methods.
When you use IAM authentication, you can choose either the Quick setup or the Standard setup
procedure. RStudio setup is only available when using the Standard setup procedure.
Note
If you onboard using IAM authentication and want to switch to authentication using IAM
Identity Center later, you must delete the Domain that you created. Then, you need to manually
re-import all notebooks and other user data that you created. For more information, see Delete
an Amazon SageMaker Domain (p. 116).
The simplest way to create a Amazon SageMaker Domain is to follow the Quick setup procedure from
the SageMaker console. Quick setup uses the same default settings as the Standard setup procedures.
These settings include shareable notebooks and public internet access. For more control, including
the option of using authentication using IAM Identity Center and RStudio, use the Standard setup
procedures.
To use authentication using IAM Identity Center with Studio and RStudio, you must onboard to an AWS
Organizations organization.
Note
The AWS Organizations account must be in the same AWS Region as Studio and RStudio.
Authentication using IAM Identity Center provides the following benefits over IAM authentication:
• Members given access to Studio have a unique sign-in URL that directly opens Studio, and they sign in
with their IAM Identity Center credentials. When you use IAM authentication, you must sign in through
the SageMaker console.
37
Amazon SageMaker Developer Guide
Onboard Using Quick setup
• Organizations manage their members in IAM Identity Center instead of the Domain. You can assign
multiple members access to the Domain at the same time. When you use IAM authentication, you must
add and manage members manually, one at time, using the Domain Control Panel.
Topics
• Onboard to Amazon SageMaker Domain Using Quick setup (p. 38)
• Onboard to Amazon SageMaker Domain Using IAM Identity Center (p. 39)
• Onboard to Amazon SageMaker Domain Using IAM (p. 43)
• Choose an Amazon VPC (p. 46)
RStudio support is not currently available when onboarding using the Quick setup procedure.
For information on how to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On)
(IAM Identity Center), see Onboard Using IAM Identity Center (p. 39).
If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).
If you choose Create a new role, the Create an IAM role dialog opens:
• For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
• Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
If you choose Create role using the role creation wizard, the Amazon SageMaker Role Manager
page opens. For more information about using SageMaker Role Manager, see Amazon SageMaker
Role Manager (p. 3108).
8. Turn on Enable SageMaker Canvas permissions (by default this option is turned on).
9. Choose Submit.
38
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center
10. From the pop-up window, select a Amazon Virtual Private Cloud (Amazon VPC) and subnet to use.
11. Choose Save and continue.
Note
If you receive an error message that you need to create an Amazon VPC, see Choose an
Amazon VPC (p. 46).
When Status is Ready, the user name that you specified is enabled.
Now that you've onboarded to the Domain, you can launch an app following the steps in Launch Amazon
SageMaker Studio (p. 133). For information about adding users to your Domain, see Add and Remove
User Profiles (p. 119).
For information about using SageMaker Studio, see SageMaker Studio (p. 128).
For information about setting up IAM Identity Center for use with Domain, see Set Up IAM Identity
Center for use with Amazon SageMaker Domain (p. 42).
4. Under Permission, for Default execution role, choose an option from the role selector.
If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).
If you choose Create a new role, the Create an IAM role dialog opens:
39
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center
a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
5. Under Network and storage, specify the following:
• Your Amazon Virtual Private Cloud (Amazon VPC) information – For more information, see Choose
an Amazon VPC (p. 46).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).
Note
Encryption in transit is only available for Amazon SageMaker Studio.
6. Select Next.
1. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as
the default for your Domain. For information on selecting a JupyterLab version, see JupyterLab
Versioning (p. 135).
2. Under Notebook Sharing Configuration, accept the default notebook sharing configuration or
customize the options.
3. Under SageMaker Projects and JumpStart, accept the default Project and JumpStart settings,
or customize whether administrators and users can create projects and use Jumpstart. For more
information, see SageMaker Studio Permissions Required to Use Projects (p. 2806).
4. Select Next.
1. Under RStudio Workbench, verify that your RStudio license is automatically detected. For more
information about getting an RStudio license and activating it with SageMaker, see RStudio
license (p. 435).
2. Select an instance type to launch your RStudio Server on. For more information, see
RStudioServerPro instance type (p. 437).
3. Under Permission, create your role or select an existing role. The role must have the following
permissions policy. This policy allows the RStudioServerPro app to access necessary resources and
allows Amazon SageMaker to automatically launch an RStudioServerPro app when the existing
RStudioServerPro app is in a Deleted or Failed status. For information about adding permissions
to a role, see Modifying a role permissions policy (console).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",
"license-manager:CheckInLicense",
"logs:CreateLogDelivery",
40
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}
4. Under RStudio Connect, add the URL for your RStudio Connect server. RStudio Connect is a
publishing platform for Shiny applications, R Markdown reports, dashboards, plots, and more.
When you onboard to RStudio on SageMaker, an RStudio Connect server is not created. For more
information, see RStudio Connect URL (p. 438).
5. Under RStudio Package Manager, add the URL for your RStudio Package Manager. SageMaker
creates a default package repository for the Package Manager when you onboard RStudio. For more
information about RStudio Package Manager, see RStudio Package Manager (p. 438).
6. Select Next.
1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions option
turned on (it is turned on by default). This establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).
3. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role. However, if you already have an IAM role with the required Amazon Forecast
permissions attached, select Use an existing execution role. For more information, see the IAM role
setup method (p. 278).
4. Use the default IAM role suffix or provide a custom suffix for the role.
5. For Local file upload configuration, select Enable local file upload to enable users to upload local
files into their SageMaker Canvas application (it's already checked by default).
6. Choose Submit.
1. Create an execution role that is used to create a Domain and attach the
AmazonSageMakerFullAccess policy. You can also use an existing role that has, at a minimum, an
attached trust policy that grants SageMaker permission to assume the role. For more information,
see SageMaker Roles (p. 3086).
41
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center
2. Get the default Amazon Virtual Private Cloud (Amazon VPC) of your account.
4. Create a Domain by passing the default Amazon VPC ID, subnets, and execution role ARN. You must
also pass a SageMaker image ARN. For information on the available JupyterLab version ARNs, see
Setting a default JupyterLab version (p. 137).
After you are given access to the Domain, you are sent an email inviting you to create a password and
use IAM Identity Center. The email also contains the URL to sign in to the Domain. For more information
about signing in and session duration, see How to sign in to the user portal.
After you activate your account, go to the Domain URL, sign in, and wait for your user profile to be
created. On subsequent visits, you only need to wait for the Studio or RStudio app to load.
Bookmark the URL. The URL is also available on the Domain settings page.
For information about using Studio, see SageMaker Studio (p. 128).
For information about using RStudio, see RStudio on Amazon SageMaker (p. 432).
After you have created your organization and user, you can create a SageMaker user profile for that user
in IAM Identity Center as follows.
1. From the Amazon SageMaker console: – You can use the Amazon SageMaker console to create a
user profile for the user in IAM Identity Center. If the user in IAM Identity Center hasn’t already been
associated with the Domain, it is automatically associated.
2. Using the AWS CLI or AWS CloudFormation – A user in IAM Identity Center assigned to the Domain
can create a user profile using the SageMaker console, the AWS CLI or AWS CloudFormation.
• The user in IAM Identity Center, or a group in IAM Identity Center containing that user, must first
be assigned to the Domain from the IAM Identity Center console. For more information about
application assignment, see Assign user access.
42
Amazon SageMaker Developer Guide
Onboard Using IAM
• A user profile can then be created for the user in IAM Identity Center with the AWS CLI or AWS
CloudFormation.
Note
To simplify administration of access permissions, we recommend assigning groups in IAM
Identity Center to the Domain instead of assigning users in IAM Identity Center. Groups allow
permissions to be granted or denied to multiple users at once. A user can be moved out of a
group or to a different group if needed. When assigning user access to applications, IAM Identity
Center does not currently support users being added to nested groups. If a user is added to a
nested group, they may receive a "You do not have any applications" error message during sign-
in. Assignments must be made to the immediate group the user is a member of.
Return to the Domains page to continue to onboard using authentication using IAM Identity Center.
For information on how to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On)
(IAM Identity Center), see Onboard Using IAM Identity Center (p. 39).
If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).
If you choose Create a new role, the Create an IAM role dialog opens:
a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
4. For Space default execution role, choose an option from the role selector.
If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).
43
Amazon SageMaker Developer Guide
Onboard Using IAM
If you choose Create a new role, the Create an IAM role dialog opens:
a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
5. Under Network and storage, specify the following:
• Your Amazon Virtual Private Cloud (Amazon VPC) information – For more information, see Choose
an Amazon VPC (p. 46).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).
Note
Encryption in transit is only available for Amazon SageMaker Studio.
6. Select Next.
1. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as
the default for your Domain. For information on selecting a JupyterLab version, see JupyterLab
Versioning (p. 135).
2. Under Notebook Sharing Configuration, accept the default notebook sharing configuration or
customize the options.
3. Under SageMaker Projects and JumpStart, accept the default Project and JumpStart settings
or customize whether administrators and user can create projects and use Jumpstart. For more
information, see SageMaker Studio Permissions Required to Use Projects (p. 2806).
4. Select Next.
1. Under RStudio Workbench, verify that your RStudio license is automatically detected. For more
information about getting an RStudio license and activating it with SageMaker, see RStudio
license (p. 435).
2. Select an instance type to launch your RStudio Server on. For more information, see
RStudioServerPro instance type (p. 437).
3. Under Permission, create your role or select an existing role. The role must have the following
permissions policy. This policy allows the RStudioServerPro app to access necessary resources and
allows Amazon SageMaker to automatically launch an RStudioServerPro app when the existing
RStudioServerPro app is in a Deleted or Failed status. For information on adding permissions to a
role, see Modifying a role permissions policy (console).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",
44
Amazon SageMaker Developer Guide
Onboard Using IAM
"license-manager:CheckInLicense",
"logs:CreateLogDelivery",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}
4. Under RStudio Connect, add the URL for your RStudio Connect Server. RStudio Connect is a
publishing platform for Shiny applications, R Markdown reports, dashboards, plots, and more. When
you onboard to RStudio on Amazon SageMaker, an RStudio Connect server is not created. You must
create an RStudio Connect server on an EC2 instance to use Connect with Amazon SageMaker. For
more information, see RStudio Connect URL (p. 438).
5. Under RStudio Package Manager, add the URL for your RStudio Package Manager. SageMaker
creates a default package repository for the Package Manager when you onboard RStudio. For more
information about RStudio Package Manager, see RStudio Package Manager (p. 438).
6. Select Next.
1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions option
turned on (it is turned on by default). This establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).
3. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role, or select Use an existing execution role if you already have an IAM role with the
required Amazon Forecast permissions attached (for more information, see the IAM role setup
method (p. 278)).
4. Use the default IAM role suffix or provide a custom suffix for the role.
5. For Local file upload configuration, select Enable local file upload to enable users to upload local
files into their SageMaker Canvas application (it's already checked by default).
6. Choose Submit.
1. Create an execution role that is used to create a Domain and attach the
AmazonSageMakerFullAccess policy. You can also use an existing role that has, at a minimum, an
attached trust policy that grants SageMaker permission to assume the role. For more information,
see SageMaker Roles (p. 3086).
45
Amazon SageMaker Developer Guide
Choose an Amazon VPC
2. Get the default Amazon Virtual Private Cloud (Amazon VPC) of your account.
4. Create a Domain by passing the default Amazon VPC ID, subnets, and execution role ARN. You must
also pass a SageMaker image ARN. For information on the available JupyterLab version ARNs, see
Setting a default JupyterLab version (p. 137).
For information about using Amazon SageMaker Studio, see SageMaker Studio (p. 128).
For information about using RStudio, see RStudio on Amazon SageMaker (p. 432).
By default, SageMaker Domain uses two Amazon VPC. One Amazon VPC is managed by Amazon
SageMaker and provides direct internet access. You specify the other Amazon VPC, which provides
encrypted traffic between the Domain and your Amazon Elastic File System (Amazon EFS) volume.
You can change this behavior so that SageMaker sends all traffic over your specified Amazon VPC. When
you choose this option, you must provide the subnets, security groups, and interface endpoints that are
necessary to communicate with the SageMaker API and SageMaker runtime, and various AWS services,
such as Amazon Simple Storage Service (Amazon S3) and Amazon CloudWatch, that are used by Amazon
SageMaker Studio and your Studio notebooks.
When you onboard to SageMaker Domain, you tell SageMaker to send all traffic over your Amazon VPC
by setting the network access type to VPC only.
46
Amazon SageMaker Developer Guide
SageMaker JumpStart
• No entities – You must create one or more entities in order to use Domain. Choose Create <entity> to
open the VPC console in a new browser tab. After you create the entities, return to the Domain Get
started page to continue the onboarding process.
This procedure is part of the Amazon SageMaker Domain onboarding process when you choose Standard
setup. Your Amazon VPC information is specified under the Network section.
• Public internet only – Non-Amazon EFS traffic goes through a SageMaker managed Amazon VPC,
which allows internet access. Traffic between the Domain and your Amazon EFS volume is through
the specified Amazon VPC.
• VPC only – All SageMaker traffic is through the specified Amazon VPC and subnets. You must use
a subnet that does not have direct internet access in VPC only mode. Internet access is disabled by
default.
4. Choose the security groups. If you chose Public internet only, this step is optional. If you chose VPC
only, this step is required.
Note
For the maximum number of allowed security groups, see UserSettings.
For Amazon VPC requirements in VPC only mode, see Connect SageMaker Studio Notebooks in a VPC to
External Resources (p. 3209).
SageMaker JumpStart
SageMaker JumpStart provides pretrained, open-source models for a wide range of problem types to
help you get started with machine learning. You can incrementally train and tune these models before
deployment. JumpStart also provides solution templates that set up infrastructure for common use
cases, and executable example notebooks for machine learning with SageMaker.
You can access the pretrained models, solution templates, and examples through the JumpStart landing
page in Amazon SageMaker Studio. The following steps show how to access JumpStart models and
solutions using Amazon SageMaker Studio.
You can also access JumpStart models using the SageMaker Python SDK. For information about how
to use JumpStart models programmatically, see Use SageMaker JumpStart Algorithms with Pretrained
Models.
47
Amazon SageMaker Developer Guide
Open and use JumpStart
Open JumpStart
In Amazon SageMaker Studio, open the JumpStart landing page either through the Home page or the
Home menu on the left-side panel.
The Launch quick start assets page lists your currently launched solutions, deployed model
endpoints, and training jobs created with Quick start. You can access the JumpStart landing page
from this tab by clicking on the Browse Quick start solutions button at the top right of the tab.
The JumpStart landing page lists available end-to-end machine learning solutions, pretrained models,
and example notebooks. From any individual solution or model page, you can choose the Browse
JumpStart button ( ) at the top right of the tab to return to the SageMaker
JumpStart page.
48
Amazon SageMaker Developer Guide
Open and use JumpStart
Important
Before downloading or using third-party content: You are responsible for reviewing and
complying with any applicable license terms and making sure that they are acceptable for your
use case.
Use JumpStart
From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and
other resources.
You can find JumpStart resources by using the search bar, or by browsing each category. Use the tabs to
filter the available solutions by categories:
• Solutions – In one step, launch comprehensive machine learning solutions that tie SageMaker to other
AWS services. Select Explore All Solutions to view all available solutions.
• ML tasks – Find a model by problem type (e.g., Image Classification, Image Embedding, Object
Detection, Text Generation). Select Explore All Models to view all available models.
• Data types – Find a model by data type (e.g., Vision, Text, Tabular, Audio). Select Explore All Models to
view all available models.
• Notebooks – Find example notebooks that use SageMaker features across multiple model types and
use cases. Select Explore All Notebooks to view all available example notebooks.
• Frameworks – Find a model by framework (e.g., PyTorch, TensorFlow, Hugging Face).
• Resources – Use example notebooks, blogs, and video tutorials to learn and head start your problem
types.
• Blogs – Read details and solutions from machine learning experts.
• Video tutorials – Watch video tutorials for SageMaker features and machine learning use cases from
machine learning experts.
• Example notebooks – Run example notebooks that use SageMaker features like Spot Instance
training and experiments over a large variety of model types and use cases.
49
Amazon SageMaker Developer Guide
Solution Templates
Manage JumpStart
From the Home menu in the left panel, navigate to Quick start solutions, then choose Launched Quick
start assets to list your currently launched solutions, deployed model endpoints, and training jobs
created with Quick start.
Topics
• Solution Templates (p. 50)
• JumpStart Foundation Models (p. 58)
• Task-Specific Models (p. 66)
• Shared Models and Notebooks (p. 79)
• Amazon SageMaker JumpStart Industry: Financial (p. 83)
Solution Templates
SageMaker JumpStart provides one-click, end-to-end solutions for many common machine learning use
cases. Explore the following use cases for more information on available solution templates.
50
Amazon SageMaker Developer Guide
Solution Templates
Choose the solution template that best fits your use case from the JumpStart landing page. When you
choose a solution template, JumpStart opens a new tab showing a description of the solution and a
Launch button. When you select Launch, JumpStart creates all of the resources that you need to run the
solution, including training and model hosting instances. For more information on launching a JumpStart
solution, see the section called “Launch a Solution” (p. 56).
After launching the solution, you can explore solution features and any generated artifacts in JumpStart.
Use the Launched Quick start assets menu to find your solution. In your solution's tab, select Open
Notebook to use provided notebooks and explore the solution’s features. When artifacts are generated
during launch or after running the provided notebooks, they're listed in the Generated Artifacts table.
You can delete individual artifacts with the trash icon ( ). You can delete all of the solution’s
resources by choosing Delete solution resources.
Demand forecasting
Demand forecasting uses historical time series data in order to make future estimations in relation to
customer demand over a specific period and streamline the supply-demand decision-making process
across businesses.
Demand forecasting use cases include predicting ticket sales in the transportation industry, stock prices,
number of hospital visits, number of customer representatives to hire for multiple locations in the next
month, product sales across multiple regions in the next quarter, cloud server usage for the next day for
a video streaming service, electricity consumption for multiple regions over the next week, number of
IoT devices and sensors such as energy consumption, and more.
Time series data is categorized as univariate and multi-variate. For example, the total electricity
consumption for a single household is a univariate time series over a period of time. When multiple
univariate time series are stacked on each other, it’s called a multi-variate time series. For example, the
total electricity consumption of 10 different (but correlated) households in a single neighborhood make
up a multi-variate time series dataset.
51
Amazon SageMaker Developer Guide
Solution Templates
Fraud detection
Many businesses lose billions annually to fraud. Machine learning based fraud detection models can help
systematically identify likely fraudulent activities from a tremendous amount of data. The following
solutions use transaction and user identity datasets to identify fraudulent transactions.
Computer vision
With the rise of business use cases such as autonomous vehicles, smart video surveillance, healthcare
monitoring and various object counting tasks, fast and accurate object detection systems are rising
in demand. These systems involve not only recognizing and classifying every object in an image, but
localizing each one by drawing the appropriate bounding box around it. In the last decade, the rapid
advances of deep learning techniques greatly accelerated the momentum of object detection.
52
Amazon SageMaker Developer Guide
Solution Templates
Object detection for bird species Identify birds species in a scene Find in Amazon SageMaker
using a SageMaker object Studio.
detection model.
Predictive maintenance
Predictive maintenance aims to optimize the balance between corrective and preventative maintenance
by facilitating the timely replacement of components. The following solutions use sensor data from
industrial assets to predict machine failures, unplanned downtime, and repair costs.
53
Amazon SageMaker Developer Guide
Solution Templates
Churn prediction
Customer churn, or rate of attrition, is a costly problem faced by a wide range of companies. In an effort
to reduce churn, companies can identify customers that are likely to leave their service in order to focus
their efforts on customer retention. Use a JumpStart churn prediction solution to analyze data sources
such as user behavior and customer support chat logs to identify customers that are at a high risk of
cancelling a subscription or service.
Churn prediction for mobile Identify unhappy mobile phone Find in Amazon SageMaker
phone customers customers using SageMaker Studio.
XGBoost.
Personalized recommendations
You can use JumpStart solutions to analyze customer identity graphs or user sessions to
better understand and predict customer behavior. Use the following solutions for personalized
recommendations to model customer identity across multiple devices, to determine the likelihood of
a customer making a purchase, or to create a custom movie recommender based on past customer
behavior.
54
Amazon SageMaker Developer Guide
Solution Templates
Reinforcement learning
Reinforcement learning (RL) is a type of learning that is based on interaction with the environment. This
type of learning is used by an agent that must learn behavior through trial-and-error interactions with
a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives
as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain
rewards with exploiting actions that have known rewards.
RL is well-suited for solving large, complex problems, such as supply chain management, HVAC systems,
industrial robotics, game artificial intelligence, dialog systems, and autonomous vehicles.
Financial pricing
Many businesses dynamically adjust pricing on a regular basis in order to maximize their returns. Use
the following JumpStart solutions for price optimization, dynamic pricing, option pricing, or portfolio
optimization use cases.
55
Amazon SageMaker Developer Guide
Solution Templates
Causal inference
Researchers can use machine learning models such as Bayesian networks to represent causal
dependencies and draw causal conclusions based on data. Use the following JumpStart solution to
understand the causal relationship between Nitrogen-based fertilizer application and corn crop yields.
Launch a Solution
First, choose a solution through the SageMaker JumpStart landing page in the Amazon SageMaker
Studio UI. For information on the onboarding steps to sign in to Amazon SageMaker Studio, see Onboard
to Amazon SageMaker Domain. For details on getting to the SageMaker JumpStart landing page, see
Open and use JumpStart (p. 47).
After you choose a solution, a solution's tab opens showing a description of the solution and a Launch
button. To launch a solution, select Launch in the Launch Solution section. JumpStart then creates all
of the resources needed to run the solution. This includes training and model hosting instances.
Advanced parameters
The solution that you choose may have advanced parameters that you can select. Choose Advanced
Parameters to specify the AWS Identity and Access Management role for the solution.
Solutions are able to launch resources across 9 AWS services that interact with each other. For the
solution to work as expected, newly created components from one service must be able to act on newly
created components from another service. We recommend that you use the default IAM role to ensure
that all needed permissions are added. For more information about IAM roles, see Identity and Access
Management for Amazon SageMaker (p. 3048).
If you select this option, the default IAM roles that are required by this solution are used. Each solution
requires different resources. The following list describes the default roles that are used for the solutions
based on the service needed. For a description of the permissions required for each service, see AWS
Managed Policies for SageMaker projects and JumpStart (p. 3172).
56
Amazon SageMaker Developer Guide
Solution Templates
If you are using a new SageMaker Domain with JumpStart project templates enabled, these roles are
automatically created in your account.
If you are using an existing SageMaker domain, these roles may not exist in your account. If this is the
case, you will receive the following error when launching the solution.
Unable to locate the updated roles required to launch this solution, a general role '/
service-role/AmazonSageMakerServiceCatalogProductsUseRole' will be used. Please update your
studio domain to generate these roles.
You can still launch a solution without the needed role, but the legacy default role
AmazonSageMakerServiceCatalogProductsUseRole is used in place of the needed role. The legacy
default role has trust relationships with all of the services that JumpStart solutions need to interact with.
For the best security, we recommend that you update your domain to have the newly created default
roles for each AWS service.
If you have already onboarded to a SageMaker domain, you can update your domain to generate the
default roles using the following procedure.
You should be able to see the default roles listed in Projects - Amazon SageMaker project templates
enabled for this account under the Apps - Studio tab.
If you select this option, you must select an existing IAM role from the dropdown list for each of the
required services. The selected role must have at least the minimum permissions required for the
corresponding service. For a description of the permissions required for each service, see AWS Managed
Policies for SageMaker projects and JumpStart (p. 3172).
If you select this option, you must manually enter the ARN for an existing IAM role. The selected role
must have at least the minimum permissions required for the corresponding service. For a description
57
Amazon SageMaker Developer Guide
Foundation Models
of the permissions required for each service, see AWS Managed Policies for SageMaker projects and
JumpStart (p. 3172).
A foundation model is a large pre-trained model that is adaptable to many downstream tasks and often
serves as the starting point for developing more specialized models. Examples of foundation models
include AlexaTM, BLOOM, and FLAN, which are pre-trained on massive amounts of text data and can
be fine-tuned for specific language tasks. Amazon SageMaker JumpStart onboards and maintains open
source, community, and third-party foundation models for you to access, customize, and integrate into
your machine learning lifecycles.
To get started exploring and experimenting with available models, see How to use JumpStart
foundation models (p. 60). All foundation models are available to use programmatically with the
SageMaker Python SDK. For more information, see Use foundation models with the SageMaker Python
SDK (p. 60).
For more information on considerations to make when choosing a model, see Choose a foundation
model (p. 61).
For specifics about customization and fine-tuning foundation models, see Customize a foundation
model (p. 63).
For more general information on foundation models, see the paper On the Opportunities and Risks of
Foundation Models.
58
Amazon SageMaker Developer Guide
Foundation Models
To get started with one of these featured models, see How to use JumpStart foundation
models (p. 60) or explore one of the available Example notebooks (p. 59). In a given example
notebook, try switching out the model ID to experiment with different models within the same model
family.
Example notebooks
For step-by-step examples on how to use JumpStart foundation models with the SageMaker Python
SDK, refer to the following notebooks on text generation, image generation, and model customization.
If a notebook is associated with a specific foundation model, you can find the foundation model in
SageMaker Studio, navigate to the Run in notebook section, and choose Open notebook.
Alternatively, for instructions on how to create and access Jupyter notebook instances that you can
use to run the example in SageMaker, see Amazon SageMaker Notebook Instances. After you have
created a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of
the SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.
Text generation
Explore text generation example notebooks, including guidance on general text generation workflows,
multilingual text classification, real-time batch inference, few-shot learning, chatbot interactions, and
more.
Image generation
Get started with text-to-image Stable Diffusion models, learn how to run image generation inference,
and experiment with a simple workflow to generate images of your dog.
59
Amazon SageMaker Developer Guide
Foundation Models
Model customization
Sometimes your use case requires greater foundation model customization for specific tasks. For more
information on model customization approaches, see Customize a foundation model (p. 63) or
explore one of the following example notebooks.
• SageMaker JumpStart Foundation Models - Fine-tuning text generation GPT-J 6B model on domain
specific dataset
• SageMaker JumpStart Foundation Models - HuggingFace Text2Text Instruction Fine-Tuning
• Retrieval-Augmented Generation: Question Answering based on Custom Dataset
• Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced
LangChain Library
To reference available model IDs, see the Built-in Algorithms with pre-trained Model Table. Search for
the name of the foundation model of your choice in the Search bar, change the number of entries shown
using the Show entries dropdown menu, or choose the Next text highlighted in blue on the lefthand
side of the page to navigate through the available models.
For example notebooks with detailed steps on using JumpStart foundation models with the SageMaker
Python SDK, see Example notebooks (p. 59).
60
Amazon SageMaker Developer Guide
Foundation Models
In the SageMaker JumpStart section of the navigation pane, choose Models, notebooks, solutions.
Then, scroll down to find the Foundation Models section. You can choose a model from here by choosing
View model, or choose Explore All Foundation Models to see all available foundation models. If you
choose to see all available foundation models, you can further filter them by task, data type, content
type, or framework. You can also search for a model name in the Search bar. If you need guidance on
selecting a model, see Choose a foundation model (p. 61).
After you choose View model for the foundation model of your choice in Studio, you can deploy the
model. For more information, see Deploy a Model (p. 69). You can also choose Open notebook in
the Run in notebook section to run an example notebook for the foundation model directly in Studio.
If the model is fine-tunable, you can also fine-tune the model. For more information, see Fine-Tune
a Model (p. 76). For a list of which JumpStart foundation models are fine-tunable, see Fine-tune a
foundation model (p. 64).
61
Amazon SageMaker Developer Guide
Foundation Models
customized for multiple use cases. When choosing your foundation model, start with defining a specific
task.
Try out trending publicly available text generation and text-to-text models for the following tasks:
Try out some recommended text-to-image models for the following tasks:
Note
Proprietary foundation models are currently in preview. You need the
AmazonSageMakerFullAccess policy attached to your role to access proprietary foundation
models. If you don’t have access to the proprietary models, choose Request access. You can
reach out to your account administrator or Amazon SageMaker JumpStart support sagemaker-
[email protected] for further details.
62
Amazon SageMaker Developer Guide
Foundation Models
The recommended way to first customize a foundation model to a specific use case is through prompt
engineering. Providing your foundation model with well-engineered, context-rich prompts can help
achieve desired results without any fine-tuning or changing of model weights. For more information, see
Prompt engineering for foundation models (p. 63).
If prompt engineering alone is not enough to customize your foundation model to a specific task, you
can fine-tune a foundation model on additional domain-specific data. For more information, see Fine-
tune a foundation model (p. 64). The fine-tuning process involves changing model weights.
If you want to customize your model with information from a knowledge library without any retraining,
see Retrieval Augmented Generation (RAG) (p. 65).
Effective prompt engineering is crucial for directing model behavior and achieving desired responses.
Through prompt engineering, you can control a model’s tone, style, and domain expertise without
more involved customization measures like fine-tuning. We recommend dedicating time to prompt
engineering before you consider fine-tuning a model on additional data. The goal is to provide sufficient
context and guidance to the model so that it can generalize and perform well on unseen or limited data
scenarios.
Zero-shot learning
Zero-shot learning involves training a model to generalize and make predictions on unseen classes or
tasks. To perform prompt engineering in zero-shot learning environments, we recommend constructing
prompts that explicitly provide information about the target task and the desired output format. For
example, if you want to use a foundation model for zero-shot text classification on a set of classes
that the model did not see during training, a well-engineered prompt could be: "Classify the
following text as either sports, politics, or entertainment: [input text]." By
explicitly specifying the target classes and the expected output format, you can guide the model to make
accurate predictions even on unseen classes.
Few-shot learning
Few-shot learning involves training a model with a limited amount of data for new classes or tasks.
Prompt engineering in few-shot learning environments focuses on designing prompts that effectively
use the limited available training data. For example, if you use a foundation model for an image
classification task and only have a few examples of a new image class, you can engineer a prompt that
includes the available labeled examples with a placeholder for the target class. For example, the prompt
could be: "[image 1], [image 2], and [image 3] are examples of [target class].
Classify the following image as [target class]". By incorporating the limited labeled
examples and explicitly specifying the target class, you can guide the model to generalize and make
accurate predictions even with minimal training data.
If prompt engineering is not sufficient to adapt your foundation model to specific business needs,
domain-specific language, target tasks, or other requirements, you can consider fine-tuning your
model on additional data or using Retrieval Augmented Generation (RAG) to augment your model
architecture with enhanced context from archived knowledge sources. For more information, see Fine-
tune a foundation model (p. 64) or Retrieval Augmented Generation (RAG) (p. 65).
63
Amazon SageMaker Developer Guide
Foundation Models
There are two main approaches that you can take for fine-tuning depending on your use case and chosen
foundation model. If you're interested in fine-tuning your model on domain-specific data, see Domain
adaptation fine-tuning (p. 64). If you're interested in instruction-based fine-tuning using prompt and
response examples, see Instruction-based fine-tuning (p. 64).
Domain adaptation fine-tuning allows you to leverage pre-trained foundation models and adapt them
to specific tasks using limited domain-specific data. If prompt engineering efforts do not provide enough
customization, you can use domain adaption fine-tuning to get your model working with domain-specific
language, such as industry jargon, technical terms, or other specialized data. This fine-tuning process
modifies the weights of the model. For more information, see the SageMaker JumpStart Foundation
Models - Fine-tuning text generation GPT-J 6B model on domain specific dataset example notebook.
• GPT-J 6B
• GPT Neo 2.7B
• BloomZ 7b1
Instruction-based fine-tuning
Fine-tuned LAnguage Net (FLAN) models use instruction tuning to make models more amenable to
solving general downstream NLP tasks. Amazon SageMaker JumpStart provides a number of foundation
models in the FLAN model family. For example, FLAN-T5 models are instruction fine-tuned on a wide
range of tasks to increase zero-shot performance for a variety of common use cases. With additional
data and fine-tuning, instruction-based models can be further adapted to more specific tasks that
weren’t considered during pre-training. For more information, see the SageMaker JumpStart Foundation
Models - HuggingFace Text2Text Instruction Fine-Tuning example notebook.
• FLAN-T5 XL
64
Amazon SageMaker Developer Guide
Foundation Models
• FLAN-T5 Large
• FLAN-T5 Small
• FLAN-T5 Base
With RAG, the external data used to augment your prompts can come from multiple data sources, such
as a document repositories, databases, or APIs. The first step is to convert your documents and any
user queries into a compatible format to perform relevancy search. To make the formats compatible,
a document collection, or knowledge library, and user-submitted queries are converted to numerical
representations using embedding language models. Embedding is the process by which text is given
numerical representation in a vector space. RAG model architectures compare the embeddings of user
queries within the vector of the knowledge library. The original user prompt is then appended with
relevant context from similar documents within the knowledge library. This augmented prompt is then
sent to the foundation model. You can update knowledge libraries and their relevant embeddings
asynchronously.
65
Amazon SageMaker Developer Guide
Task-Specific Models
Task-Specific Models
JumpStart supports task-specific models across fifteen of the most popular problem types. Of the
supported problem types, Vision and NLP-related types total thirteen. There are eight problem types
that support incremental training and fine-tuning. For more information about incremental training and
hyper-parameter tuning, see SageMaker Automatic Model Tuning. JumpStart also supports four popular
algorithms for tabular data modeling.
You can search and browse models from the JumpStart landing page in Studio. When you select a
model, the model detail page provides information about the model, and you can train and deploy your
model in a few steps. The description section describes what you can do with the model, the expected
types of inputs and outputs, and the data type needed for fine-tuning your model.
You can also programmatically utilize models with the SageMaker Python SDK. For a list of all available
models, see the JumpStart Available Model Table.
The list of problem types and links to their example Jupyter notebooks are summarized in the following
table.
66
Amazon SageMaker Developer Guide
Task-Specific Models
67
Amazon SageMaker Developer Guide
Task-Specific Models
Introduction
to JumpStart
- Tabular
Classification
- AutoGluon
Learner
Introduction
to JumpStart
- Tabular
Classification -
TabTransformer
Learner
Introduction
to JumpStart –
Tabular Regression
- AutoGluon
Learner
Introduction
to JumpStart –
Tabular Regression
- TabTransformer
Learner
68
Amazon SageMaker Developer Guide
Task-Specific Models
Deploy a Model
When you deploy a model from JumpStart, SageMaker hosts the model and deploys an endpoint that
you can use for inference. JumpStart also provides an example notebook that you can use to access the
model after it's deployed.
The default instance type for deploying a model depends on the model. The instance type is the
hardware that the training job runs on. In the following example, the ml.p2.xlarge instance is the
default for this particular BERT model.
You can also change the endpoint name, add key;value resource tags, activate or deactive the
jumpstart- prefix for any JumpStart resources related to the model, and specify an Amazon S3 bucket
for storing model artifacts used by your SageMaker endpoint.
69
Amazon SageMaker Developer Guide
Task-Specific Models
Choose Security Settings to specify the AWS Identity and Access Management (IAM ) role, Amazon
Virtual Private Cloud (Amazon VPC), and encryption keys for the model.
70
Amazon SageMaker Developer Guide
Task-Specific Models
When you deploy a model with JumpStart, you can specify an IAM role, Amazon VPC, and encryption
keys for the model. If you don't specify any values for these entries: The default IAM role is your Studio
runtime role; default encryption is used; no Amazon VPC is used.
IAM role
You can select an IAM role that is passed as part of training jobs and hosting jobs. SageMaker uses this
role to access training data and model artifacts. If you don't select an IAM role, SageMaker deploys the
model using your Studio runtime role. For more information about IAM roles, see Identity and Access
Management for Amazon SageMaker (p. 3048).
The role that you pass must have access to the resources that the model needs, and must include all of
the following.
71
Amazon SageMaker Developer Guide
Task-Specific Models
Note
You can scope down the Amazon S3 permissions granted in each of the following roles. Do
this by using the ARN of your Amazon Simple Storage Service (Amazon S3) bucket and the
JumpStart Amazon S3 bucket.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket"
],
"Resources": [
"arn:aws:s3:::jumpstart-cache-prod-<region>/*",
"arn:aws:s3:::jumpstart-cache-prod-<region>",
"arn:aws:s3:::bucket/*"
]
}
If you select this option, you must select an existing IAM role from the dropdown list.
If you select this option, you must manually enter the ARN for an existing IAM role. If your Studio
runtime role or Amazon VPC block the iam:list* call, you must use this option to use an existing IAM
role.
72
Amazon SageMaker Developer Guide
Task-Specific Models
Amazon VPC
All JumpStart models run in network isolation mode. After the model container is created, no more calls
can be made. You can select an Amazon VPC that is passed as part of training jobs and hosting jobs.
SageMaker uses this Amazon VPC to push and pull resources from your Amazon S3 bucket. This Amazon
VPC is different from the Amazon VPC that limits access to the public internet from your Studio instance.
For more information about the Studio Amazon VPC, see Connect SageMaker Studio Notebooks in a VPC
to External Resources (p. 3209).
The Amazon VPC that you pass does not need access to the public internet, but it does need access
to Amazon S3. The Amazon VPC endpoint for Amazon S3 must allow access to at least the following
resources that the model needs.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket"
],
"Resources": [
"arn:aws:s3:::jumpstart-cache-prod-<region>/*",
"arn:aws:s3:::jumpstart-cache-prod-<region>",
"arn:aws:s3:::bucket/*"
]
}
Find VPC
If you select this option, you must select an existing Amazon VPC from the dropdown list. After you
select an Amazon VPC, you must select a subnet and security group for your Amazon VPC. For more
information about subnets and security groups, see Overview of VPCs and subnets.
73
Amazon SageMaker Developer Guide
Task-Specific Models
Input VPC
If you select this option, you must manually select the subnet and security group that compose your
Amazon VPC. If your Studio runtime role or Amazon VPC blocks the ec2:list* call, you must use this
option to select the subnet and security group.
Encryption keys
You can select an AWS KMS key that is passed as part of training jobs and hosting jobs. SageMaker uses
this key to encrypt the Amazon EBS volume for the container, and the repackaged model in Amazon S3
for hosting jobs and the output for training jobs. For more information about AWS KMS keys, see AWS
KMS keys.
The key that you pass must trust the IAM role that you pass. If you do not specify an IAM role, the AWS
KMS key must trust your Studio runtime role.
If you do not select an AWS KMS key, SageMaker provides default encryption for the data in the Amazon
EBS volume and the Amazon S3 artifacts.
74
Amazon SageMaker Developer Guide
Task-Specific Models
If you select this option, you must select existing AWS KMS keys from the dropdown list.
If you select this option, you must manually enter the AWS KMS keys. If your Studio execution role or
Amazon VPC block the kms:list* call, you must use this option to select existing AWS KMS keys.
75
Amazon SageMaker Developer Guide
Task-Specific Models
Fine-Tune a Model
Fine-tuning trains a pretrained model on a new dataset without training from scratch. This process, also
known as transfer learning, can produce accurate models with smaller datasets and less training time.
You can fine-tune a model if its card shows a fine-tunable attribute set to Yes.
To browse the buckets available to you, choose Find S3 bucket. These buckets are limited by the
permissions used to set up your Studio account. You can also specify an Amazon S3 URI by choosing
Enter Amazon S3 bucket location.
76
Amazon SageMaker Developer Guide
Task-Specific Models
Tip
To find out how to format the data in your bucket, choose Learn more. The description section
for the model has detailed information about inputs and outputs.
Note
The Amazon S3 bucket must be in the same AWS Region where you're running SageMaker
Studio because SageMaker doesn't allow cross-Region requests.
77
Amazon SageMaker Developer Guide
Task-Specific Models
p3.2xlarge 1
p3.8xlarge 4
p3.16xlarge 8
p3dn.24xlarge 8
Hyperparameters
You can customize the hyperparameters of the training job that are used to fine-tune the model. The
hyperparameters available for each fine-tunable model differ depending on the model. For information
on each available hyperparameter, reference the hyperparameters documentation for the model of your
choosing in Use Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281). For example,
see Image Classification - TensorFlow Hyperparameters (p. 1526) for details on the fine-tunable Image
Classification - TensorFlow hyperparameters.
If you use the default dataset for text models without changing the hyperparameters, you get a nearly
identical model as a result. For vision models, the default dataset is different from the dataset used to
train the pretrained models, so your model is different as a result.
• Epochs – One epoch is one cycle through the entire dataset. Multiple intervals complete a batch, and
multiple batches eventually complete an epoch. Multiple epochs are run until the accuracy of the
model reaches an acceptable level, or when the error rate drops below an acceptable level.
• Learning rate – The amount that values should be changed between epochs. As the model is refined,
its internal weights are being nudged and error rates are checked to see if the model improves. A
typical learning rate is 0.1 or 0.01, where 0.01 is a much smaller adjustment and could cause the
training to take a long time to converge, whereas 0.1 is much larger and can cause the training to
overshoot. It is one of the primary hyperparameters that you might adjust for training your model.
Note that for text models, a much smaller learning rate (5e-5 for BERT) can result in a more accurate
model.
• Batch size – The number of records from the dataset that is to be selected for each interval to send to
the GPUs for training.
In an image example, you might send out 32 images per GPU, so 32 would be your batch size. If
you choose an instance type with more than one GPU, the batch is divided by the number of GPUs.
Suggested batch size varies depending on the data and the model that you are using. For example,
how you optimize for image data differs from how you handle language data.
In the instance type chart in the deployment configuration section, you can see the number of GPUs
per instance type. Start with a standard recommended batch size (for example, 32 for a vision model).
Then, multiply this by the number of GPUs in the instance type that you selected. For example, if
you're using a p3.8xlarge, this would be 32(batch size) multiplied by 4 (GPUs), for a total of 128, as
your batch size adjusts for the number of GPUs. For a text model like BERT, try starting with a batch
size of 64, and then reduce as needed.
Training output
When the fine-tuning process is complete, JumpStart provides information about the model: parent
model, training job name, training job ARN, training time, and output path. The output path is where you
can find your new model in an Amazon S3 bucket. The folder structure uses the model name that you
provided and the model file is in an /output subfolder and it's always named model.tar.gz.
78
Amazon SageMaker Developer Guide
Shared Models and Notebooks
Example: s3://bucket/model-name/output/model.tar.gz
Share Models
You can share JumpStart models through the Studio UI directly from the Launched Quick start assets
page using the following procedure:
1. Open Amazon SageMaker Studio and choose Launched Quick start assets in the Quick start
solutions section of the lefthand navigation pane.
2. Select the Training jobs tab to view the list of your model training jobs.
3. Under the Training jobs list, select the training job that you want to share. This opens the training job
details page. You cannot share more than one training job at a time.
4. In the header for the training job, choose Share, and select either Share to Canvas or Share with my
organization.
For more information about how to share a model with a SageMaker Canvas user, see Bring Your Own
Model Into Canvas.
Note
Only tabular models can be shared to SageMaker Canvas. Trying to share a non-tabular model
to SageMaker Canvas throws the error Unsupported Data Type.
For more information about sharing models with your organization, see Shared Models and
Notebooks (p. 79).
All models that you share and models that are shared with you are searchable in a centralized location
directly in Amazon SageMaker Studio. For information on the onboarding steps to sign into Amazon
SageMaker Studio, see Onboard to Amazon SageMaker Domain.
1. Shared by me – Models and notebooks that you shared to either JumpStart or SageMaker Canvas.
2. Shared with me – Models and notebooks shared with you
3. Shared by my organization – All models and notebooks that are shared to anyone in your
organization
79
Amazon SageMaker Developer Guide
Shared Models and Notebooks
You can also sort your models and notebooks based on the time they were last updated or by ascending
or descending alphabetical order. Choose the filter ( ) icon to further sort your selections.
You can filter for models and notebooks shared to and from SageMaker Canvas by selecting the filter
( ) icon in the Shared by me or Shared with me tabs. For more information about how to share a
model to SageMaker Canvas, see Bring Your Own Model Into Canvas.
80
Amazon SageMaker Developer Guide
Shared Models and Notebooks
select the Add dropdown list. Choose to either add a model or add a notebook.
Add a model
To add a model, choose Shared by my organization, and then select Add model from the the Add
dropdown list. Enter the basic information for your model, and add any training or inference information
you want to share with collaborators to train or deploy your model. After you enter all the necessary
information, choose Add model in the lower right corner.
Basic information
First, add the basic descriptive information about your model. This information is used to improve the
searchability of your model.
1. Add a title for this model. Adding a title automatically populates a unique identifier in the ID field
based on the model title.
2. Add a description of the model.
3. Select a data type from the options: text, vision, tabular, or audio.
4. Select a machine learning task from the list of available tasks, such as image classification or text
generation.
5. Select a machine learning framework.
81
Amazon SageMaker Developer Guide
Shared Models and Notebooks
6. Add metadata information with keywords or phrases to use when searching for a model. Use commas
to separate keywords. Any spaces are automatically replaced with commas.
Enable training
When adding a model to share, you can optionally provide a training environment and allow
collaborators in your organization to train the shared model.
Note
If you are adding a tabular model, you also need to specify a column format and target column
to enable training. For more information, see Amazon SageMaker Canvas in the Amazon
SageMaker Developer Guide.
1. Add a container to use for model training. You can select a container used for an existing training job,
bring your own container in Amazon ECR, or use an Amazon SageMaker Deep Learning Container.
2. Add environment variables.
3. Provide a training script location.
4. Provide a script mode entry point.
5. Provide an Amazon S3 URI for model artifacts generated during training.
6. Provide the Amazon S3 URI to the default training dataset.
7. Provide a model output path. The model output path should be the Amazon S3 URI path for any
model artifacts generated from training. SageMaker saves the model artifacts as a single compressed
TAR file in Amazon S3.
8. Provide a validation dataset to use for evaluating your model during training. Validation datasets must
contain the same number of columns and the same feature headers as the training dataset.
9. Turn on network isolation. Network isolation isolates the model container so that no inbound or
outbound network calls can be made to or from the model container.
10.Provide training channels through which SageMaker can access your data. For example, you might
specify input channels named train or test. For each channel, specify a channel name and a URI to
the location of your data. Choose Browse to search for Amazon S3 locations.
11.Provide hyperparameters. Add any hyperparameters with which collaborators should experiment
during training. Provide a range of valid values for these hyperparameters. This range is used
for training job hyperparameter validation. You can define ranges based on the datatype of the
hyperparameter.
12.Select an instance type. We recommend a GPU instance with more memory for training with large
batch sizes. For a comprehensive list of SageMaker training instances across AWS Regions, see the On-
Demand Pricing table in Amazon SageMaker Pricing.
13.Provide metrics. Define metrics for a training job by specifying a name and a regular expression for
each metric that your training monitors. Design the regular expressions to capture the values of
metrics that your algorithm emits. For example, the metric loss might have the regular expression
"Loss =(.*?);".
Enable deployment
When adding a model to share, you can optionally provide an inference environment in which
collaborators in your organization can deploy the shared model for inference.
1. Add a container to use for inference. You can bring your own container in Amazon ECR or use an
Amazon SageMaker Deep Learning Container.
2. Provide the Amazon S3 URI to an inference script. Custom inference scripts run inside your chosen
container. Your inference script should include a function for model loading, and optionally
functions generating predictions, and input and output processing. For more information on creating
inference scripts for the framework of your choice, see Frameworks in the SageMaker Python SDK
82
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial
documentation. For example, for TensorFlow, see How to implement the pre- and/or post-processing
handler(s).
3. Provide an Amazon S3 URI for model artifacts. Model artifacts are the output that results from
training a model, and typically consist of trained parameters, a model definition that describes how to
compute inferences, and other metadata. If you trained your model in SageMaker, the model artifacts
are saved as a single compressed TAR file in Amazon S3. If you trained your model outside SageMaker,
you need to create this single compressed TAR file and save it in an Amazon S3 location.
4. Select an instance type. We recommend a GPU instance with more memory for training with large
batch sizes. For a comprehensive list of SageMaker training instances across AWS Regions, see the On-
Demand Pricing table in Amazon SageMaker Pricing.
Add a notebook
To add a notebook, choose Shared by my organization, and then select Add notebook from the the
Add dropdown list. Enter the basic information for your notebook and provide an Amazon S3 URI for the
location of that notebook.
Basic information
First, add the basic descriptive information about your notebook. This information is used to improve the
searchability of your notebook.
1. Add a title for this notebook. Adding a title automatically populates a unique identifier in the ID field
based on the notebook title.
2. Add a description of the notebook.
3. Select a data type from the options: text, vision, tabular, or audio.
4. Select an ML task from the list of available tasks, such as image classification or text generation.
5. Select an ML framework.
6. Add metadata information with keywords or phrases to use when searching for a notebook. Use
commas to separate keywords. Any spaces are automatically replaced with commas.
Add notebook
Provide an Amazon S3 URI for the location of that notebook. You can choose Browse to search through
your Amazon S3 buckets for your notebook file location. After you find your notebook, copy the Amazon
S3 URI, choose Cancel, and then add the Amazon S3 URI to the Notebook Location field.
After you enter all the necessary information, choose Add notebook in the lower right corner.
Topics
• Amazon SageMaker JumpStart Industry Python SDK (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Solution (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Models (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Example Notebooks (p. 86)
• Amazon SageMaker JumpStart Industry: Financial Blog Posts (p. 86)
83
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial
This SageMaker JumpStart Industry: Financial solution provides a template for a text-enhanced
corporate credit rating model. It shows how to take a model based on numeric features (in this case,
Altman's famous 5 financial ratios) combined with texts from SEC filings to achieve an improvement in
the prediction of credit ratings. In addition to the 5 Altman ratios, you can add more variables as needed
or set custom variables. This solution notebook shows how SageMaker JumpStart Industry Python
SDK helps process Natural Language Processing (NLP) scoring of texts from SEC filings. Furthermore,
the solution demonstrates how to train a model using the enhanced dataset to achieve a best-in-class
model, deploy the model to a SageMaker endpoint for production, and receive improved predictions in
real time.
Credit ratings are traditionally generated using models that use financial statement data and market
data, which is tabular only (numeric and categorical). This solution constructs a network of firms using
SEC filingsand shows how to use the network of firm relationships with tabular data to generate accurate
rating predictions. This solution demonstrates a methodology to use data on firm linkages to extend
the traditionally tabular-based credit scoring models, which have been used by the ratings industry for
decades, to the class of machine learning models on networks.
Note
The solution notebooks are for demonstration purposes only. They should not be relied on as
financial or investment advice.
You can find these financial services solutions through the SageMaker JumpStart page in Studio.
Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and
launch SageMaker Studio. For more information about how to find the solution card, see the
previous topic at SageMaker JumpStart.
84
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial
• RoBERTa-SEC-WIKI-Large
The RoBERTa-SEC-Base and RoBERTa-SEC-Large models are the text embedding models based on
GluonNLP's RoBERTa model and pretrained on S&P 500 SEC 10-K/10-Q reports of the decade of the
2010's (from 2010 to 2019). In addition to these, SageMaker JumpStart Industry: Financial provides two
more RoBERTa variations, RoBERTa-SEC-WIKI-Base and RoBERTa-SEC-WIKI-Large, which are pretrained
on the SEC filings and common texts of Wikipedia.
You can find these models in SageMaker JumpStart by navigating to the Text Models node, choosing
Explore All Text Models, and then filtering for the ML Task Text Embedding. You can access any
corresponding notebooks after selecting the model of your choice. The paired notebooks will walk you
through how the pretrained models can be fine-tuned for specific classification tasks on multimodal
datasets, which are enhanced by the SageMaker JumpStart Industry Python SDK.
Note
The model notebooks are for demonstration purposes only. They should not be relied on as
financial or investment advice.
The following screenshot shows the pretrained model cards provided through the SageMaker JumpStart
page on Studio.
Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and
85
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial
launch SageMaker Studio. For more information about how to find the model cards, see the
previous topic at SageMaker JumpStart.
• Financial TabText Data Construction – This example introduces how to use the SageMaker JumpStart
Industry Python SDK for processing the SEC filings, such as text summarization and scoring texts based
on NLP score types and their corresponding word lists. To preview the content of this notebook, see
Simple Construction of a Multimodal Dataset from SEC Filings and NLP Scores.
• Multimodal ML on TabText Data – This example shows how to merge different types of datasets
into a single dataframe called TabText and perform multimodal ML. To preview the content of this
notebook, see Machine Learning on a TabText Dataframe – An Example Based on the Paycheck
Protection Program.
• Multi-category ML on SEC filings data – This example shows how to train an AutoGluon NLP model
over the multimodal (TabText) datasets curated from SEC filings for a multiclass classification task.
Classify SEC 10K/Q Filings to Industry Codes Based on the MDNA Text Column.
Note
The example notebooks are for demonstrative purposes only. They should not be relied on as
financial or investment advice.
Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and
launch SageMaker Studio. For more information about how to find the example notebooks, see
the previous topic at SageMaker JumpStart.
To preview the content of the example notebooks, see Tutorials – Finance in the SageMaker JumpStart
Industry Python SDK documentation.
• Use pre-trained financial language models for transfer learning in Amazon SageMaker JumpStart
• Use SEC text for ratings classification using multimodal ML in Amazon SageMaker JumpStart
• Create a dashboard with SEC text for financial NLP in Amazon SageMaker JumpStart
• Build a corporate credit ratings classifier using graph machine learning in Amazon SageMaker
JumpStart
• Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial
data
86
Amazon SageMaker Developer Guide
Get Started with Notebook Instances
You can also take advantage of SageMaker features that help you deal with every stage of a complete
ML cycle: data labeling, data preprocessing, model training, model deployment, evaluation on prediction
performance, and monitoring the quality of model in production.
If you're a first-time SageMaker user, we recommend you to use the SageMaker Python SDK, following
the end-to-end ML tutorial. To find the open source documentation, see the Amazon SageMaker Python
SDK.
Tutorial Overview
This Get Started tutorial walks you through how to create a SageMaker notebook instance, open a
Jupyter notebook with a preconfigured kernel with the Conda environment for machine learning, and
start a SageMaker session to run an end-to-end ML cycle. You'll learn how to save a dataset to a default
Amazon S3 bucket automatically paired with the SageMaker session, submit a training job of an ML
model to Amazon EC2, and deploy the trained model for prediction by hosting or batch inferencing
through Amazon EC2.
This tutorial explicitly shows a complete ML flow of training the XGBoost model from the SageMaker
built-in model pool. You use the US Adult Census dataset, and you evaluate the performance of the
trained SageMaker XGBoost model on predicting individuals' income.
87
Amazon SageMaker Developer Guide
Step 1: Create an Amazon SageMaker Notebook Instance
• SageMaker XGBoost – The XGBoost model is adapted to the SageMaker environment and
preconfigured as Docker containers. SageMaker provides a suite of built-in algorithms that are
prepared for using SageMaker features. To learn more about what ML algorithms are adapted to
SageMaker, see Choose an Algorithm and Use Amazon SageMaker Built-in Algorithms. For the
SageMaker built-in algorithm API operations, see First-Party Algorithms in the Amazon SageMaker
Python SDK.
• Adult Census dataset – The dataset from the 1994 Census bureau database by Ronny Kohavi and Barry
Becker (Data Mining and Visualization, Silicon Graphics). The SageMaker XGBoost model is trained
using this dataset to predict if an individual makes over $50,000 a year or less.
Topics
• Step 1: Create an Amazon SageMaker Notebook Instance (p. 88)
• Step 2: Create a Jupyter Notebook (p. 89)
• Step 3: Download, Explore, and Transform a Dataset (p. 90)
• Step 4: Train a Model (p. 94)
• Step 5: Deploy the Model to Amazon EC2 (p. 98)
• Step 6: Evaluate the Model (p. 100)
• Step 7: Clean Up (p. 103)
a. For Notebook instance name, type a name for your notebook instance.
b. For Notebook Instance type, choose ml.t2.medium. This is the least expensive instance type
that notebook instances support, and it suffices for this exercise. If a ml.t2.medium instance
type isn't available in your current AWS Region, choose ml.t3.medium.
c. For Platform Identifier, choose a platform type to create the notebook instance on. This
platform type dictates the Operating System and the JupyterLab version that your notebook
instance is created with. For information about platform identifier type, see Amazon Linux 2 vs
Amazon Linux notebook instances (p. 205). For information about JupyterLab versions, see
JupyterLab versioning (p. 208).
d. For IAM role, choose Create a new role, and then choose Create role. This IAM role
automatically gets permissions to access any S3 bucket that has sagemaker in the name. It
gets these permissions through the AmazonSageMakerFullAccess policy, which SageMaker
attaches to the role.
Note
If you want to grant the IAM role permission to access S3 buckets without sagemaker
in the name, you need to attach the S3FullAccess policy or limit the permissions
88
Amazon SageMaker Developer Guide
Step 2: Create a Jupyter Notebook
to specific S3 buckets to the IAM role. For more information and examples of adding
bucket policies to the IAM role, see Bucket Policy Examples.
e. Choose Create notebook instance.
For more information about creating a SageMaker notebook instance, see Create a Notebook
Instance.
To change and update the SageMaker Notebook instance type and the EBS volume
1. On the Notebook instances page in the SageMaker console, choose your notebook instance.
2. Choose Actions, choose Stop, and then wait until the notebook instance fully stops.
3. After the notebook instance status changes to Stopped, choose Actions, and then choose Update
settings.
For more information about updating SageMaker notebook instance settings, see Update a Notebook
Instance.
For complete documentation about SageMaker notebook instance, see Use Amazon SageMaker
notebook Instances.
89
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data
• If you opened the notebook in the JupyterLab view, on the File menu, choose New, and then
choose Notebook. For Select Kernel, choose conda_python3. This preinstalled environment
includes the default Anaconda installation and Python 3.
• If you opened the notebook in the classic Jupyter view, on the Files tab, choose New, and then
choose conda_python3. This preinstalled environment includes the default Anaconda installation
and Python 3.
3. Save the notebooks as follows:
• In the JupyterLab view, choose File, choose Save Notebook As..., and then rename the notebook.
• In the Jupyter classic view, choose File, choose Save as..., and then rename the notebook.
To run the following example, paste the sample code into a cell in your notebook instance.
import shap
X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)
feature_names = list(X.columns)
feature_names
Note
If the current Jupyter kernel does not have the SHAP library, install it by running the following
conda command:
If you're using JupyterLab, you must manually refresh the kernel after the installation and
updates have completed. Run the following IPython script to shut down the kernel (the kernel
will restart automatically):
import IPython
90
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data
IPython.Application.instance().kernel.do_shutdown(True)
The feature_names list object should return the following list of features:
['Age',
'Workclass',
'Education-Num',
'Marital Status',
'Occupation',
'Relationship',
'Race',
'Sex',
'Capital Gain',
'Capital Loss',
'Hours per week',
'Country']
Tip
If you're starting with unlabeled data, you can use Amazon SageMaker Ground Truth to create a
data labeling workflow in minutes. To learn more, see Label Data.
display(X.describe())
hist = X.hist(bins=30, sharey=True, figsize=(20, 10))
91
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data
Tip
If you want to use a dataset that needs to be cleaned and transformed, you can simplify and
streamline data preprocessing and feature engineering using Amazon SageMaker Data Wrangler.
To learn more, see Prepare ML Data with Amazon SageMaker Data Wrangler.
Split the training set to separate out a validation set. The validation set is used to evaluate the
performance of the trained model while tuning the model's hyperparameters. 75 percent of the training
set becomes the final training set, and the rest is the validation set.
Using the pandas package, explicitly align each dataset by concatenating the numeric features with the
true labels.
import pandas as pd
train = pd.concat([pd.Series(y_train, index=X_train.index,
name='Income>50K', dtype=int), X_train], axis=1)
validation = pd.concat([pd.Series(y_val, index=X_val.index,
name='Income>50K', dtype=int), X_val], axis=1)
test = pd.concat([pd.Series(y_test, index=X_test.index,
name='Income>50K', dtype=int), X_test], axis=1)
train
92
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data
validation
test
93
Amazon SageMaker Developer Guide
Step 4: Train a Model
The following code sets up the default S3 bucket URI for your current SageMaker session, creates a
new demo-sagemaker-xgboost-adult-income-prediction folder, and uploads the training and
validation datasets to the data subfolder.
boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'data/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'data/validation.csv')).upload_file('validation.csv')
Run the following AWS CLI to check if the CSV files are successfully uploaded to the S3 bucket.
Topics
• Choose the Training Algorithm (p. 94)
• Create and Run a Training Job (p. 94)
1. Import the Amazon SageMaker Python SDK and start by retrieving the basic information from your
current SageMaker session.
94
Amazon SageMaker Developer Guide
Step 4: Train a Model
import sagemaker
region = sagemaker.Session().boto_region_name
print("AWS Region: {}".format(region))
role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))
• region – The current AWS Region where the SageMaker notebook instance is running.
• role – The IAM role used by the notebook instance.
Note
Check the SageMaker Python SDK version by running sagemaker.__version__. This
tutorial is based on sagemaker>=2.20. If the SDK is outdated, install the latest version by
running the following command:
If you run this installation in your exiting SageMaker Studio or notebook instances, you
need to manually refresh the kernel to finish applying the version update.
2. Create an XGBoost estimator using the sagemaker.estimator.Estimator class. In the following
example code, the XGBoost estimator is named xgb_model.
xgb_model=sagemaker.estimator.Estimator(
image_uri=container,
role=role,
instance_count=1,
instance_type='ml.m4.xlarge',
volume_size=5,
output_path=s3_output_location,
sagemaker_session=sagemaker.Session(),
rules=[
Rule.sagemaker(rule_configs.create_xgboost_report()),
ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]
)
• image_uri – Specify the training container image URI. In this example, the SageMaker XGBoost
training container URI is specified using sagemaker.image_uris.retrieve.
• role – The AWS Identity and Access Management (IAM) role that SageMaker uses to perform
tasks on your behalf (for example, reading training results, call model artifacts from Amazon S3,
and writing training results to Amazon S3).
• instance_count and instance_type – The type and number of Amazon EC2 ML compute
instances to use for model training. For this training exercise, you use a single ml.m4.xlarge
95
Amazon SageMaker Developer Guide
Step 4: Train a Model
instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS)
storage, and a high network performance. For more information about EC2 compute instance
types, see Amazon EC2 Instance Types. For more information about billing, see Amazon
SageMaker pricing.
• volume_size – The size, in GB, of the EBS storage volume to attach to the training instance. This
must be large enough to store training data if you use File mode (File mode is on by default). If
you don't specify this parameter, its value defaults to 30.
• output_path – The path to the S3 bucket where SageMaker stores the model artifact and
training results.
• sagemaker_session – The session object that manages interactions with SageMaker API
operations and other AWS service that the training job uses.
• rules – Specify a list of SageMaker Debugger built-in rules. In this example, the
create_xgboost_report() rule creates an XGBoost report that provides insights into the
training progress and results, and the ProfilerReport() rule creates a report regarding the EC2
compute resource utilization. For more information, see SageMaker Debugger XGBoost Training
Report (p. 1685).
Tip
If you want to run distributed training of large sized deep learning models, such as
convolutional neural networks (CNN) and natural language processing (NLP) models, use
SageMaker Distributed for data parallelism or model parallelism. For more information, see
Distributed Training in Amazon SageMaker (p. 1821).
3. Set the hyperparameters for the XGBoost algorithm by calling the set_hyperparameters
method of the estimator. For a complete list of XGBoost hyperparameters, see XGBoost
Hyperparameters (p. 1377).
xgb_model.set_hyperparameters(
max_depth = 5,
eta = 0.2,
gamma = 4,
min_child_weight = 6,
subsample = 0.7,
objective = "binary:logistic",
num_round = 1000
)
Tip
You can also tune the hyperparameters using the SageMaker hyperparameter
optimization feature. For more information, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
4. Use the TrainingInput class to configure a data input flow for training. The following example
code shows how to configure TrainingInput objects to use the training and validation
datasets you uploaded to Amazon S3 in the Split the Dataset into Train, Validation, and Test
Datasets (p. 92) section.
train_input = TrainingInput(
"s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
"s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv"
)
5. To start model training, call the estimator's fit method with the training and validation datasets.
By setting wait=True, the fit method displays progress logs and waits until training is complete.
96
Amazon SageMaker Developer Guide
Step 4: Train a Model
For more information about model training, see Train a Model with Amazon SageMaker (p. 10). This
tutorial training job might take up to 10 minutes.
After the training job has done, you can download an XGBoost training report and a profiling
report generated by SageMaker Debugger. The XGBoost training report offers you insights into the
training progress and results, such as the loss function with respect to iteration, feature importance,
confusion matrix, accuracy curves, and other statistical results of training. For example, you can find
the following loss curve from the XGBoost training report which clearly indicates that there is an
overfitting problem.
Run the following code to specify the S3 bucket URI where the Debugger training reports are
generated and check if the reports exist.
Download the Debugger XGBoost training and profiling reports to the current workspace:
Run the following IPython script to get the file link of the XGBoost training report:
97
Amazon SageMaker Developer Guide
Step 5: Deploy the Model
The following IPython script returns the file link of the Debugger profiling report that shows
summaries and details of the EC2 instance resource utilization, system bottleneck detection results,
and python operation profiling results:
profiler_report_name = [rule["RuleConfigurationName"]
for rule in xgb_model.latest_training_job.rule_job_summary()
if "Profiler" in rule["RuleConfigurationName"]][0]
profiler_report_name
display("Click link below to view the profiler report", FileLink(profiler_report_name
+"/profiler-output/profiler-report.html"))
Tip
If the HTML reports do not render plots in the JupyterLab view, you must choose Trust
HTML at the top of the reports.
To identify training issues, such as overfitting, vanishing gradients, and other problems
that prevents your model from converging, use SageMaker Debugger and take automated
actions while prototyping and training your ML models. For more information, see Debug
and Profile Training Jobs Using Amazon SageMaker Debugger (p. 1649). To find a complete
analysis of model parameters, see the Explainability with Amazon SageMaker Debugger
example notebook.
You now have a trained XGBoost model. SageMaker stores the model artifact in your S3 bucket. To
find the location of the model artifact, run the following code to print the model_data attribute of the
xgb_model estimator:
xgb_model.model_data
Tip
To measure biases that can occur during each stage of the ML lifecycle (data collection, model
training and tuning, and monitoring of ML models deployed for prediction), use SageMaker
Clarify. For more information, see Amazon SageMaker Clarify Model Explainability (p. 2093). For
an end-to-end example of it, see the Fairness and Explainability with SageMaker Clarify example
notebook.
Topics
• Deploy the Model to SageMaker Hosting Services (p. 98)
• (Optional) Use SageMaker Predictor to Reuse the Hosted Endpoint (p. 99)
• (Optional) Make Prediction with Batch Transform (p. 99)
import sagemaker
98
Amazon SageMaker Developer Guide
Step 5: Deploy the Model
The deploy method creates a deployable model, configures the SageMaker hosting services endpoint,
and launches the endpoint to host the model. For more information, see the SageMaker generic
Estimator's deploy class method in the Amazon SageMaker Python SDK. To retrieve the name of
endpoint that's generated by the deploy method, run the following code:
xgb_predictor.endpoint_name
This should return the endpoint name of the xgb_predictor. The format of the endpoint name is
"sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS". This endpoint stays active in the ML instance,
and you can make instantaneous predictions at any time unless you shut it down later. Copy this
endpoint name and save it to reuse and make real-time predictions elsewhere in SageMaker Studio or
SageMaker notebook instances.
Tip
To learn more about compiling and optimizing your model for deployment to Amazon EC2
instances or edge devices, see Compile and Deploy Models with Neo.
import sagemaker
xgb_predictor_reuse=sagemaker.predictor.Predictor(
endpoint_name="sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS",
sagemaker_session=sagemaker.Session(),
serializer=sagemaker.serializers.CSVSerializer()
)
The xgb_predictor_reuse Predictor behaves exactly the same as the original xgb_predictor. For
more information, see the SageMaker Predictor class in the Amazon SageMaker Python SDK.
99
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model
1. Run the following code to convert the feature columns of the test dataset to a CSV file and uploads
to the S3 bucket:
boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
2. Specify S3 bucket URIs of input and output for the batch transform job as shown following:
3. Create a transformer object specifying the minimal number of parameters: the instance_count
and instance_type parameters to run the batch transform job, and the output_path to save
prediction data as shown following:
transformer = xgb_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
output_path=batch_output
)
4. Initiate the batch transform job by executing the transform() method of the transformer object
as shown following:
transformer.transform(
data=batch_input,
data_type='S3Prefix',
content_type='text/csv',
split_type='Line'
)
transformer.wait()
5. When the batch transform job is complete, SageMaker creates the test.csv.out prediction data
saved in the batch_output path, which should be in the following format: s3://sagemaker-
<region>-111122223333/demo-sagemaker-xgboost-adult-income-prediction/batch-
prediction. Run the following AWS CLI to download the output data of the batch transform job:
This should create the test.csv.out file under the current working directory. You'll be able to see
the float values that are predicted based on the logistic regression of the XGBoost training job.
100
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model
1. Set up the following function to predict each line of the test set. In the following example code, the
rows argument is to specify the number of lines to predict at a time. You can change the value of it
to perform a batch inference that fully utilizes the instance's hardware resource.
import numpy as np
def predict(data, rows=1000):
split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
predictions = ''
for array in split_array:
predictions = ','.join([predictions,
xgb_predictor.predict(array).decode('utf-8')])
return np.fromstring(predictions[1:], sep=',')
2. Run the following code to make predictions of the test dataset and plot a histogram. You need to
take only the feature columns of the test dataset, excluding the 0th column for the actual values.
predictions=predict(test.to_numpy()[:,1:])
plt.hist(predictions)
plt.show()
3. The predicted values are float type. To determine True or False based on the float values, you
need to set a cutoff value. As shown in the following example code, use the Scikit-learn library to
return the output confusion metrics and classification report with a cutoff of 0.5.
import sklearn
cutoff=0.5
print(sklearn.metrics.confusion_matrix(test.iloc[:, 0], np.where(predictions > cutoff,
1, 0)))
print(sklearn.metrics.classification_report(test.iloc[:, 0], np.where(predictions >
cutoff, 1, 0)))
101
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model
4. To find the best cutoff with the given test set, compute the log loss function of the logistic
regression. The log loss function is defined as the negative log-likelihood of a logistic model that
returns prediction probabilities for its ground truth labels. The following example code numerically
and iteratively calculates the log loss values (-(y*log(p)+(1-y)log(1-p)), where y is the true
label and p is a probability estimate of the corresponding test sample. It returns a log loss versus
cutoff graph.
plt.figure(figsize=(15,10))
plt.plot(cutoffs, log_loss)
plt.xlabel("Cutoff")
plt.ylabel("Log loss")
plt.show()
5. Find the minimum points of the error curve using the NumPy argmin and min functions:
print(
102
Amazon SageMaker Developer Guide
Step 7: Clean Up
This should return: Log loss is minimized at a cutoff of 0.53, and the log loss
value at the minimum is 4.348539186773897.
Instead of computing and minimizing the log loss function, you can estimate a cost function as
an alternative. For example, if you want to train a model to perform a binary classification for a
business problem such as a customer churn prediction problem, you can set weights to the elements
of confusion matrix and calculate the cost function accordingly.
You have now trained, deployed, and evaluated your first model in SageMaker.
Tip
To monitor model quality, data quality, and bias drift, use Amazon SageMaker Model Monitor
and SageMaker Clarify. To learn more, see Amazon SageMaker Model Monitor, Monitor Data
Quality, Monitor Model Quality, Monitor Bias Drift, and Monitor Feature Attribution Drift.
Tip
To get human review of low confidence ML predictions or a random sample of predictions, use
Amazon Augmented AI human review workflows. For more information, see Using Amazon
Augmented AI for Human Review.
Step 7: Clean Up
To avoid incurring unnecessary charges, use the AWS Management Console to delete the endpoints and
resources that you created while running the exercises.
Note
Training jobs and logs cannot be deleted and are retained indefinitely.
Note
If you plan to explore other exercises in this guide, you might want to keep some of these
resources, such as your notebook instance, S3 bucket, and IAM role.
103
Amazon SageMaker Developer Guide
Step 7: Clean Up
2. Choose the notebook instance that you created in the example, choose Actions, and then
choose Stop. The notebook instance takes several minutes to stop. When the Status changes to
Stopped, move on to the next step.
3. Choose Actions, and then choose Delete.
2. Open the Amazon S3 console at https://console.aws.amazon.com/s3/, and then delete the bucket
that you created for storing model artifacts and the training dataset.
3. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/, and then
delete all of the log groups that have names starting with /aws/sagemaker/.
104
Amazon SageMaker Developer Guide
SageMaker Domain
• Amazon SageMaker Studio: Lets you build, train, debug, deploy, and monitor your machine learning
models.
• Amazon SageMaker Notebook Instances: Lets you prepare and process data, and train and deploy
machine learning models from a compute instance running the Jupyter Notebook application.
• Amazon SageMaker Studio Lab: Studio Lab is a free service that gives you access to AWS compute
resources, in an environment based on open-source JupyterLab, without requiring an AWS account.
• Amazon SageMaker Canvas: Gives you the ability to use machine learning to generate predictions
without needing to code.
• Amazon SageMaker geospatial: Gives you the ability to build, train, and deploy geospatial models.
• RStudio on Amazon SageMaker: RStudio is an IDE for R, with a console, syntax-highlighting editor
that supports direct code execution, and tools for plotting, history, debugging and workspace
management.
To use these machine learning environments, except Studio Lab and SageMaker Notebook Instances,
you or your organization's administrator must create an Amazon SageMaker Domain. Studio Lab has a
separate onboarding process.
Topics
• Amazon SageMaker Domain (p. 105)
• Amazon SageMaker Studio (p. 128)
• Amazon SageMaker Notebook Instances (p. 204)
• Amazon SageMaker Studio Lab (p. 230)
• Amazon SageMaker Canvas (p. 258)
• Amazon SageMaker geospatial capabilities (p. 401)
• RStudio on Amazon SageMaker (p. 432)
• Domain: An Amazon SageMaker Domain consists of an associated Amazon Elastic File System (Amazon
EFS) volume; a list of authorized users; and a variety of security, application, policy, and Amazon
Virtual Private Cloud (Amazon VPC) configurations. Users within a Domain can share notebook files
and other artifacts with each other. An account can have multiple Domains. For more information
about multiple Domains, see Multiple Domains Overview (p. 108).
• UserProfile: A user profile represents a single user within a Domain. It is the main way to reference a
user for the purposes of sharing, reporting, and other user-oriented features. This entity is created
when a user onboards to the Amazon SageMaker Domain. For more information about user profiles,
see Domain User Profiles (p. 118).
105
Amazon SageMaker Developer Guide
SageMaker Domain
• shared space: A shared space consists of a shared JupyterServer application and shared directory. All
users within the Domain have access to the shared space. All user profiles in a Domain have access
to all shared spaces in the Domain. For more information about shared spaces, see Collaborate with
shared spaces (p. 123).
• App: An app represents an application that supports the reading and execution experience of the
user’s notebooks, terminals, and consoles. The type of app can be JupyterServer, KernelGateway,
RStudioServerPro, or RSession. A user may have multiple apps active simultaneously.
The following tables describe the status values for the Domain, UserProfile, shared space, and App
entities. Where applicable, they also give troubleshooting steps.
Value Description
Value Description
106
Amazon SageMaker Developer Guide
SageMaker Domain
Value Description
failed UserProfile and recreate it after fixing
the error mentioned in FailureReason.
Value Description
Value Description
107
Amazon SageMaker Developer Guide
Prerequisites
Value Description
Topics
• Prerequisites (p. 108)
• Multiple Domains Overview (p. 108)
• Domain resource isolation (p. 110)
• Setting Defaults for a Domain (p. 112)
• Environment (p. 114)
• View and Edit Domains (p. 114)
• Delete an Amazon SageMaker Domain (p. 116)
• Domain User Profiles (p. 118)
• IAM Identity Center Groups in a Domain (p. 122)
• Collaborate with shared spaces (p. 123)
Prerequisites
To use the features available in an Amazon SageMaker Domain, you must first onboard to a Domain. For
more information, see Onboard to Amazon SageMaker Domain.
If you are interacting with your Domain using the AWS CLI, you must also complete the following
prerequisites.
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
Topics
• Automatic tag propagation (p. 109)
• Scoping each Domain (p. 109)
• Backfilling Domain tags (p. 110)
108
Amazon SageMaker Developer Guide
Multiple Domains Overview
ImageVersionArn • describe-image-version
• update-image-version
• delete-image-version
ModelCardExportJobArn describe-model-card-export-job
PipelineExecutionArn • retry-pipeline-execution
• update-pipeline-execution
• describe-pipeline-execution
• describe-pipeline-definition-for-execution
ModelPackageArn describe-action
You can also use these tags for cost allocation using AWS Billing and Cost Management. For more
information, see Using AWS cost allocation tags.
To enable resource isolation, you must modify the IAM execution role of your Domain, as follows.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",
"Effect": "Allow",
"Action": "sagemaker:Create*",
"NotResource": [
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"sagemaker:Update*",
109
Amazon SageMaker Developer Guide
Domain resource isolation
"sagemaker:Delete*",
"sagemaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn": "domain-arn"
}
}
},
{
"Sid": "AllowActionsThatDontSupportTagging",
"Effect": "Allow",
"Action": [
"sagemaker:DescribeImageVersion",
"sagemaker:UpdateImageVersion",
"sagemaker:DeleteImageVersion",
"sagemaker:DescribeModelCardExportJob",
"sagemaker:RetryPipelineExecution",
"sagemaker:DescribePipelineExecution",
"sagemaker:UpdatePipelineExecution",
"sagemaker:DescribeAction"
],
"Resource": "*"
},
{
"Sid": "DeleteDefaultApp",
"Effect": "Allow",
"Action": "sagemaker:DeleteApp",
"Resource": "arn:aws:sagemaker:*:*:app/domain-ID/*/jupyterserver/default"
}
}
To accurately attribute resources to their respective Domain, you must add the Domain tag to existing
resources using the AWS CLI, as follows.
1. Map all existing SageMaker resources and their respective ARNs to the Domains that exist in your
account.
2. Run the following command from your local machine to tag the resource with the ARN of the
resource's respective Domain. This must be repeated for every SageMaker resource in your account.
110
Amazon SageMaker Developer Guide
Domain resource isolation
Console
The following section shows how to create a new IAM policy that limits access to resources in the Domain
to user profiles with the Domain tag, as well as how to attach this policy to the IAM execution role of the
Domain, from the Amazon SageMaker console.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",
"Effect": "Allow",
"Action": [
"SageMaker:Create*"
],
"NotResource":
[
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"SageMaker:Update*",
"SageMaker:Delete*",
"SageMaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn":
"arn:aws:sagemaker:region:account-id:domain/domain-id"
}
}
}
]
}
AWS CLI
The following section shows how to create a new IAM policy that limits access to resources in the Domain
to user profiles with the Domain tag, as well as how to attach this policy to the execution role of the
Domain, from the AWS CLI.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",
111
Amazon SageMaker Developer Guide
Setting Defaults for a Domain
"Effect": "Allow",
"Action": [
"SageMaker:Create*"
],
"NotResource":
[
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"SageMaker:Update*",
"SageMaker:Delete*",
"SageMaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn":
"arn:aws:sagemaker:region:account-id:domain/domain-id"
}
}
}
]
}
3. Attach the newly created policy to a new or existing role that is used as the Domain's execution role.
Topics
• Domain default settings (p. 112)
• Context keys (p. 113)
• DefaultUserSettings
• DefaultSpaceSettings
112
Amazon SageMaker Developer Guide
Setting Defaults for a Domain
Note
DefaultSpaceSettings only supports the use of JupyterLab 3 image ARNs for
SageMakerImageArn. For more information, see JupyterLab Versioning (p. 135).
"DefaultSpaceSettings": {
"ExecutionRole": "string",
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"InstanceType": "string",
"LifecycleConfigArn": "string",
"SageMakerImageArn": "string",
"SageMakerImageVersionArn": "string"
},
"LifecycleConfigArns": [ "string" ]
},
"KernelGatewayAppSettings": {
"CustomImages": [
{
"AppImageConfigName": "string",
"ImageName": "string",
"ImageVersionNumber": number
}
],
"DefaultResourceSpec": {
"InstanceType": "string",
"LifecycleConfigArn": "string",
"SageMakerImageArn": "string",
"SageMakerImageVersionArn": "string"
},
"LifecycleConfigArns": [ "string" ]
},
"SecurityGroups": [ "string" ]
}
Context keys
You can add context keys to the IAM policy that creates a Domain. This restricts the values that users can
pass for those fields. The following list shows the context keys that Domain supports and where they're
implemented.
• sagemaker:ImageArns
• Implemented as part of DefaultUserSettings:SagemakerImageArn
in DefaultUserSettings.JupyterServerAppSettings and
DefaultUserSettings.KernelGatewayAppSettings. CustomImages in
DefaultUserSettings.KernelGatewayAppSettings.
• Implemented as part of DefaultSpaceSettings:SagemakerImageArn
in DefaultSpaceSettings.JupyterServerAppSettings and
DefaultSpaceSettings.KernelGatewayAppSettings. CustomImages in
DefaultSpaceSettings.KernelGatewayAppSettings.
• sagemaker:VpcSecurityGroupIds
• Implemented as part of DefaultUserSettings:SecurityGroups in DefaultUserSettings.
• Implemented as part of DefaultSpaceSettings:SecurityGroups in
DefaultSpaceSettings.
• sagemaker:DomainSharingOutputKmsKey
113
Amazon SageMaker Developer Guide
Environment
You cannot restrict users to passing incompatible values when using context keys for the defaults.
For example, the values for SageMakerImageArn set as part of DefaultUserSettings and
DefaultSpaceSettings must be compatible. You cannot set the following incompatible default
values. For more information about the available JupyterLab version ARNs, see Setting a default
JupyterLab version (p. 137).
• Only a JupyterLab version 1 ARN can be used for the SageMakerImageArn value in
DefaultUserSettings
• Only a JupyterLab version 3 ARN can be used for the SageMakerImageArn value in
DefaultSpaceSettings
Environment
This page gives information about modifications to the Amazon SageMaker Domain environment. This
includes custom images, lifecycle configurations, and git repositories attached to a Domain environment.
These can also be attached to a shared space using the AWS CLI by passing values to the create-space
command using the space-settings parameter.
For more information about bringing a custom Amazon SageMaker Studio image, see Bring your own
SageMaker image.
For more information about bringing a custom RStudio image, see Bring your own image to RStudio on
SageMaker.
For instructions on using a lifecycle configuration with Studio, see Use Lifecycle Configurations with
Amazon SageMaker Studio.
For information about attaching a git repository to a Domain, see Attach Suggested Git Repos to
SageMaker.
Complete the following procedure to view the custom images, lifecycle configurations, and git
repositories attached to a Domain environment.
Topics
• View Domains (p. 114)
• Edit Domain settings (p. 115)
View Domains
The following section shows how to view a list of your Domains, and details of an individual Domain
from the SageMaker console or the AWS CLI.
114
Amazon SageMaker Developer Guide
View and Edit Domains
Console
The console's Domain overview page gives information about the structure of a Domain, and it provides
a list of your Domains. The page's Domain structure diagram describes Domain components and how
they interact with each other.
The following procedure shows how to view a list of your Domains from the SageMaker console.
To view the details of the Domain, complete the following procedure. This page gives information about
the general settings for the Domain, including the name, Domain ID, execution role used to create the
Domain, and the authentication method of the Domain.
1. From the list of Domains, select the Domain that you want to open the Domain settings page for.
2. On the Domain details page, choose the Domain settings tab.
AWS CLI
Run the following command from the terminal of your local machine to view a list of Domains from the
AWS CLI.
The following section shows how to edit Domain settings from the SageMaker console or the AWS CLI.
Console
You can edit the Domain from the SageMaker console using the following procedure.
115
Amazon SageMaker Developer Guide
Delete a Domain
AWS CLI
Run the following command from the terminal of your local machine to update a Domain from the AWS
CLI. For more information about the structure of default-user-settings, see CreateDomain.
• AWS console
• AWS Command Line Interface (AWS CLI)
• SageMaker SDK
The following sections explain how to delete a Domain and the requirements for doing so.
Requirements
You must satisfy the following requirements to delete a Domain.
EFS files
Your files are kept in an Amazon EFS volume as a backup. This backup includes the files in the mounted
directory, which is /home/sagemaker-user for Jupyter and /root for your kernel.
When you delete files from these mounted directories, the kernel or app may move the deleted files
into a hidden trash folder. If the trash folder is inside the mounted directory, those files are copied
into the Amazon EFS volume and will incur charges. To avoid these Amazon EFS charges, you must
116
Amazon SageMaker Developer Guide
Delete a Domain
identify and clean the trash folder location. The trash folder location for default apps and kernels is
~/.local/. This may vary depending on the Linux distribution used for custom apps or kernels. For
more information about the Amazon EFS volume, see Manage Your Amazon EFS Storage Volume in
SageMaker Studio (p. 198).
When you use the SageMaker console to delete the Domain, the Amazon EFS volume is detached but not
deleted. The same behavior occurs by default when you use the AWS CLI or the SageMaker Python SDK
to delete the Domain. However, when you use the AWS CLI or the SageMaker Python SDK, you can set
the RetentionPolicy to HomeEfsFileSystem=Delete to delete the Amazon EFS volume along with
the Domain.
Important
When a user is deleted, they lose access to the Amazon EFS volume that contains their data,
including notebooks and other artifacts. The data is not deleted and can be accessed by an
administrator.
5. When all users are deleted, choose the Space management tab.
6. Repeat the following steps for each shared space in the Spaces list.
117
Amazon SageMaker Developer Guide
Domain User Profiles
8. Delete the Domain. To also delete the Amazon EFS volume, specify HomeEfsFileSystem=Delete.
118
Amazon SageMaker Developer Guide
Domain User Profiles
application is directly associated with the user profile and has an isolated Amazon EFS directory, an
execution role associated with the user profile, and Kernel Gateway applications.
Topics
• Add and Remove User Profiles (p. 119)
• View User Profiles and User Profile Details (p. 121)
Topics
• Add user profiles (p. 119)
• Remove user profiles (p. 120)
If you choose Create a new role, the Create an IAM role dialog box opens:
a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM role, AmazonSageMaker-
ExecutionPolicy, with the AmazonSageMakerFullAccess policy attached.
8. (Optional) Add tags to the user profile. All resources that the user profile creates will have a Domain
ARN tag and a user profile ARN tag. The Domain ARN tag is based on Domain ID, while the user
profile ARN tag is based on the user profile name.
9. Choose Next.
10. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as the
default for your user profile. For information about selecting a JupyterLab version, see JupyterLab
Versioning.
11. In the SageMaker Projects and JumpStart section, you have two options. You can accept the default
Project and JumpStart settings, or you can customize whether the user profile can create projects
and use JumpStart. For more information, see SageMaker Studio Permissions Required to Use
Projects.
12. Choose Next.
119
Amazon SageMaker Developer Guide
Domain User Profiles
13. (Optional) If the Domain has an RStudio license associated, select whether you want to create the
user with one of the following authorizations:
• Unauthorized
• RStudio Admin
• RStudio User
14. Choose Next.
15. For the Canvas base permissions configuration, select whether to establish the minimum required
permissions to use the SageMaker Canvas application.
16. (Optional) For the Time series forecasting configuration: To grant user permissions for time series
forecasting in SageMaker Canvas, leave the Enable time series forecasting option turned on. It is
turned on by default.
17. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role. Alternatively, if you already have an IAM role with the required Amazon Forecast
permissions attached, select Use an existing execution role. For more information, see the IAM role
setup method (p. 278).
18. Choose Submit.
To create a user profile in a Domain from the AWS CLI, run the following command from the terminal of
your local machine. For information about the available JupyterLab version ARNs, see Setting a default
JupyterLab version (p. 137).
120
Amazon SageMaker Developer Guide
Domain User Profiles
Topics
• View user profiles (p. 121)
• View user profile details (p. 121)
121
Amazon SageMaker Developer Guide
IAM Identity Center Groups in a Domain
To describe a user profile from the AWS CLI, run the following command from the terminal of your local
machine.
Topics
• View groups and users (p. 122)
• Add groups and users (p. 122)
• Remove groups (p. 122)
Remove groups
Complete the following procedure to remove groups from your Domain from the SageMaker console. For
information about deleting a user, see Remove user profiles (p. 120).
122
Amazon SageMaker Developer Guide
Collaborate with shared spaces
1. On the Groups tab, choose the group that you want to remove.
2. Choose Unassign groups.
3. On the pop-up window, choose Yes, unassign groups.
4. Enter unassign in the field.
5. Choose Unassign groups.
A shared space only supports Studio and KernelGateway applications. A shared space only supports
the use of a JupyterLab 3 image Amazon Resource Name (ARN). For more information, see JupyterLab
Versioning (p. 135).
Amazon SageMaker automatically tags all SageMaker resources that you create within the scope of
a shared space. You can use these tags to monitor costs and plan budgets using tools, such as AWS
Budgets.
A shared space uses the same VPC settings as the Domain that it's created in.
Note
Domains with AWS IAM Identity Center (successor to AWS Single Sign-On) authentication do not
currently support the use of shared spaces. Shared spaces do not support the use of Amazon
SageMaker Data Wrangler or Amazon EMR cross-account clusters.
Automatic tagging
All resources created in a shared space are automatically tagged with a Domain ARN tag and shared
space ARN tag. The Domain ARN tag is based on the Domain ID, while the shared space ARN tag is based
on the shared space name.
You can use these tags to monitor AWS CloudTrail usage. For more information, see Log Amazon
SageMaker API Calls with AWS CloudTrail.
You can also use these tags to monitor costs with AWS Billing and Cost Management. For more
information, see Using AWS cost allocation tags.
A key benefit of a shared space is that it facilitates collaboration between members of the shared space
in real time. Users collaborating in a workspace get access to a shared Studio application where they
can access, read, and edit their notebooks in real time. Real time collaboration is only supported for
JupyterServer applications within a shared space.
Users with access to a shared space can simultaneously open, view, edit, and execute Jupyter notebooks
in the shared Studio application in that space.
The notebook indicates each co-editing user with a different cursor that shows the user profile name.
While multiple users can view the same notebook, co-editing is best suited for small groups of two to
five users.
To track changes being made by multiple users, we strongly recommended using Studio's built-in Git-
based version control.
123
Amazon SageMaker Developer Guide
Collaborate with shared spaces
JupyterServer 2
To use shared spaces, Jupyter Server version 2 is required. Certain JupyterLab extensions and packages
can forcefully downgrade Jupyter Server to version 1. This prevents the use of shared space. Run the
following from the command prompt to change the version number and continue using shared spaces.
To attach a lifecycle configuration or custom image to a shared space, you must use the AWS CLI. For
more information about creating and attaching lifecycle configurations, see Creating and Associating a
Lifecycle Configuration (p. 183). For more information about creating and attaching custom images,
see Bring your own SageMaker image (p. 169).
Topics
• Add shared space support to an existing Domain (p. 124)
• Create from the console (p. 125)
• Create from AWS CLI (p. 125)
Console
Complete the following procedure to add support for shared spaces to an existing Domain from the
SageMaker console.
AWS CLI
Run the following command from the terminal of your local machine to add default shared space
settings to a Domain from the AWS CLI. If you are adding default shared space settings to a Domain
124
Amazon SageMaker Developer Guide
Collaborate with shared spaces
within an Amazon VPC, you must also include a list of security groups. shared spaces only support the
use of JupyterLab 3 image ARNs. For more information, see JupyterLab Versioning (p. 135).
# VPCOnly domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-
arn,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=sagemaker-
image-arn}},SecurityGroups=[security-groups]"
Verify that the default shared space settings have been updated.
You cannot set the execution role of a shared space when creating or updating it.
The DefaultDomainExecRole can only be set when creating or updating the Domain. shared
spaces only support the use of JupyterLab 3 image ARNs. For more information, see JupyterLab
Versioning (p. 135).
To create a shared space from the AWS CLI, run the following command from the terminal of your local
machine.
125
Amazon SageMaker Developer Guide
Collaborate with shared spaces
"SageMakerImageArn": "sagemaker-image-arn",
"InstanceType": "system"
}
}
}'
Topics
• List shared spaces (p. 126)
• View shared space details (p. 126)
126
Amazon SageMaker Developer Guide
Collaborate with shared spaces
To view the details of a shared space from the AWS CLI, run the following command from the terminal of
your local machine.
To edit the details of a shared space from the AWS CLI, run the following command from the terminal
of your local machine. shared spaces only support the use of JupyterLab 3 image ARNs. For more
information, see JupyterLab Versioning (p. 135).
Topics
• Console (p. 127)
• AWS CLI (p. 128)
Console
Complete the following procedure to delete a shared space in the Amazon SageMaker Domain from the
SageMaker console.
127
Amazon SageMaker Developer Guide
SageMaker Studio
AWS CLI
To delete a shared space from the AWS CLI, run the following command from the terminal of your local
machine.
For information on the onboarding steps to sign in to SageMaker Studio, see Onboard to Amazon
SageMaker Domain (p. 37).
For the AWS Regions supported by SageMaker Studio, see Supported Regions and Quotas (p. 33).
Topics
• Studio Features (p. 128)
• Amazon SageMaker Studio UI Overview (p. 129)
• Launch Amazon SageMaker Studio (p. 133)
• JupyterLab Versioning (p. 135)
• Use the Amazon SageMaker Studio Launcher (p. 141)
• Use Amazon SageMaker Studio Notebooks (p. 144)
• Customize Amazon SageMaker Studio (p. 168)
• Perform Common Tasks in Amazon SageMaker Studio (p. 194)
• Amazon SageMaker Studio Pricing (p. 200)
• Troubleshooting Amazon SageMaker Studio (p. 201)
Studio Features
Studio includes the following features:
128
Amazon SageMaker Developer Guide
UI Overview
• SageMaker Autopilot
• SageMaker Clarify
• SageMaker Data Wrangler
• SageMaker Debugger
• SageMaker Experiments
• SageMaker Feature Store
• SageMaker JumpStart
• Amazon SageMaker Model Building Pipelines
• SageMaker Model Registry
• SageMaker Projects
• SageMaker Studio Notebooks
• SageMaker Studio Universal Notebook
The following image shows the default view upon launching Amazon SageMaker Studio. The left
navigation panel displays all top-level categories of features, and a Studio Home page (p. 130) is open in
the main working area. Come back to this central point of orientation by choosing the Home ( ) icon
at any time, then selecting the Home node in the navigation menu.
Try the Getting started notebook for an in-product hands-on guide on how to set up and get familiar
with Amazon SageMaker Studio features. On the Quick actions section of the Studio Home page, choose
Open the Getting started notebook.
129
Amazon SageMaker Developer Guide
UI Overview
Note
This chapter is based on Studio's updated user interface (UI) available on version v5.38.x and
above on JupyterLab3.
• To retrieve your version of Studio UI, from the Studio Launcher, open a System Terminal, then
Topics
• Studio Home page (p. 130)
• Studio layout (p. 130)
The Prebuilt and automated solutions help you get started quickly with SageMaker's low-code solutions
such as Amazon SageMaker JumpStart and Autopilot.
In Workflows and tasks, you can find a list of relevant tasks for each step of your ML workflow that
takes you to the right tool for the job. For example, Transform, analyse, and export data takes you
to Amazon SageMaker Data Wrangler and opens the workflow to create a new data flow, or View all
experiments takes you to SageMaker Experiments and opens the experiments list view.
Upon Studio launch, the Home page is open in the main working area. You can customize your
SageMaker Home page by choosing Customize Layout at the top right of the Home tab.
Studio layout
The Amazon SageMaker Studio interface consists of a menu bar at the top, a collapsible left sidebar
displaying a variety of icons such as the Home icon and the File Browser, a status bar at the bottom
of the screen, and a central area divided horizontally into two panes. The left pane is a collapsible
navigation panel. The right pane, or main working area, contains one or more tabs for resources such as
launchers, notebooks, terminals, metrics, and graphs, and can be further divided.
Report a bug in Studio or choose the notification icon ( ) to view notifications from Studio, such as
new Studio versions and new SageMaker features, on the right corner of the menu bar. To update to a
new version of Studio, see Shut Down and Update SageMaker Studio and Studio Apps (p. 198).
The following sections describe the Studio main user interface areas.
Left sidebar
The left sidebar includes the following icons. When hovering over an icon, a tooltip displays the icon
name. A single click on an icon opens up the left navigation panel with the described functionality. A
double click minimizes the left navigation panel.
130
Amazon SageMaker Developer Guide
UI Overview
Icon Description
Home
Choose the Home icon to open a top-level navigation menu in the left
navigation panel.
Using the Home navigation menu, you can discover and navigate to the right
tools for each step of your ML workflow. The menu also provides shortcuts
to quick-start solutions and learning resources such as documentation and
guided tutorials.
File Browser
The File Browser displays lists of your notebooks, experiments, trials, trial
components, endpoints, and low-code solutions.
Whether you are in a personal or shared space determines who has access to
your files. You can identify which type of space you are in by looking at the
top right corner. If you are in a personal app, you see a user icon followed by
[user_name] / Personal Studio and if you are in a collaborative space, you
see a globe icon followed by “[user_name] / [space_name].”
• Personal Studio app: A private Amazon EFS directory that only you can
access.
• Studio launcher: Choose the plus (+) sign on the menu at the top of the
file browser to open the Amazon SageMaker Studio Launcher.
•
Upload files: Choose the Upload Files icon ( ) to add files to Studio or
drag and drop them from your desktop.
• Open files: Double-click a file to open the file in a new tab or right-click
and select Open.
131
Amazon SageMaker Developer Guide
UI Overview
Icon Description
• Panel management: To work in adjacent files, choose a tab that contains
a notebook, Python, or text file, then choose New View for File.
Property Inspector
You can check the list of all the kernels and terminals currently running
across all notebooks, code consoles, and directories. You can shut down
individual resources, including notebooks, terminals, kernels, apps, and
instances. You can also shut down all resources in one of these categories at
the same time.
Git
You can connect to a Git repository and then access a full range of Git tools
and operations.
Table of Contents
Extensions
You can turn on and manage third-party JupyterLab extensions. You can
check the already installed extensions and search for extensions by typing
the name in the search bar. When you have found the extension you want to
install, choose Install. After installing your new extensions, be sure to restart
JupyterLab by refreshing your browser.
For example, choosing the Home icon displays the navigation menu. Choosing File browser lists all
the files and directories available in your workspace (notebooks, experiments, data flows, trials, trial
components, endpoints, or low-code solutions).
132
Amazon SageMaker Developer Guide
Launch Amazon SageMaker Studio
In the navigation menu, choosing a node brings up the corresponding feature page in the main working
area. For example, choosing Data Wrangler in the Data menu opens up the Data Wrangler tab listing all
existing flows.
Topics
• Launch Studio Using the Amazon SageMaker Console (p. 133)
• Launch Studio Using the AWS CLI (p. 134)
Topics
• Prerequisite (p. 133)
• Launch Studio from the Domain details page (p. 133)
• Launch Studio from the Studio landing page (p. 133)
Prerequisite
To complete this procedure, you must onboard to a Domain by following the steps in Onboard to
Amazon SageMaker Domain.
The following procedure shows how to navigate to the Domain details page.
133
Amazon SageMaker Developer Guide
Launch Amazon SageMaker Studio
3. From the list of Domain, select the Domain that you want to launch the Studio application in.
The following procedure shows how to launch a Studio application that is scoped to a user profile.
The following procedure shows how to launch a Studio application that is scoped to a shared space.
Launch Studio
Prerequisites
• Onboard to Amazon SageMaker Domain. For more information, see Onboard to Amazon SageMaker
Domain.
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
The following code snippet demonstrates how to launch Amazon SageMaker Studio from the AWS CLI
using a presigned Domain URL. For more information, see create-presigned-domain-url.
134
Amazon SageMaker Developer Guide
JupyterLab Versioning
JupyterLab Versioning
The Amazon SageMaker Studio interface is based on JupyterLab, which is a web-based interactive
development environment for notebooks, code, and data. Studio now supports using both JupyterLab
1 and JupyterLab 3. The default version of JupyterLab in Studio is JupyterLab 3. If you created your
Amazon SageMaker Domain and user profile using the AWS Management Console before 08/31/2022
or using the AWS Command Line Interface before 02/22/23, then your Studio instance defaults to
JupyterLab 1. After 08/31/2022, JupyterLab version 1 on Amazon SageMaker Studio only receives
security fixes. You can choose the version that you want to run. However, you can run only a single
instance of JupyterLab at one time per user profile. You can’t run multiple versions of JupyterLab
simultaneously.
After 03/31/23, Studio only supports the creation of JupyterLab 3 applications. After that date, Studio
stops supporting JupyterLab 1 application creation. On 04/30/2023, Studio removes all existing
applications that run JupyterLab 1. Update your existing JupyterLab1 applications to JupyterLab 3
before 04/30/2023 following the steps in View and update the JupyterLab version of an application
from the console (p. 140).
Topics
• JupyterLab 3 (p. 135)
• Restricting default JupyterLab version using an IAM policy condition key (p. 136)
• Setting a default JupyterLab version (p. 137)
• View and update the JupyterLab version of an application from the console (p. 140)
• Installing JupyterLab and Jupyter Server extensions (p. 140)
JupyterLab 3
JupyterLab 3 includes the following features that are not available in previous versions. For more
information about these features, see JupyterLab 3.0 is released!.
• Visual debugger when using the Base Python 2.0 and Data Science 2.0 kernels.
• File browser filter
• Table of Contents (TOC)
• Multi-language support
• Simple mode
• Single interface mode
• When setting the JupyterLab version using the AWS CLI, select the corresponding image for your
Region and JupyterLab version from the image list in From the AWS CLI (p. 137).
• In JupyterLab 3, you must activate the studio conda environment before installing extensions. For
more information, see Installing JupyterLab and Jupyter Server extensions (p. 140).
135
Amazon SageMaker Developer Guide
JupyterLab Versioning
The following policy shows how to limit the JupyterLab version at the Domain level.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the domain level",
"Effect": "Deny",
"Action": [
"sagemaker:CreateDomain",
"sagemaker:UpdateDomain"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}
The following policy shows how to limit the JupyterLab version at the user profile level.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the user profile level",
"Effect": "Deny",
"Action": [
"sagemaker:CreateUserProfile",
"sagemaker:UpdateUserProfile"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}
The following policy shows how to limit the JupyterLab version at the application level. The CreateApp
request must include the image ARN for this policy to apply.
136
Amazon SageMaker Developer Guide
JupyterLab Versioning
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the application level",
"Effect": "Deny",
"Action": "sagemaker:CreateApp",
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}
To set the default JupyterLab version using the AWS CLI, you must include the ARN of the desired default
JupyterLab version as part of an AWS CLI command. This ARN differs based on the version and the
Region of the SageMaker Domain.
The following table lists the ARNs of the available JupyterLab versions for each Region:
137
Amazon SageMaker Developer Guide
JupyterLab Versioning
138
Amazon SageMaker Developer Guide
JupyterLab Versioning
The following shows how to create a Domain with JupyterLab 3 as the default, using the AWS CLI:
The following shows how to update a Domain to use JupyterLab 3 as the default, using the AWS CLI:
You can set a default JupyterServer version at the user profile level
by invoking CreateUserProfile or UpdateUserProfile and passing
the UserSettings.JupyterServerAppSettings.DefaultResourceSpec.SageMakerImageArn
field.
The following shows how to create a user profile with JupyterLab 3 as the default on an existing Domain,
using the AWS CLI:
139
Amazon SageMaker Developer Guide
JupyterLab Versioning
}
}
}'
The following shows how to update a user profile to use JupyterLab 3 as the default, using the AWS CLI:
140
Amazon SageMaker Developer Guide
Use the Studio Launcher
If you're reusing an existing lifecycle configuration script that must work with both versions of
JupyterLab, use the following code in your script:
If you're writing a new lifecycle configuration script that only uses JupyterLab 3, you can use the
following code in your script:
conda deactivate
• Choose Amazon SageMaker Studio at the top left of the Studio interface.
• Use the keyboard shortcut Ctrl + Shift + L.
• From the Studio menu, choose File and then choose New Launcher.
• If the SageMaker file browser is open, choose the plus (+) sign in the Studio file browser menu.
141
Amazon SageMaker Developer Guide
Use the Studio Launcher
• In the Quick actions section of the Home tab, choose Open Launcher. The Launcher opens in a new
tab. The Quick actions section is visible by default but can be toggled off. Choose Customize Layout
to turn this section back on.
Topics
• Notebooks and compute resources (p. 142)
• Utilities and files (p. 143)
1. Choose Change environment to select a SageMaker image, a kernel, an instance type, and, optionally,
add a lifecycle configuration script that runs on image start-up. For more information on lifecycle
configuration scripts, see Use Lifecycle Configurations with Amazon SageMaker Studio (p. 182). For
more information about kernel updates, see Change an Image or a Kernel (p. 159).
2. Select an item.
Note
When you choose an item from this section, you might incur additional usage charges. For more
information, see Usage Metering (p. 161).
• Notebook
142
Amazon SageMaker Developer Guide
Use the Studio Launcher
Creates the notebook in the folder that you have currently selected in the file browser. To view the file
browser, in the left sidebar of Studio, choose the File Browser icon.
• Console
Opens the shell in the folder that you have currently selected in the file browser.
• Image terminal
Opens the terminal in the root folder for the user (as shown by the Home folder in the file browser).
Note
By default, CPU instances launch on a ml.t3.medium instance, while GPU instances launch on a
ml.g4dn.xlarge instance.
Opens a new tab that displays contextual help for functions in a Studio notebook. To display the help,
choose a function in an active notebook. To make it easier to see the help in context, drag the help tab
so that it's adjacent to the notebook tab. To open the help tab from within a notebook, press Ctrl +
I.
The following screenshot shows the contextual help for the Experiment.create method.
143
Amazon SageMaker Developer Guide
Use Studio Notebooks
• System terminal
Opens a bash shell in the root folder for the user (as shown by the Home folder in the file browser).
• Text File and Markdown File
Creates a file of the associated type in the folder that you have currently selected in the file browser.
To view the file browser, in the left sidebar, choose the File Browser icon ( ).
You can share your notebooks with others, so that they can easily reproduce your results and collaborate
while building models and exploring your data. You provide access to a read-only copy of the notebook
through a secure URL. Dependencies for your notebook are included in the notebook's metadata. When
your colleagues copy the notebook, it opens in the same environment as the original notebook.
• Amazon EC2 instance type – The hardware configuration the notebook runs on. The configuration
includes the number and type of processors (vCPU and GPU), and the amount and type of memory.
The instance type determines the pricing rate.
144
Amazon SageMaker Developer Guide
Use Studio Notebooks
• SageMaker image – A container image that is compatible with SageMaker Studio. The image consists
of the kernels, language packages, and other files required to run a notebook in Studio. There can be
multiple images in an instance. For more information, see Bring your own SageMaker image (p. 169).
• KernelGateway app – A SageMaker image runs as a KernelGateway app. The app provides access to
the kernels in the image. There is a one-to-one correspondence between a SageMaker image and a
SageMaker app.
• Kernel – The process that inspects and runs the code contained in the notebook. A kernel is defined by
a kernel spec in the image. There can be multiple kernels in an image.
You can change any of these resources from within the notebook.
The following diagram outlines how a notebook kernel runs in relation to the KernelGateway App, User,
and Domain.
Sample SageMaker Studio notebooks are available in the aws_sagemaker_studio folder of the Amazon
SageMaker example GitHub repository. Each notebook comes with the necessary SageMaker image that
opens the notebook with the appropriate kernel.
We recommend that you familiarize yourself with the SageMaker Studio interface and the Studio
notebook toolbar before creating or using a Studio notebook. For more information, see Amazon
SageMaker Studio UI Overview (p. 129) and Use the Studio Notebook Toolbar (p. 150).
Topics
• How Are Amazon SageMaker Studio Notebooks Different from Notebook Instances? (p. 146)
• Get Started (p. 146)
• Amazon SageMaker Studio Tour (p. 147)
• Create or Open an Amazon SageMaker Studio Notebook (p. 148)
• Use the Studio Notebook Toolbar (p. 150)
• Install External Libraries and Kernels in Amazon SageMaker Studio (p. 152)
• Share and Use an Amazon SageMaker Studio Notebook (p. 154)
• Get Studio Notebook and App Metadata (p. 155)
• Get Notebook Differences (p. 157)
145
Amazon SageMaker Developer Guide
Use Studio Notebooks
• Faster: Starting a Studio notebook is faster than launching an instance-based notebook. Typically, it is
5-10 times faster than instance-based notebooks.
• Easy notebook sharing: Notebook sharing is an integrated feature in Studio. Users can generate a
shareable link that reproduces the notebook code and also the SageMaker image required to execute
it, in just a few clicks.
• Latest Python SDK: Studio notebooks come pre-installed with the latest Amazon SageMaker Python
SDK.
• Access all Studio features: Studio notebooks are accessed from within Studio. This enables you to
build, train, debug, track, and monitor your models without leaving Studio.
• Persistent user directories: Each member of a Studio team gets their own home directory to store
their notebooks and other files. The directory is automatically mounted onto all instances and kernels
as they're started, so their notebooks and other files are always available. The home directories are
stored in Amazon Elastic File System (Amazon EFS) so that you can access them from other services.
• Direct access: When using IAM Identity Center, you use your IAM Identity Center credentials through a
unique URL to directly access Studio. You don't have to interact with the AWS Management Console to
run your notebooks.
• Optimized images: Studio notebooks are equipped with a set of predefined SageMaker image settings
to get you started faster.
Note
Studio notebooks don't support local mode. However, you can use a notebook instance to train a
sample of your dataset locally, and then use the same code in a Studio notebook to train on the
full dataset.
When you open a notebook in SageMaker Studio, the view is an extension of the JupyterLab interface.
The primary features are the same, so you'll find the typical features of a Jupyter notebook and
JupyterLab. For more information about the Studio interface, see Amazon SageMaker Studio UI
Overview (p. 129).
Get Started
To get started, you or your organization's administrator need to complete the Amazon SageMaker Studio
onboarding process. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
• You receive an email invitation to access Studio through your organization's IAM Identity Center, which
includes a direct link to login to Studio without having to use the Amazon SageMaker console. You can
proceed to the the section called “Next Steps” (p. 147).
• You receive a link to a shared Studio notebook, which includes a direct link to log in to Studio
without having to use the SageMaker console. You can proceed to the the section called “Next
Steps” (p. 147).
• You onboard to Studio and then log in to the SageMaker console. For more information, see Onboard
to Amazon SageMaker Domain (p. 37).
146
Amazon SageMaker Developer Guide
Use Studio Notebooks
Next Steps
Now that you're in Studio, you can try any of the following options:
• To create a Studio notebook or explore Studio end-to-end tutorial notebooks – See Amazon
SageMaker Studio Tour (p. 147) in the next section.
• To familiarize yourself with the Studio interface – See Amazon SageMaker Studio UI
Overview (p. 129) or try the Getting started notebook by selecting Open the Getting started
notebook in the Quick actions section of the Studio Home page.
Prerequisites
• An IAM account to sign in to Studio. For information, see Onboard to Amazon SageMaker
Domain (p. 37).
• Basic familiarity with the Studio user interface and Jupyter notebooks. For information, see Amazon
SageMaker Studio UI Overview (p. 129).
• A copy of the aws/amazon-sagemaker-examples repository in your Studio environment.
1. Sign in to Studio. For users in IAM Identity Center, sign in using the URL from your invitation email.
For IAM users, follow these steps.
147
Amazon SageMaker Developer Guide
Use Studio Notebooks
~/amazon-sagemaker-examples/aws_sagemaker_studio/getting_started/
xgboost_customer_churn_studio.ipynb
3. Follow the notebook to learn about Studio's main features.
Note
If you encounter an error when you run the sample notebook, and some time has passed from
when you cloned the repository, review the notebook on the remote repository for updates.
If you create or open additional notebooks that use the same instance type, whether or not the
notebooks use the same kernel, the notebooks run on the same instance of that instance type.
After you launch a notebook, you can change its instance type, SageMaker image, and kernel from within
the notebook. For more information, see Change an Instance Type (p. 158) and Change an Image or a
Kernel (p. 159).
Note
You can have only one instance of each instance type. Each instance can have multiple
SageMaker images running on it. Each SageMaker image can run multiple kernels or terminal
instances.
Billing occurs per instance and starts when the first instance of a given instance type is launched. If
you want to create or open a notebook without the risk of incurring charges, open the notebook from
the File menu and choose No Kernel from the Select Kernel dialog. You can read and edit a notebook
without a running kernel but you can't run cells.
Billing ends when the SageMaker image for the instance is shut down. For more information, see Usage
Metering (p. 161).
For information about shutting down the notebook, see Shut Down Resources (p. 160).
Topics
• Open a notebook in Studio (p. 148)
• Create a Notebook from the File Menu (p. 149)
• Create a Notebook from the Launcher (p. 149)
• List of the available instance types, images, and kernels (p. 150)
To open a notebook
1.
In the left sidebar, choose the File Browser icon ( ) to display the file browser.
2. Browse to a notebook file and double-click it to open the notebook in a new tab.
148
Amazon SageMaker Developer Guide
Use Studio Notebooks
1. From the Studio menu, choose File, choose New, and then choose Notebook.
2. In the Change environment dialog, use the dropdown menus to select your Image, Kernel, Instance
type, and Start-up script, then choose Select. Your notebook launches and opens in a new Studio
tab.
1. To open the Launcher, choose Amazon SageMaker Studio at the top left of the Studio interface or
use the keyboard shortcut Ctrl + Shift + L.
To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker Studio
Launcher (p. 141)
2. In the Launcher, in the Notebooks and compute resources section, choose Change environment.
3. In the Change environment dialog, use the dropdown menus to select your Image, Kernel, Instance
type, and Start-up script, then choose Select.
4. In the Launcher, choose Create notebook. Your notebook launches and opens in a new Studio tab.
To view the notebook's kernel session, in the left sidebar, choose the Running Terminals and Kernels
icon ( ). You can stop the notebook's kernel session from this view.
149
Amazon SageMaker Developer Guide
Use Studio Notebooks
The following image shows the toolbar and an empty cell from a Studio notebook.
When you pause on a toolbar icon, a tooltip displays the icon function. Additional notebook commands
are found in the Studio main menu. The toolbar includes the following icons:
Icon Description
Saves the notebook and updates the checkpoint file. For more information,
see Get the Difference Between the Last Checkpoint (p. 157).
Insert cell
Inserts a code cell below the current cell. The current cell is noted by the
blue vertical marker in the left margin.
Run cells
Runs the selected cells and then makes the cell that follows the last selected
cell the new selected cell.
Interrupt kernel
Interrupts the kernel, which cancels the currently running operation. The
kernel remains active.
Restart kernel
Restarts the kernel. Variables are reset. Unsaved information is not affected.
Restarts the kernel, then run all the cells of the notebook.
150
Amazon SageMaker Developer Guide
Use Studio Notebooks
Icon Description
Cell type
Displays or changes the current cell type. The cell types are:
Launch terminal
Checkpoint diff
Opens a new tab that displays the difference between the notebook and the
checkpoint file. For more information, see Get the Difference Between the
Last Checkpoint (p. 157).
Git diff
Displays or changes the instance type the notebook runs in. The format is as
follows:
Cluster
Connect your notebook to an Amazon EMR cluster and scale your ETL jobs
or run large-scale model training using Apache Spark, Hive, or Presto.
For more information, see Prepare data using Amazon EMR (p. 1164).
151
Amazon SageMaker Developer Guide
Use Studio Notebooks
Icon Description
Displays the busy status of the kernel. When the edge of the circle and
its interior are the same color, the kernel is busy. The kernel is busy when
it is starting and when it is processing cells. Additional kernel states are
displayed in the status bar at the bottom-left corner of SageMaker Studio.
Share notebook
Shares the notebook. For more information, see Share and Use an Amazon
SageMaker Studio Notebook (p. 154).
To select multiple cells, click in the left margin outside of a cell. Hold down the Shift key and use K or
the Up key to select previous cells, or use J or the Down key to select following cells.
The different Jupyter kernels in Amazon SageMaker Studio notebooks are separate conda environments.
For information about conda environments, see Managing environments.
• Notebooks – The following commands are supported. If one of the following does not work on your
image, try the other one.
• %conda install
• %pip install
• The Jupyter terminal – You can install packages using pip and conda directly. You can also use apt-
get install to install system packages from the terminal.
Note
We do not recommend using pip install -u or pip install --user, because those
commands install packages on the user's Amazon EFS volume and can potentially block
152
Amazon SageMaker Developer Guide
Use Studio Notebooks
JupyterServer app restarts. Instead, use a lifecycle configuration to reinstall the required
packages on app restarts as shown in Install packages using lifecycle configurations (p. 154).
We recommend using %pip and %conda to install packages from within a notebook because they
correctly take into account the active environment or interpreter being used. For more information, see
Add %pip and %conda magic functions. You can also use the system command syntax (lines starting
with !) to install packages. For example, !pip install and !conda install.
Conda
Conda is an open source package management system and environment management system that can
install packages and their dependencies. SageMaker supports using conda with either of these two main
channels: the default channel or the conda-forge channel. For more information, see Conda channels.
The conda-forge channel is a community channel where contributors can upload packages.
Note
Installing packages from conda-forge can take up to 10 minutes. Timing relates to how conda
resolves the dependency graph.
All of the SageMaker provided environments are functional. User installed packages may not function
correctly.
Conda has two methods for activating environments: conda activate, and source activate. For
more information, see Managing environment.
Pip
Pip is the tool for installing and managing Python packages. Pip searches for packages on the Python
Package Index (PyPI) by default. Unlike conda, pip doesn't have built in environment support. Therfore,
pip isn't as thorough as conda when it comes to packages with native or system library dependencies. Pip
can be used to install packages in conda environments. You can use alternative package repositories with
pip instead of the PyPI.
Unsupported
SageMaker aims to support as many package installation operations as possible. However, if the
packages were installed by SageMaker and you use the following operations on these packages, it might
make your environment unstable:
153
Amazon SageMaker Developer Guide
Use Studio Notebooks
• Uninstalling
• Downgrading
• Upgrading
Due to potential issues with network conditions or configurations, or the availability of conda or PyPi,
packages may not install in a fixed or deterministic amount of time.
Note
Attempting to install a package in an environment with incompatible dependencies can result
in a failure. If issues occur, you can contact the library maintainer about updating the package
dependencies. When you modify the environment, such as removing or updating existing
packages, this may result in instability of that environment.
Topics
• Share a Notebook (p. 154)
• Use a Shared Notebook (p. 155)
Share a Notebook
The following screenshot shows the menu from a Studio notebook.
To share a notebook
• Include Git repo information – Includes a link to the Git repository that contains the notebook.
This enables you and your colleague to collaborate and contribute to the same Git repository.
154
Amazon SageMaker Developer Guide
Use Studio Notebooks
• Include output – Includes all notebook output that has been saved.
Note
If you're an user in IAM Identity Center and you don't see these options, your IAM Identity
Center administrator probably disabled the feature. Contact your administrator.
3. Choose Create.
4. After the snapshot is created, choose Copy link and then choose Close.
5. Share the link with your colleague.
After selecting your sharing options, you are provided with a URL. You can share this link with users that
have access to Amazon SageMaker Studio. When the user opens the URL, they're prompted to log in
using IAM Identity Center or IAM authentication. This shared notebook becomes a copy, so changes made
by the recipient will not be reproduced in your original notebook.
When you choose a link to a shared notebook for the first time, a read-only version of the notebook
opens. To edit the shared notebook, choose Create a Copy. This copies the shared notebook to your
personal storage.
The copied notebook launches on an instance of the instance type and SageMaker image that the
notebook was using when the sender shared it. If you aren't currently running an instance of the instance
type, a new instance is started. Customization to the SageMaker image isn't shared. You can also inspect
the notebook snapshot by choosing Snapshot Details.
The following are some important considerations about sharing and authentication:
• If you have an active session, you see a read-only view of the notebook until you choose Create a
Copy.
• If you don't have an active session, you need to log in.
• If you use IAM to login, after you login, select your user profile then choose Open Studio. Then you
need to choose the link you were sent.
• If you use IAM Identity Center to login, after you login the shared notebook is opened automatically in
Studio.
Topics
• Get Studio Notebook Metadata (p. 155)
• Get App Metadata (p. 156)
155
Amazon SageMaker Developer Guide
Use Studio Notebooks
1.
In the right sidebar, choose the Property Inspector icon ( ).
2. Open the Advanced Tools section.
{
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:<acct-id>:image/
datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
}
• AppType – KernelGateway
• DomainId – Same as the StudioID
• UserProfileName – The profile name of the current user
• ResourceArn – The Amazon Resource Name (ARN) of the App, which includes the instance type
• ResourceName – The name of the SageMaker image
Additional metadata might be included for internal use by Studio and is subject to change.
1.
In the center of the notebook menu, choose the Launch Terminal icon ( ). This opens a
terminal in the SageMaker image that the notebook runs in.
2. Run the following commands to display the contents of the resource-metadata.json file.
$ cd /opt/ml/metadata/
cat resource-metadata.json
156
Amazon SageMaker Developer Guide
Use Studio Notebooks
{
"AppType": "KernelGateway",
"DomainId": "d-xxxxxxxxxxxx",
"UserProfileName": "profile-name",
"ResourceArn": "arn:aws:sagemaker:us-east-2:account-id:app/d-xxxxxxxxxxxx/profile-
name/KernelGateway/datascience--1-0-ml-t3-medium",
"ResourceName": "datascience--1-0-ml",
"AppImageVersion":""
}
Topics
• Get the Difference Between the Last Checkpoint (p. 157)
• Get the Difference Between the Last Commit (p. 158)
By default, a notebook is auto-saved every 120 seconds and also when you close the notebook.
However, the checkpoint file isn't updated to match the notebook. To save the notebook and update the
checkpoint file to match, you must choose the Save notebook and create checkpoint icon ( ) on
the left of the notebook menu or use the Ctrl + S keyboard shortcut.
To view the changes between the notebook and the checkpoint file, choose the Checkpoint diff icon (
To revert the notebook to the checkpoint file, from the main Studio menu, choose File then Revert
Notebook to Checkpoint.
157
Amazon SageMaker Developer Guide
Use Studio Notebooks
To view the changes in the notebook from the last Git commit, choose the Git diff icon ( ) in the
center of the notebook menu.
Manage Resources
You can change the instance type, and SageMaker image and kernel from within an Amazon SageMaker
Studio notebook. To create a custom kernel to use with your notebooks, see Bring your own SageMaker
image (p. 169).
Topics
• Change an Instance Type (p. 158)
• Change an Image or a Kernel (p. 159)
• Shut Down Resources (p. 159)
You can change the instance type that your Studio notebook runs on from within the notebook.
The following information only applies to Studio notebooks. For information about how to change the
instance type of a Amazon SageMaker notebook instance, see Update a Notebook Instance (p. 212).
Important
If you change the instance type, unsaved information and existing settings for the notebook are
lost, and installed packages must be re-installed.
The previous instance type continues to run even if no kernel sessions or apps are active. You
must explicitly stop the instance to stop accruing charges. To stop the instance, see Shut Down
Resources (p. 160).
The following screenshot shows the menu from a Studio notebook. The processor and memory of the
instance type powering the notebook are displayed as 2 vCPU + 4 GiB.
158
Amazon SageMaker Developer Guide
Use Studio Notebooks
For a list of the available instance types, see Available Studio Instance Types (p. 162).
The following screenshot shows the menu from a Studio notebook. The current SageMaker kernel
and image are displayed as Python 3 (Data Science), where Python 3 denotes the kernel and Data
Science denotes the SageMaker image that contains the kernel. The color of the circle to the right
indicates the kernel is idle or busy. The kernel is busy when the center and the edge of the circle are the
same color.
For a list of available SageMaker images, see Available Amazon SageMaker Images (p. 164).
For a list of available SageMaker kernels, see Available Amazon SageMaker Kernels (p. 167).
159
Amazon SageMaker Developer Guide
Use Studio Notebooks
Note
Amazon SageMaker Studio does not support shutting down resources from within a notebook.
Topics
• Shut Down an Open Notebook (p. 160)
• Shut Down Resources (p. 160)
You can shut down an open notebook from the Amazon SageMaker Studio File menu or from the
Running Terminal and Kernels pane.
Note
When you shut down a notebook, any unsaved information in the notebook is lost. The
notebook is not deleted.
1. Optionally, save the notebook contents by choosing the Disk icon on the left of the notebook menu.
2. Choose File then Close and Shutdown Notebook.
3. Choose OK.
You can reach the Running Terminals and Kernels pane on the left side of Amazon SageMaker Studio
with the icon. The Running Terminals and Kernels pane consists of four sections. Each section
lists all the resources of that type. You can shut down each resource individually or shut down all the
resources in a section at the same time.
When you choose to shut down all resources in a section, the following occurs:
• RUNNING INSTANCES/RUNNING APPS – All instances, apps, notebooks, kernel sessions, consoles/
shells, and image terminals are shut down. System terminals aren't shut down.
Note
When you shutdown the Studio notebook instances, any additional resources, such as
SageMaker endpoints, Amazon EMR clusters, and Amazon S3 buckets created from Studio are
not deleted. Delete those resources to stop accrual of charges.
160
Amazon SageMaker Developer Guide
Use Studio Notebooks
• KERNEL SESSIONS – All kernels, notebooks and consoles/shells are shut down.
• TERMINAL SESSIONS – All image terminals and system terminals are shut down.
1.
In the left sidebar, choose the Running Terminals and Kernels icon ( ).
2. Do either of the following:
•
To shut down a specific resource, choose the Shut Down icon ( ) on the same row as the
resource.
For running instances, a confirmation dialog lists all the resources that will be shut down. For
running apps, a confirmation dialog is displayed. Choose Shut down all to proceed.
Note
No confirmation dialog is displayed for kernel sessions or terminal sessions.
• To shut down all resources in a section, choose the X to the right of the section label. A
confirmation dialog is displayed. Choose Shut down all to proceed.
Usage Metering
There is no additional charge for using Amazon SageMaker Studio. The costs incurred for running
Amazon SageMaker Studio notebooks, interactive shells, consoles, and terminals are based on Amazon
Elastic Compute Cloud (Amazon EC2) instance usage.
When you run the following resources, you must choose a SageMaker image and kernel:
• Notebook
• Interactive Shell
• Image Terminal
• Notebook
• Console
When launched, the resource is run on an Amazon EC2 instance of the chosen instance type. If an
instance of that type was previously launched and is available, the resource is run on that instance.
For CPU based images, the default suggested instance type is ml.t3.medium. For GPU based images,
the default suggested instance type is ml.g4dn.xlarge.
The costs incurred are based on the instance type. You are billed separately for each instance.
Metering starts when an instance is created. Metering ends when all the apps on the instance are shut
down, or the instance is shut down. For information about how to shut down an instance, see Shut Down
Resources (p. 159).
Important
You must shut down the instance to stop incurring charges. If you shut down the notebook
running on the instance but don't shut down the instance, you will still incur charges. When
161
Amazon SageMaker Developer Guide
Use Studio Notebooks
you shutdown the Studio notebook instances, any additional resources, such as SageMaker
endpoints, Amazon EMR clusters, and Amazon S3 buckets created from Studio are not deleted.
Delete those resources to stop accrual of charges.
When you open multiple notebooks on the same instance type, the notebooks run on the same instance
even if they are using different kernels. You are billed only for the time that one instance is running.
You can change the instance type from within the notebook after you open it. For more information, see
Change an Instance Type (p. 158).
For information about billing along with pricing examples, see Amazon SageMaker Pricing.
Available Resources
The following sections list the available resources for Amazon SageMaker Studio notebooks.
Topics
• Available Studio Instance Types (p. 162)
• Available Amazon SageMaker Images (p. 164)
• Available Amazon SageMaker Kernels (p. 167)
For detailed information on which instance types fit your use case, and their performance capabilities,
see Amazon Elastic Compute Cloud Instance types.
For information about available Amazon SageMaker Notebook Instance types, see
CreateNotebookInstance.
Note
For most use cases, you should use a ml.t3.medium. This is the default instance type for CPU-
based SageMaker images, and is available as part of the AWS Free Tier.
>> Fast launch instances types are optimized to start in under two minutes.
162
Amazon SageMaker Developer Guide
Use Studio Notebooks
• ml.m5.12xlarge
• ml.m5.16xlarge
• ml.m5.24xlarge
• ml.m5d.large
• ml.m5d.xlarge
• ml.m5d.2xlarge
• ml.m5d.4xlarge
• ml.m5d.8xlarge
• ml.m5d.12xlarge
• ml.m5d.16xlarge
• ml.m5d.24xlarge
• ml.r5.large
• ml.r5.xlarge
• ml.r5.2xlarge
• ml.r5.4xlarge
• ml.r5.8xlarge
• ml.r5.12xlarge
• ml.r5.16xlarge
• ml.r5.24xlarge
• ml.p3.2xlarge
• ml.p3.8xlarge
• ml.p3.16xlarge
• ml.p3dn.24xlarge
• ml.g4dn.xlarge >> Fast launch
• ml.g4dn.2xlarge
• ml.g4dn.4xlarge
• ml.g4dn.8xlarge
• ml.g4dn.12xlarge
• ml.g4dn.16xlarge
• ml.g5.xlarge
163
Amazon SageMaker Developer Guide
Use Studio Notebooks
• ml.g5.2xlarge
• ml.g5.4xlarge
• ml.g5.8xlarge
• ml.g5.12xlarge
• ml.g5.24xlarge
• ml.g5.48xlarge
Official Python 3.6 image from DockerHub with boto3 and AWS CLI included.
• Base Python 2.0 [sagemaker-base-python-38]
Official Python 3.8 image from DockerHub with boto3 and AWS CLI included.
• Base Python 3.0 [sagemaker-base-python-310-v1]
Official Python 3.10 image from DockerHub with boto3 and AWS CLI included.
• Data Science [datascience-1.0]
Data Science is a Python 3.7 conda image with the most commonly used Python packages and
libraries, such as NumPy and SciKit Learn.
• Data Science 2.0 [sagemaker-data-science-38]
Data Science 2.0 is a Python 3.8 conda image based on anaconda version 2021.11 with the most
commonly used Python packages and libraries, such as NumPy and SciKit Learn.
• Data Science 3.0 [sagemaker-data-science-310-v1]
Data Science 3.0 is a Python 3.10 conda image based on anaconda version 2022.10 with the most
commonly used Python packages and libraries, such as NumPy and SciKit Learn.
• Amazon SageMaker geospatial [sagemaker-geospatial-1.0]
Amazon SageMaker geospatial is a Python image consisting of commonly used geospatial libraries
such as GDAL, Fiona, GeoPandas, Shapely, and Rasterio, and allows you to visualize geospatial data
within SageMaker. For more information, see Amazon SageMaker geospatial Notebook SDK
• SparkMagic [sagemaker-sparkmagic]
Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• SparkAnalytics 1.0 [sagemaker-sparkanalytics-v1]
Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• SparkAnalytics 2.0 [sagemaker-sparkanalytics-310-v1]
Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• MXNet 1.6 Python 3.6 (optimized for CPU) [mxnet-1.6-cpu-py36]
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.6 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for MXNet 1.6.0 .
164
Amazon SageMaker Developer Guide
Use Studio Notebooks
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.6 with CUDA 10.1
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for MXNet 1.6.0 .
• MXNet 1.8 Python 3.7 (optimized for CPU) [mxnet-1.8-cpu-py37-ubuntu16.04-v1]
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.8 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for AWS MX 1.8.0 .
• MXNet 1.8 Python 3.7 (optimized for GPU) [mxnet-1.8-gpu-py37-cu110-ubuntu16.04-v1]
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.8 with CUDA 11.0
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for AWS MX 1.8.0 .
• MXNet 1.9 Python 3.8 (optimized for CPU) [mxnet-1.9-cpu-py38-ubuntu20.04-sagemaker-v1.0]
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.9 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for MX 1.9.0 on SageMaker .
• MXNet 1.9 Python 3.8 (optimized for GPU) [mxnet-1.9-gpu-py38-cu112-ubuntu20.04-sagemaker-v1.0]
The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.9 with CUDA 11.2
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for MX 1.9.0 on SageMaker .
• PyTorch 1.10 Python 3.8 (optimized for CPU) [pytorch-1.10-cpu-py38]
The AWS Deep Learning Containers for PyTorch 1.10 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.10.2 on SageMaker .
• PyTorch 1.10 Python 3.8 (optimized for GPU) [pytorch-1.10-gpu-py38]
The AWS Deep Learning Containers for PyTorch 1.10 with CUDA 11.3 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.10.2 on SageMaker .
• PyTorch 1.4 Python 3.6 (optimized for CPU) [pytorch-1.4-cpu-py36]
The AWS Deep Learning Containers for PyTorch 1.4 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers v3.2 for
PyTorch .
• PyTorch 1.4 Python 3.6 (optimized for GPU) [pytorch-1.4-gpu-py36]
The AWS Deep Learning Containers for PyTorch 1.4 with CUDA 10.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v3.2 for PyTorch .
• PyTorch 1.6 Python 3.6 (optimized for CPU) [pytorch-1.6-cpu-py36-ubuntu16.04-v1]
The AWS Deep Learning Containers for PyTorch 1.6 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.6.0 .
• PyTorch 1.6 Python 3.6 (optimized for GPU) [pytorch-1.6-gpu-py36-cu110-ubuntu18.04-v3]
The AWS Deep Learning Containers for PyTorch 1.6 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.6.0 with CUDA 11.0 .
• PyTorch 1.8 Python 3.6 (optimized for CPU) [1.8.1-cpu-py36]
165
Amazon SageMaker Developer Guide
Use Studio Notebooks
The AWS Deep Learning Containers for PyTorch 1.8 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.8.0 .
• PyTorch 1.8 Python 3.6 (optimized for GPU) [pytorch-1.8-gpu-py36]
The AWS Deep Learning Containers for PyTorch 1.8 with CUDA 11.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.8.0 .
• PyTorch 1.12 Python 3.8 (optimized for CPU) [pytorch-1.12-cpu-py38]
The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training
on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.12.0 .
• PyTorch 1.12 Python 3.8 (optimized for GPU) [pytorch-1.12-gpu-py38]
The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.12.0.
• TensorFlow 1.15 Python 3.6 (optimized for CPU) [tensorflow-1.15-cpu-py36]
The AWS Deep Learning Containers for TensorFlow 1.15 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 1.15.3 .
• TensorFlow 1.15 Python 3.6 (optimized for GPU) [tensorflow-1.15-gpu-py36]
The AWS Deep Learning Containers for TensorFlow 1.15 with CUDA 10.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 1.15.3 .
• TensorFlow 1.15 Python 3.7 (optimized for CPU) [tensorflow-1.15-cpu-py37-ubuntu18.04-v7]
The AWS Deep Learning Containers for TensorFlow 1.15 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v7.0 for TensorFlow .
• TensorFlow 1.15 Python 3.7 (optimized for GPU) [tensorflow-1.15-gpu-py37-cu110-ubuntu18.04-v8]
The AWS Deep Learning Containers for TensorFlow 1.15 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v7.0 for TensorFlow .
• TensorFlow 2.1 Python 3.6 (optimized for CPU) [tensorflow-2.1-cpu-py36]
The AWS Deep Learning Containers for TensorFlow 2.1 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v6.2 for Tensorflow .
• TensorFlow 2.1 Python 3.6 (optimized for GPU) [tensorflow-2.1-gpu-py36]
The AWS Deep Learning Containers for TensorFlow 2.1 with CUDA 10.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v6.2 for Tensorflow .
• TensorFlow 2.3 Python 3.7 (optimized for CPU) [tensorflow-2.3-cpu-py37-ubuntu18.04-v1]
The AWS Deep Learning Containers for TensorFlow 2.3 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 2.3.0 .
• TensorFlow 2.3 Python 3.7 (optimized for GPU) [tensorflow-2.3-gpu-py37-cu110-ubuntu18.04-v3]
166
Amazon SageMaker Developer Guide
Use Studio Notebooks
The AWS Deep Learning Containers for TensorFlow 2.3 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.3.1 with CUDA 11.0 .
• TensorFlow 2.6 Python 3.8 (optimized for CPU) [tensorflow-2.6-cpu-py38-ubuntu20.04-v1]
The AWS Deep Learning Containers for TensorFlow 2.6 include containers for training on GPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.6 .
• TensorFlow 2.6 Python 3.8 (optimized for GPU) [tensorflow-2.6-gpu-py38-cu112-ubuntu20.04-v1]
The AWS Deep Learning Containers for TensorFlow 2.6 with CUDA 11.2 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.6 .
• TensorFlow 2.10 Python 3.9 (optimized for CPU) [2.10.0-cpu-py39-ubuntu20.04-sagemaker-v1.0]
The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training
on CPU, optimized for performance and scale on AWS. For more information, see Release Notes for
Deep Learning Containers.
• TensorFlow 2.10 Python 3.9 (optimized for GPU) [tensorflow-2.10-gpu-py39-cu112-ubuntu20.04-
sagemaker-v1]
The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for
Deep Learning Containers.
Data Science is a conda image with the most commonly used Python packages and libraries, such as
NumPy and scikit-learn.
167
Amazon SageMaker Developer Guide
Customize Studio
• Bring your own SageMaker image: A SageMaker image is a file that identifies the kernels, language
packages, and other dependencies required to run a Jupyter notebook in Amazon SageMaker Studio.
Amazon SageMaker provides many built-in images for you to use. If you need different functionality,
you can bring your own custom images to Studio.
• Use Lifecycle Configurations with Amazon SageMaker Studio: Lifecycle Configurations are
shell scripts triggered by Amazon SageMaker Studio lifecycle events, such as starting a new
Studio notebook. You can use Lifecycle Configurations to automate customization for your Studio
environment. For example, you can install custom packages, configure notebook extensions, preload
datasets, and set up source code repositories.
• Attach suggested Git repos to Studio:You can attach suggested Git repository URLs at the Amazon
SageMaker Domain or user profile level. Then, you can select the repo URL from the list of suggestions
and clone that into your environment using the Git extension in Studio.
• Persist Conda environments to the Studio Amazon EFS volume: Studio uses an Amazon EFS volume
as a persistent storage layer. You can save your Conda environment on this Amazon EFS volume, then
use the saved environment to create kernels. Studio automatically picks up all valid environments
saved in Amazon EFS as KernelGateway kernels. These kernels persist through restart of the kernel,
app, and Studio. For more information, see the Persist Conda environments to the Studio EFS
volume section in Four approaches to manage Python packages in Amazon SageMaker Studio
notebooks.
The following topics show how to use these three options to customize your Amazon SageMaker Studio
environment.
Topics
• Bring your own SageMaker image (p. 169)
• Use Lifecycle Configurations with Amazon SageMaker Studio (p. 182)
168
Amazon SageMaker Developer Guide
Customize Studio
If you need different functionality, you can bring your own custom images to Studio. You can create
images and image versions, and attach image versions to your domain or shared space, using the
SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS
CLI). You can also create images and image versions using the SageMaker console, even if you haven't
onboarded to a SageMaker domain. SageMaker provides sample Dockerfiles to use as a starting point for
your custom SageMaker images in the SageMaker Studio Custom Image Samples repository.
The following topics explain how to bring your own image using the SageMaker console or AWS CLI,
then launch the image in Studio. For a similar blog article, see Bringing your own R environment to
Amazon SageMaker Studio. For notebooks that show how to bring your own image for use in training
and inference, see Amazon SageMaker Studio Container Build CLI.
Key terminology
The following section defines key terms for bringing your own image to use with Studio.
• Dockerfile: A Dockerfile is a file that identifies the language packages and other dependencies for your
Docker image.
• Docker image: The Docker image is a built Dockerfile. This image is checked into Amazon ECR and
serves as the basis of the SageMaker image.
• SageMaker image: A SageMaker image is a holder for a set of SageMaker image versions based on
Docker images. Each image version is immutable.
• Image version: An image version of a SageMaker image represents a Docker image and is stored in an
Amazon ECR repository. Each image version is immutable. These image versions can be attached to a
domain or shared space and used with Studio.
Topics
• Custom SageMaker image specifications (p. 169)
• Prerequisites (p. 171)
• Add a Docker image compatible with Studio to Amazon ECR (p. 171)
• Create a custom SageMaker image (p. 172)
• Attach a custom SageMaker image (p. 175)
• Launch a custom SageMaker image in Amazon SageMaker Studio (p. 180)
• Clean up resources (p. 181)
ENTRYPOINT and CMD instructions are overridden to enable the image to run as a KernelGateway
app.
169
Amazon SageMaker Developer Guide
Customize Studio
Port 8888 in the image is reserved for running the KernelGateway web server.
Stopping the image
The DeleteApp API issues the equivalent of a docker stop command. Other processes in the
container won’t get the SIGKILL/SIGTERM signals.
Kernel discovery
You can specify a list of kernels to display before running the image. If not specified, python3 is
displayed. Use the DescribeAppImageConfig API to view the list of kernels.
The /opt/.sagemakerinternal and /opt/ml directories are reserved. Any data in these
directories might not be visible at runtime.
User data
Each user in a domain gets a user directory on a shared Amazon Elastic File System volume in the
image. The location of the current user's directory on the Amazon EFS volume is configurable. By
default, the location of the directory is /home/sagemaker-user.
SageMaker configures POSIX UID/GID mappings between the image and the host. This defaults to
mapping the root user's UID/GID (0/0) to the UID/GID on the host.
Amazon SageMaker Studio only supports the following DefaultUID and DefaultGID
combinations:
• DefaultUID: 1000 and DefaultGID: 100, which corresponds to a non-priveleged user.
• DefaultUID: 0 and DefaultGID: 0, which corresponds to root access.
Metadata
On a GPU instance, the image is run with the --gpus option. Only the CUDA toolkit should be
included in the image not the NVIDIA drivers. For more information, see NVIDIA User Guide.
Metrics and logging
Logs from the KernelGateway process are sent to Amazon CloudWatch in the customer’s account.
The name of the log group is /aws/sagemaker/studio. The name of the log stream is
$domainID/$userProfileName/KernelGateway/$appName.
Image size
Limited to 25 GB. To view the size of your image, run docker image ls.
Sample Dockerfile
The following sample Dockerfile creates an image based Amazon Linux 2, installs third party packages
and the python3 kernel, and sets the scope to the non-privileged user.
170
Amazon SageMaker Developer Guide
Customize Studio
FROM public.ecr.aws/amazonlinux/amazonlinux:2
ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"
RUN \
yum install --assumeyes python3 shadow-utils && \
useradd --create-home --shell /bin/bash --gid "${NB_GID}" --uid ${NB_UID} ${NB_USER} &&
\
yum clean all && \
python3 -m pip install ipykernel && \
python3 -m ipykernel install
USER ${NB_UID}
Prerequisites
You must satisfy the following prerequisites to bring your own container for use with Amazon SageMaker
Studio.
• The Docker application. For information about setting up Docker, see Orientation and setup.
• Install the AWS CLI by following the steps in Getting started with the AWS CLI.
• A local copy of any Dockerfile for creating a Studio compatible image. For sample custom images, see
the SageMaker Studio custom image samples repository.
• Permissions to access the Amazon Elastic Container Registry (Amazon ECR) service. For more
information, see Amazon ECR Managed Policies.
• An AWS Identity and Access Management execution role that has the AmazonSageMakerFullAccess
policy attached. If you have onboarded to Amazon SageMaker domain, you can get the role from the
Domain Summary section of the SageMaker control panel.
• Install the Studio image build CLI by following the steps in SageMaker Docker Build. This CLI enables
you to build a Dockerfile using AWS CodeBuild.
Note
The Amazon ECR repository must be in the same AWS Region as Studio.
1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR
console, see Creating a repository.
171
Amazon SageMaker Developer Guide
Customize Studio
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/smstudio-custom",
"registryId": "acct-id",
"repositoryName": "smstudio-custom",
"repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/smstudio-custom",
...
}
}
2. Build the Dockerfile using the Studio image build CLI. The period (.) specifies that the Dockerfile
should be in the context of the build command. This command builds the image and uploads the
built image to the ECR repo. It then outputs the image URI.
When you create an image from the console, SageMaker also creates an initial image version. The image
version represents a container image in Amazon Elastic Container Registry (ECR). The container image
must satisfy the requirements to be used in Amazon SageMaker Studio. For more information, see
Custom SageMaker image specifications (p. 169). For information on testing your image locally and
resolving common issues, see the SageMaker Studio Custom Image Samples repo.
After you have created your custom SageMaker image, you must attach it to your domain or shared
space to use it with Studio. For more information, see Attach a custom SageMaker image (p. 175).
To create an image
acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest]
5. Choose Next.
6. Under Image properties, enter the following:
• Image name – The name must be unique to your account in the current AWS Region.
• (Optional) Display name – The name displayed in the Studio user interface. When not provided,
Image name is displayed.
• (Optional) Description – A description of the image.
172
Amazon SageMaker Developer Guide
Customize Studio
• IAM role – The role must have the AmazonSageMakerFullAccess policy attached. Use the
dropdown menu to choose one of the following options:
• Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets
that you want users of your notebooks to have access to. If you don't want to allow access to
additional buckets, choose None.
SageMaker attaches the AmazonSageMakerFullAccess policy to the role. The role allows
users of your notebooks access to the S3 buckets listed next to the checkmarks.
• Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
• Use existing role – Choose one of your existing roles from the list.
• (Optional) Image tags – Choose Add new tag. You can add up to 50 tags. Tags are searchable
using the Studio user interface, the SageMaker console, or the SageMaker Search API.
7. Choose Submit.
The new image is displayed in the Custom images list and briefly highlighted. After the image has been
successfully created, you can choose the image name to view its properties or choose Create version to
create another version.
• Create an Image.
• Create an ImageVersion.
• Create a configuration file.
• Create an AppImageConfig.
{
"ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/custom-image"
}
173
Amazon SageMaker Developer Guide
Customize Studio
{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-
image/1"
}
{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-
image/1",
"ImageVersionStatus": "CREATED"
}
Note
If the response is "ImageVersionStatus": "CREATED_FAILED", the response also
includes the failure reason. A permissions issue is a common cause of failure. You also
can check your Amazon CloudWatch logs if you experience a failure when starting or
running the KernelGateway app for a custom image. The name of the log group is /aws/
sagemaker/studio. The name of the log stream is $domainID/$userProfileName/
KernelGateway/$appName.
4. Create a configuration file, named app-image-config-input.json. The Name value of
KernelSpecs must match the name of the kernelSpec available in the Image associated with this
AppImageConfig. This value is case sensitive. You can find the available kernelSpecs in an image
by running jupyter-kernelspec list from a shell inside the container. MountPath is the path
within the image to mount your Amazon Elastic File System (Amazon EFS) home directory. It needs
to be different from the path you use inside the container because that path will be overridden when
your Amazon EFS home directory is mounted.
Note
The following DefaultUID and DefaultGID combinations are the only accepted values:
{
"AppImageConfigName": "custom-image-config",
"KernelGatewayImageConfig": {
"KernelSpecs": [
{
"Name": "python3",
"DisplayName": "Python 3 (ipykernel)"
}
],
"FileSystemConfig": {
"MountPath": "/home/sagemaker-user",
"DefaultUid": 1000,
"DefaultGid": 100
}
}
174
Amazon SageMaker Developer Guide
Customize Studio
5. Create the AppImageConfig using the file created in the previous step.
{
"AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/custom-
image-config"
}
To make a custom SageMaker image available to all users within a domain, you attach the image to the
domain. To make an image available to all users within a shared space, you can attach the image to the
shared space. To make an image available to a single user, you attach the image to the user's profile.
When you attach an image, SageMaker uses the latest image version by default. You can also attach a
specific image version. After you attach the version, you can choose the version from the SageMaker
Launcher or the image selector when you launch a notebook.
There is a limit to the number of image versions that can be attached at any given time. After you reach
the limit, you must detach a version in order to attach another version of the image.
The following sections demonstrate how to attach a custom SageMaker image to your domain using
either the SageMaker console or the AWS CLI. You can only attach a custom image to a share space using
the AWS CLI.
This topic describes how you can attach an existing custom SageMaker image version to your domain
using the SageMaker control panel. You can also create a custom SageMaker image and image version,
and then attach that version to your domain. For the procedure to create an image and image version,
see Create a custom SageMaker image (p. 172).
175
Amazon SageMaker Developer Guide
Customize Studio
9. Choose Next.
10. Verify the values for Image name, Image display name, and Description.
11. Choose the IAM role. For more information, see Create a custom SageMaker image (p. 172).
12. (Optional) Add tags for the image.
13. Specify the EFS mount path. This is the path within the image to mount the user's Amazon Elastic
File System (EFS) home directory.
14. For Image type, select SageMaker Studio image
15. For Kernel name, enter the name of an existing kernel in the image. For information on how to
get the kernel information from the image, see DEVELOPMENT in the SageMaker Studio Custom
Image Samples repository. For more information, see the Kernel discovery and User data sections
of Custom SageMaker image specifications (p. 169).
16. (Optional) For Kernel display name, enter the display name for the kernel.
17. Choose Add kernel.
18. Choose Submit.
• Wait for the image version to be attached to the domain. When attached, the version is
displayed in the Custom images list and briefly highlighted.
The following sections demonstrate how to attach a custom SageMaker image when creating a new
domain or updating your existing domain using the AWS CLI.
The following section demonstrates how to create a new domain with the version attached. These steps
require that you specify the Amazon Virtual Private Cloud (VPC) information and execution role required
to create the domain. You perform the following steps to create the domain and attach the custom
SageMaker image:
vpc-xxxxxxxx
2. Get your default subnet IDs using the VPC ID from the previous step.
176
Amazon SageMaker Developer Guide
Customize Studio
[
"subnet-b55171dd",
"subnet-8a5f99c6",
"subnet-e88d1392"
]
3. Create a configuration file named create-domain-input.json. Insert the VPC ID, subnet IDs,
ImageName, and AppImageConfigName from the previous steps. Because ImageVersionNumber
isn't specified, the latest version of the image is used, which is the only version in this case.
{
"DomainName": "domain-with-custom-image",
"VpcId": "<vpc-id>",
"SubnetIds": [
"<subnet-ids>"
],
"DefaultUserSettings": {
"ExecutionRole": "<execution-role>",
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "custom-image",
"AppImageConfigName": "custom-image-config"
}
]
}
},
"AuthMode": "IAM"
}
{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx",
"Url": "https://d-xxxxxxxxxxxx.studio.us-east-2.sagemaker.aws/..."
}
If you have onboarded to a SageMaker domain, you can attach the custom image to your current
domain. For more information about onboarding to a SageMaker domain, see Onboard to Amazon
SageMaker Domain (p. 37). You don't need to specify the VPC information and execution role when
attaching a custom image to your current domain. After you attach the version, you must delete all the
apps in your domain and reopen Studio. For information about deleting the apps, see Delete an Amazon
SageMaker Domain (p. 116).
You perform the following steps to add the SageMaker image to your current domain.
177
Amazon SageMaker Developer Guide
Customize Studio
{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}
7. Save the default user settings section of the response to a file named default-user-
settings.json.
8. Insert the ImageName and AppImageConfigName from the previous steps as a custom image.
Because ImageVersionNumber isn't specified, the latest version of the image is used, which is the
only version in this case.
{
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "string",
"AppImageConfigName": "string"
}
],
...
}
}
}
9. Use the domain ID and default user settings file to update your domain.
{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}
178
Amazon SageMaker Developer Guide
Customize Studio
You can only attach the SageMaker image to a shared space using the AWS CLI. After you attach the
version, you must delete all of the applications in your shared space and reopen Studio. For information
about deleting the apps, see Delete an Amazon SageMaker Domain (p. 116).
You perform the following steps to add the SageMaker image to a shared space.
{
"DomainId": "d-xxxxxxxxxxxx",
...
"DefaultSpaceSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}
7. Save the default space settings section of the response to a file named default-space-
settings.json.
8. Insert the ImageName and AppImageConfigName from the previous steps as a custom image.
Because ImageVersionNumber isn't specified, the latest version of the image is used, which is the
only version in this case.
{
"DefaultSpaceSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "string",
"AppImageConfigName": "string"
}
],
179
Amazon SageMaker Developer Guide
Customize Studio
...
}
}
}
9. Use the domain ID and default space settings file to update your domain.
{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}
After you create the custom SageMaker image and attach it to your domain, the image appears in the
Environment tab of the Domain. You can only view the attached images for shared spaces using the
AWS CLI by using the following command.
1. In Amazon SageMaker Studio, open the Launcher. To open the Launcher, choose Amazon
SageMaker Studio at the top left of the Studio interface or use the keyboard shortcut Ctrl +
Shift + L.
To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker Studio
Launcher (p. 141)
180
Amazon SageMaker Developer Guide
Customize Studio
2. In the Launcher, in the Notebooks and compute resources section, choose Change environment.
3. In the Change environment dialog, use the dropdown menus to select your Image from the Custom
Image section, and your Kernel, then choose Select.
4. In the Launcher, choose Create notebook or Open image terminal. Your notebook or terminal
launches in the selected custom image and kernel.
To change your image or kernel in an open notebook, see Change an Image or a Kernel (p. 159).
Note
If you encounter an error when launching the image, check your Amazon CloudWatch logs.
The name of the log group is /aws/sagemaker/studio. The name of the log stream is
$domainID/$userProfileName/KernelGateway/$appName.
Clean up resources
The following sections show how to clean up the resources you created in the previous sections from the
SageMaker console or AWS CLI. You perform the following steps to clean up the resources:
The following section shows how to clean up resources from the SageMaker console.
When you detach an image from a domain, all versions of the image are detached. When an image
is detached, all users of the domain lose access to the image versions. A running notebook that has a
kernel session on an image version when the version is detached, continues to run. When the notebook is
stopped or the kernel is shut down, the image version becomes unavailable.
To detach an image
1. In the Control Panel, under Custom SageMaker Studio images attached to domain, choose the
image and then choose Detach.
2. (Optional) To delete the image and all versions from SageMaker, select Also delete the selected
images .... This does not delete the associated container images from Amazon ECR.
3. Choose Detach.
The following section shows how to clean up resources from the AWS CLI.
To clean up resources
1. Detach the image and image versions from your domain by passing an empty custom image list to
the domain. Open the default-user-settings.json file you created in Attach the SageMaker
image to your current domain (p. 177). To detach the image and image version from a shared
space, open the default-space-settings.json file.
2. Delete the custom images and then save the file.
"DefaultUserSettings": {
181
Amazon SageMaker Developer Guide
Customize Studio
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
},
...
}
3. Use the domain ID and default user settings file to update your domain. To update your shared
space, use the default space settings file.
{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}
5. Delete the SageMaker image, which also deletes all image versions. The container images in ECR
that are represented by the image versions are not deleted.
Using Lifecycle Configurations gives you flexibility and control to configure Studio to meet your specific
needs. For example, you can create a minimal set of base container images with the most commonly
used packages and libraries, then use Lifecycle Configurations to install additional packages for specific
use cases across your data science and machine learning teams.
For example Lifecycle Configuration scripts, see the Studio Lifecycle Configuration examples GitHub
repository. For a blog on implementing Lifecycle Configurations, see Customize Amazon SageMaker
Studio using Lifecycle Configurations.
Note
Each script has a limit of 16384 characters.
Topics
• Creating and Associating a Lifecycle Configuration (p. 183)
• Setting Default Lifecycle Configurations (p. 187)
• Debugging Lifecycle Configurations (p. 189)
• Updating and deleting Lifecycle Configurations (p. 190)
182
Amazon SageMaker Developer Guide
Customize Studio
• JupyterServerapplications: This application type enables access to the visual interface for Studio.
Every user in Studio gets their own Jupyter Server application.
• KernelGateway applications: This application type enables access to the code run environment and
kernels for your Studio notebooks and terminals. For more information, see Jupyter Kernel Gateway.
For more information about Studio's architecture and Studio applications, see Use Amazon SageMaker
Studio Notebooks.
Topics
• Create a Lifecycle Configuration from the AWS CLI (p. 183)
• Create a Lifecycle Configuration from the SageMaker Console (p. 185)
The following topic shows how to create a lifecycle configuration using the AWS CLI to automate
customization for your Studio environment.
Prerequisites
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
• Onboard to Amazon SageMaker Studio. For more information, see Onboard to Amazon SageMaker
Studio.
The following procedure shows how to create a lifecycle configuration script that prints Hello World.
1. From your local machine, create a file named my-script.sh with the following content.
#!/bin/bash
set -eux
echo 'Hello World!'
2. Convert your my-script.sh file into base64 format. This requirement prevents errors that occur
from spacing and line break encoding.
3. Create a Studio lifecycle configuration. The following command creates a lifecycle configuration that
runs when you launch an associated KernelGateway application.
183
Amazon SageMaker Developer Guide
Customize Studio
--region region \
--studio-lifecycle-config-name my-studio-lcc \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type KernelGateway
Note the ARN of the newly created lifecycle configuration that is returned. This ARN is required to
attach the lifecycle configuration to your application.
Step 2: Attach the Lifecycle Configuration to your Studio domain, user profile, or shared space
To attach the lifecycle configuration, you must update the UserSettings for your Studio domain or an
individual user profile, or the SpaceSettings for a shared space. Lifecycle configuration scripts that are
associated at the domain level are inherited by all users. However, scripts that are associated at the user
profile level are scoped to a specific user, while scripts that are associated at the shared space level are
scoped to the shared space.
The following example shows how to create a new user profile with the lifecycle configuration attached.
To update an existing user profile, use the update-user-profile command.
Add the lifecycle configuration ARN from the previous step to the settings for the appropriate AppType.
For example, place it in the JupyterServerAppSettings of the user. You can add multiple lifecycle
configuration at the same time by using a list of lifecycle configuration.
The following example shows how to update an existing shared space to attach the lifecycle
configuration. The lifecycle configuration specified as part of DefaultResourceSpec indicates which
lifecycle configuration is automatically attached to new applications created in the shared space.
After you attach a lifecycle configuration to a user profile or space, the user can select it when launching
an application using the AWS CLI. This section describes how to launch an application with an attached
lifecycle configuration.
Launch the application and specify the lifecycle configuration ARN in the ResourceSpec argument of
the CreateApp API.
184
Amazon SageMaker Developer Guide
Customize Studio
• The following example shows how to create a JupyterServer application. When creating the app-
type JupyterServer, the app-name must be default.
The following topic shows how to create a lifecycle configuration from the Amazon SageMaker console
to automate customization for your Studio environment.
Prerequisites
Before you can begin this tutorial, complete the following prerequisite:
• Onboard to Amazon SageMaker Studio. For more information, see Onboard to Amazon SageMaker
Studio.
You can create a lifecycle configuration by entering a script from the Amazon SageMaker console.
The following procedure shows how to create a lifecycle configuration script that prints Hello World.
185
Amazon SageMaker Developer Guide
Customize Studio
#!/bin/bash
set -eux
echo 'Hello World!'
Lifecycle configuration scripts associated at the domain level are inherited by all users. However, scripts
that are associated at the user profile level are scoped to a specific user.
The following sections show how to attach a lifecycle configuration to your domain and user profile.
The following shows how to attach a lifecycle configuration to your existing domain in Studio.
The following shows how to attach a lifecycle configuration to your existing user profile.
After you attach a lifecycle configuration to a user profile, the user can select it when launching an
application using the Studio Launcher. The following procedure describes how to launch an application
with an attached lifecycle configuration.
186
Amazon SageMaker Developer Guide
Customize Studio
You can view the logs for your lifecycle configuration after it has been attached to a Studio domain or
user profile.
1. First, provide access to CloudWatch for your AWS Identity and Access Management (IAM) role. Add
read permissions for the following log group /aws/sagemaker/studio and for the following log
stream <Domain>/<UserProfile>/<AppType>/<AppName>/LifecycleConfigOnStart. For
information about adding permissions, see Enabling logging from certain AWS services.
2.
From within Studio, navigate to the Running Terminals and Kernels icon to monitor your
lifecycle configuration.
3. Select an application from the list of running applications. Applications with attached lifecycle
• JupyterServer apps: When added to the DefaultResourceSpec of a JupyterServer app, the default
Lifecycle Configuration script runs automatically when the user logs into Studio for the first time
or restarts Studio. This can be used to automate one-time set-up actions for the Studio developer
environment, such as installing notebook extensions or setting up a GitHub repo. For an example of
this, see Customize Amazon SageMaker Studio using Lifecycle Configurations.
187
Amazon SageMaker Developer Guide
Customize Studio
Note
A default KernelGateway Lifecycle Configuration specified in DefaultResourceSpec applies
to all KernelGateway images in the Studio Domain unless the user selects a different script from
the list presented in the Studio launcher. The default script also runs if No Script is selected
by the user. For more information on selecting a script, see Step 3: Launch an application with
the Lifecycle Configuration (p. 186).
To associate a Lifecycle Configuration when creating a new Studio Domain or UserProfile, you need the
ARN of the Lifecycle Configuration that you created. This ARN is passed to one of the following API calls:
• create-user-profile
• create-domain
For example, the following API call creates a new UserProfile with an associated Lifecycle Configuration.
To associate a Lifecycle Configuration when updating an existing Studio Domain or UserProfile, you need
the ARN of the Lifecycle Configuration that you created. This ARN is passed to one of the following API
calls:
• update-user-profile
• update-domain
The Lifecycle Configuration ARN should be placed in 2 places, the DefaultResourceSpec and the
LifecycleConfigArns list in KernelGatewayAppSettings. For example, the following API call
updates a UserProfile with an associated Lifecycle Configuration.
188
Amazon SageMaker Developer Guide
Customize Studio
}'
Topics
• Verify Lifecycle Configuration Process from Amazon CloudWatch Logs (p. 189)
• JupyterServer App failure (p. 189)
• KernelGateway App failure (p. 190)
• Lifecycle Config timeout (p. 190)
Lifecycle Configurations only log STDOUT and STDERR. STDOUT is the default output for bash scripts,
while STDERR can be written to by appending >&2 to the end of a bash command. For example, echo
'hello'>&2. Logs for your Lifecycle Configurations are published to your AWS Account via CloudWatch.
These logs can be found in the /aws/sagemaker/studio Log Stream from the AWS CloudWatch
console.
<DomainId>/<UserProfileName>/<AppType>/<AppName>
For example, to find the Lifecycle Configuration logs for Domain d-m85lcu8vbqmz, UserProfile i-
sonic-js, Apptype JupyterServer and AppName test-lcc-echo, use the following search
string:
d-m85lcu8vbqmz/i-sonic-js/JupyterServer/test-lcc-echo
6. Select the log stream appended with LifecycleConfigOnStart to view the script execution logs.
If your JupyterServer App crashes because of an issue with the attached Lifecycle Configuration, Studio
displays the following error message on the Studio startup screen.
Click the View script logs link to view the CloudWatch logs for your JupyterServer app.
In the case where the faulty Lifecycle Configuration is specified in the DefaultResourceSpec of your
Studio Domain or UserProfile, Studio continues to use the Lifecycle Configuration even after restarting
Studio.
To resolve this error, follow the steps in Setting Default Lifecycle Configurations (p. 187) to remove the
Lifecycle Configuration script from the DefaultResourceSpec or select another script using the AWS
CLI. Then launch a new JupyterServer app.
189
Amazon SageMaker Developer Guide
Customize Studio
If your KernelGateway App crashes because of an issue with the attached Lifecycle Configuration, Studio
displays the error message in your Studio Notebook.
Click the View script logs link to view the CloudWatch logs for your KernelGateway app.
In this case, your Lifecycle Configuration is specified in the Studio Launcher when launching a new Studio
Notebook.
To resolve this error, use the Studio launcher to select a different Lifecycle Configuration or select No
script.
Note
A default KernelGateway Lifecycle Configuration specified in DefaultResourceSpec applies
to all KernelGateway images in the Studio Domain unless the user selects a different script from
the list presented in the Studio launcher. The default script also runs if No Script is selected
by the user. For more information on selecting a script, see Step 3: Launch an application with
the Lifecycle Configuration (p. 186).
There is a Lifecycle Configuration timeout limitation of 5 minutes. If a Lifecycle Configuration script takes
longer than 5 minutes to run, Studio throws an error.
To resolve this error, ensure that your Lifecycle Configuration script completes in less than 5 minutes.
• Cut down on necessary steps. For example, limit which conda environments to install large packages
in.
• Run tasks in parallel processes.
• Use the nohup command in your script.
The following topics show how to attach Git repo URLs to a Domain or user profile from the AWS CLI and
SageMaker console. You'll also learn how to detach these repository URLs.
Topics
190
Amazon SageMaker Developer Guide
Customize Studio
Prerequisites
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
• Onboard to Amazon SageMaker Domain. For more information, see Onboard to Amazon SageMaker
Domain (p. 37).
Git repo URLs associated at the Domain level are inherited by all users. However, Git repo URLs that are
associated at the user profile level are scoped to a specific user. You can attach multiple Git repo URLs to
a Domain or user profile by passing a list of repository URLs.
The following sections show how to attach a Git repo URL to your Domain and user profile.
Attach to a Domain
The following shows how to attach a Git repo URL to an existing user profile.
Prerequisites
Before you can begin this tutorial, you must onboard to Amazon SageMaker Domain. For more
information, see Onboard to Amazon SageMaker Domain (p. 37).
191
Amazon SageMaker Developer Guide
Customize Studio
Git repo URLs associated at the Domain level are inherited by all users. However, Git repo URL that are
associated at the user profile level are scoped to a specific user.
The following sections show how to attach a Git repo URL to a Domain and user profile.
Attach to a Domain
The following shows how to attach a Git repository URL to an existing user profile.
Topics
• Detach a Git repo using the AWS CLI (p. 192)
• Detach the Git repo using the SageMaker console (p. 193)
To detach all Git repo URLs from a Domain or user profile, you must pass an empty list of code
repositories. This list is passed as part of the JupyterServerAppSettings parameter in an update-
domain or update-user-profile command. To detach only one Git repo URL, pass the code
192
Amazon SageMaker Developer Guide
Customize Studio
repositories list without the desired Git repo URL. This section shows how to detach all Git repo URLs
from your Domain or user profile using the AWS Command Line Interface (AWS CLI).
The following command detaches all Git repo URLs from a Domain.
The following command detaches all Git repo URLs from a user profile.
The following sections show how to detach a Git repo URL from a Domain or user profile using the
SageMaker console.
Use the following steps to detach a Git repo URL from an existing Domain.
Use the following steps to detach a Git repo URL from a user profile.
193
Amazon SageMaker Developer Guide
Perform Common Tasks
Topics
• Upload Files to SageMaker Studio (p. 194)
• Clone a Git Repository in SageMaker Studio (p. 194)
• Stop a Training Job in SageMaker Studio (p. 195)
• Use TensorBoard in Amazon SageMaker Studio (p. 195)
• Using CodeWhisperer and CodeGuru extensions with SageMaker (p. 197)
• Manage Your Amazon EFS Storage Volume in SageMaker Studio (p. 198)
• Provide Feedback on SageMaker Studio (p. 198)
• Shut Down and Update SageMaker Studio and Studio Apps (p. 198)
1.
In the left sidebar, choose the File Browser icon ( ).
2.
In the file browser, choose the Upload Files icon ( ).
3. Select the files you want to upload and then choose Open.
4. Double-click a file to open the file in a new tab in Studio.
1.
In the left sidebar, choose the File Browser icon ( ).
2. Choose the root folder or the folder you want to clone the repo into.
3.
In the left sidebar, choose the Git icon ( ).
194
Amazon SageMaker Developer Guide
Perform Common Tasks
1. Follow the View, search, and compare experiment runs (p. 1592) procedure on this page until you
open the Describe Trial Component tab.
2. At the upper-right side of the tab, choose Stop training job. The Status at the top left of the tab
changes to Stopped.
3. To view the training time and billing time, choose AWS Settings.
Prerequisites
This tutorial requires an Amazon SageMaker Studio Domain. For more information, see Onboard to
Amazon SageMaker Domain (p. 37)
Set Up TensorBoardCallback
1. Launch Studio, and open the Launcher. For more information, see Use the Amazon SageMaker
Studio Launcher (p. 141)
2. In the Amazon SageMaker Studio Launcher, under Notebooks and compute resources, choose
the Change environment button.
3. On the Change environment dialog, use the dropdown menus to select the TensorFlow 2.3
Python 3.7(optimized for CPU) Studio Image.
195
Amazon SageMaker Developer Guide
Perform Common Tasks
4. Back to the Launcher, click the Create notebook tile. Your notebook launches and opens in a new
Studio tab.
5. Run this code from within your notebook cells.
6. Import the required packages.
import os
import datetime
import tensorflow as tf
mnist = tf.keras.datasets.mnist
def create_model():
return tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model = create_model()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR,
histogram_freq=1)
model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=(x_test, y_test),
callbacks=[tensorboard_callback])
10. Generate the EFS path for the TensorBoard logs. You use this path to set up your logs from the
terminal.
EFS_PATH_LOG_DIR = "/".join(LOG_DIR.strip("/").split('/')[1:-1])
print (EFS_PATH_LOG_DIR)
Retrieve the EFS_PATH_LOG_DIR. You will need it in the TensorBoard installation section.
Install TensorBoard
1. Click on the Amazon SageMaker Studio button on the top left corner of Studio to open
the Amazon SageMaker Studio Launcher. This launcher must be opened from your root directory.
For more information, see Use the Amazon SageMaker Studio Launcher (p. 141)
196
Amazon SageMaker Developer Guide
Perform Common Tasks
Launch TensorBoard
1. To launch TensorBoard, copy your Studio URL and replace lab? with proxy/6006/ as follows. You
must include the trailing / character.
https://<YOUR_URL>.studio.region.sagemaker.aws/jupyter/default/proxy/6006/
The following extensions support writing code by generating code recommendations and suggesting
improvements related to code issues:
• Amazon CodeWhisperer
• Amazon CodeGuru
For more information, see the Setting up CodeWhisperer with Amazon SageMaker Studio.
CodeGuru Security improves the security of your code in the following ways:
197
Amazon SageMaker Developer Guide
Perform Common Tasks
From SageMaker, you can call CodeGuru Security by using the open-source Jupyter plugin. You can use
CodeGuru Security to scan notebooks for a variety of issues that can affect the security, correctness,
reproducibility, maintainability, and performance of your code. For more information, see Get started
with the Amazon CodeGuru Extension for JupyterLab and SageMaker Studio.
For information on how to access the Amazon EFS volume, see Using file systems in Amazon EFS.
To delete the Amazon EFS volume, see Deleting an Amazon EFS file system.
To provide feedback
1.
198
Amazon SageMaker Developer Guide
Perform Common Tasks
Amazon SageMaker does not update Amazon SageMaker Studio apps when it is in service.
Studio provides a notification icon ( ) in the upper-right corner of the Studio UI. This notification icon
displays the number of unread notices. To read the notices, select the icon.
• Upgrade – Displayed when Studio or one of the Studio apps have released a new version. To update
Studio, see Shut down and Update SageMaker Studio (p. 199). To update Studio apps, see Shut down
and Update Studio Apps (p. 200).
• Information – Displayed for new features and other information.
To reset the notification icon, you must select the link in each notice. Read notifications may still display
in the icon. This does not indicate that updates are still needed after you have updated Studio and Studio
Apps.
To learn how to update Amazon SageMaker Data Wrangler, see Shut down and Update Studio
Apps (p. 200).
To ensure that you have the most recent software updates, update Amazon SageMaker Studio and your
Studio apps using the methods outlined in the following topics.
Topics
• Shut down and Update SageMaker Studio (p. 199)
• Shut down and Update Studio Apps (p. 200)
Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't
impacted.
Some of the services within Studio, like Data Wrangler, run on their own app. To update these
services you must delete the app for that service. To learn more, see Shut down and Update Studio
Apps (p. 200).
Note
A JupyterServer app is associated with a single Studio user. When you update the app for one
user it doesn't affect other users.
The following topic shows how to update the JupyterServer App from the SageMaker console or from
inside Studio.
1. Navigate to https://console.aws.amazon.com/sagemaker/.
2. Choose Domains.
3. Select the Domain that includes the Studio application that you want to update.
4. Under User profiles, select your user name.
5. Under Apps, in the row displaying JupyterServer, choose Action, then choose Delete.
199
Amazon SageMaker Developer Guide
Studio Pricing
1. Launch Studio.
2. On the top menu, choose File then Shut Down.
3. Choose one of the following options:
• Shutdown Server – Shuts down the JupyterServer app. Terminal sessions, kernel sessions,
SageMaker images, and instances aren't shut down. These resources continue to accrue charges.
• Shutdown All – Shuts down all apps, terminal sessions, kernel sessions, SageMaker images, and
instances. These resources no longer accrue charges.
4. Close the window.
5. After the app has been deleted, launch a new Studio app to use the latest version.
1. Navigate to https://console.aws.amazon.com/sagemaker/.
2. Choose Domains.
3. Select the Domain that includes the application that you want to update.
4. Under User profiles, select your user name.
5. Under Apps, in the row displaying the App name, choose Action, then choose Delete
To update Data Wrangler, delete the app that starts with sagemaker-data-wrang.
6. Choose Yes, delete app.
7. Type delete in the confirmation box.
8. Choose Delete.
9. After the app has been deleted, launch a new kernel from within Studio to use the latest version.
200
Amazon SageMaker Developer Guide
Troubleshooting
When this member, or any member of the team, opens Studio, a home directory is created in the
volume for the member. A storage charge is incurred for this directory. Subsequently, additional storage
charges are incurred for the notebooks and data files stored in the member's home directory. For pricing
information on Amazon EFS, see Amazon EFS Pricing.
Additional costs are incurred when other operations are run inside Studio, for example, running a
notebook, running training jobs, and hosting a model.
For information on the costs associated with using Studio notebooks, see Usage Metering (p. 161).
For information about billing along with pricing examples, see Amazon SageMaker Pricing.
When launching the Studio application, a pop-up displays the following message. No matter which
option is selected, Studio does not load.
Loading...
The loading screen is taking a long time. Would you like to clear the workspace or keep
waiting?
The Studio application can have a launch delay if multiple tabs are open in the Studio workspace
or several files are on Amazon EFS. This pop-up should disappear in a few seconds after the Studio
workspace is ready.
If you continue to see a loading screen with a spinner after selecting either of the options, there could
be connectivity issues with the Amazon Virtual Private Cloud used by Studio.
To resolve connectivity issues with the Amazon Virtual Private Cloud (Amazon VPC) used by Studio,
verify the following networking configurations:
• If your domain is set up in VpcOnly mode: Verify that there is an Amazon VPC endpoint for AWS
STS, or a NAT Gateway for outbound traffic, including traffic over the internet. To do this, follow the
steps in Connect SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).
• If your Amazon VPC is set up with a custom DNS instead of the DNS provided by Amazon: Verify that
the routes are configured using Dynamic Host Configuration Protocol (DHCP) for each Amazon VPC
endpoint added to the Amazon VPC used by Studio. For more information about setting default and
custom DHCP option sets, see DHCP option sets in Amazon VPC.
• Internal Failure when launching Studio
When launching Studio, you are unable to view the Studio UI. You also see an error similar to the
following, with Internal Failure as the error detail.
This error can be caused by multiple factors. If completion of these steps does not resolve your issue,
create an issue with https://aws.amazon.com/premiumsupport/.
201
Amazon SageMaker Developer Guide
Troubleshooting
• Missing Amazon EFS mount target: Studio uses Amazon EFS for storage. The Amazon EFS volume
needs a mount target for each subnet that the Amazon SageMaker domain is created in. If this
Amazon EFS mount target is deleted accidentally, the Studio application cannot load because it
cannot mount the user’s file directory. To resolve this issue, complete the following steps.
1. Find the Amazon EFS volume that is associated with the domain by using the DescribeDomain
API call.
2. Sign in to the AWS Management Console and open the Amazon EFS console at https://
console.aws.amazon.com/efs/.
3. From the list of Amazon EFS volumes, select the Amazon EFS volume that is associated with the
domain.
4. On the Amazon EFS details page, select the Network tab. Verify that there are mount targets
for all of the subnets that the domain is set up in.
5. If mount targets are missing, add the missing Amazon EFS mount targets. For instructions, see
Creating and managing mount targets and security groups.
6. After the missing mount targets are created, launch the Studio application.
• Conflicting files in the user’s .local folder: If you're using JupyterLab version 1 on Studio,
conflicting libraries in your .local folder can cause issues when launching the Studio application.
To resolve this, update your user profile's default JupyterLab version to JupyterLab 3.0.
For more information about viewing and updating the JupyterLab version, see JupyterLab
Versioning (p. 135).
• ConfigurationError: LifecycleConfig when launching Studio
You can't view the Studio UI when launching Studio. This is caused by issues with the default lifecycle
configuration script attached to the domain.
1. View the Amazon CloudWatch Logs for the lifecycle configuration to trace the command that
caused the failure. To view the log, follow the steps in Verify Lifecycle Configuration Process from
Amazon CloudWatch Logs (p. 189).
2. Detach the default script from the user profile or domain. For more information, see Updating and
deleting Lifecycle Configurations (p. 190).
3. Launch the Studio application.
4. Debug your lifecycle configuration script. You can run the lifecycle configuration script from the
system terminal to troubleshoot. When the script runs successfully from the terminal, you can
attach the script to the user profile or the domain.
• SageMaker Studio core functionalities are not available.
If you get this error message when opening Studio, it may be due to Python package version conflicts.
This occurs if you used the following commands in a notebook or terminal to install Python packages
that have version conflicts with SageMaker package dependencies.
!pip install
202
Amazon SageMaker Developer Guide
Troubleshooting
The problem should be resolved if you have uninstalled the package which caused the conflict. To
install packages without causing this issue again, use %pip install without the --user flag.
If the issue persists, create a new user profile and set up your environment with that user profile.
If you are unable to open Studio and cannot make a new running instance with all default settings,
create an issue with https://aws.amazon.com/premiumsupport/.
When the user launches a new notebook, they are unable to connect to the notebook session. If the
KernelGateway application's status is In Service, you can verify the following to resolve the issue.
• Check Security Group configurations
If the domain is set up in VPCOnly mode, the security group associated with the domain must allow
traffic between the ports in the range 8192-65535 for connectivity between the JupyterServer and
KernelGateway apps.
1. Get the security groups associated with the domain using the DescribeDomain API call.
2. Sign in to the AWS Management Console and open the Amazon VPC console at https://
console.aws.amazon.com/vpc/.
3. From the left navigation, under Security, choose Security Groups.
4. Filter by the IDs of the security groups that are associated with the domain.
5. For each security group:
For more information about security group rules, see Control traffic to resources using security
groups. For more information about requirements to use Studio in VPCOnly mode, see Connect
SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).
• Verify firewall and WebSocket connections
If the KernelGateway apps have an InService status and the user is unable to connect to the
Studio notebook session, verify the firewall and WebSocket settings.
1. Launch the Studio application. For more information, see Launch Amazon SageMaker
Studio (p. 133).
2. Open your web browser’s developer tools.
3. Choose the Network tab.
203
Amazon SageMaker Developer Guide
SageMaker Notebook Instances
wss://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/api/kernels/
<unique-code>/channels?session_id=<unique-code>
If the status or response code for the entry is anything other than 101, then your network
settings are preventing the connection between the Studio application and the KernelGateway
apps.
To resolve this issue, contact the team that manages your networking settings to allow list the
Studio URL and enable WebSocket connections.
• Unable to launch an app caused by exceeded resource quotas
When a user tries to launch a new notebook, the notebook creation fails with either of the following
errors. This is caused by exceeding resource quotas.
•
Unable to start more Apps of AppType [KernelGateway] and ResourceSpec(instanceType=[])
for UserProfile []. Please delete an App with a matching AppType and ResourceSpec,
then try again
Studio supports up to four running KernelGateway apps on the same instance. To resolve this issue,
you can do either of the following:
• Delete an existing KernelGateway application running on the instance, then restart the new
notebook.
• Start the new notebook on a different instance type
In this case, the account does not have sufficient limits to create a Studio application on the
specified instance type. To resolve this, navigate to the Service Quotas console at https://
console.aws.amazon.com/servicequotas/. In that console, request to increase the Studio
KernelGateway Apps running on instance-type instance limit. For more information,
see AWS service quotas.
SageMaker also provides sample notebooks that contain complete code walkthroughs. These
walkthroughs show how to use SageMaker to perform common machine learning tasks. For more
information, see Example Notebooks (p. 220).
Topics
• Amazon Linux 2 vs Amazon Linux notebook instances (p. 205)
• JupyterLab versioning (p. 208)
• Create a Notebook Instance (p. 209)
• Access Notebook Instances (p. 212)
• Update a Notebook Instance (p. 212)
204
Amazon SageMaker Developer Guide
AL2 vs AL1 instances
Notebook instances based on AL1 will enter a maintenance phase as of 12/01/2022. To replace AL1, you
now have the option to create Amazon SageMaker notebook instances with AL2. The AL1 maintenance
phase also coincides with the deprecation of Python 2 and Chainer. Notebooks based on AL2 do not have
managed Python 2 and Chainer kernels.
Date Description
205
Amazon SageMaker Developer Guide
AL2 vs AL1 instances
Supported instances
Amazon Linux 2 supports instances listed under Notebook Instances in Amazon SageMaker Pricing with
the exception that Amazon Linux 2 does not support ml.p2 instances.
Available Kernels
notebook-al1-v1: The following kernels are available in notebook instances based on the Amazon
Linux platform. These notebook instances support JupyterLab version 1. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).
Kernel Name
Sparkmagic (PySpark)
Sparkmagic (Spark)
Sparkmagic (SparkR)
conda_amazonei_mxnet_p27
conda_amazonei_mxnet_p36
conda_amazonei_pytorch_latest_p36
conda_amazonei_tensorflow2_p27
conda_amazonei_tensorflow2_p36
conda_amazonei_tensorflow_p27
conda_amazonei_tensorflow_p36
conda_chainer_p27
conda_chainer_p36
conda_mxnet_latest_p37
conda_mxnet_p27
conda_mxnet_p36
conda_python2
conda_python3
conda_pytorch_latest_p36
conda_pytorch_p27
conda_pytorch_p36
conda_tensorflow2_p36
conda_tensorflow_p27
conda_tensorflow_p36
206
Amazon SageMaker Developer Guide
AL2 vs AL1 instances
notebook-al2-v1: The following kernels are available in notebook instances based on the Amazon
Linux 2 platform. These notebook instances support JupyterLab version 1. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).
Kernel Name
Sparkmagic (PySpark)
Sparkmagic (Spark)
Sparkmagic (SparkR)
conda_amazonei_mxnet_p36
conda_amazonei_pytorch_latest_p37
conda_amazonei_tensorflow2_p36
conda_mxnet_p38
conda_python3
conda_pytorch_p39
conda_tensorflow2_p310
notebook-al2-v2: The following kernels are available in notebook instances based on the Amazon
Linux 2 platform. These notebook instances support JupyterLab version 3. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).
Kernel Name
Sparkmagic (PySpark)
Sparkmagic (Spark)
Sparkmagic (SparkR)
conda_amazonei_pytorch_latest_p37
conda_mxnet_p38
conda_python3
conda_pytorch_p39
conda_tensorflow2_p310
207
Amazon SageMaker Developer Guide
JupyterLab versioning
JupyterLab versioning
The Amazon SageMaker notebook instance interface is based on JupyterLab, which is a web-based
interactive development environment for notebooks, code, and data. Notebooks now support using
either JupyterLab 1 or JupyterLab 3. A single notebook instance can run a single instance of JupyterLab
(at most). You can have multiple notebook instances with different JupyterLab versions.
You can configure your notebook to run your preferred JupyterLab version by selecting the appropriate
platform identifier. Use either the AWS CLI or the SageMaker console when creating your notebook
instance. For more information about platform identifiers, see Amazon Linux 2 vs Amazon Linux
notebook instances. If you don’t explicitly configure a platform identifier, your notebook instance
defaults to running JupyterLab 1.
Topics
• JupyterLab 3 (p. 208)
• Creating a notebook with your JupyterLab version (p. 209)
• View the JupyterLab version of a notebook from the console (p. 209)
JupyterLab 3
JupyterLab 3 support is available only on the Amazon Linux 2 operating system platform. JupyterLab 3
includes the following features that are not available in JupyterLab 1. For more information about these
features, see JupyterLab 3.0 is released!.
• v2.0.0
• v3.0.0
208
Amazon SageMaker Developer Guide
Create a Notebook Instance
• nbserverproxy 0.x (0.3.2) has been replaced with jupyter-server-proxy 3.x (3.2.1).
You can also select the JupyterLab version by passing the platform-identifier parameter when
creating your notebook instance using the AWS CLI as follows:
The notebook instance type you choose depends on how you use your notebook instance. You want to
ensure that your notebook instance is not bound by memory, CPU, or IO. If you plan to load a dataset
into memory on the notebook instance for exploration or preprocessing, we recommend that you
choose an instance type with enough RAM memory for your dataset. This would require an instance
with at least 16 GB of memory (.xlarge or larger). If you plan to use the notebook for compute intensive
preprocessing, we recommend you choose a compute-optimized instance such as a c4 or c5.
A best practice when using a SageMaker notebook is to use the notebook instance to orchestrate other
AWS services. For example, you can use the notebook instance to manage large dataset processing by
making calls to AWS Glue for ETL (extract, transform, and load) services or Amazon EMR for mapping
and data reduction using Hadoop. You can use AWS services as temporary forms of computation or
storage for your data.
You can store and retrieve your training and test data using an Amazon S3 bucket. You can then use
SageMaker to train and build your model, so the instance type of your notebook would have no bearing
on the speed of your model training and testing.
209
Amazon SageMaker Developer Guide
Create a Notebook Instance
• Creates a network interface—If you choose the optional VPC configuration, SageMaker creates the
network interface in your VPC. It uses the subnet ID that you provide in the request to determine
which Availability Zone to create the subnet in. SageMaker associates the security group that you
provide in the request with the subnet. For more information, see Connect a Notebook Instance in a
VPC to External Resources (p. 3211).
• Launches an ML compute instance—SageMaker launches an ML compute instance in a SageMaker
VPC. SageMaker performs the configuration tasks that allow it to manage your notebook instance, and
if you specified your VPC, it enables traffic between your VPC and the notebook instance.
• Installs Anaconda packages and libraries for common deep learning platforms—SageMaker installs
all of the Anaconda packages that are included in the installer. For more information, see Anaconda
package list. In addition, SageMaker installs the TensorFlow and Apache MXNet deep learning libraries.
• Attaches an ML storage volume—SageMaker attaches an ML storage volume to the ML compute
instance. You can use the volume as a working area to clean up the training dataset or to temporarily
store validation, test, or other data. Choose any size between 5 GB and 16384 GB, in 1 GB increments,
for the volume. The default is 5 GB. ML storage volumes are encrypted, so SageMaker can't determine
the amount of available free space on the volume. Because of this, you can increase the volume size
when you update a notebook instance, but you can't decrease the volume size. If you want to decrease
the size of the ML storage volume in use, create a new notebook instance with the desired size.
Only files and data saved within the /home/ec2-user/SageMaker folder persist between notebook
instance sessions. Files and data that are saved outside this directory are overwritten when the
notebook instance stops and restarts. Each notebook instance's /tmp directory provides a minimum
of 10 GB of storage in an instance store. An instance store is temporary, block-level storage that isn't
persistent. When the instance is stopped or restarted, SageMaker deletes the directory's contents. This
temporary storage is part of the root volume of the notebook instance.
• Copies example Jupyter notebooks— These Python code examples illustrate model training and
hosting exercises using various algorithms and training datasets.
a. For Notebook instance name, type a name for your notebook instance.
b. For Notebook instance type, choose an instance size suitable for your use case. For a list of
supported instance types and quotas, see Amazon SageMaker Service Quotas.
c. For Elastic Inference, choose an inference accelerator type to associate with the notebook
instance if you plan to conduct inferences from the notebook instance, or choose none. For
information about elastic inference, see Use Amazon SageMaker Elastic Inference (EI) (p. 2628).
d. For Platform Identifier, choose a platform type to create the notebook instance on. This
platform type dictates the Operating System and the JupyterLab version that your notebook
instance is created with. For information about platform identifier type, see Amazon Linux 2 vs
Amazon Linux notebook instances (p. 205). For information about JupyterLab versions, see
JupyterLab versioning (p. 208).
e. (Optional) Additional configuration lets advanced users create a shell script that can run when
you create or start the instance. This script, called a lifecycle configuration script, can be used
to set the environment for the notebook or to perform other functions. For information, see
Customize a Notebook Instance Using a Lifecycle Configuration Script (p. 213).
f. (Optional) Additional configuration also lets you specify the size, in GB, of the ML storage
volume that is attached to the notebook instance. You can choose a size between 5 GB and
16,384 GB, in 1 GB increments. You can use the volume to clean up the training dataset or to
temporarily store validation or other data.
210
Amazon SageMaker Developer Guide
Create a Notebook Instance
g. (Optional) For Minimum IMDS Version, select a version from the dropdown list. If this value
is set to v1, both versions can be used with the notebook instance. If v2 is selected, then only
IMDSv2 can be used with the notebook instance. For information about IMDSv2, see Use
IMDSv2.
Note
Starting October 31, 2022, the default minimum IMDS Version for SageMaker
notebook instances changes from IMDSv1 to IMDSv2.
Starting February 1, 2023, IMDSv1 is no longer be available for new notebook instance
creation. After this date, you can create notebook instances with a minimum IMDS
version of 2.
h. For IAM role, choose either an existing IAM role in your account that has the
necessary permissions to access SageMaker resources or choose Create a new
role. If you choose Create a new role, SageMaker creates an IAM role named
AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS. The AWS managed policy
AmazonSageMakerFullAccess is attached to the role. The role provides permissions that
allow the notebook instance to call SageMaker and Amazon S3.
i. For Root access, to enable root access for all notebook instance users, choose Enable. To disable
root access for users, choose Disable.If you enable root access, all notebook instance users have
administrator privileges and can access and edit all files on it.
j. (Optional) Encryption key lets you encrypt data on the ML storage volume attached to the
notebook instance using an AWS Key Management Service (AWS KMS) key. If you plan to store
sensitive information on the ML storage volume, consider encrypting the information.
k. (Optional) Network lets you put your notebook instance inside a Virtual Private Cloud (VPC).
A VPC provides additional security and restricts access to resources in the VPC from sources
outside the VPC. For more information on VPCs, see Amazon VPC User Guide.
You can choose Open JupyterLab to open the JupyterLab dashboard. The dashboard provides
access to your notebook instance and sample SageMaker notebooks that contain complete code
walkthroughs. These walkthroughs show how to use SageMaker to perform common machine
learning tasks. For more information, see Example Notebooks (p. 220). For more information, see
Control root access to a SageMaker notebook instance (p. 3042).
For more information about Jupyter notebooks, see The Jupyter notebook.
211
Amazon SageMaker Developer Guide
Access Notebook Instances
Choose Notebook instances. The console displays a list of notebook instances in your account. To
open a notebook instance with a standard Jupyter interface, choose Open Jupyter for that instance.
To open a notebook instance with a JupyterLab interface, choose Open JupyterLab for that instance.
Use the Jupyter notebook dashboard to create and manage notebooks and to write code. For more
information about Jupyter notebooks, see http://jupyter.org/documentation.html.
You can update the tags of a notebook instance that is InService. To update any other attribute of a
notebook instance, its status must be Stopped.
When you do this, the notebook instance status changes to Stopping. Wait until the status changes
to Stopped to complete the following steps.
212
Amazon SageMaker Developer Guide
Customize a Notebook Instance
5. Select the Edit button to open the Edit notebook instance page. For information about the
notebook properties you can update, see Create a Notebook Instance (p. 209).
6. Update your notebook instance and select the Update notebook instance button at the bottom
of the page when you are done to return to the notebook instances page. Your notebook instance
status changes to Updating.
When the notebook instance update is complete, the status changes to Stopped.
You can also use a lifecycle configuration script to access AWS services from your notebook. For example,
you can create a script that lets you use your notebook to control other AWS resources, such as an
Amazon EMR instance.
We maintain a public repository of notebook lifecycle configuration scripts that address common use
cases for customizing notebook instances at https://github.com/aws-samples/amazon-sagemaker-
notebook-instance-lifecycle-configuration-samples.
Note
Each script has a limit of 16384 characters.
The value of the $PATH environment variable that is available to both scripts is /usr/local/
sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin. The working directory, which
is the value of the $PWD environment variable, is /.
View CloudWatch Logs for notebook instance lifecycle configurations in log group /
aws/sagemaker/NotebookInstances in log stream [notebook-instance-name]/
[LifecycleConfigHook].
Scripts cannot run for longer than 5 minutes. If a script runs for longer than 5 minutes, it fails
and the notebook instance is not created or started. To help decrease the run time of scripts, try
the following:
• Cut down on necessary steps. For example, limit which conda environments in which to install
large packages.
• Run tasks in parallel processes.
• Use the nohup command in your script.
You can see a list of notebook instance lifecycle configurations you previously created by choosing
Lifecycle configuration in the SageMaker console. You can attach a notebook instance lifecycle
configuration when you create a new notebook instance. For more information about creating a
notebook instance, see Create a Notebook Instance (p. 209).
213
Amazon SageMaker Developer Guide
Customize a Notebook Instance
5. For Name, type a name using alphanumeric characters and "-", but no spaces. The name can have a
maximum of 63 characters.
6. (Optional) To create a script that runs when you create the notebook and every time you start it,
choose Start notebook.
7. In the Start notebook editor, type the script.
8. (Optional) To create a script that runs only once, when you create the notebook, choose Create
notebook.
9. In the Create notebook editor, type the script configure networking.
10. Choose Create configuration.
• Lifecycle configurations run as the root user. If your script makes any changes within the /home/ec2-
user/SageMaker directory, (for example, installing a package with pip), use the command sudo -u
ec2-user to run as the ec2-user user. This is the same user that Amazon SageMaker runs as.
• SageMaker notebook instances use conda environments to implement different kernels for Jupyter
notebooks. If you want to install packages that are available to one or more notebook kernels, enclose
the commands to install the packages with conda environment commands that activate the conda
environment that contains the kernel where you want to install the packages.
For example, if you want to install a package only for the python3 environment, use the following
code:
#!/bin/bash
sudo -u ec2-user -i <<EOF
# Replace myPackage with the name of the package you want to install.
pip install myPackage
# You can also perform "conda install" here as well.
source deactivate
EOF
If you want to install a package in all conda environments in the notebook instance, use the following
code:
#!/bin/bash
sudo -u ec2-user -i <<EOF
# Installing packages in the Jupyter system environment can affect stability of your
SageMaker
# Notebook Instance. You can remove this check if you'd like to install Jupyter
extensions, etc.
if [ $env = 'JupyterSystemEnv' ]; then
continue
214
Amazon SageMaker Developer Guide
Customize a Notebook Instance
fi
# Replace myPackage with the name of the package you want to install.
pip install --upgrade --quiet myPackage
# You can also perform "conda install" here as well.
source /home/ec2-user/anaconda3/bin/deactivate
done
EOF
• You must store all conda environments in the default environments folder (/home/user/anaconda3/
envs).
Important
When you create or change a script, we recommend that you use a text editor that provides
Unix-style line breaks, such as the text editor available in the console when you create a
notebook. Copying text from a non-Linux operating system might introduce incompatible line
breaks and result in an unexpected error.
The different Jupyter kernels in Amazon SageMaker notebook instances are separate conda
environments. For information about conda environments, see Managing environments in the Conda
documentation.
Install custom environments and kernels on the notebook instance's Amazon EBS volume. This ensures
that they persist when you stop and restart the notebook instance, and that any external libraries you
install are not updated by SageMaker. To do that, use a lifecycle configuration that includes both a script
that runs when you create the notebook instance (on-create) and a script that runs each time you
restart the notebook instance (on-start). For more information about using notebook instance lifecycle
configurations, see Customize a Notebook Instance Using a Lifecycle Configuration Script (p. 213).
There is a GitHub repository that contains sample lifecycle configuration scripts at SageMaker Notebook
Instance Lifecycle Config Samples.
• conda install
• pip install
215
Amazon SageMaker Developer Guide
Customize a Notebook Instance
For example scripts, see SageMaker Notebook Instance Lifecycle Config Samples. For more information
on lifecycle configuration, see Customize a Notebook Instance Using a Lifecycle Configuration Script.
• Notebooks – The following commands are supported.
• %conda install
• %pip install
• The Jupyter terminal – You can install packages using pip and conda directly.
From within a notebook you can use the system command syntax (lines starting with !) to install
packages, for example, !pip install and !conda install. More recently, new commands have been
added to IPython: %pip and %conda. These commands are the recommended way to install packages
from a notebook as they correctly take into account the active environment or interpreter being used.
For more information, see Add %pip and %conda magic functions.
Conda
Conda is an open source package management system and environment management system, which can
install packages and their dependencies. SageMaker supports using Conda with either of the two main
channels, the default channel, and the conda-forge channel. For more information, see Conda channels.
The conda-forge channel is a community channel where contributors can upload packages.
Note
Due to how Conda resolves the dependency graph, installing packages from conda-forge can
take significantly longer (in the worst cases, upwards of 10 minutes).
The Deep Learning AMI comes with many conda environments and many packages preinstalled. Due to
the number of packages preinstalled, finding a set of packages that are guaranteed to be compatible
is difficult. You may see a warning "The environment is inconsistent, please check the package plan
carefully". Despite this warning, SageMaker ensures that all the SageMaker provided environments are
correct. SageMaker cannot guarantee that any user installed packages will function correctly.
Note
Users of SageMaker, AWS Deep Learning AMI and Amazon EMR can access the commercial
Anaconda repository without taking a commercial license through February 1, 2024 when
using Anaconda in those services. For any usage outside of these three services, customers are
responsible for determining their own Anaconda license requirements.
Conda has two methods for activating environments: conda activate/deactivate, and source activate/
deactivate. For more information, see Should I use 'conda activate' or 'source activate' in Linux.
SageMaker supports moving Conda environments onto the Amazon EBS volume, which is persisted when
the instance is stopped. The environments aren't persisted when the environments are installed to the
root volume, which is the default behavior. For an example lifecycle script, see persistent-conda-ebs.
216
Amazon SageMaker Developer Guide
Customize a Notebook Instance
Pip
Pip is the de facto tool for installing and managing Python packages. Pip searches for packages on the
Python Package Index (PyPI) by default. Unlike Conda, pip doesn't have built in environment support,
and is not as thorough as Conda when it comes to packages with native/system library dependencies. Pip
can be used to install packages in Conda environments.
You can use alternative package repositories with pip instead of the PyPI. For an example lifecycle script,
see on-start.sh.
• Using pip to install a package without an active conda environment (install packages system wide)
• Using pip to install a package in a conda environment
• Using pip to install a package in all conda environments
• Changing the pip install location to use EBS
• Using an alternative repository to install packages with pip
Unsupported
SageMaker aims to support as many package installation operations as possible. However, if the
packages were installed by SageMaker or DLAMI, and you use the following operations on these
packages, it might make your notebook instance unstable:
• Uninstalling
• Downgrading
• Upgrading
We do not provide support for installing packages via yum install or installing R packages from CRAN.
Due to potential issues with network conditions or configurations, or the availability of Conda or PyPi, we
cannot guarantee that packages will install in a fixed or deterministic amount of time.
Note
We cannot guarantee that a package installation will be successful. Attempting to install a
package in an environment with incompatible dependencies can result in a failure. In such a
case you should contact the library maintainer to see if it is possible to update the package
dependencies. Alternatively you can attempt to modify the environment in such a way as to
allow the installation. This modification however will likely mean removing or updating existing
packages, which means we can no longer guarantee stability of this environment.
• Kernel updates
• Security patches
• AWS SDK updates
• Amazon SageMaker Python SDK updates
• Open source software updates
To ensure that you have the most recent software updates, stop and restart your notebook instance,
either in the SageMaker console or by calling
StopNotebookInstance.
217
Amazon SageMaker Developer Guide
Customize a Notebook Instance
You can also manually update software installed on your notebook instance while it is running by using
update commands in a terminal or in a notebook.
Note
Updating kernels and some packages might depend on whether root access is enabled for the
notebook instance. For more information, see Control root access to a SageMaker notebook
instance (p. 3042).
You can check the Personal Health Dashboard or the security bulletin at Security Bulletins for updates.
The process requires three procedures using the Amazon SageMaker console:
To create an Amazon EMR Spark instance that can be controlled from a notebook using
Sparkmagic
To create a notebook that uses Sparkmagic to control an Amazon EMR Spark instance
# OVERVIEW
# This script connects an Amazon EMR cluster to an Amazon SageMaker notebook instance
that uses Sparkmagic.
#
# Note that this script will fail if the Amazon EMR cluster's master node IP address is
not reachable.
218
Amazon SageMaker Developer Guide
Customize a Notebook Instance
# 1. Ensure that the EMR master node IP is resolvable from the notebook instance.
# One way to accomplish this is to have the notebook instance and the Amazon EMR
cluster in the same subnet.
# 2. Ensure the EMR master node security group provides inbound access from the
notebook instance security group.
# Type - Protocol - Port - Source
# Custom TCP - TCP - 8998 - $NOTEBOOK_SECURITY_GROUP
# 3. Ensure the notebook instance has internet connectivity to fetch the SparkMagic
example config.
#
# https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-
backed-by-spark-in-amazon-emr/
# PARAMETERS
EMR_MASTER_IP=your.emr.master.ip
cd /home/ec2-user/.sparkmagic
6. In the PARAMETERS section of the script, replace your.emr.master.ip with the Master Public DNS
name for the Amazon EMR instance.
7. Choose Create configuration.
8. On the Create notebook page, choose Network - optional.
9. Choose the VPC and subnet where the Amazon EMR instance is located.
10. Choose the security group used by the Amazon EMR master node.
11. Choose Create notebook instance.
While the notebook instance is being created, the status is Pending. After the instance has been created
and the lifecycle configuration script has successfully run, the status is InService.
Note
If the notebook instance can't connect to the Amazon EMR instance, SageMaker can't create
the notebook instance. The connection can fail if the Amazon EMR instance and notebook are
not in the same VPC and subnet, if the Amazon EMR master security group is not used by the
notebook, or if the Master Public DNS name in the script is incorrect.
To test the connection between the Amazon EMR instance and the notebook
1. When the status of the notebook is InService, choose Open Jupyter to open the notebook.
2. Choose New, then choose Sparkmagic (PySpark).
3. In the code cell, enter %%info and then run the cell.
219
Amazon SageMaker Developer Guide
Example Notebooks
Example Notebooks
Your notebook instance contains example notebooks provided by Amazon SageMaker. The example
notebooks contain code that shows how to apply machine learning solutions by using SageMaker.
Notebook instances use the nbexamples Jupyter extension, which enables you to view a read-
only version of an example notebook or create a copy of it that you can modify and run. For more
information about the nbexamples extension, see https://github.com/danielballan/nbexamples.
For information about example notebooks for SageMaker Studio, see Use Amazon SageMaker Studio
Notebooks (p. 144).
Note
Example notebooks typically download datasets from the internet. If you disable SageMaker-
provided internet access when you create your notebook instance, example notebooks might
not work. For more information, see Connect a Notebook Instance in a VPC to External
Resources (p. 3211).
To view a read-only version of an example notebook in the Jupyter classic view, on the SageMaker
Examples tab, choose Preview for that notebook. To create a copy of an example notebook in the home
directory of your notebook instance, choose Use. In the dialog box, you can change the notebook's name
before saving it.
220
Amazon SageMaker Developer Guide
Set the Notebook Kernel
To view a read-only version of an example notebook, choose the name of the notebook. This opens the
notebook as a tab in the main area. To create a copy of an example notebook in the home directory of
your notebook instance, choose Create a Copy in the top banner. In the dialog box, type a name for the
notebook and then choose CREATE COPY.
For more information about the example notebooks, see the SageMaker examples GitHub repository.
221
Amazon SageMaker Developer Guide
Git Repos
You can also create a custom kernel that you can use in your notebook instance. For information, see
Install External Libraries and Kernels in Notebook Instances (p. 215).
• Persistence - Notebooks in a notebook instance are stored on durable Amazon EBS volumes, but they
do not persist beyond the life of your notebook instance. Storing notebooks in a Git repository enables
you to store and use notebooks even if you stop or delete your notebook instance.
• Collaboration - Peers on a team often work on machine learning projects together. Storing your
notebooks in Git repositories allows peers working in different notebook instances to share notebooks
and collaborate on them in a source-control environment.
• Learning - Many Jupyter notebooks that demonstrate machine learning techniques are available in
publicly hosted Git repositories, such as on GitHub. You can associate your notebook instance with a
repository to easily load Jupyter notebooks contained in that repository.
There are two ways to associate a Git repository with a notebook instance:
• Add a Git repository as a resource in your Amazon SageMaker account. Then, to access the repository,
you can specify an AWS Secrets Manager secret that contains credentials. That way, you can access
repositories that require authentication.
• Associate a public Git repository that is not a resource in your account. If you do this, you cannot
specify credentials to access the repository.
Topics
• Add a Git Repository to Your Amazon SageMaker Account (p. 222)
• Create a Notebook Instance with an Associated Git Repository (p. 225)
• Associate a CodeCommit Repository in a Different AWS Account with a Notebook Instance (p. 226)
• Use Git Repositories in a Notebook Instance (p. 227)
222
Amazon SageMaker Developer Guide
Git Repos
You can add Git repositories to your SageMaker account in the SageMaker console or by using the AWS
CLI.
Note
You can use the SageMaker API
CreateCodeRepository to add Git repositories to your SageMaker account, but step-by-step
instructions are not provided here.
a. To use an existing AWS Secrets Manager secret, choose Use existing secret, and then choose
a secret from the list. For information about creating and storing a secret, see Creating a Basic
Secret in the AWS Secrets Manager User Guide. The name of the secret you use must contain the
string sagemaker.
Note
The secret must have a staging label of AWSCURRENT and must be in the following
format:
223
Amazon SageMaker Developer Guide
Git Repos
For information about creating and storing a secret, see Creating a Basic Secret in the AWS Secrets
Manager User Guide. The following command creates a new repository named MyRespository in
your Amazon SageMaker account that points to a Git repository hosted at https://github.com/
myprofile/my-repo".
For Windows:
Note
The secret must have a staging label of AWSCURRENT and must be in the following format:
224
Amazon SageMaker Developer Guide
Git Repos
Topics
• Create a Notebook Instance with an Associated Git Repository (Console) (p. 225)
• Create a Notebook Instance with an Associated Git Repository (CLI) (p. 225)
1. Follow the instructions at Step 1: Create an Amazon SageMaker Notebook Instance (p. 88).
2. For Git repositories, choose Git repositories to associate with the notebook instance.
a. For Default repository, choose a repository that you want to use as your default repository.
SageMaker clones this repository as a subdirectory in the Jupyter startup directory at /home/
ec2-user/SageMaker. When you open your notebook instance, it opens in this repository. To
choose a repository that is stored as a resource in your account, choose its name from the list.
To add a new repository as a resource in your account, choose Add a repository to SageMaker
(opens the Add repository flow in a new window) and then follow the instructions at Create
a Notebook Instance with an Associated Git Repository (Console) (p. 225). To clone a public
repository that is not stored in your account, choose Clone a public Git repository to this
notebook instance only, and then specify the URL for that repository.
b. For Additional repository 1, choose a repository that you want to add as an additional
directory. SageMaker clones this repository as a subdirectory in the Jupyter startup directory
at /home/ec2-user/SageMaker. To choose a repository that is stored as a resource in your
account, choose its name from the list. To add a new repository as a resource in your account,
choose Add a repository to SageMaker (opens the Add repository flow in a new window) and
then follow the instructions at Create a Notebook Instance with an Associated Git Repository
(Console) (p. 225). To clone a repository that is not stored in your account, choose Clone
a public Git repository to this notebook instance only, and then specify the URL for that
repository.
Repeat this step up to three times to add up to three additional repositories to your notebook
instance.
• Specify the repository that you want to use as your default repository as the value of the default-
code-repository argument. Amazon SageMaker clones this repository as a subdirectory in the
Jupyter startup directory at /home/ec2-user/SageMaker. When you open your notebook instance,
it opens in this repository. To use a repository that is stored as a resource in your SageMaker account,
specify the name of the repository as the value of the default-code-repository argument. To use
225
Amazon SageMaker Developer Guide
Git Repos
a repository that is not stored in your account, specify the URL of the repository as the value of the
default-code-repository argument.
• Specify up to three additional repositories as the value of the additional-code-repositories
argument. SageMaker clones this repository as a subdirectory in the Jupyter startup directory at /
home/ec2-user/SageMaker, and the repository is excluded from the default repository by adding
it to the .git/info/exclude directory of the default repository. To use repositories that are stored
as resources in your SageMaker account, specify the names of the repositories as the value of the
additional-code-repositories argument. To use repositories that are not stored in your
account, specify the URLs of the repositories as the value of the additional-code-repositories
argument.
For example, the following command creates a notebook instance that has a repository named
MyGitRepo, that is stored as a resource in your SageMaker account, as a default repository, and an
additional repository that is hosted on GitHub:
Note
If you use an AWS CodeCommit repository that does not contain "SageMaker" in its name, add
the codecommit:GitPull and codecommit:GitPush permissions to the role that you pass
as the role-arn argument to the create-notebook-instance command. For information
about how to add permissions to a role, see Adding and Removing IAM Policies in the AWS
Identity and Access Management User Guide.
To set up cross-account access for a CodeCommit repository and associate it with a notebook
instance:
1. In the AWS account that contains the CodeCommit repository, create an IAM policy that allows
access to the repository from users in the account that contains your notebook instance. For
information, see Step 1: Create a Policy for Repository Access in AccountA in the CodeCommit User
Guide.
2. In the AWS account that contains the CodeCommit repository, create an IAM role, and attach the
policy that you created in the previous step to that role. For information, see Step 2: Create a Role
for Repository Access in AccountA in the CodeCommit User Guide.
3. Create a profile in the notebook instance that uses the role that you created in the previous step:
vi /home/ec2-user/.aws/config
226
Amazon SageMaker Developer Guide
Git Repos
[profile CrossAccountAccessProfile]
region = us-west-2
role_arn =
arn:aws:iam::CodeCommitAccount:role/CrossAccountRepositoryContributorRole
credential_source=Ec2InstanceMetadata
output = json
vi /home/ec2-user/.gitconfig
[credential]
helper = !aws codecommit credential-helper --
profile CrossAccountAccessProfile $@
UseHttpPath = true
Where CrossAccountAccessProfile is the name of the profile that you created in the
previous step.
To open any of the additional repositories, navigate up one folder. The additional repositories are also
installed as directories under /home/ec2-user/SageMaker.
If you open the notebook instance with a JupyterLab interface, the jupyter-git extension is installed and
available to use. For information about the jupyter-git extension for JupyterLab, see https://github.com/
jupyterlab/jupyterlab-git.
When you open a notebook instance in JupyterLab, you see the git repositories associated with it on the
left menu:
227
Amazon SageMaker Developer Guide
Git Repos
You can use the jupyter-git extension to manage git visually, instead of using the command line:
228
Amazon SageMaker Developer Guide
Notebook Instance Metadata
{
"ResourceArn": "NotebookInstanceArn",
"ResourceName": "NotebookInstanceName"
}
You can use this metadata from within the notebook instance to get other information about the
notebook instance. For example, the following commands get the tags associated with the notebook
instance:
NOTEBOOK_ARN=$(jq '.ResourceArn'
/opt/ml/metadata/resource-metadata.json --raw-output)
aws sagemaker list-tags --resource-arn $NOTEBOOK_ARN
{
"Tags": [
{
"Key": "test",
"Value": "true"
}
]
}
1. Sign in to the AWS Management Console and open the SageMaker console at https://
console.aws.amazon.com/sagemaker/.
2. Choose Notebook instances.
3. In the list of notebook instances, choose the notebook instance for which you want to view Jupyter
logs by selecting the Notebook instance Name.
This will bring you to the details page for that notebook instance.
229
Amazon SageMaker Developer Guide
SageMaker Studio Lab
4. Under Monitor on the notebook instance details page, choose View logs.
5. In the CloudWatch console, choose the log stream for your notebook instance. Its name is in the
form NotebookInstanceName/jupyter.log.
For more information about monitoring CloudWatch logs for SageMaker, see Log Amazon SageMaker
Events with Amazon CloudWatch (p. 3284).
With Studio Lab, you can use AWS compute resources to create and run your Jupyter notebooks without
signing up for an AWS account. Because Studio Lab is based on open-source JupyterLab, you can take
advantage of open-source Jupyter extensions to run your Jupyter notebooks.
While Studio Lab provides free access to AWS compute resources, Amazon SageMaker Studio provides
the following advanced machine learning capabilities that Studio Lab does not support.
Studio also supports fine-grained access control and security by using AWS Identity and Access
Management (IAM), Amazon Virtual Private Cloud (Amazon VPC), and AWS Key Management Service
(AWS KMS). Studio Lab does not support these Studio features, nor does it support the use of estimators
and built-in SageMaker algorithms.
To export your Studio Lab projects for use with Studio, see Export an Amazon SageMaker Studio Lab
environment to Amazon SageMaker Studio (p. 251).
The following topics give information about Studio Lab and how to use it
Topics
• Amazon SageMaker Studio Lab components overview (p. 231)
• Onboard to Amazon SageMaker Studio Lab (p. 234)
• Manage your account (p. 235)
• Launch your Amazon SageMaker Studio Lab project runtime (p. 236)
• Use Amazon SageMaker Studio Lab starter assets (p. 237)
• Use the Amazon SageMaker Studio Lab project runtime (p. 239)
• Troubleshooting (p. 256)
230
Amazon SageMaker Developer Guide
Studio Lab components overview
Topics
• Landing page (p. 231)
• Studio Lab account (p. 231)
• Project overview page (p. 231)
• Preview page (p. 232)
• Project (p. 232)
• Compute instance type (p. 233)
• Project runtime (p. 234)
• Session (p. 234)
Landing page
You can request an account and sign in to an existing account on your landing page. To navigate to the
landing page, see the Amazon SageMaker Studio Lab website. For more information about creating a
Studio Lab account, see Onboard to Amazon SageMaker Studio Lab (p. 234).
The following screenshot shows the Studio Lab landing page interface for requesting a user account and
signing in.
231
Amazon SageMaker Developer Guide
Studio Lab components overview
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
The following screenshot shows a project overview in the Studio Lab user interface.
Preview page
On this page, you can access a read-only preview of a Jupyter notebook. You can not execute the
notebook from preview, but you can copy that notebook into your project. For many customers, this
may be the first Studio Lab page that customers see, as they may be opening a notebook from GitHub
notebook. For more information on how to use GitHub resources, see Use GitHub resources (p. 248).
1. Sign in to your Studio Lab account. For more information about creating a Studio Lab account, see
Onboard to Amazon SageMaker Studio Lab (p. 234).
2. Under Notebook compute instance, choose a compute instance type. For more information about
compute instance types, see Compute instance type (p. 233).
3. Choose Start runtime. You might be asked to solve a CAPTCHA puzzle. For more information on
CAPTCHA, see What is a CAPTCHA puzzle?
4. One time setup, for first time starting runtime using your Studio Lab account:
a. Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account
and choose Continue.
For information on supported countries and regions, see Supported countries and regions (SMS
channel).
b. Enter the 6-digit code sent to the associated mobile phone number and choose Verify.
5. Choose Copy to project.
Project
Your project contains all of your files and folders, including your Jupyter notebooks. You have full control
over the files in your project. Your project also includes the JupyterLab-based user interface. From this
interface, you can interact with your Jupyter notebooks, edit your source code files, integrate with
GitHub, and connect to Amazon S3. For more information, see Use the Amazon SageMaker Studio Lab
project runtime (p. 239).
The following screenshot shows a Studio Lab project with the file browser open and the Studio Lab
Launcher displayed.
232
Amazon SageMaker Developer Guide
Studio Lab components overview
Amazon SageMaker Studio Lab offers the choice of a CPU (Central Processing Unit) and a GPU (Graphical
Processing Unit). The following sections give information about these two options, including selection
guidance.
CPU
A central processing unit (CPU) is designed to handle a wide range of tasks efficiently, but is limited
in how many tasks it can run concurrently. For machine learning, a CPU is recommended for compute
intensive algorithms, such as time series, forecasting, and tabular data.
GPU
A graphics processing unit (GPU) is designed to render high-resolution images and video concurrently. A
GPU is recommended for deep learning tasks, especially for transformers and computer vision.
Compute time
When compute time for Studio Lab reaches its time limit, the instance stops all running computations.
Studio Lab does not support time limit increases.
Studio Lab automatically saves your environment when you update your environment and every time
you create a new file. Custom-installed extensions and packages persist even after your runtime has
ended.
File edits are periodically saved, but are not saved when your runtime ends. To ensure that you do not
lose your progress, save your work manually. If you have content in your Studio Lab project that you
don’t want to lose, we recommend that you back up your content elsewhere. For more information about
233
Amazon SageMaker Developer Guide
Onboard to Studio Lab
exporting your environment and files, see Export an Amazon SageMaker Studio Lab environment to
Amazon SageMaker Studio (p. 251).
During long computation, you do not need to keep your project open. For example, you can start training
a model, then close your browser. The instance keeps running for up to 12 hours on CPU instances and 4
hours on GPU instances. You can then sign in later to continue your work.
We recommend that you use checkpointing in your deep learning jobs. You can use saved checkpoints to
restart a job from the previously saved checkpoint. For more information, see File I/O.
Project runtime
The project runtime is the period of time when your compute instance is running.
Session
A user session begins every time you launch your project.
Topics
• Request a Studio Lab account (p. 234)
• Create a Studio Lab account (p. 235)
• Sign in to Studio Lab (p. 235)
Your account request must be approved before you can register for a Studio Lab account. Your request
will be reviewed within five business days. When your account request is approved, you receive an email
with a link to the Studio Lab account registration page. This link expires seven days after your request is
approved. If the link expires, you must submit a new account request.
Note: Your account request is denied if your email has been associated with activity that violates our
Terms of Service or other agreements.
Referral codes
Studio Lab referral codes enable new account requests to be automatically approved to support machine
learning events like workshops, hackathons, and classes. With a referral code, a trusted host can get their
234
Amazon SageMaker Developer Guide
Manage your account
participants immediate access to Studio Lab. After an account has been created using a referral code, the
account continues to exist after the expiration of the code.
To get a referral code, contact Sales Support. To use a referral code, enter the code as part of the account
request form.
1. Select Create account in the account request approval email to open a new page.
2. From the new page, enter your Email, a Password, and a Username.
3. Select Create account.
You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?
You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?
1. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
2. From the top-right corner, select your user name to open a dropdown menu.
3. From the dropdown menu, select Change password to open a new page.
4. Enter your current password into the Enter your current password field.
5. Enter your new password into the Create a new password and Confirm your new password fields.
6. Select Submit.
235
Amazon SageMaker Developer Guide
Launch Studio Lab
1. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
2. From the top-right corner, select your user name to open a dropdown menu.
3. From the dropdown menu, select Delete account to open a new page.
4. Enter your password to confirm the deletion of your Studio Lab account.
5. Select Delete.
Customer information
Studio Lab collects your email address, user name, encrypted password, project files, and metadata.
When requesting an account, you can optionally choose to provide your first and last name, country,
organization name, occupation, and the reason for your interest in this product. We protect all customer
personal data with encryption. For more information about how your personal information is handled,
see the Privacy Notice.
When you delete your account, all of your information is deleted immediately. If you have an inquiry
about this, submit the Amazon SageMaker Studio Lab Form. For information and support related to AWS
compliance, see Compliance support.
The following topic gives information about how to manage your project runtime. These topics require
that you sign in to your Amazon SageMaker Studio Lab account. For more information about signing in,
see Sign in to Studio Lab (p. 235). For more information about your project, see Amazon SageMaker
Studio Lab components overview (p. 231).
Topics
• Start your project runtime (p. 236)
• Stop your project runtime (p. 237)
• View remaining compute time (p. 237)
• Change your compute type (p. 237)
1. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
2. Under My Project, select a compute type. For more information about compute types, see Compute
instance type (p. 233).
236
Amazon SageMaker Developer Guide
Use Studio Lab starter assets
You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?
4. One time setup, for first time starting runtime using your Studio Lab account:
a. Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account
and choose Continue.
For information on supported countries and regions, see Supported countries and regions (SMS
channel).
b. Enter the 6-digit code sent to the associated mobile phone number and choose Verify.
5. After the runtime is running, select Open project to open the project runtime environment in a new
browser tab.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
237
Amazon SageMaker Developer Guide
Use Studio Lab starter assets
Studio Lab comes with a starter notebook that gives general information and guides you through key
workflows. When you launch your project runtime for the first time, this notebook automatically opens.
Dive into Deep Learning (D2L) is an interactive, open-source book that teaches the ideas, mathematical
theory, and code that power machine learning. With over 150 Jupyter notebooks, D2L provides a
comprehensive overview of deep learning principles. For more information about D2L, see the D2L
website.
The following procedure shows how to clone the D2L Jupyter notebooks to your instance.
1. Start and open the Studio Lab project runtime environment by following Start your project
runtime (p. 236).
2.
Once Studio Lab is open, choose the Git tab ( ) on the left sidebar.
3. Choose Clone a Repository. Under Git repository URL (.git) paste the MLU git repository D2L
by following the steps below. If you do not see the Clone a Repository option because you are
currently in a Git repository, return to the user directory to clone a new repository. You return to the
user directory by choosing the Folder tab ( ) on the left sidebar. In the Folder tab beneath the
file search bar choose the folder icon to the left of the currently open repository. Once you are in the
user directory, choose the Git tab on the left sidebar and choose Clone a Repository.
4. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
The AWS Machine Learning University (MLU) provides access to the machine learning courses used to
train Amazon’s own developers. With AWS MLU, any developer can learn how to use machine learning
with the learn-at-your-own-pace MLU Accelerator learning series. The MLU Accelerator series is designed
to help developers begin their ML journey. It offers three-day foundational courses on these three
subjects: Natural Language Processing, Tabular Data, and Computer Vision. For more information,
see Machine Learning University.
The following procedure shows how to clone the AWS MLU Jupyter notebooks to your instance.
1. Start and open the Studio Lab project runtime environment by following Start your project
runtime (p. 236).
2.
Once Studio Lab is open, choose the Git tab ( ) on the left sidebar.
3. Choose Clone a Repository. Under Git repository URL (.git) paste the MLU git repository URL
by following the steps below. If you do not see the Clone a Repository option because you are
currently in a Git repository, return to the user directory to clone a new repository. You return to the
238
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
user directory by choosing the Folder tab ( ) on the left sidebar. In the Folder tab beneath the
file search bar choose the folder icon to the left of the currently open repository. Once you are in the
user directory, choose the Git tab on the left sidebar and choose Clone a Repository.
4. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
Roboflow
Roboflow gives you the tools to train, fine-tune, and label objects for computer vision applications. For
more information, see https://roboflow.com/.
The following procedure shows how to clone the Roboflow Jupyter notebooks to your instance.
1. Navigate to the Studio Lab project overview page. The URL takes the following format.
https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>
Topics
• Amazon SageMaker Studio Lab UI overview (p. 240)
• Create or open an Amazon SageMaker Studio Lab notebook (p. 241)
• Use the Amazon SageMaker Studio Lab notebook toolbar (p. 242)
• Manage your environment (p. 244)
• Use external resources in Amazon SageMaker Studio Lab (p. 248)
• Get notebook differences (p. 251)
• Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio (p. 251)
• Shut down resources (p. 255)
239
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
The following image shows Studio Lab with the file browser open and the Studio Lab Launcher
displayed.
You will find the menu bar at the top of the screen. The left sidebar contains icons to open file browsers,
resource browsers, and tools. The status bar is located at the bottom-left corner of Studio Lab.
The main work area is divided horizontally into two panes. The left pane is the file and resource browser.
The right pane contains one or more tabs for resources, such as notebooks and terminals.
Topics
• Left sidebar (p. 240)
• File and resource browser (p. 241)
• Main work area (p. 241)
Left sidebar
The left sidebar includes the following icons. When you hover over an icon, a tooltip displays the icon
name. When you choose an icon, the file and resource browser displays the described functionality.
For hierarchical entries, a selectable breadcrumb at the top of the browser shows your location in the
hierarchy.
Icon Description
File Browser
240
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
Icon Description
To have adjacent files open, choose a tab that contains a notebook, Python,
or text file, and then choose New View for File.
Choose the plus (+) sign on the menu at the top of the file browser to open
the Studio Lab Launcher.
You can see a list of all of the running terminals and kernels in your project.
For more information, see Shut down resources (p. 255).
Git
You can connect to a Git repository and then access a full range of Git tools
and operations. For more information, see Use external resources in Amazon
SageMaker Studio Lab (p. 248).
Table of Contents
You can access the Table of Contents for your current Jupyter notebook.
Extension Manager
For information about shutting down the notebook, see Shut down resources (p. 255).
Topics
• Open a Studio Lab notebook (p. 241)
• Create a notebook from the file menu (p. 242)
• Create a notebook from the Launcher (p. 242)
241
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
To open a notebook
1.
In the left sidebar, choose the File Browser icon ( ) to display the file browser.
2. Browse to a notebook file and double-click it to open the notebook in a new tab.
1. From the Studio Lab menu, choose File, choose New, and then choose Notebook.
2. To use the default kernel, in the Select Kernel dialog box, choose Select. Otherwise, to select a
different kernel, use the dropdown menu.
Alternatively, you can open Launcher from the left sidebar: Choose the File Browser icon, and then
choose the plus (+) icon.
2. To use the default kernel from the Launcher, under Notebook, choose default:Python. Otherwise,
select a different kernel.
After you choose the kernel, your notebook launches and opens in a new Studio Lab tab.
To view the notebook's kernel session, in the left sidebar, choose the Running Terminals and Kernels
icon ( ). You can stop the notebook's kernel session from this view.
The following image shows the toolbar and an empty cell from a Studio Lab notebook.
When you hover over a toolbar icon, a tooltip displays the icon function. You can find additional
notebook commands in the Studio Lab main menu. The toolbar includes the following icons:
Icon Description
Insert cell
242
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
Icon Description
Inserts a code cell below the current cell. The current cell is noted by the
blue vertical marker in the left margin.
Run cells
Runs the selected cells. The cell that follows the last-selected cell becomes
the new-selected cell.
Interrupt kernel
Restart kernel
Restarts the kernel. Variables are reset. Unsaved information is not affected.
Restarts the kernel. Variables are reset. Unsaved information is not affected.
Then re-runs the entire notebook.
Cell type
Displays or changes the current cell type. The cell types are:
Checkpoint diff
Opens a new tab that displays the difference between the notebook
and the checkpoint file. For more information, see Get notebook
differences (p. 251).
Git diff
default Kernel
Displays or changes the kernel that processes the cells in the notebook.
243
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
Icon Description
Displays a kernel's busy status by showing the circle's edge and its interior
as the same color. The kernel is busy when it is starting and when it is
processing cells. Additional kernel states are displayed in the status bar at
the bottom-left corner of Studio Lab.
Your Studio Lab environment comes with a base image installed that includes key packages and
resources. You can customize your environment by adding new packages and libraries to it. You can also
create new environments from Studio Lab, import compatible environments, reset your environment to
create space, and more.
The commands on this page will be for running in a Studio Lab terminal. If you wish to run these
commands in a Studio Lab Jupyter notebook, prefix the command with a % before running the cell. For
example, the code snippet pip list in a terminal is the same as %pip list in a Jupyter notebook.
Topics
• Base image (p. 244)
• Managing conda environments (p. 245)
Base image
The default Amazon SageMaker Studio Lab base image includes the following packages.
• Python 3.9
• bzip2
• build-essential
• curl
• git
• libgl1-mesa-glx
• nano
• rsync
• unzip
• wget
• ca-certificates
• pip
• ipykernel-6.4
Machine learning frameworks simplify machine learning by abstracting complex algorithms and
processes. This abstraction helps you get started with machine learning. Libraries are collections of
244
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
files, programs, and other resources that you can use in your code. Studio Lab supports the following
frameworks and libraries, which you must install manually.
• PyTorch 1.9
• TensorFlow 1.15 and 2.6
• MxNet 1.8
• Hugging Face
• AutoGluon 0.3.1
• Scikit-learn 0.24
• PyTorch ecosystem
• OpenCV
• scipy
• numpy
For a list of all of the packages currently installed in your environment, run the following command from
your Jupyter notebook.
pip list
View environments
To view the environments in Studio Lab you can use a terminal or Jupyter notebook. The following
command will be for a Studio Lab terminal. If you wish to run the corresponding commands in a Jupyter
notebook, see Manage your environment (p. 244).
Open the Studio Lab terminal by opening the File Browser ( ) panel, choose the plus (+) sign on the
menu at the top of the file browser to open the Launcher, then choose Terminal. From the Studio Lab
terminal, list the conda environments by running the following.
This command outputs a list of the conda environments and their locations in the file system. When you
onboard to Studio Lab, you automatically activate the studiolab conda environment. The following is
an example of listed environments after you onboard.
# conda environments: #
default /home/studio-lab-user/.conda/envs/default
studiolab * /home/studio-lab-user/.conda/envs/studiolab
studiolab-safemode /opt/amazon/sagemaker/safemode-home/.conda/envs/
studiolab-safemode
245
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
base /opt/conda
If you would like to maintain multiple environments for different use cases, you can create new conda
environments in your project. The following sections show how to create and activate new conda
environments. For a Jupyter notebook that shows how to create a custom environment, see Setting up a
Custom Environment in SageMaker Studio Lab.
Note
Maintaining multiple environments counts against your available Studio Lab memory.
To create a conda environment, run the following conda command from your terminal. This example
creates a new environment with Python 3.9.
Once the conda environment is created, you can view the environment in your environment list. For more
information on how to view your environment list, see View environments (p. 245).
To activate any conda environment, run the following command in the terminal.
When you run this command, any packages installed using conda or pip are installed in the environment.
For more information on installing packages, see Customize your environment (p. 247).
To use your new conda environments with notebooks, make sure the ipykernel package is installed in
the environment.
Once the ipykernel package is installed in the environment, you can select the environment as the
kernel for your notebook.
You may need to restart JupyterLab to see the environment available as a kernel. This can be done
by choosing Amazon SageMaker Studio Lab in the top menu of Studio Lab and choosing Restart
JupyterLab....
When you create a new notebook from the Studio Lab Launcher, you will have the option to choose the
kernel under Notebook. For an overview of the Studio Lab UI, see Amazon SageMaker Studio Lab UI
overview (p. 240).
When a Jupyter notebook is open, you can choose the kernel by choosing Kernel from the top menu and
choose Change Kernel....
Studio Lab provides sample custom environments through the SageMaker Studio Lab Examples
repository. The following shows how to clone and build these environments.
246
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
1. Clone the SageMaker Studio Lab Examples GitHub repository by following the instructions in Use
GitHub resources (p. 248).
2.
In Studio Lab choose the File Browser icon ( ) on the left menu, so that the File Browser panel
shows on the left.
3. Navigate to the studio-lab-examples/custom-environments directory in the File Browser.
4. Open the directory for the environment that you want to build.
5. Right click the .yml file in the folder, then select Build conda Environment.
6. You can now use the environment as a kernel after your conda environment has finished building.
For instructions on how to use an existing environment as a kernel, see Create, activate, and use new
conda environments (p. 246)
You can customize your environment by installing and removing extensions and packages, as needed.
Any installed extensions and packages persist in your project, so you do not need to install your packages
every time you work on your project.
Note
Installed packages counts against your available Studio Lab memory
To activate your environment, see Create, activate, and use new conda environments (p. 246).
Install packages
To install additional packages to your environment from a Jupyter notebook, run one of the following
commands in a Studio Lab terminal. These commands install packages in the currently activated
environment. Any packages that you install are saved in your persistent project directory.
We don't recommend using the !pip or !conda commands because they can behave in unexpected
ways when you have multiple environments.
After you install new packages to your environment, restart the kernel to ensure that the packages work
in your notebook. This can be done by choosing Amazon SageMaker Studio Lab in the top menu of
Studio Lab and choosing Restart JupyterLab....
Remove packages
conda remove
<PACKAGE_NAME>
This command will also remove any package that depends on <PACKAGE_NAME>, unless a replacement
can be found without that dependency.
247
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
conda deactivate
&& conda env remove --name
<ENVIRONMENT_NAME>
rm -rf *.*
Topics
• Use GitHub resources (p. 248)
• Add an Open in Studio Lab button to your notebook (p. 250)
• Import files from your computer (p. 250)
• Connect to Amazon S3 (p. 250)
The following topics give information about how to use GitHub resources with Studio Lab.
To get started with a repository of sample notebooks tailored for Studio Lab, see Studio Lab Sample
Notebooks.
This repository provides notebooks for the following use cases and others.
• Computer vision
248
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
• Connecting to AWS
• Creating custom environments
• Geospatial data analysis
• Natural language processing
• Using R
To clone a GitHub repo to your Studio Lab project, follow these steps.
1. Start your Studio Lab project runtime. For more information on launching Studio Lab project
runtime, see Start your project runtime (p. 236).
2.
In Studio Lab, choose the File Browser icon ( ) on the left menu, so that the File Browser panel
shows on the left.
3. Navigate to your user directory by choosing the file icon beneath the file search bar.
4.
Select the Git icon ( ) from the left menu to open a new dropdown menu.
5. Choose Clone a Repository.
6. Paste the repository's URL under Git repository URL (.git).
7. Select Clone.
To open a notebook in Studio Lab, you must have access to the repo that the notebook is in. The
following examples describe Studio Lab permission-related behavior in various situations.
• If a repo is public, you can automatically clone the notebook into your project from the Studio Lab
preview page.
• If a repo is private, you are prompted to sign in to GitHub from the Studio Lab preview page. If you
have access to a private repo, you can clone the notebook into your project.
• If you don't have access to a private repo, you cannot clone the notebook from the Studio Lab preview
page.
The following sections show two options for you to copy a GitHub notebook in your Studio Lab project.
These options depend on whether the notebook has an Open in Studio Lab button.
The following procedure shows how to copy a notebook that has an Open in Studio Lab button.
If you want to add this button to your notebook, see Add an Open in Studio Lab button to your
notebook (p. 250).
1. Sign in to Studio Lab following the steps in Sign in to Studio Lab (p. 235).
2. In a new browser tab, navigate to the GitHub notebook that you want to clone.
3. In the notebook, select the Open in Studio Lab button to open a new page in Studio Lab with a
preview of the notebook.
4. If your project runtime is not already running, start it by choosing the Start runtime button at the
top of the preview page. Wait for the runtime to start before proceeding to the next step.
5. After your project runtime has started, select Copy to project to open your project runtime in a new
browser tab.
249
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
6. In the Copy from GitHub? dialog box, select Copy notebook only. This copies the notebook file to
your project.
The following procedure shows how to copy any notebook from GitHub.
# Original URL
https://github.com/<PATH_TO_NOTEBOOK>
# Modified URL
https://studiolab.sagemaker.aws/import/github/<PATH_TO_NOTEBOOK>
3. Navigate to the modified URL. This opens a preview of the notebook in Studio Lab.
4. If your project runtime is not already running, start it by choosing the Start runtime button at the
top of the preview page. Wait for the runtime to start before proceeding to the next step.
5. After your project runtime has started, select Copy to project to open your project runtime in a new
browser tab.
6. In the Copy from GitHub? dialog box, select Copy notebook only to copy the notebook file to your
project.
To add the functional Open in Studio Lab button to your Jupyter notebook or repository, add the
following markdown to the top of your notebook or repository.
Alternatively, you can drag and drop files from your computer into the File Browser panel.
Connect to Amazon S3
The AWS CLI enables AWS integration in your Studio Lab project. With this integration, you can pull
resources from Amazon S3 to use with your Jupyter notebooks.
250
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
To use AWS CLI with Studio Lab, complete the following steps. For a notebook that outlines this
integration, see Using Studio Lab with AWS Resources.
1. Install the AWS CLI following the steps in Installing or updating the latest version of the AWS CLI.
2. Configure your AWS credentials by following the steps in Quick setup. The role for your AWS
account must have permissions to access the Amazon S3 bucket that you are copying data from.
3. From your Jupyter notebook, clone resources from the Amazon S3 bucket, as needed. The following
command shows how to clone all resources from an Amazon S3 path to your project. For more
information, see the AWS CLI Command Reference.
Topics
• Get the difference between the last checkpoint (p. 251)
• Get the difference between the last commit (p. 251)
To save the Studio Lab notebook and update the checkpoint file to match: Choose the Save notebook
and create checkpoint icon ( ). This is located on the Studio Lab menu's left side. The keyboard
shortcut for Save notebook and create checkpoint is Ctrl + s.
To view changes between the Studio Lab notebook and the checkpoint file: Choose the Checkpoint diff
icon ( ), located in the center of the Studio Lab menu.
To revert the Studio Lab notebook to the checkpoint file: On the main Studio Lab menu, choose File, and
then Revert Notebook to Checkpoint.
To view the changes in the notebook from the last Git commit: Choose the Git diff icon ( ) in the
center of the notebook menu.
251
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
environment to Studio to take advantage of more compute capacity, storage, and features. However, you
may want to familiarize yourself with Studio's prebuilt containers, which are optimized for the full MLOP
pipeline. For more information, see Amazon SageMaker Studio Lab (p. 230)
To migrate your Studio Lab environment to Studio, you must first onboard to Studio following the steps
in Onboard to Amazon SageMaker Domain (p. 37).
Topics
• Step 1: Export your Studio Lab conda environment (p. 252)
• Step 2: Save your Studio Lab artifacts (p. 253)
• Step 3: Import your Studio Lab artifacts to Studio (p. 254)
• Step 4: Install your Studio Lab conda environments in Studio (p. 255)
1.
Open the Studio Lab terminal by opening the File Browser ( ) panel, choose the plus (+) sign
on the menu at the top of the file browser to open the Launcher, then choose Terminal. From the
Studio Lab terminal, list the conda environments by running the following.
This command outputs a list of the conda environments and their locations in the file system. When
you onboard to Studio Lab, you automatically activate the studiolab conda environment.
# conda environments: #
default /home/studio-lab-user/.conda/envs/default
studiolab * /home/studio-lab-user/.conda/envs/studiolab
studiolab-safemode /opt/amazon/sagemaker/safemode-home/.conda/envs/
studiolab-safemode
base /opt/conda
We recommend that you do not export the studiolab, studiolab-safemode, and base
environments. These environments are not usable in Studio for the following reasons:
• studiolab: This sets up the JupyterLab environment for Studio Lab. Studio Lab runs a different
major version of JupyterLab than Studio, so it is not usable in Studio.
• studiolab-safemode: This also sets up the JupyterLab environment for Studio Lab. Studio Lab
runs a different major version of JupyterLab than Studio, so it is not usable in Studio.
• base: This environment comes with conda by default. The base environment in Studio Lab and
the base environment in Studio have incompatible versions of many packages.
2. For the conda environment that you want to migrate to Studio, first activate the conda
environment.The default environment is then changed when new libraries are installed or
removed from it. To get the exact state of the environment, export it into a YAML file using the
command line. The following command lines export the default environment into a YAML file,
creating a file called myenv.yml.
252
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
One option is to save the environment onto your local machine. To do this, use the following
procedure.
1.
In Studio Lab, choose the File Browser ( ) icon on the left menu, so that the File Browser
panel shows on the left.
2. Navigate to your user directory by choosing the file icon beneath the file search bar.
3. Choose (right-click) the myenv.yml file and then choose Download. You can repeat this process
for other files you want to import to Studio.
Another option is to save your environment to a Git repository. This option uses GitHub as an
example. These steps require a GitHub account and repository. For more information, visit GitHub.
The following procedure shows how to synchronize your content with GitHub using the Studio Lab
terminal.
1. From the Studio Lab terminal, navigate to your user directory and make a new directory to
contain the files you want to export.
cd ~
mkdir <NEW_DIRECTORY_NAME>
2. After you create a new directory, copy any file or directory you want to export to
<NEW_DIRECTORY_NAME>.
cp <FILE_NAME> <NEW_DIRECTORY_NAME>
cp -r <DIRECTORY_NAME> <NEW_DIRECTORY_NAME>
For example, replace <DIRECTORY_NAME> with any directory name in your user directory.
3. Navigate to the new directory and initialize the directory as a Git repository using the following
command. For more information, see the git-init documentation.
cd <NEW_DIRECTORY_NAME>
git init
4. Using Git, add all relevant files and then commit your changes.
253
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
git add .
git commit -m "<COMMIT_MESSAGE>"
For example, replace <COMMIT_MESSAGE> with Add Amazon SageMaker Studio Lab
artifacts to GitHub repository to migrate to Amazon SageMaker Studio .
5. Push the commit to your remote repository. This repository has the format https://
github.com/<GITHUB_USERNAME>/ <REPOSITORY_NAME>.git where
<GITHUB_USERNAME> is your GitHub user name and the <REPOSITORY_NAME> is your
remote repository name. Create a branch <BRANCH_NAME> to push the content to the GitHub
repository.
From Studio, you can import files from your local machine or from a Git repository. You can do this using
the Studio GUI or terminal. The following procedure uses the examples from Step 2: Save your Studio
Lab artifacts (p. 253).
If you saved the files to your local machine, you can import the files to Studio using the following
steps.
1.
Open the File Browser ( ) panel at the top left of Studio.
2.
Choose the Upload Files icon ( ) on the menu at the top of the File Browser panel.
3. Navigate to the file you want to import, then choose Open.
Note
If you wish to import a directory into Studio, first compress the directory on your
local machine to a file. On a Mac, right-click the directory and choose Compress
"<DIRECTORY_NAME>". In Windows, right-click the directory and choose Send to, and
then choose Compressed (zipped) folder. After the directory is compressed, import the
compressed file using the preceding steps. Unzip the compressed file by navigating to the
Studio terminal and running the command <DIRECTORY_NAME>.zip.
Import using a Git repository
This example provides two options for how to clone a GitHub repository into Studio. You can use
the Studio GUI by choosing the Git ( ) tab on the left side of Studio. Choose Clone a Repository,
then paste your GitHub repository URL from Step 2: Save your Studio Lab artifacts (p. 253).
Another option is to use the Studio terminal by using the following procedure.
1. Open the Studio Launcher. For more information on opening the Launcher, see Amazon
SageMaker Studio Launcher.
254
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime
2. In the Launcher, in the Notebooks and compute resources section, choose Change
environment.
3. In Studio, open the Launcher. To open the Launcher, choose Amazon SageMaker Studio at the
top-left corner of Studio.
To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker
Studio Launcher (p. 141).
4. In the Change environment dialog, use the Image dropdown list to select the Data Science
image and choose Select. This image comes with conda pre-installed.
5. In the Studio Launcher, choose Open image terminal.
6. From the image terminal, run the following command to clone your repository. This command
creates a directory named after <REPOSITORY_NAME> in your Studio instance and clones your
artifacts in that repository.
After these commands are complete, you can select your environment as the kernel for your Studio
notebook instances. To view the available environment, run conda env list. To activate your
environment, run conda activate <ENVIRONMENT_NAME>.
Topics
• Shut down an open notebook (p. 255)
• Shut down resources (p. 256)
1.
Save the notebook contents by choosing the icon, located in the notebook menu.
255
Amazon SageMaker Developer Guide
Troubleshooting
On the left sidebar of Studio Lab, you will find the Running Terminals and Kernels pane and icon.
The Running Terminals and Kernels pane has three sections. Each section lists all of the resources
of that type. You can shut down each resource individually, or shut down all resources in a section
simultaneously.
When you shut down all resources in a section, the following occurs:
1.
In the left sidebar, choose the Running Terminals and Kernels icon ( ).
2. Do either of the following:
• To shut down a specific resource: Choose the SHUT DOWN icon on the same row as the resource.
• To shut down all resources in a section: Choose Shut Down All, which is located to the right of the
section label. After a confirmation dialog box appears, choose Shut down all to proceed.
Troubleshooting
The guide shows common errors that might occur when using Amazon SageMaker Studio Lab. Each error
contains a description, as well as a solution to the error.
Note
You cannot share your password with multiple users or use Studio Lab to mine cryptocurrency.
We don’t recommend using Studio Lab for production tasks because of runtime limits.
If you can’t access your account, verify that you are using the correct email and password. If you have
forgotten your password, use the following steps to reset your password. If you still cannot access your
account, you must request and register for a new account using the instructions in Onboard to Amazon
SageMaker Studio Lab (p. 234).
Forgot password
If you forget your password, you must reset it using the following steps.
256
Amazon SageMaker Developer Guide
Troubleshooting
If the Studio Lab project runtime does not launch, try launching it again. If this doesn't work, switch
the instance type from CPU to GPU (or in reverse). For more information, see Change your compute
type (p. 237).
If there is an issue with the environment used to run JupyterLab, then Studio Lab will automatically
recreate the environment. Studio Lab does not support manual activation of this process.
Conflicting versions
Because you can add packages and modify your environment as needed, you may run into conflicts
between packages in your environment. If there are conflicts between packages in your environment, you
must remove the conflicting package.
When you build an environment from a YAML file, a package-version conflict or file issue might cause a
build to fail. To resolve this, remove the environment by running the following command. Do this before
attempting to build it again.
Studio uses the web application firewall service AWS WAF to protect your resources, which uses
JavaScript. If you are using a browser security plugin that prevents JavaScript from downloading, this
error may pop up. To use Studio, allow the JavaScript download from *.awswaf.com as a trusted domain.
For more information on AWS WAF, see AWS WAF from the AWS WAF, AWS Firewall Manager, and AWS
Shield Advanced. Developer Guide.
If you run into a notification saying mentioning that your disk space is full or File Load Error
for <FILE_NAME> while attempting to open a file, you can remove files, directories, libraries, or
environments to increase space. For more information on managing your libraries and environments, see
Manage your environment (p. 244).
If you run into a notification that Project runtime is in safe mode, you must free up some disk space to
resume using the Studio Lab project runtime. Follow the instructions in the preceding troubleshoot item,
Disk space is full. Once up to at least 500 MB of space has been cleared, you may restart the project
runtime to use Studio Lab. This can be done by choosing Amazon SageMaker Studio Lab in the top
menu of Studio Lab and choosing Restart JupyterLab....
If you run into an error when importing cv2 after installing opencv-python, you must uninstall
opencv-python and install opencv-python-headless as follows.
257
Amazon SageMaker Developer Guide
SageMaker Canvas
The Studio Lab IDE may fail to render when large files are opened, resulting in blocked access to Studio
Lab resources. To resolve this, reset the Studio Lab workspace using the following procedure.
1. After you open the IDE, copy the URL in your browser's address bar. This URL should be in the
https://xxxxxx.studio.us-east-2.sagemaker.aws/studiolab/default/jupyter/lab
format. Close the tab.
2. In a new tab, paste the URL and remove anything after https://xxxxxx.studio.us-
east-2.sagemaker.aws/studiolab/default/jupyter/lab.
3. Add ?reset to the end of the URL, so it is in the https://xxxxxx.studio.us-
east-2.sagemaker.aws/studiolab/default/jupyter/lab?reset format.
4. Navigate to the updated URL. This resets the saved UI state and makes the Studio Lab IDE
responsive.
With Canvas, you can access Ready-to-use models or build a custom model trained on your data.
The Ready-to-use models (p. 289) in Canvas can extract insights from your data for a variety of use
cases. You don’t have to build a model to use Ready-to-use models because they are powered by Amazon
AI services, including Amazon Rekognition, Amazon Textract, and Amazon Comprehend. You only have to
import your data and start using a solution to generate predictions.
If you want a model that is customized to your use case and trained with your data, you can build a
model (p. 297). You can get predictions customized to your data by doing the following:
You can also bring your own models into Canvas from Amazon SageMaker Studio.
258
Amazon SageMaker Developer Guide
Are you a first-time SageMaker Canvas user?
To learn more about pricing, see the SageMaker Canvas pricing page. You can also see Manage billing
and cost in SageMaker Canvas (p. 400) for more information.
• US East (Ohio)
• US East (N. Virginia)
• US West (Oregon)
• Asia Pacific (Mumbai)
• Asia Pacific (Seoul)
• Asia Pacific (Singapore)
• Asia Pacific (Sydney)
• Asia Pacific (Tokyo)
• Europe (Frankfurt)
• Europe (Ireland)
Topics
• Are you a first-time SageMaker Canvas user? (p. 259)
• Getting started with using Amazon SageMaker Canvas (p. 259)
• Setting Up and Managing Amazon SageMaker Canvas (for IT Administrators) (p. 264)
• Use Ready-to-use models (p. 289)
• Use custom models (p. 297)
• Logging out of Amazon SageMaker Canvas (p. 392)
• Limitations and troubleshooting (p. 393)
• Manage billing and cost in SageMaker Canvas (p. 400)
Topics
• Prerequisites for setting up Amazon SageMaker Canvas (p. 260)
• Step 1: Log in to Amazon SageMaker Canvas as a business user (p. 262)
• Step 2: Use SageMaker Canvas to get predictions (p. 264)
259
Amazon SageMaker Developer Guide
Getting started
The following sections describe how to set up an Amazon SageMaker Domain and give yourself
SageMaker Canvas permissions.
Important
For you to set up Amazon SageMaker Canvas, your version of Amazon SageMaker Studio must
be 3.19.0 or later. For information about updating Amazon SageMaker Studio, see Shut down
and Update SageMaker Studio (p. 199).
Use the following procedure to configure the general settings for the Domain:
1. Under Permission, for IAM role, choose an option from the role selector.
If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).
If you choose Create a new role, the Create an IAM role dialog opens:
• Your VPC information – For more information, see Choose an Amazon VPC (p. 46) and Configure
Amazon SageMaker Canvas in a VPC without internet access (p. 285).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).
260
Amazon SageMaker Developer Guide
Getting started
Note
Encryption in transit is only available for Amazon SageMaker Studio.
3. Select Next.
Use the following procedure to configure the SageMaker Canvas settings for the Domain:
1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions
option turned on (it is turned on by default). This attaches the AmazonSageMakerCanvasFullAccess
policy to your user's execution role and establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Canvas Ready-to-use models configuration, leave the Enable Canvas Ready-to-
use models option turned on to give your users permissions to generate predictions with Ready-to-
use models in Canvas (it is turned on by default).
3. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).
• If you left Enable time series forecasting turned on, select Create and use a new execution
role, or select Use an existing execution role if you already have an IAM role with the required
Amazon Forecast permissions attached (for more information, see the IAM role setup method.
4. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role, or select Use an existing execution role if you already have an IAM role with the
required Amazon Forecast permissions attached (for more information, see the IAM role setup
method (p. 278)).
5. (Optional) For the ML Ops permissions configuration section, leave the Enable Model Registry
registration permissions for all users option turned on to give your users permissions to register
their model version to the SageMaker model registry (it is turned on by default). For more
information, see Register a model version in the SageMaker model registry (p. 373).
6. (Optional) Add Tags to track your cost and usage trends in AWS Billing and Cost Management.
SageMaker adds the tags you specify in the Domain to all of the SageMaker Canvas apps you
create in the Domain. For more information about billing and tags, see Manage billing and cost in
SageMaker Canvas (p. 400).
7. Finish making any other changes to your Domain setup, and then choose Submit.
Note
If you encounter any issues with granting permissions through the console, such as permissions
for Ready-to-use models, see the topic Troubleshooting issues with granting permissions
through the SageMaker console (p. 393).
When you set up the Domain, SageMaker Canvas creates an Amazon S3 bucket with a name that uses the
following pattern: sagemaker-<Region>-<your-account-id>. Your Canvas application data, such as
imported datasets and batch predictions, are stored in the Canvas/ folder in the bucket.
• Local file upload. The permissions for local file upload are turned on by default in the Canvas base
permissions when setting up your Domain. If you don’t have the ability to upload local files from your
machine to SageMaker Canvas, you can attach a CORS policy to the default bucket that SageMaker
created for your Domain (sagemaker-<Region>-<your-account-id>). For more information, see
Grant Your Users Permissions to Upload Local Files.
261
Amazon SageMaker Developer Guide
Getting started
• Custom image and text prediction models. The permissions for building custom image and
text prediction models are turned on by default in the Canvas base permissions when setting
up your Domain. However, if you have a custom IAM configuration and don't want to attach the
AmazonSageMakerCanvasFullAccess policy to your user's IAM execution role, then you must explicitly
grant your user the necessary permissions. For more information, see Grant Your Users Permissions to
Build Custom Image and Text Prediction Models (p. 275).
• Ready-to-use models. You might want to have the ability to use the Canvas Ready-to-use models
to make predictions for your data. The permissions are turned on by default when setting up your
Domain, or you can edit the permissions for a Domain that you’ve already created. The Canvas Ready-
to-use models permissions option adds the AmazonSageMakerCanvasAIServicesAccess policy to
your execution role. For more information, see the Get started (p. 290) section of the Ready-to-use
models documentation.
• Time series forecasting. If you’d like to have the ability to perform forecasts on time series data,
you can add time series forecasting permissions when setting up your Domain, or you can edit the
permissions for a Domain or user profile after creating your Domain. The required permissions are the
AmazonSageMakerCanvasForecastAccess managed policy and a trust relationship with Amazon
Forecast to the AWS IAM role you chose when setting up the user profile. For instructions on how
to add these permissions to your IAM role, see Grant Your Users Permissions to Perform Time Series
Forecasting.
• Send batch predictions to Amazon QuickSight. You might want to have the ability to send batch
predictions, or datasets of predictions you generate from a custom model, to Amazon QuickSight for
analysis. In QuickSight, you can build and publish predictive dashboards with your prediction results.
For instructions on how to add these permissions to your Canvas user's IAM role, see Grant Your Users
Permissions to Send Predictions to Amazon QuickSight.
• Register model versions to the model registry. You might want to register versions of your model
to the SageMaker model registry, which is a repository for tracking the status of updated versions of
your model. A data scientist or MLOps team working in the SageMaker model registry can view the
versions of your model that you’ve built and approve or reject them. Then, they can deploy your model
version to production or kick off an automated workflow. Model registration permissions are turned on
by default for your Domain. You can manage permissions at the user profile level and grant or remove
permissions to specific users. For more information, see Register a model version in the SageMaker
model registry (p. 373).
• Collaboration with data scientists. If you want to collaborate with Studio users and share models,
you must add additional permissions to the AWS IAM role you chose when setting up the user profile.
For instructions on how to add the policy to the role, see Grant Users Permissions to Collaborate with
Studio.
• Import data from Amazon Redshift. If you want to import data from Amazon Redshift, you must give
yourself additional permissions. You must add the AmazonRedshiftFullAccess managed policy to
the AWS IAM role you chose when setting up the user profile. For instructions on how to add the policy
to the role, see Grant Users Permissions to Import Amazon Redshift Data.
Note
The necessary permissions to import through other data sources, such as Amazon
Athena and SaaS platforms, are included in the AmazonSageMakerFullAccess and
AmazonSageMakerCanvasFullAccess policies. If you followed the standard setup instructions,
these policies should already be attached to your execution role. For more information about
these data sources and their permissions, see Connect to data sources (p. 310).
When the initial setup is complete, you can access SageMaker Canvas by doing the following.
262
Amazon SageMaker Developer Guide
Getting started
When you log into SageMaker Canvas for the first time, there is a welcome message with quick getting
started tutorials that you can follow for a walkthrough of the SageMaker Canvas application.
You can follow the Get started with Canvas tutorial for a high-level overview of the SageMaker Canvas
application. There are also shorter tutorials that guide you through the individual steps of using
SageMaker Canvas. These tutorials show you how to import a dataset, build a model, analyze the results
of a built model, and generate predictions with your model. You can revisit the tutorials at any time by
choosing the Help button and then choosing one of the tutorials.
263
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
You can either use Canvas Ready-to-use models to make predictions without building a model, or you
can build a custom model for your specific business problem. Review the following information to decide
whether Ready-to-use models or custom models are best for your use case.
• Ready-to-use models. With Ready-to-use models, you can use pre-built models to extract insights
from your data. The Ready-to-use models cover a variety of use cases, such as language detection and
document analysis. To get started making predictions with Ready-to-use models, see Use Ready-to-use
models (p. 289).
• Custom models. With custom models, you can build a variety of model types that are customized to
make predictions for your data. Use custom models if you’d like to build a model that is trained on
your business-specific data and if you’d like to use features such as collaborating with data scientists
and evaluating your model’s performance. To get started with building a custom model, see Use
custom models (p. 297).
You can also bring your own model (BYOM) from other features in SageMaker. An Amazon SageMaker
Studio user can share their model with a Canvas user, and the Canvas user can generate predictions with
the model. To learn more, see Bring your own model to SageMaker Canvas.
264
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
You can also set up SageMaker Canvas for your users with AWS CloudFormation. For more information,
see AWS::SageMaker::App in the AWS CloudFormation User Guide.
Topics
• Grant Your Users Permissions to Upload Local Files (p. 265)
• Set Up SageMaker Canvas for Your Users (p. 267)
• Encrypt Your SageMaker Canvas Data with AWS KMS (p. 271)
• Grant Your Users Permissions to Build Custom Image and Text Prediction Models (p. 275)
• Grant Your Users Permissions to Perform Time Series Forecasting (p. 275)
• Update SageMaker Canvas for Your Users (p. 279)
• Request a Quota Increase (p. 280)
• Grant Users Permissions to Import Amazon Redshift Data (p. 281)
• Grant Users Permissions to Collaborate with Studio (p. 282)
• Grant Your Users Permissions to Send Predictions to Amazon QuickSight (p. 283)
• Manage apps (p. 284)
• Configure Amazon SageMaker Canvas in a VPC without internet access (p. 285)
To grant users permissions to upload local files to the bucket, you can attach a CORS configuration to
it using either of the following procedures. You can use the first method when setting up your Domain
or editing the existing Domain settings, where you opt in to allow SageMaker to attach the CORS
configuration to the default bucket for you. The second method is the manual method, where you can
attach the CORS configuration to the bucket yourself.
265
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
The following procedure shows how you can turn on this option when doing a Quick setup for your
Domain in the console.
If you are doing a Standard setup for your Domain, then use the following procedure for the Canvas
settings section to turn on local file upload.
1. For Enable and configure Canvas permissions, select Local file upload. (It's already checked by
default.)
2. Choose Next.
3. Finish setting up the Domain.
Your users can now upload local files into their SageMaker Canvas application.
You can also turn on or turn off local upload permissions for an existing Domain by using the following
procedure.
1. Sign in to https://console.aws.amazon.com/s3/.
2. Choose the bucket with the name that uses the following pattern:
sagemaker-{region}-{account-ID}.
3. Choose Permissions.
4. Navigate to Cross-origins resource sharing (CORS).
5. Choose Edit.
6. Add the following CORS policy:
[
{
"AllowedHeaders": [
"*"
],
266
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
"AllowedMethods": [
"POST"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": []
}
]
In the preceding procedure, the CORS policy must have "POST" listed under AllowedMethods.
If your users still can't upload the local files after you update the CORS policy, the browser might be
caching the CORS settings from a previous upload attempt. If they're running into issues, instruct them
to clear their browser cache and try again.
Use Okta Single-Sign On (Okta SSO) to grant your users access to Amazon SageMaker Canvas.
SageMaker Canvas supports SAML 2.0 SSO methods. The following sections guide you through
procedures to set up Okta SSO.
To set up a Domain, see Onboard to Amazon SageMaker Runtime Studio Using IAM. You can use the
following information to help you complete the procedure in the section:
Use the following procedure to set up Okta. For all of the following procedures, you specify the same
IAM role for IAM-role .
267
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:saml-provider/Okta"
},
"Action": [
"sts:AssumeRoleWithSAML",
"sts:SetSourceIdentity",
"sts:TagSession"
],
"Condition": {
"StringEquals": {
"SAML:aud": "https://signin.aws.amazon.com/saml"
}
}
}
]
268
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AmazonSageMakerPresignedUrlPolicy",
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:CreatePresignedDomainUrlWithPrincipalTag"
],
"Resource": "*"
}
]
}
To configure Amazon SageMaker Canvas to use Okta, follow the steps in this section. You must specify
unique user names for each SageMakerStudioProfileName field. For example, you can use user.login
as a value. If the username is different from the SageMaker Canvas profile name, choose a different
uniquely identifying attribute. For example, you can use an employee's ID number for the profile name.
For an example of values that you can set for Attributes, see the code following the procedure.
• SAML 2.0
• Default Relay State – https://Region.console.aws.amazon.com/sagemaker/home?
region=Region#/studio/canvas/open/StudioId. You can find the Studio ID in the console:
https://console.aws.amazon.com/sagemaker/
6. Choose Attributes.
7. In the SageMakerStudioProfileName fields, specify unique values for each username. The
usernames must match the usernames that you've created in the AWS console.
Attribute 1:
Name: https://aws.amazon.com/SAML/Attributes/
PrincipalTag:SageMakerStudioUserProfileName
Value: ${user.login}
Attribute 2:
Name: https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys
Value: {"SageMakerStudioUserProfileName"}
269
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateSageMakerStudioUserProfilePolicy",
"Effect": "Allow",
"Action": "sagemaker:CreateUserProfile",
"Resource": "*",
"Condition": {
"ForAnyValue:StringEquals": {
"aws:TagKeys": [
"studiouserid"
]
}
}
}
]
}
If you choose to add the preceding policy to the admin user, you must use the following permissions
from Set up ID federation in IAM (p. 268).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AmazonSageMakerPresignedUrlPolicy",
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:CreatePresignedDomainUrlWithPrincipalTag"
],
"Resource": "*",
"Condition": {
"StringEquals": {
270
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
"sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/
SageMakerStudioUserProfileName}"
}
}
}
]
}
Amazon SageMaker Canvas provides you with several options for encrypting your data. SageMaker
Canvas provides default encryption within the application for tasks such as building your model and
generating insights. You can also choose to encrypt data stored in Amazon S3 to protect your data
at rest. SageMaker Canvas supports importing encrypted datasets, so you can establish an encrypted
workflow. The following sections describe how you can use AWS KMS encryption to protect your data
while building models with SageMaker Canvas.
Prerequisites
To use your own KMS key for either of the previously described purposes, you must first grant your user's
IAM role permission to use the key. Then, you can specify the KMS key when setting up your Domain.
The simplest way to grant your role permission to use the key is to modify the key policy. Use the
following procedure to grant your role the necessary permissions.
{
"Sid": "ExampleStmt",
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/Jane>"
},
"Resource": "*"
}
271
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
The less preferred method is to modify the user’s IAM role to grant the user permissions to use or
manage the KMS key. If you use this method, the KMS key policy must also allow access management
through IAM. To learn how to grant permission to a KMS key through the user’s IAM role, see Specifying
KMS keys in IAM policy statements in the AWS KMS Developer Guide.
To use your AWS KMS key to encrypt time series forecasting models in SageMaker Canvas, you must
modify the key policy for the KMS key used to store objects to Amazon S3. Your key policy must
grant permissions to the AmazonSageMakerCanvasForecastRole, which SageMaker creates
when you grant time series forecasting permissions for your users. Amazon Forecast uses the
AmazonSageMakerCanvasForecastRole to perform time series forecasting operations in SageMaker
Canvas. Your KMS key must grant permissions to this role in order to ensure data is encrypted for time
series forecasting.
To modify the permissions of your KMS key policy to allow encrypted time series forecasting, do the
following.
{
"Sid": "Enable IAM Permissions for Amazon Forecast KMS access",
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/service-role/
AmazonSagemakerCanvasForecastRole-444455556666>"
},
"Action": [
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlainText",
"kms:Decrypt"
],
"Resource": "*"
}
You can now use your KMS key to encrypt time series forecasting operations in SageMaker Canvas.
Note
The following permissions are only required if you are using the IAM role setup method to
configure time series forecasting. Add the following permissions policy to your user's IAM role.
You must also update the key policy with updated policies required for Amazon Forecast. For
more information about the permissions required for time series forecasting, see Grant Your
Users Permissions to Perform Time Series Forecasting (p. 275).
{
"Sid": "Enable IAM Permissions for Amazon Forecast KMS access",
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/AmazonSageMaker-
ExecutionRole-111122223333444>"
272
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
},
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:GenerateDataKey"
"kms:GenerateDataKeyWithoutPlainText",
],
"Resource": "*"
}
The first KMS key you can use in SageMaker Canvas is used for encrypting application data stored on
Amazon Elastic Block Store (EBS) volumes and in the Amazon Elastic File System that SageMaker creates
in your Domain. SageMaker Canvas encrypts your data with this key in the underlying application and
temporary storage systems created when using compute instances for building models and generating
insights. SageMaker Canvas passes the key to other AWS services, such as Autopilot, whenever
SageMaker Canvas initiates jobs with them to process your data.
You can specify this key by setting the KmsKeyID in the CreateDomain API call or while doing the
Standard Domain setup in the console. If you don’t specify your own KMS key, SageMaker uses a default
AWS managed KMS key to encrypt your data in the SageMaker Canvas application.
To specify your own KMS key for use in the SageMaker Canvas application through the console, first set
up your Amazon SageMaker Domain using the Standard setup. Use the following procedure to complete
the Network and Storage Section for the Domain.
The second KMS key you can specify is used for data that SageMaker Canvas stores to Amazon S3.
SageMaker Canvas saves duplicates of your input datasets, application and model data, and output data
to the Region’s default SageMaker S3 bucket for your account. The naming pattern for this bucket is
sagemaker-{region}-{account-ID}, and SageMaker Canvas stores data in the Canvas/ folder.
273
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
your user's IAM role additional permissions if you want to import data from Amazon S3 that is already
encrypted with AWS KMS.
To grant your user permissions to import encrypted datasets from Amazon S3 into SageMaker Canvas,
add the following permissions to the IAM execution role that you've used for the user profile.
"kms:Decrypt",
"kms:GenerateDataKey"
To learn how to edit the IAM permissions for a role, see Adding and removing IAM identity permissions
in the IAM User Guide. For more information about KMS keys, see Key policies in AWS Key Management
Service in the AWS KMS Developer Guide.
FAQs
Refer to the following FAQ items for answers to commonly asked questions about SageMaker Canvas
AWS KMS support.
A: No. SageMaker Canvas may temporarily cache your key or pass it on to other AWS services (such as
Autopilot), but SageMaker Canvas does not retain your KMS key.
Q: I specified a KMS key when setting up my Domain. Why did my dataset fail to import in
SageMaker Canvas?
A: Your user’s IAM role may not have permissions to use that KMS key. To grant your user permissions,
see the Prerequisites (p. 271). Another possible error is that you have a bucket policy on your Amazon
S3 bucket that requires the use of a specific KMS key that doesn’t match the KMS key you specified
in your Domain. Make sure that you specify the same KMS key for your Amazon S3 bucket and your
Domain.
Q: Can I change the default SageMaker S3 bucket used to store SageMaker Canvas data?
A: SageMaker Canvas uses the default SageMaker S3 bucket to store duplicates of your input datasets,
model artifacts, and model outputs.
Q: What use cases are supported for using KMS keys with SageMaker Canvas?
A: With SageMaker Canvas, you can use your own encryption keys with AWS KMS for building regression,
binary and multi-class classification, and time series forecasting models, as well as for batch inference
with your model.
A: Yes. You must give your KMS key additional permissions in order to perform encrypted time series
forecasting. For more information about how to modify your key’s policy in order to grant time series
forecasting permissions, see Prerequisites for time series forecasting (p. 272).
274
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
However, if you are using a custom IAM configuration, then you must explicitly add permissions to your
user's IAM execution role so that they can build custom image and text prediction model types. To grant
the necessary permissions to build image and text prediction models, read the following section to learn
how to attach a least-permissions policy to your role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateAutoMLJobV2",
"sagemaker:DescribeAutoMLJobV2"
],
"Resource": "*"
}
]
}
For more information about AWS managed policies, see Managed policies and inline policies in the IAM
User Guide.
If you want to encrypt your time series forecasts with your own key, you must use an AWS KMS key
and modify your KMS key's policy to grant permissions to the role used by Amazon Forecast. For more
275
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
information about setting up your KMS key and modifying the policy for time series forecasting, see
Prerequisites for time series forecasting (p. 272).
If you are setting up your Amazon SageMaker Domain for the first time and want to turn on time series
forecasting permissions for all users in the Domain, then use the following procedures.
Quick setup
Use the following procedure to turn on SageMaker Canvas time series forecasting permissions when
doing a Quick setup for your Domain.
1. In the Amazon SageMaker Domain Quick setup, fill out the Name and Default execution role
fields in the User profile section.
2. Leave the Enable SageMaker Canvas permissions option turned on. It is turned on by default.
3. Choose Submit to finish setting up your Domain.
Standard setup
Use the following procedure to turn on SageMaker Canvas time series forecasting permissions when
doing a Standard setup for your Domain.
1. In the Amazon SageMaker Domain Standard setup, fill out the General settings, Studio
settings, and RStudio settings pages.
2. Choose the Canvas settings page.
3. For the Canvas base permissions configuration, leave the Enable Canvas base permissions
option turned on. It is turned on by default. These permissions are required in order to turn on
time series forecasting permissions.
4. For the Time series forecasting configuration, leave the Enable time series forecasting option
turned on. It is turned on by default.
5. Select Create and use a new execution role, or select Use an existing execution role if you
already have an IAM role with the required Amazon Forecast permissions attached. For more
information, see the IAM role setup method (p. 278).
6. Finish making any other changes to your Domain setup, and then choose Submit.
Your users should now have the necessary permissions to perform time series forecasting in SageMaker
Canvas.
276
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
4. In the User profiles tab, select the name of the user whose permissions you want to edit.
5. On the User Details page, choose Edit.
6. Choose the Canvas settings page.
7. Turn on Enable Canvas base permissions. These permissions are required in order to turn on time
series forecasting permissions.
8. Turn on the Enable time series forecasting option.
9. If you want to use a different execution role for the user than the role specified in the Domain, select
Create and use a new execution role, or Use an existing execution role if you already have an IAM
role ready to use.
Note
If you want to use an existing IAM role, make sure that it has the IAM policy
AmazonSageMakerCanvasForecastAccess attached and has a trust relationship that
establishes Amazon Forecast as a service principal. For more information, see the section
IAM role setup method (p. 278).
10. The Canvas settings page should look like the following screenshot. Finish making any other
changes to your user profile, and then choose Submit to save your changes.
Your user should now have permission to do time series forecasting in SageMaker Canvas.
277
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
You can also remove your user's permissions by using the preceding procedure and turning off the
Enable time series forecasting option.
The following section shows you how to create the trust relationship and attach the
AmazonSageMakerCanvasForecastAccess managed policy to your IAM role, which grants the
minimum permissions necessary for time series forecasting to work in SageMaker Canvas.
To configure an IAM role with the manual method, use the following procedure.
6. Once you have the name of the user's IAM role, go to the IAM console.
7. Choose Roles.
8. Search for the user's IAM role by name from the list of roles and select it.
9. Under Permissions, choose Add permissions.
10. Choose Attach policies.
11. Search for the AmazonSageMakerCanvasForecastAccess managed policy and select it. Choose
Attach policies to attach the policy to the role.
After attaching the policy, the role's Permissions section should now include
AmazonSageMakerCanvasForecastAccess.
278
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
12. Return to the IAM role's page, and under Trust relationships, choose Edit trust policy.
13. In the Edit trust policy editor, update the trust policy to add Forecast as a service principal. The
policy should look like the following example.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com",
"forecast.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
You should now have an IAM role that has the policy AmazonSageMakerCanvasForecastAccess
attached to it and a trust relationship established with Amazon Forecast, giving users permission to
perform time series forecasting in SageMaker Canvas. For information about AWS managed policies, see
Managed policies and inline policies.
Note
If you use this method to set up time series forecasting and want to use AWS KMS encryption
for your forecasts, then you must configure your KMS key’s policy to grant additional
permissions. For more information, see Prerequisites for time series forecasting (p. 272).
To update the Amazon SageMaker Canvas application, you must delete the previous version.
Important
Deleting the previous version of Amazon SageMaker Canvas doesn't delete the data or models
that the users have created.
Use the following procedure to log in to AWS, open Amazon SageMaker Domain, and update Amazon
SageMaker Canvas. The users can start using the SageMaker Canvas application when they log back in.
The following image shows the user profile page and highlights the Delete app action from the
preceding procedure.
279
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
Amazon SageMaker Canvas uses the following services to process the requests of your users:
For information about increasing quotas for SageMaker Canvas operations that aren't used to forecast
time series data, see Amazon SageMaker endpoints and quotas.
For information about increasing quotas for SageMaker Canvas operations that are used to forecast time
series data, see Amazon Forecast endpoints and quotas.
To allow SageMaker Canvas to complete post-building analysis of models, you must increase the
SageMaker Hosting endpoint limit for the ml.m5.2xlarge instance type to a non-zero value in your
AWS account. After building a model, SageMaker Canvas hosts the model on a SageMaker Hosting
endpoint and uses the endpoint to generate the post-building analysis. If you don't increase the default
account limit of 0 for ml.m5.2xlarge instances, SageMaker Canvas cannot complete this step and
generates an error during post-building analysis.
Use the following procedure to request a limit increase for your account.
280
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
To add the AmazonRedshiftFullAccess policy to the user's IAM role, do the following.
After attaching the policy, the role’s Permissions section should now include
AmazonRedshiftFullAccess.
To add Amazon Redshift as a service principal to the IAM role, do the following.
1. On the same page for the IAM role, under Trust relationships, choose Edit trust policy.
2. In the Edit trust policy editor, update the trust policy to add Amazon Redshift as a service principal.
An IAM role that allows Amazon Redshift to access other AWS services on your behalf has a trust
relationship as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
281
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
You should now have an IAM role that has the policy AmazonRedshiftFullAccess attached to it and a
trust relationship established with Amazon Redshift, giving users permission to import Amazon Redshift
data into SageMaker Canvas. For more information about AWS managed policies, see Managed policies
and inline policies in the IAM User Guide.
Amazon Redshift modifies the cluster to complete the change, and the IAM role to which you previously
granted Amazon Redshift permissions is now associated with your Amazon Redshift cluster. Your users
now have the required permissions to import Amazon Redshift data into SageMaker Canvas.
For more information about how Canvas users can share models with Studio users, see Collaborate with
data scientists (p. 377). For more information about how Canvas users can bring a model shared from
Studio, see Bring your own model to SageMaker Canvas (p. 384).
Before Canvas and Studio users can collaborate, the users must be in the same Amazon SageMaker
Domain. Add the following IAM permissions added to the same IAM execution role that you've used for
their profiles.
282
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateSharedModel",
"sagemaker:DescribeSharedModel",
"sagemaker:ListSharedModelEvents",
"sagemaker:ListSharedModels",
"sagemaker:ListSharedModelVersions",
"sagemaker:SendSharedModelEvent",
"sagemaker:UpdateSharedModel",
],
"Resource": "*"
}
]
}
For more information about AWS managed policies, see Managed policies and inline policies in the IAM
User Guide.
To grant the necessary permissions to share batch predictions with users in QuickSight, you must add a
permissions policy to the AWS Identity and Access Management (IAM) execution role that you’ve used for
the user profile. The following section shows you how to attach a least-permissions policy to your role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"quicksight:CreateDataSet",
"quicksight:ListUsers",
"quicksight:ListNamespaces",
"quicksight:CreateDataSource",
"quicksight:PassDataSet",
283
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
"quicksight:PassDataSource"
],
"Resource":[
"arn:aws:quicksight:*:<your-account-number>:datasource/*",
"arn:aws:quicksight:*:<your-account-number>:user/*",
"arn:aws:quicksight:*:<your-account-number>:namespace/*",
"arn:aws:quicksight:*:<your-account-number>:dataset/*"
]
}
]
}
You should now have a customer-managed IAM policy attached to your execution role that grants your
Canvas users the necessary permissions to send batch predictions to users in QuickSight.
Manage apps
The following sections describe how you can manage your SageMaker Canvas applications. You can view,
delete, or relaunch your apps from the Domains section of the SageMaker console.
The Status column displays the status of the app, such as Ready, Pending, or Deleted. If the app is
Ready, then your SageMaker Canvas workspace instance is active. You can delete the app from the
console or log out from the SageMaker Canvas interface.
Delete app
If you want to end your SageMaker Canvas workspace instance, you can either log out from the
SageMaker Canvas application or delete your application from the SageMaker console. A workspace
instance is dedicated for your use from when you start using SageMaker Canvas to the point when you
stop using it. Deleting the application only ends the workspace instance. Models and datasets aren’t
affected, but Quick build tasks automatically restart when you log in again. The billing for the workspace
instance also stops.
To delete your Canvas app through the AWS console, first close the browser tab in which your Canvas
app was open. Then, use the following procedure to delete your SageMaker Canvas application.
284
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
4. On the Domain details page, under User profiles, select the user profile name for the Canvas
application you want to view.
5. Under Apps, find the application that says Canvas in the App type column.
6. In the Action column, choose Delete app.
7. In the Delete app dialog box, select the Yes, delete app prompt, confirm the deletion by typing
delete in the text field, and then choose Delete.
After you've successfully deleted the application, the Status column says Deleted. Otherwise, your
application is still active.
You can also end the workspace instance by logging out (p. 392) from within the SageMaker Canvas
application.
Relaunch app
If you delete or log out of your SageMaker Canvas application and want to relaunch the application, use
the following procedure.
You can also use the following secondary procedure if you encounter any issues with the previous
procedure.
When the SageMaker Canvas application is running in the AWS managed VPC, it can interact with other
AWS services using either an internet connection or through VPC endpoints created in a customer-
managed VPC (without public internet access). SageMaker Canvas applications can access these VPC
endpoints through a Studio-created network interface that provides connectivity to the customer-
285
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
managed VPC. The default behavior of the SageMaker Canvas application is to have internet access.
When using an internet connection, the containers for the preceding jobs access AWS resources over the
internet, such as the Amazon S3 buckets where you store training data and model artifacts.
However, if you have security requirements to control access to your data and job containers, we
recommend that you configure SageMaker Canvas and your VPC so that your data and containers aren’t
accessible over the internet. SageMaker uses the VPC configuration settings you specify when setting up
your Domain for SageMaker Canvas.
If you want to configure your SageMaker Canvas application without internet access, you must configure
your VPC settings when you onboard to Amazon SageMaker Domain (p. 37), set up VPC endpoints,
and grant the necessary AWS Identity and Access Management permissions. For information about
configuring a VPC in Amazon SageMaker, see Choose an Amazon VPC (p. 46). The following sections
describe how to run SageMaker Canvas in a VPC without public internet access.
When onboarding to Domain, if you choose Public internet only as the network access type, the VPC is
SageMaker managed and allows internet access.
You can change this behavior by choosing VPC only so that SageMaker sends all traffic to a network
interface that SageMaker creates in your specified VPC. When you choose this option, you must provide
the subnets, security groups, and VPC endpoints that are necessary to communicate with the SageMaker
API and SageMaker Runtime, and various AWS services, such as Amazon S3 and Amazon CloudWatch,
that are used by SageMaker Canvas. Note that you can only import data from Amazon S3 buckets
located in the same Region as your VPC.
The following procedures show how you can configure these settings to use SageMaker Canvas without
the internet.
To send SageMaker Canvas traffic to a network interface in your own VPC instead of over the internet,
specify the VPC you want to use when onboarding to Amazon SageMaker Domain (p. 37). You must also
specify at least two subnets in your VPC that SageMaker can use. Choose Standard setup and do the
following procedure when configuring the Network and Storage Section for the Domain.
After disabling internet access, finish the onboarding process to set up your Domain. For more
information about the VPC settings for Amazon SageMaker Domain, see Choose an Amazon VPC (p. 46).
286
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
SageMaker Canvas only accesses other AWS services to manage and store data for its functionality.
For example, it connects to Amazon Redshift if your users access an Amazon Redshift database. It can
connect to an AWS service such as Amazon Redshift using an internet connection or a VPC endpoint. Use
VPC endpoints if you want to set up connections from your VPC to AWS services that don't use the public
internet.
A VPC endpoint creates a private connection to an AWS service that uses a networking path that is
isolated from the public internet. For example, if you set up access to Amazon S3 using a VPC endpoint
from your own VPC, then the SageMaker Canvas application can access Amazon S3 by going through
the network interface in your VPC and then through the VPC endpoint that connects to Amazon S3. The
communication between SageMaker Canvas and Amazon S3 is private.
For more information about configuring VPC endpoints for your VPC, see AWS PrivateLink.
The following are the VPC endpoints for each service you can use with SageMaker Canvas:
com.amazonaws.Region.sagemaker.runtime
com.amazonaws.Region.notebook
com.amazonaws.Region.forecastquery
287
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)
You must also add the following endpoint policy for Amazon S3 to control AWS principal access to your
VPC endpoint. For information about how to update your VPC endpoint policy, see Control access to VPC
endpoints using endpoint policies.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:CreateBucket",
"s3:GetBucketCors",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::*SageMaker*",
"arn:aws:s3:::*Sagemaker*",
"arn:aws:s3:::*sagemaker*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": "*"
}
The SageMaker Canvas user must have the necessary AWS Identity and Access Management permissions
to allow connection to the VPC endpoints. The IAM role to which you give permissions must be the
same one you used when onboarding to Amazon SageMaker Domain. You can attach the SageMaker
managed AmazonSageMakerFullAccess policy to the IAM role for the user to give the user the
required permissions. If you require more restrictive IAM permissions and use custom policies instead,
then give the user’s role the ec2:DescribeVpcEndpointServices permission. SageMaker Canvas
requires these permissions to verify the existence of the required VPC endpoints for standard build jobs.
If it detects these VPC endpoints, then standard build jobs run by default in your VPC. Otherwise, they
will run in the default AWS managed VPC.
For instructions on how to attach the AmazonSageMakerFullAccess IAM policy to your user’s IAM
role, see Adding and removing IAM identity permissions.
To grant your user’s IAM role the granular ec2:DescribeVpcEndpointServices permission, use the
following procedure.
1. Sign in to the AWS Management Console and open the IAM console.
2. In the navigation pane, choose Roles.
3. In the list, choose the name of the role to which you want to grant permissions.
288
Amazon SageMaker Developer Guide
Use Ready-to-use models
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ec2:DescribeVpcEndpointServices",
"Resource": "*"
}
]
}
7. Choose Review policy, and then enter a Name for the policy (for example,
VPCEndpointPermissions).
8. Choose Create policy.
The user’s IAM role should now have permissions to access the VPC endpoints configured in your VPC.
If you are an administrator, you might want different users to have different VPC settings, or user-specific
VPC settings. When you override the default VPC’s security group settings for a specific user, these
settings are passed on to the SageMaker Canvas application for that user.
You can override the security groups that a specific user has access to in your VPC when you set up a new
user profile in Studio. You can use the CreateUserProfile SageMaker API call (or create_user_profile with
the AWS CLI), and then in the UserSettings, you can specify the SecurityGroups for the user.
Canvas integrates with existing AWS services, such as Amazon Textract, Amazon Rekognition, and
Amazon Comprehend, to analyze your data and make predictions or extract insights. You can use the
predictive power of these services from within the Canvas application to get high quality predictions for
your data.
Sentiment analysis Detect sentiment in lines of text, Text (CSV or plain text)
which can be positive, negative,
neutral, or mixed. Currently, you
can only do sentiment analysis
for English language text.
289
Amazon SageMaker Developer Guide
Use Ready-to-use models
Entities extraction Extract entities, which are real- Text (CSV or plain text)
world objects such as people,
places, and commercial items,
or units such as dates and
quantities, from text.
Personal information detection Detect personal information Text (CSV or plain text)
that could be used to identify
an individual, such as addresses,
bank account numbers, and
phone numbers, from text.
Object detection in images Detect objects, concepts, scenes, Image (JPG, PNG)
and actions in your images.
Text detection in images Detect text in your images. Image (JPG, PNG)
Expense analysis Extract information from Document (PDF, JPG, PNG, TIFF)
invoices and receipts, such as
date, number, item prices, total
amount, and payment terms.
Identity document analysis Extract information from Document (PDF, JPG, PNG, TIFF)
passports, driver licenses, and
other identity documentation
issued by the US Government.
Document analysis Analyze documents and forms Document (PDF, JPG, PNG, TIFF)
for relationships among
detected text.
Get started
To get started with Ready-to-use models, review the following information.
Prerequisites
To use Ready-to-use models in Canvas, you must turn on the Canvas Ready-to-use models
configuration permissions when setting up your Amazon SageMaker Domain. The Canvas Ready-
to-use models configuration attaches the AmazonSageMakerCanvasAIServicesAccess policy to your
Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues
with granting permissions, see the topic Troubleshooting issues with granting permissions through the
SageMaker console (p. 393).
If you’ve already set up your Domain, you can edit your Domain settings and turn on the permissions. For
instructions on how to edit your Domain settings, see View and Edit Domains. When editing the settings
for your Domain, go to the Canvas settings and turn on the Enable Canvas Ready-to-use models
option.
290
Amazon SageMaker Developer Guide
Use Ready-to-use models
1. (Optional) Import your data. You can import a tabular, image, or document dataset to generate
batch predictions, or a dataset of predictions, with Ready-to-use models. To get started with
importing a dataset, see Import data for Ready-to-use models (p. 291).
2. Generate predictions. You can generate single or batch predictions with your chosen Ready-
to-use model. To get started with making predictions, see Make predictions with Ready-to-use
models (p. 292).
You can use Canvas Ready-to-use models to get predictions for an entire dataset. You only have to
import your data into Canvas.
• Text data. Text data consists of text in a standard CSV format. The data should consist of at least one
column of plain text data.
• Image data. Image datasets consist of image files in JPG or PNG format.
• Document data. Document data consists of files in PDF, JPG, PNG, or TIFF format.
When you import your data into Canvas, you must make sure that it meets the input requirements. For
a table of requirements by data type, you can refer to the limits table on the Create a dataset (p. 301)
page for custom models.
You can import data into Canvas from the following data sources:
For a table of all of the supported data sources and what data types you can import from them, see the
data sources table on the Import data into Canvas page in the custom models documentation.
Use the following procedures to import datasets into Canvas that you can use with Ready-to-use models.
The procedures for importing text and image data are the same for Ready-to-use models and custom
models. You can refer to the custom model procedures for instructions on how to import these types of
datasets:
291
Amazon SageMaker Developer Guide
Use Ready-to-use models
• To learn how to import text data, use the Import tabular data (p. 303) procedure within the custom
model documentation.
• To learn how to import image data, use the Import image data (p. 305) procedure within the custom
model documentation.
Note
You can only import image datasets from local file upload or an Amazon S3 bucket.
With document datasets, you can generate predictions for expense analysis, identity document
analysis, and document analysis Ready-to-use models. Review the limitations table in the Create a
dataset (p. 301) section to ensure that your document dataset meets the requirements for document
data.
Note
You can only import document datasets from local file upload or an Amazon S3 bucket.
While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).
When the Status of your dataset shows as Ready, Canvas has successfully imported your data.
On the Datasets page, you can choose your dataset to preview it, which shows you up to the first 100
documents of your dataset.
• Text data: Sentiment analysis, entities extraction, language detection, personal information detection
• Image data: Object detection in images, text detection in images
• Document data: Expense analysis, identity document analysis, document analysis
The following screenshot shows you the landing page for Ready-to-use models, which showcases all of
the different solutions.
292
Amazon SageMaker Developer Guide
Use Ready-to-use models
Each Ready-to-use model supports both Single predictions and Batch predictions for your dataset. A
Single prediction is when you only need to make one prediction. For example, you have one image from
which you want to extract text, or one paragraph of text for which you want to detect the dominant
language. A Batch prediction is when you’d like to make predictions for an entire dataset. For example,
you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or
you might have image files in which you’d like to detect objects.
When you have your data and have identified your use case, choose one of the following workflows to
make predictions for your data.
Single predictions
To make a single prediction for Ready-to-use models that accept text data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For text data,
it should be one of the following: Sentiment analysis, Entities extraction, Language detection, or
Personal information detection.
3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. For Text field, enter the text for which you’d like to get a prediction.
5. Choose Generate prediction results to get your prediction.
In the right pane Prediction results, you receive an analysis of your text in addition to a Confidence
score for each result or label. For example, if you chose language detection and entered a passage of text
in French, you might get French with a 95% confidence score and traces of other languages, like English,
with a 5% confidence score.
The following screenshot shows the results for a single prediction using language detection where the
model is 100% confident that the passage is English.
293
Amazon SageMaker Developer Guide
Use Ready-to-use models
Batch predictions
To make batch predictions for Ready-to-use models that accept text data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For text data,
it should be one of the following: Sentiment analysis, Entities extraction, Language detection, or
Personal information detection.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can Preview the output data. Then, you can choose Download CSV to download the results.
Single predictions
To make a single prediction for Ready-to-use models that accept image data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Object detection images or Text detection in images.
3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. Choose Upload image.
5. You are prompted to select an image to upload from your local computer. Select the image from
your local files, and then the prediction results generate.
294
Amazon SageMaker Developer Guide
Use Ready-to-use models
In the right pane Prediction results, you receive an analysis of your image in addition to a Confidence
score for each object or text detected. For example, if you chose object detection in images, you receive
a list of objects in the image along with a confidence score of how certain the model is that each object
was accurately detected, such as 93%.
The following screenshot shows the results for a single prediction using the object detection in images
solution, where the model predicts objects such as a clock tower and bus with 100% confidence.
Batch predictions
To make batch predictions for Ready-to-use models that accept image data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Object detection images or Text detection in images.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ),
you can choose View prediction results to preview the output data. Then, you can choose Download
prediction and download the results as a CSV or a ZIP file.
Single predictions
To make a single prediction for Ready-to-use models that accept document data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For document
data, it should be one of the following: Expense analysis, Identity document analysis, or Document
analysis.
295
Amazon SageMaker Developer Guide
Use Ready-to-use models
3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. If your Ready-to-use model is identity document analysis or document analysis, do the following (if
you’re doing expense analysis, skip this step and go to Step 5):
In the right pane Prediction results, you’ll receive an analysis of your document.
The following information describes the results for each type of solution:
• For expense analysis, the results are categorized into Summary fields, which include fields such as the
total on a receipt, and Line item fields, which include fields such as individual items on a receipt. The
identified fields are highlighted on the document image in the output.
• For identity document analysis, the output shows you the fields that the Ready-to-use model
identified, such as first and last name, address, or date of birth. The identified fields are highlighted on
the document image in the output.
• For document analysis, the results are categorized into Raw text, Forms, Tables, and Signatures. Raw
text includes all of the extracted text, while Forms, Tables, and Signatures only include information
on the form that falls into those categories. For example, Tables only includes information extracted
from tables in the document. The identified fields are highlighted on the document image in the
output.
The following screenshot shows the results for a single prediction using the document analysis solution.
Batch predictions
To make batch predictions for Ready-to-use models that accept document data, do the following:
1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
296
Amazon SageMaker Developer Guide
Use custom models
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Expense analysis, Identity document analysis, or Document
analysis.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions. If your use
case is document analysis, continue to Step 6.
6. (Optional) If your use case is Document analysis, another dialog box called Select features to
include in batch prediction appears. You can select Forms, Tables, and Signatures to group the
results by those features. Then, choose Generate predictions.
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose View prediction results to preview the analysis of your document data.
The following information describes the results for each type of solution:
• For expense analysis, the results are categorized into Summary fields, which include fields such as the
total on a receipt, and Line item fields, which include fields such as individual items on a receipt. The
identified fields are highlighted on the document image in the output.
• For identity document analysis, the output shows you the fields that the Ready-to-use model
identified, such as first and last name, address, or date of birth. The identified fields are highlighted on
the document image in the output.
• For document analysis, the results are categorized into Raw text, Forms, Tables, and Signatures. Raw
text includes all of the extracted text, while Forms, Tables, and Signatures only include information
on the form that falls into those categories. For example, Tables only includes information extracted
from tables in the document. The identified fields are highlighted on the document image in the
output.
After previewing your results, you can choose Download prediction and download the results as a ZIP
file.
You can train a Canvas custom model on the following types of datasets:
The following table shows the types of custom models that you can build in Canvas, along with their
supported data types and data sources
297
Amazon SageMaker Developer Guide
Use custom models
Model type Example use case Supported data types Supported data
sources
Single-label image Predicting types of Image (JPG, PNG) Local upload, Amazon
prediction manufacturing defects S3
in images
Multi-category text Predicting categories Source column: Text Local upload, Amazon
prediction of products, such as S3
clothing, electronics, or Target column: Binary
household goods, based or Categorical
on product descriptions
Get started
To get started with building and generating predictions from a custom model, do the following:
• Determine your use case and type of model that you want to build. For more information about the
custom model types, see Build a custom model (p. 321). For more information about the data types
and sources supported for custom models, see Import data into Canvas (p. 299).
• Import your data into Canvas. You can build a custom model with any tabular or image dataset that
meets the input requirements. For more information about the input requirements, see Create a
dataset (p. 301).
To learn more about SageMaker-provided sample datasets you can experiment with, see Use sample
datasets.
• Build your custom model. You can do a Quick build to get your model and start making predictions
more quickly, or you can do a Standard build for greater accuracy.
For numeric, categorical, and time series forecasting model types, you can clean and prepare your data
with features such as advanced transforms and joins. For image prediction models, you can Edit an
image dataset (p. 328) to update your labels or add and delete images. Note that you can't use these
features for multi-category text prediction models.
• Evaluate your model's performance and determine how well it might perform on real-world data.
• (Optional) For certain model types, you can collaborate with data scientists in Amazon SageMaker
Studio who can help review and improve your model.
• Make single or batch predictions with your model.
298
Amazon SageMaker Developer Guide
Use custom models
Note
If you already have a trained model in Amazon SageMaker Studio that you’d like to share with
Canvas, you can bring your own model to SageMaker Canvas. Review the BYOM prerequisites to
determine whether your model is eligible for sharing.
Each use case for which you can build a custom model accepts different types of input. For example,
if you want to build a single-label image classification model, then you should import image data.
For more information about the different model types and the data they accept, see Build a custom
model (p. 321). You can import data and build custom models in SageMaker Canvas for the following
data types:
You can import data into Canvas from the following data sources:
For a full list of data sources you can import from, see the following table:
299
Amazon SageMaker Developer Guide
Use custom models
300
Amazon SageMaker Developer Guide
Use custom models
For instructions on how to import data and information regarding input data requirements, such as the
maximum file size for images, see Create a dataset (p. 301).
Canvas also provides several sample datasets in your application to help you get started. To learn more
about the SageMaker-provided sample datasets you can experiment with, see Use sample datasets.
After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual
update or you can set up a schedule for automatic dataset updates. For more information, see Update a
dataset (p. 308).
For more information specific to each dataset type, see the following sections:
Tabular
To import data from an external data source (such as a Snowflake database or a SaaS platform), you
must authenticate and connect to the data source in the Canvas application. For more information, see
Connect to data sources (p. 310).
After creating datasets in Canvas, you can join multiple datasets into a single dataset. Joining datasets
is only supported for tabular datasets. As long as your data is arranged into tables, you can join datasets
from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake. For information about
joining datasets, see Join data that you've imported into SageMaker Canvas (p. 317).
Image
For information about how to edit an image dataset and perform tasks such as assigning or reassigning
labels, adding images, or deleting images, see Edit an image dataset (p. 328).
Create a dataset
The following sections describe how to create a dataset in Amazon SageMaker Canvas. For custom
models, you can create datasets for tabular and image data. Choose your workflow based on the
following information:
• For categorical, numeric, text, and timeseries data, see Import tabular data (p. 303).
• For image data, see Import image data (p. 305).
Note
For information about how to import a document dataset for Ready-to-use models that accept
document data, see the Import document data (p. 292) workflow in the Ready-to-use models
documentation.
301
Amazon SageMaker Developer Guide
Use custom models
A dataset can consist of multiple files. For example, you might have multiple files of inventory data in
CSV format. You can upload these files together as a dataset as long as the schema (or column names
and data types) of the files match.
Canvas also supports managing multiple versions of your dataset. When you create a dataset, the first
version is labeled as V1. You can create a new version of your dataset by updating your dataset. You can
do a manual update, or you can set up an automated schedule for updating your dataset with new data.
For more information, see Update a dataset (p. 308).
When you import your data into Canvas, make sure that it meets the requirements in the following table.
The limitations are specific to the type of model you’re building.
Supported file CSV (local upload, CSV (local upload, JPG, PNG PDF, JPG, PNG,
types Amazon S3, or Amazon S3, or TIFF
databases) databases)
Maximum file size 5 GB (for all files 5 MB (for all files 30 MB per image 5 MB per
in the dataset) in the dataset) document
302
Amazon SageMaker Developer Guide
Use custom models
Numeric, time
series: N/A
*Document data is currently only supported for Ready-to-use models (p. 289) that accept document
data. You can't build a custom model with document data.
• For tabular data, CSV files must be comma delimited and not have newline characters except when
denoting a new row.
• For image data, if you have any unlabeled images, you must label them before building your model.
For information about how to assign labels to images within the Canvas application, see Edit an image
dataset (p. 328).
• If you set up automatic dataset updates or automatic batch prediction configurations, you can only
create a total of 20 configurations in your Canvas application. For more information, see Manage
automations (p. 375).
After you import a dataset, you can view your datasets on the Datasets page at any time.
303
Amazon SageMaker Developer Guide
Use custom models
7. (Optional) If you’re connecting to an Amazon Redshift or Snowflake database for the first time, a
dialog box appears to create a connection. Fill out the dialog box with your credentials and choose
Create connection. If you already have a connection, choose your connection.
8. From your data source, select your files to import. For local upload and importing from Amazon
S3, you can select files. For database sources, you can drag-and-drop data tables from the left
navigation pane.
9. (Optional) For tabular data sources that support SQL querying (such as Amazon Redshift, Amazon
Athena, or Snowflake), you can choose Edit in SQL to make SQL queries and join tables before
importing them. For more information, see Join data that you've imported into SageMaker
Canvas (p. 317).
The following screenshot shows the Edit SQL view for an Amazon Athena data source.
304
Amazon SageMaker Developer Guide
Use custom models
10. (Optional) You can choose Preview to preview your dataset before importing. For tabular datasets,
this shows you up to the first 100 rows of your dataset. The following screenshot shows you the
Import preview screen
11. When you’re ready to import your data, choose Import data.
While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).
When the Status of your dataset shows as Ready, Canvas successfully imported your data and you can
proceed with building a model.
If you have a connection to a data source, such as an Amazon Redshift database or a SaaS connector, you
can return to that connection. For Amazon Redshift and Snowflake, you can add another connection by
creating another dataset, returning to the Import data page, and choosing the Data Source tile for that
connection. From the dropdown menu, you can open the previous connection or choose Add connection.
Note
For SaaS platforms, you can only have one connection per data source.
With image datasets, you can build single-label image prediction custom models, which predict a label
for an image. Review the limitations in the preceding Import a dataset section to ensure that your image
dataset meets the requirements for image data.
Note
You can only import image datasets from local file upload or an Amazon S3 bucket. Also, for
image datasets, you must have at least 25 images per label.
305
Amazon SageMaker Developer Guide
Use custom models
8. From your computer or Amazon S3 bucket, select the images or folders of images that you want to
upload.
9. When you’re ready to import your data, choose Import data.
While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).
When the Status of your dataset shows as Ready, Canvas successfully imported your data and you can
proceed with building a model.
When you are building your model, you can edit your image dataset, and you can assign or re-assign
labels, add images, or delete images from your dataset. For more information about how to edit your
image dataset, see Edit an image dataset (p. 328).
For each of your datasets, you can view all of the files in a dataset, the dataset’s version history, and any
auto update configurations for the dataset. From the Datasets page, you can also initiate actions such as
Update a dataset (p. 308) or Build a custom model (p. 321).
On the Data tab, you can see a preview of your data. If you choose Dataset details, you can see all of
the files that are part of your dataset. Choose a file to see only the data from that file in the preview. For
image datasets, the preview only shows you the first 100 images of your dataset.
On the Version history tab, you can see a list of all of the versions of your dataset. A new version
is made whenever you update a dataset. To learn more about updating a dataset, see Update a
dataset (p. 308). The following screenshot shows the Version history tab in the Canvas application.
306
Amazon SageMaker Developer Guide
Use custom models
On the Auto updates tab, you can enable auto updates for the dataset and set up a configuration to
update your dataset on a regular schedule. To learn more about setting up auto updates for a dataset,
see Configure automatic updates for a dataset (p. 309). The following screenshot shows the Auto
updates tab with auto updates turned on and a list of auto update jobs that have been performed on the
dataset.
307
Amazon SageMaker Developer Guide
Use custom models
Update a dataset
After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that
you want to add to your dataset. For example, you might get inventory data at the end of every week
that you want to add to your dataset. Instead of importing your data multiple times, you can update
your existing dataset and add or remove files from it.
Note
You can only update datasets that you have imported through local upload or Amazon S3.
You can update your dataset either manually or automatically. With automatic updates, you specify
a location where Canvas checks for files at a frequency you specify. If you import new files during the
update, the schema of the files must match the existing dataset exactly.
Every time you update your dataset, Canvas creates a new version of your dataset. You can only use
the latest version of your dataset to build a model or generate predictions. For more information about
viewing the version history of your dataset, see View your dataset details (p. 306).
You can also use dataset updates with automated batch predictions, which starts a batch prediction job
whenever you update your dataset. For more information, see Make batch predictions (p. 360).
The following sections describe how to do manual and automatic updates to your dataset.
308
Amazon SageMaker Developer Guide
Use custom models
On the Datasets page, you can choose the Version history tab to see all of the versions of your dataset
and the history of both manual and automatic updates you’ve made.
An automatic update is when you set up a configuration for Canvas to update your dataset at a given
frequency. We recommend that you use this option if you regularly receive new files of data that you
want to add to your dataset.
When you set up the auto update configuration, you specify an Amazon S3 location where you upload
your files and a frequency at which Canvas checks the location and imports files. Each instance of Canvas
updating your dataset is referred to as a job. For each job, Canvas imports all of the files in the Amazon
S3 location. If you have new files with the same names as existing files in your dataset, Canvas overwrites
the old files with the new files.
For automatic dataset updates, Canvas doesn’t perform schema validation. If the schema of files
imported during an automatic update don’t match the schema of the existing files or exceed the size
limitations (see Import a dataset for a table of file size limitations), then you get errors when your jobs
run.
Note
You can only set up a maximum of 20 automatic configurations in your Canvas application.
Additionally, Canvas only does automatic updates while you’re logged in to your Canvas
application. If you log out of your Canvas application, automatic updates pause until you log
back in.
309
Amazon SageMaker Developer Guide
Use custom models
9. When you’re ready to create the auto update configuration, choose Save.
Canvas begins the first job of your auto update cadence at the specified starting time.
For more information about viewing your auto update job history or making changes to your
auto update configuration through the Automations page in the Canvas application, see Manage
automations (p. 375).
The following sections describe how to view, update, and delete your automatic update configuration
through the Datasets page in the Canvas application.
To view the job history for your automatic dataset updates, on your dataset details page, choose the
Auto updates tab.
Each automatic update to a dataset shows as a job in the Auto updates tab under the Job history
section. For each job, you can see the following:
• Job created – The timestamp for when Canvas started updating the dataset.
• Files – The number of files in the dataset.
• Cells (Columns x Rows) – The number of columns and rows in the dataset.
• Status – The status of the dataset after the update. If the job was successful, the status is Ready. If the
job failed for any reason, the status is Failed, and you can hover over the status for more details.
You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.
To make changes to your auto update configuration for a dataset, go to the Auto updates tab of your
dataset and choose Edit to make changes to the configuration.
To pause your dataset updates, turn off your automatic configuration. You can turn off auto updates by
going to the Auto updates tab of your dataset and turning the Enable auto updates toggle off. You can
turn this toggle back on at any time to resume the update schedule.
To learn how to delete your configuration, see Delete an automatic configuration (p. 377).
When you go through the Import workflow to import data in the Canvas application, you can choose
your data source and then select the data that you want to import. For certain data sources, like
Snowflake and Amazon Redshift, you must specify your credentials and add a connection to the data
source.
The following screenshot shows the data sources toolbar in the Import workflow, with all of the
available data sources highlighted. You can only import data from the data sources that are available to
you. Contact your administrator if your desired data source isn’t available.
310
Amazon SageMaker Developer Guide
Use custom models
The following sections provide information about importing data from AWS services (like Amazon
Redshift) and from SaaS platforms (such as Snowflake or Facebook Ads). Review the following section
first to determine what permissions you need to import data from your data source.
Permissions
Review the following information to ensure that you have the necessary permissions to import data from
your data source:
• Amazon S3: You can import data from any Amazon S3 bucket as long as your user has permissions to
access the bucket. For more information about using AWS IAM to control access to Amazon S3 buckets,
see Identity and access management in Amazon S3 in the Amazon S3 User Guide.
• Amazon Athena: If you have the AmazonSageMakerFullAccess policy and the
AmazonSageMakerCanvasFullAccess policy attached to your user’s execution role, then you’ll
be able to query your AWS Glue Data Catalog with Amazon Athena. If you’re part of an Athena
workgroup, make sure that the Canvas user has permissions to run Athena queries on the data. For
more information, see Using workgroups for running queries in the Amazon Athena User Guide.
• Amazon Redshift: To give yourself the necessary permissions to import data from Amazon Redshift,
see Grant Users Permissions to Import Amazon Redshift Data.
• SaaS platforms: If you have the AmazonSageMakerFullAccess policy and the
AmazonSageMakerCanvasFullAccess policy attached to your user’s execution role, then you’ll have
the necessary permissions to import data from SaaS platforms. See Use SaaS connectors with
Canvas (p. 316) for more information about connecting to a specific SaaS connector.
You can create multiple connections to Amazon Redshift. For Amazon Athena, you can access any
databases that you have in your AWS Glue Data Catalog. For Amazon S3, you can import data from a
bucket as long as you have the necessary permissions.
For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have
permissions through your Amazon Athena workgroup.
311
Amazon SageMaker Developer Guide
Use custom models
To import data from an Amazon S3 bucket, or to run queries and import data tables with Amazon
Athena, see Create a dataset (p. 301). You can only import tabular data from Amazon Athena, and you
can import tabular and image data from Amazon S3.
You can import data from Amazon Redshift, a data warehouse where your organization keeps its
data. Before you can import data from Amazon Redshift, the AWS IAM role you use must have the
AmazonRedshiftFullAccess managed policy attached. For instructions on how to attach this policy,
see Grant Users Permissions to Import Amazon Redshift Data (p. 281).
You can use the Amazon Redshift editor to drag datasets onto the import pane and import them into
SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
• SQL queries
• Joins
SQL queries give you the ability to customize how you import the values in the dataset. For example, you
can specify the columns returned in the dataset or the range of values for a column.
You can use joins to combine multiple datasets from Amazon Redshift into a single dataset. You can drag
your datasets from Amazon Redshift into the panel that gives you the ability to join the datasets.
You can use the SQL editor to edit the dataset that you've joined and convert the joined dataset into a
single node. You can join another dataset to the node. You can import the data that you've selected into
SageMaker Canvas.
• Warehouse
• Database
• Schema
9. Choose Import data.
The following image shows an example of fields specified for an Amazon Redshift connection.
312
Amazon SageMaker Developer Guide
Use custom models
The following image shows the page used to join datasets in Amazon Redshift.
The following image shows an SQL query being used to edit a join in Amazon Redshift.
313
Amazon SageMaker Developer Guide
Use custom models
Note
You can only import tabular data, such as data tables, from SaaS platforms.
Snowflake is a data storage and analytics service, and you can import your data from Snowflake into
SageMaker Canvas. For more information about Snowflake, see the Snowflake documentation.
You can import data from your Snowflake account by doing the following:
You can use the Snowflake editor to drag datasets onto the import pane and import them into
SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:
• SQL queries
• Joins
SQL queries give you the ability to customize how you import the values in the dataset. For example, you
can specify the columns returned in the dataset or the range of values for a column.
You can join multiple Snowflake datasets into a single dataset before you import into Canvas using SQL
or the Canvas interface. You can drag your datasets from Snowflake into the panel that gives you the
ability to join the datasets, or you can edit the joins in SQL and convert the SQL into a single node. You
can join other nodes to the node that you've converted. You can then combine the datasets that you've
joined into a single node and join the nodes to a different Snowflake dataset. Finally, you can import the
data that you've selected into Canvas.
Use the following procedure to import data from Snowflake to Amazon SageMaker Canvas.
• Warehouse
• Database
• Schema
314
Amazon SageMaker Developer Guide
Use custom models
The following image shows an example of fields specified for a Snowflake connection.
The following image shows the page used to add context to a connection.
The following image shows the page used to join datasets in Snowflake.
315
Amazon SageMaker Developer Guide
Use custom models
The following image shows a SQL query being used to edit a join in Snowflake.
Before you can import data from a SaaS platform, your administrator must authenticate and create a
connection to the data source. For more information about how administrators can create a connection
with a SaaS platform, see Managing Amazon AppFlow connections in the Amazon AppFlow User Guide.
316
Amazon SageMaker Developer Guide
Use custom models
If you’re an administrator getting started with Amazon AppFlow for the first time, see Getting started in
the Amazon AppFlow User Guide.
To import data from a SaaS platform, you can follow the standard Import tabular data (p. 303)
procedure, which shows you how to import tabular datasets into Canvas.
You can use Amazon SageMaker Canvas to join multiple datasets into a single dataset. A join combines
the two datasets. By default, SageMaker Canvas automatically joins the datasets on their matching
column names. The option to combine multiple datasets might give you the ability to get more insight
from the models that you build.
317
Amazon SageMaker Developer Guide
Use custom models
318
Amazon SageMaker Developer Guide
Use custom models
319
Amazon SageMaker Developer Guide
Use custom models
Sample datasets
The following datasets are the samples that SageMaker Canvas provides by default. These datasets
cover use cases such as predicting house prices, loan defaults, and readmission for diabetic patients;
forecasting sales; predicting machine failures to streamline predictive maintenance in manufacturing
units; and generating supply chain predictions for transportation and logistics. The datasets are stored in
the sample_dataset folder in the default Amazon S3 bucket that SageMaker creates for your account
in a Region.
If you no longer wish to use the sample datasets, you can delete them from the Datasets page of your
SageMaker Canvas application. However, these datasets are still stored in the default SageMaker-created
Amazon S3 bucket for your account, so you can always access them later.
The default Amazon S3 bucket name where the datasets are stored follows the pattern
sagemaker-{region}-{account ID}. You can find the sample datasets in the directory path
Canvas/sample_dataset.
If you delete a sample dataset from your SageMaker Canvas application and want to access the sample
dataset again, use the following procedure.
320
Amazon SageMaker Developer Guide
Use custom models
When you begin building a model, Canvas automatically recommends one or more model types. Model
types fall into one of the following categories:
• Numeric prediction – This is known as regression in machine learning. Use the numeric prediction
model type when you want to make predictions for numeric data. For example, you might want to
predict the price of houses based on features such as the house’s square footage.
• Categorical prediction – This is known as classification in machine learning. When you want to
categorize data into groups, use the categorical prediction model types:
• 2 category prediction – Use the 2 category prediction model type (also known as binary
classification in machine learning) when you have two categories that you want to predict for your
data. For example, you might want to determine whether a customer is likely to churn.
• 3+ category prediction – Use the 3+ category prediction model type (also known as multi-class
classification in machine learning) when you have three or more categories that you want to predict
for your data. For example, you might want to predict a customer's loan status based on features
such as previous payments.
• Time series forecasting – Use time series forecasts when you want to make predictions over a period
of time. For example, you might want to predict the number of items you’ll sell in the next quarter. For
information about time series forecasts, see Time Series Forecasts in Amazon SageMaker Canvas.
• Image prediction – Use the single-label image prediction model type (also known as single-label
image classification in machine learning) when you want to assign labels to images. For example, you
might want to classify different types of manufacturing defects in images of your product.
• Text prediction – Use the multi-category text prediction model type (also known as multi-class text
classification in machine learning) when you want to assign labels to passages of text. For example,
you might have a dataset of customer reviews for a product, and you want to determine whether
customers liked or disliked the product. You might have your model predict whether a given passage
of text is Positive, Negative, or Neutral.
For a table of the supported input data types for each model type, see Use custom models (p. 297).
For each tabular data model that you build (which includes numeric, categorical, time series forecasting,
and text prediction models), you choose the Target column. The Target column is the column that
contains the information that you want to predict. For example, if you're building a model to predict
whether people have cancelled their subscriptions, the Target column contains data points that are
either a yes or a no about someone's cancellation status.
For image prediction models, you build the model with a dataset of images that have been assigned
labels. For the unlabeled images that you provide, the model predicts a label. For example, if you’re
building a model to predict whether an image is a cat or a dog, you provide images labeled as cats or
321
Amazon SageMaker Developer Guide
Use custom models
dogs when building the model. Then, the model can accept unlabeled images and predict them as either
cats or dogs.
To build your model, you can choose either a Quick build or a Standard build. The Quick build has a
shorter build time, but the Standard build generally has a higher accuracy. The following table outlines
the average build times for each model and build type, along with the minimum and maximum number
of data points you should have for each build type.
Quick build time 2‐20 minutes 2‐20 minutes 15‐30 minutes 15‐30 minutes
Standard build 2‐4 hours 2‐4 hours 2‐5 hours 2‐5 hours
time
If you log out while running a Quick build, your build might be interrupted until you log in again. When
you log in again, Canvas resumes the Quick build.
Canvas predicts values by using the information in the rest of the dataset, depending on the model type:
• For categorical prediction, Canvas puts each row into one of the categories listed in the Target
column.
• For numeric prediction, Canvas uses the information in the dataset to predict the numeric values in the
Target column.
• For time series forecasting, Canvas uses historical data to predict values for the Target column in the
future.
• For image prediction, Canvas uses images that have been assigned labels to predict labels for
unlabeled images.
• For text prediction, Canvas analyzes text data that has been assigned labels to predict labels for
passages of unlabeled text.
Before building your model, you can filter your data or prepare it using advanced transforms. For
more information about preparing your data for model building, see Prepare data with advanced
transformations (p. 338).
You can also use visualization and analytics to explore your data and determine which features are best
to include in your model. For more information, see Explore and analyze your data.
To learn more about additional features such as previewing your model, validating your dataset, and
changing the size of the random sample used to build your model, see Preview your model (p. 326).
322
Amazon SageMaker Developer Guide
Use custom models
For tabular datasets with multiple columns (such as datasets for building categorical, numeric, or time
series forecasting model types), you might have rows with missing data points. While Canvas builds
the model, it automatically adds missing values. Canvas uses the values in your dataset to perform a
mathematical approximation for the missing values. For the highest model accuracy, we recommend
adding in the missing data if you can find it. Note that the missing data feature is not supported for text
prediction or image prediction models.
Get started
To get started with building a custom model, see Build a model (p. 323) and follow the procedure for
the type of model that you want to build.
Build a model
The following sections show you how to build a model for each of the main types of custom models.
• To build numeric prediction, 2 category prediction, or 3+ category prediction models, see Build a
custom numeric or categorical prediction model (p. 323).
• To build single-label image prediction models, see Build a custom image prediction model (p. 324).
• To build multi-category text prediction models, see Build a custom text prediction model (p. 325).
• To get started with time-series forecasting models, see Time Series Forecasts in Amazon SageMaker
Canvas.
Note
If you encounter an error during post-building analysis that tells you to increase your quota for
ml.m5.2xlarge instances, see Request a Quota Increase.
Numeric and categorical prediction models support both Quick builds and Standard builds.
323
Amazon SageMaker Developer Guide
Use custom models
10. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and
determine which features you might want to include in your model. For more information, see
Explore and analyze your data.
11. (Optional) Use data transformations to clean, transform, and prepare your data for model building.
For more information, see Prepare your data with advanced transformations. You can view and
remove your transforms by choosing Model recipe to open the Model recipe side panel.
12. (Optional) For additional features such as previewing the accuracy of your model, validating your
dataset, and changing the size of the random sample that Canvas takes from your dataset, see
Preview your model (p. 326).
13. After reviewing your data and making any changes to your dataset, choose Quick build or Standard
build to begin a build for your model. The following screenshot shows the Build page and the Quick
build and Standard build options.
After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.
Single-label image prediction models support both Quick builds and Standard builds.
324
Amazon SageMaker Developer Guide
Use custom models
tasks when you Edit an image dataset (p. 328), such as renaming labels and adding images to the
dataset.
9. After reviewing your data and making any changes to your dataset, choose Quick build or Standard
build to begin a build for your model. The following screenshot shows the Build page of an image
prediction model that is ready to be built.
After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.
Multi-category text prediction models support both Quick builds and Standard builds.
325
Amazon SageMaker Developer Guide
Use custom models
After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.
SageMaker Canvas provides you with tools to preview your model and validate data before you begin
building. The following functionalities include previewing the accuracy of your model, validating your
dataset to prevent issues while building the model, and changing the size of the random sample for your
model.
Preview a model
With Amazon SageMaker Canvas, you can get insights from your data before you build a model by
choosing Preview model. For example, you can see how the data in each column is distributed. For
models built using categorical data, you can also choose Preview model to generate an Estimated
accuracy prediction of how well the model might analyze your data. The accuracy of a Quick build or a
Standard build represents how well the model can perform on real data and is generally higher than the
Estimated accuracy.
Amazon SageMaker Canvas automatically handles missing values in your dataset while it builds the
model. It infers the missing values by using adjacent values that are present in the dataset.
326
Amazon SageMaker Developer Guide
Use custom models
Validate data
Before you build your model, SageMaker Canvas checks your dataset for issues that will cause your build
to fail. If SageMaker Canvas finds any issues, then it warns you on the Build page before you attempt to
build a model.
You can choose Validate data to see a list of the issues with your dataset. You can then use the
SageMaker Canvas data preparation features (p. 338), or your own tools, to fix your dataset before
starting a build. If you don’t fix the issues with your dataset, then your build fails.
If you make changes to your dataset to fix the issues, you have the option to re-validate your dataset
before attempting a build. We recommend that you re-validate your dataset before building.
The following table shows the issues that SageMaker Canvas checks for in your dataset and how to
resolve them.
Issue Resolution
Wrong model type for your data Try another model type or use a different dataset.
Missing values in your target column Replace the missing values, drop rows with
missing values, or use a different dataset.
Too many unique labels in your target column Verify that you've used the correct column for
your target column, or use a different dataset.
Too many non-numeric values in your target Choose a different target column, select another
column model type, or use a different dataset.
One or more column names contain double Rename the columns to remove any double
underscores underscores, and try again.
None of the rows in your dataset are complete Replace the missing values, or use a different
dataset.
Too many unique labels for the number of rows in Check that you're using the right target column,
your data increase the number of rows in your dataset,
consolidate similar labels, or use a different
dataset.
Random sample
SageMaker Canvas uses the random sampling method to sample your dataset. The random sample
method means that each row has an equal chance of being picked for the sample. You can choose a
column in the preview to get summary statistics for the random sample, such as the mean and the mode.
By default, SageMaker Canvas uses a random sample size of 20,000 rows from your dataset for datasets
with more than 20,000 rows. For datasets smaller than 20,000 rows, the default sample size is the
number of rows in your dataset. You can increase or decrease the sample size by choosing Random
sample in the Build tab of the SageMaker Canvas application. You can use the slider to select your
desired sample size, and then choose Update to change the sample size. The maximum sample size you
can choose for a dataset is 40,000 rows, and the minimum sample size is 500 rows. If you choose a large
sample size, the dataset preview and summary statistics might take a few moments to reload.
The Build page shows a preview of 100 rows from your dataset. If the sample size is the same size as
your dataset, then the preview uses the first 100 rows of your dataset. Otherwise, the preview uses the
first 100 rows of the random sample.
327
Amazon SageMaker Developer Guide
Use custom models
To begin editing your image dataset, you should be on the Build tab while building your single-label
image prediction model.
A new page opens that shows the images in your dataset along with their labels. This page categorizes
your image dataset into Total images, Labeled images, and Unlabeled images. You can also review the
Dataset preparation guide for best practices on building a more accurate image prediction model.
The following screenshot shows the page for editing your image dataset.
To view an individual image, you can search for it by file name in the search bar. Then, choose the image
to open the full view. You can view the image properties and reassign the image’s label. Choose Save
when you’re doing viewing the image.
Canvas lists the labels for your dataset in the left navigation pane. You can add new labels to the dataset
by entering a label in the Add label text field.
To rename or delete a label from your dataset, choose the More options icon ( ) next to the label and
select either Rename or Delete. If you rename the label, you can enter the new label name and choose
Confirm. If you delete the label, the label is removed from all images in your dataset that have that
label. Any images with that label will be unlabeled.
328
Amazon SageMaker Developer Guide
Use custom models
To view the unlabeled images in your dataset, choose Unlabeled in the left navigation pane. For each
image, select it and open the label titled Unlabeled and select a label to assign to the image from the
dropdown list. You can also select more than one image and perform this action, and all selected images
are assigned the label you chose.
You can reassign labels to images by selecting the image (or multiple images at a time) and opening the
dropdown titled with the current label. Select your desired label, and the image or images are updated
with the new label.
You can view all the images for a given label by choosing the label in the left navigation pane.
You can add more images to your dataset by choosing Add images in the top navigation pane. You’ll be
taken through the workflow to import more images. The images you import are added to your existing
dataset.
You can delete images from your dataset by selecting them and then choosing Delete in the top
navigation pane.
Note
After making any changes to your dataset, choose Save dataset to make sure that you don’t lose
your changes.
In Amazon SageMaker Canvas, you can explore the variables in your dataset using visualizations and
analytics. SageMaker Canvas provides you with the ability to create in-application visualizations and
analytics. You can use these explorations to uncover relationships between your variables before building
your model.
For more information about visualization techniques in Canvas, see Explore your data using visualization
techniques (p. 329).
For more information about analytics in Canvas, see Explore your data using analytics (p. 336).
With Amazon SageMaker Canvas, you can explore and visualize your data to gain advanced insights into
your data before building your ML models. You can visualize using scatter plots, bar charts, and box
plots, which can help you understand your data and discover the relationships between features that
could affect the model accuracy.
In the Build tab of the SageMaker Canvas application, choose Data visualizer to begin creating your
visualizations.
329
Amazon SageMaker Developer Guide
Use custom models
You can change the visualization sample size to adjust the size of the random sample taken from your
dataset. A sample size that is too large might affect the performance of your data visualizations, so we
recommend that you choose an appropriate sample size. To change the sample size, use the following
procedure.
Note
Certain visualization techniques require columns of a specific data type. For example, you can
only use numeric columns for the x and y-axes of scatter plots.
Scatter plot
To create a scatter plot with your dataset, choose Scatter plot in the Visualization panel. Then, you can
choose the features you want to plot on the x and y-axes from the Columns section. You can drag and
drop the columns onto the axes, or once an axis has been dropped, you can choose a column from the
list of supported columns.
You can use Color by to color the data points on the plot with a third feature. You can also use Group by
to group the data into separate plots based on a fourth feature.
The following image shows a scatter plot that uses Color by and Group by. In this example, each data
point is colored by the MaritalStatus feature, and grouping by the Department feature results in a
scatter plot for the data points of each department.
330
Amazon SageMaker Developer Guide
Use custom models
331
Amazon SageMaker Developer Guide
Use custom models
Bar chart
To create a bar chart with your dataset, choose Bar chart in the Visualization panel. Then, you can
choose the features you want to plot on the x and y-axes from the Columns section. You can drag and
drop the columns onto the axes, or once an axis has been dropped, you can choose a column from the
list of supported columns.
You can use Group by to group the bar chart by a third feature. You can use Stack by to vertically shade
each bar based on the unique values of a fourth feature.
The following image shows a bar chart that uses Group by and Stack by. In this example, the bar chart
is grouped by the MaritalStatus feature and stacked by the JobLevel feature. For each JobRole on
the x axis, there is a separate bar for the unique categories in the MaritalStatus feature, and every bar
is vertically stacked by the JobLevel feature.
332
Amazon SageMaker Developer Guide
Use custom models
333
Amazon SageMaker Developer Guide
Use custom models
Box plot
To create a box plot with your dataset, choose Box plot in the Visualization panel. Then, you can choose
the features you want to plot on the x and y-axes from the Columns section. You can drag and drop
the columns onto the axes, or once an axis has been dropped, you can choose a column from the list of
supported columns.
You can use Group by to group the box plots by a third feature.
The following image shows a box plot that uses Group by. In this example, the x and y-axes show
JobLevel and JobSatisfaction, respectively, and the colored box plots are grouped by the
Department feature.
334
Amazon SageMaker Developer Guide
Use custom models
335
Amazon SageMaker Developer Guide
Use custom models
With analytics in Amazon SageMaker Canvas, you can explore your dataset and gain insight on all of
your variables before building a model. You can determine the relationships between features in your
dataset using correlation matrices. You can use this technique to summarize your dataset into a matrix
that shows the correlations between two or more values. This helps you identify and visualize patterns in
a given dataset for advanced data analysis.
The matrix shows the correlation between each feature as positive, negative, or neutral. You might want
to include features that have a high correlation with each other when building your model. Features that
have little to no correlation might be irrelevant to your model, and you can drop those features when
building your model.
To get started with correlation matrices in SageMaker Canvas, see the following section.
You can create a correlation matrix when you are preparing to build a model in the Build tab of the
SageMaker Canvas application.
For instructions on how to begin creating a model, see Build a model (p. 323).
After you’ve started preparing a model in the SageMaker Canvas application, do the following:
You should see a visualization similar to the following screenshot, which shows up to 15 columns of the
dataset organized into a correlation matrix.
After you’ve created the correlation matrix, you can customize it by doing the following:
For Columns, you can select the columns that you want to include in the matrix. You can compare up to
15 columns from your dataset.
336
Amazon SageMaker Developer Guide
Use custom models
Note
You can use numeric, categorical, or binary column types for a correlation matrix. The
correlation matrix doesn’t support datetime or text data column types.
To add or remove columns from the correlation matrix, select and deselect columns from the Columns
panel. You can also drag and drop columns from the panel directly onto the matrix. If your dataset has a
lot of columns, you can search for the columns you want in the Search columns bar.
To filter the columns by data type, choose the dropdown menu and select All, Numeric, or Categorical.
Selecting All shows you all of the columns from your dataset, whereas the Numeric and Categorical
filters only show you the numeric or categorical columns in your dataset. Note that binary column types
are included in the numeric or categorical filters.
For the best data insights, include your target column in the correlation matrix. When you include your
target column in the correlation matrix, it appears as the last feature on the matrix with a target symbol.
To change the correlation type, use the Columns filter mentioned in the preceding section to filter
for your desired column type and columns. You should see the Correlation type in the side panel.
For numeric comparisons, you have the option to select either Pearson or Spearman. For categorical
comparisons, the correlation type is set as MI. For categorical and mixed comparisons, the correlation
type is set as Spearman & MI.
For matrices that only compare numeric columns, the correlation type is either Pearson or Spearman.
The Pearson measure evaluates the linear relationship between two continuous variables. The Spearman
measure evaluates the monotonic relationship between two variables. For both Pearson and Spearman,
the scale of correlation ranges from -1 to 1, with either end of the scale indicating a perfect correlation
(a direct 1:1 relationship) and 0 indicating no correlation. You might want to select Pearson if your data
has more linear relationships (as revealed by a scatter plot visualization). If your data is not linear, or
contains a mixture of linear and monotonic relationships, then you might want to select Spearman.
For matrices that only compare categorical columns, the correlation type is set to Mutual Information
Classification (MI). The MI value is a measure of the mutual dependence between two random variables.
The MI measure is on a scale of 0 to 1, with 0 indicating no correlation and 1 indicating a perfect
correlation.
For matrices that compare a mix of numeric and categorical columns, the correlation type Spearman &
MI is a combination of the Spearman and MI correlation types. For correlations between two numeric
columns, the matrix shows the Spearman value. For correlations between a numeric and categorical
column or two categorical columns, the matrix shows the MI value.
Lastly, remember that correlation does not necessarily indicate causation. A strong correlation value only
indicates that there is a relationship between two variables, but the variables might not have a causal
relationship. Carefully review your columns of interest to avoid bias when building your model.
For Spearman and Pearson comparisons, you can set the Filter correlations range anywhere from -1 to
1, with 0 meaning that there is no correlation. -1 and 1 mean that the variables have a strong negative or
positive correlation, respectively.
For MI comparisons, the correlation range only goes from 0 to 1, with 0 meaning that there is no
correlation and 1 meaning that the variables have a strong correlation, either positive or negative.
337
Amazon SageMaker Developer Guide
Use custom models
Each feature has a perfect correlation (1) with itself. Therefore, you might notice that the top row of the
correlation matrix is always 1. If you want to exclude these values, you can use the filter to set the Max
less than 1.
Keep in mind that if your matrix compares a mix of numeric and categorical columns and uses the
Spearman & MI correlation type, then the categorical x numeric and categorical x categorical correlations
(which use the MI measure) are on a scale of 0 to 1, whereas the numeric x numeric correlations (which
use the Spearman measure) are on a scale of -1 to 1. Review your correlations of interest carefully to
ensure that you know the correlation type being used to calculate each value.
In the side panel, you can use Visualize by to change the visualization method of the matrix. Choose
the Numeric visualization method to show the correlation (Pearson, Spearman, or MI) value, whereas
choosing the Size visualization method visualizes the correlation with differently sized and colored dots.
If you choose Size, you can hover over a specific dot on the matrix to see the actual correlation value.
In the side panel, you can use Color selection to change the color palette used for the scale of negative
to positive correlation in the matrix. Select one of the alternative color palettes to change the colors
used in the matrix.
Your machine learning dataset might require data preparation before you build your model. You
might want to clean your data due to various issues, which might include missing values or outliers,
and perform feature engineering to improve the accuracy of your model. Amazon SageMaker Canvas
provides ML data transforms with which you can clean, transform, and prepare your data for model
building. You can use these transforms on your datasets without any code. SageMaker Canvas adds the
transforms you use to the Model recipe, which is a record of the data preparation done on your data
before building the model. Any data transforms you use only modify the input data for model building
and do not modify your original data source.
The following transforms are available in SageMaker Canvas for you to prepare your data for building.
Note
The preview of your dataset shows the first 100 rows of the dataset. If your dataset has more
than 20,000 rows, Canvas takes a random sample of 20,000 rows and previews the first 100
rows from that sample. You can only search for and specify values from the previewed rows, and
the filter functionality only filters the previewed rows and not the entire dataset.
You can use mathematical functions and operators to explore and distribute your data. You can use the
SageMaker Canvas supported functions or create your own formula with your existing data and create a
new column with the result of the formula. For example, you can add the corresponding values of two
columns and save the result to a new column.
You can nest statements to create more complex functions. The following are some examples of nested
functions that you might use.
• To calculate BMI, you could use the function weight / (height ^ 2).
• To classify ages, you could use the function Case(age < 18, 'child', age < 65, 'adult',
'senior').
338
Amazon SageMaker Developer Guide
Use custom models
You can specify functions in the data preparation stage before you build your model. To use a function,
do the following.
• In the Build tab of the SageMaker Canvas app, choose Functions to open up the Functions panel.
• In the Functions panel, you can choose a Formula to add to your Model Recipe. Each formula is
applied to all of the values in the columns you specify. For formulas that accept two or more columns
as arguments, use columns with matching data types; otherwise, you will get an error or null values
in the new column.
• After you’ve specified a Formula, add a column name in the New Column Name field. SageMaker
Canvas uses this name for the new column that is created.
• To add the function to your Model Recipe, choose Add.
SageMaker Canvas saves the result of your function to a new column using the name you specified in
New Column Name. You can view or remove functions from the Model Recipe panel.
SageMaker Canvas supports the following operators for functions. You can use either the text format or
the in-line format to specify your function.
Add Returns the sum of the values Numeric Add(sales1, sales1 + sales2
sales2)
339
Amazon SageMaker Developer Guide
Use custom models
SageMaker Canvas also supports aggregate operators, which can perform operations such as calculating
the sum of all the values or finding the minimum value in a column. You can use aggregate operators
in combination with standard operators in your functions. For example, to calculate the difference of
values from the mean, you could use the function Abs(height – avg(height)). SageMaker Canvas
supports the following aggregate operators.
340
Amazon SageMaker Developer Guide
Use custom models
approx_count_distinct
Returns the approximate number of approx_count_distinct
approx_count_distinct(c1)
distinct items in a column
Datetime extraction
With the datetime extraction transform, you can extract values from a datetime column to a separate
column. For example, if you have a column containing dates of purchases, you can extract the month
value to a separate column and use the new column when building your model. You can also extract
multiple values to separate columns with a single transform.
Your datetime column must use a supported timestamp format. For a list of the formats that SageMaker
Canvas supports, see Time Series Forecasts in Amazon SageMaker Canvas (p. 367). If your dataset does
not use one of the supported formats, update your dataset to use a supported timestamp format and re-
import it to Amazon SageMaker Canvas before building your model.
SageMaker Canvas creates a new column in the dataset for each of the values you extract. Except for
Year values, SageMaker Canvas uses a 0-based encoding for the extracted values. For example, if you
extract the Month value, January is extracted as 0, and February is extracted as 1.
341
Amazon SageMaker Developer Guide
Use custom models
You can see the transform listed in the Model recipe section. If you remove the transform from the
Model recipe section, the new columns are removed from the dataset.
Drop columns
You can exclude a column from your model build by dropping it in the Build tab of the SageMaker
Canvas application. Deselect the column you want to drop, and it isn't included when building the model.
Note
If you drop columns and then make batch predictions (p. 358) with your model, SageMaker
Canvas adds the dropped columns back to the .csv file available for you to download. However,
SageMaker Canvas does not add the dropped columns back for time series models.
Rename columns
With the rename columns transform, you can rename columns in your data. When you rename a column,
SageMaker Canvas changes the column name in the model input.
You can rename a column in your dataset by double-clicking on the column name in the Build tab of the
SageMaker Canvas application and entering a new name. Pressing the Enter key submits the change, and
clicking anywhere outside the input cancels the change. You can also rename a column by clicking the
More options icon ( ), located at the end of the row in list view or at the end of the header cell in grid
view, and choosing Rename.
Your column name can’t be longer than 32 characters or have double underscores (__), and you can’t
rename a column to the same name as another column. You also can’t rename a dropped column.
The following screenshot shows how to rename a column by double-clicking the column name.
342
Amazon SageMaker Developer Guide
Use custom models
When you rename a column, SageMaker Canvas adds the transform in the Model recipe section. If you
remove the transform from the Model recipe section, the column reverts to its original name.
Remove rows
This transform removes rows of data from the dataset where values in a specific column meet conditions
that you specify. You can remove rows that have missing values, contain outliers, or meet custom
conditions in a column you choose. These rows are not used when building your model.
To remove rows that contain missing values in a specified column, do the following.
1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check for missing values.
3. For the Operation, choose Is missing.
4. Choose Add to add the transform to the Model recipe.
SageMaker Canvas drops rows that contain missing values in the Column you selected. After removing
the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe section. If you
remove the transform from the Model recipe section, the rows return to your dataset.
343
Amazon SageMaker Developer Guide
Use custom models
Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. With SageMaker Canvas, you can detect and remove rows that contain
outliers in numeric columns. You can choose to define outliers with either standard deviations or a
custom range.
1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check for outliers.
3. For the Operation, choose Is outlier.
4. Set the Outlier range to either Standard deviation or Custom range.
5. If you choose Standard deviation, specify a SD (standard deviation) value from 1–3. If you choose
Custom range, select either Percentile or Number, and then specify the Min and Max values.
6. Choose Add to add the transform to the Model recipe.
The Standard deviation option detects and removes outliers in numeric columns using the mean and
standard deviation. You specify the number of standard deviations a value must vary from the mean to
be considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.
The Custom range option detects and removes outliers in numeric columns using minimum and
maximum values. Use this method if you know your threshold values that delimit outliers. You can set
the Type of the range to either Percentile or Number. If you choose Percentile, the Min and Max values
should be the minimum and maximum of the percentile range (0–100) that you want to allow. If you
choose Number, the Min and Max values should be the minimum and maximum numeric values that you
want to allow in the data.
After removing the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the rows return to your dataset.
You can remove rows with values that meet custom conditions. For example, you might want to exclude
all of the rows with a price value greater than 100 when building your model. With this transform, you
can create a rule that removes all rows that exceed the threshold you set.
344
Amazon SageMaker Developer Guide
Use custom models
1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check.
3. Select the type of Operation you want to use, and then specify the values for the selected condition.
4. Choose Add to add the transform to the Model recipe.
For the Operation, you can choose one of the following options. Note that the available operations
depend on the data type of the column you choose. For example, you cannot create a is greater
than operation for a column containing text values.
Is equal to Binary, numeric, text, Removes rows where the value in Column equals
categorical the values you specify.
Is not equal to Binary, numeric, text, Removes rows where the value in Column doesn't
categorical equal the values you specify.
Is less than Numeric Removes rows where the value in Column is less
than the value you specify.
Is less than or equal to Numeric Removes rows where the value in Column is less
than or equal to the value you specify.
Is greater than or equal Numeric Removes rows where the value in Column is
to greater than or equal to the value you specify.
Starts with Text, categorical Removes rows where the value in Column begins
with a value you specify.
Ends with Text, categorical Removes rows where the value in Column ends
with a value you specify.
After removing the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the rows return to your dataset.
345
Amazon SageMaker Developer Guide
Use custom models
Replace values
This transform replaces values in your dataset where the values in a specific column meet conditions that
you specify. You can replace missing values or outliers. SageMaker Canvas uses the replaced values when
building your model but doesn’t change your original dataset. Note that if you've dropped a column
from your dataset using the Drop columns (p. 342) transform, you can't replace values in that column.
Missing values are a common occurrence in machine learning datasets and can impact model accuracy.
You can choose to drop rows that have missing values, but your model is more accurate if you choose
to replace the missing values instead. With this transform, you can replace missing values in numeric
columns with the mean or median of the data in a column, or you can also specify a custom value with
which to replace missing values. For non-numeric columns, you can replace missing values with the mode
(most common value) of the column or a custom value.
Use this transform if you want to replace the null or empty values in certain columns. To replace missing
values in a specified column, do the following.
• If your column is numeric, then select Mean, Median, or Custom. Mean replaces missing values
with the mean for the column, and Median replaces missing values with the median for the
column. If you choose Custom, then you must specify a custom value that you want to use to
replace missing values.
• If your column is non-numeric, then select Mode or Custom. Mode replaces missing values with
the mode, or the most common value, for the column. For Custom, specify a custom value. that
you want to use to replace missing values.
6. Choose Add to add the transform to the Model recipe.
After replacing the missing values in the dataset, SageMaker Canvas adds the transform in the Model
recipe section. If you remove the transform from the Model recipe section, the missing values return to
the dataset.
346
Amazon SageMaker Developer Guide
Use custom models
Replace outliers
Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. SageMaker Canvas enables you to detect outliers in numeric columns
and replace the outliers with values that lie within an accepted range in your data. You can choose to
define outliers with either standard deviations or a custom range, and you can replace outliers with the
minimum and maximum values in the accepted range.
The Standard deviation option detects outliers in numeric columns using the mean and standard
deviation. You specify the number of standard deviations a value must vary from the mean to be
considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier. SageMaker Canvas replaces outliers with the
minimum value or maximum value in the accepted range. For example, if you configure the standard
deviations to only include values from 200–300, then SageMaker Canvas changes a value of 198 to 200
(the minimum).
The Custom Range option detects outliers in numeric columns using minimum and maximum values.
Use this method if you know your threshold values that delimit outliers. You can set the Type of the
custom range to either Percentile or Number. If you choose Percentile, the Min and Max values should
be the minimum and maximum of the percentile range (0–100) that you want to allow. If you choose
Number, the Min and Max values should be the minimum and maximum numeric values that you want
to allow. SageMaker Canvas replaces any values that fall outside of the minimum and maximum to
the minimum and maximum values. For example, if your range only allows values from 1–100, then
SageMaker Canvas changes a value of 102 to 100 (the maximum).
After replacing the values in the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the original values return to the
dataset.
347
Amazon SageMaker Developer Guide
Use custom models
Filter rows
The filter functionality filters the previewed rows (the first 100 rows of your dataset) according to
conditions that you specify. Filtering rows creates a temporary preview of the data and does not impact
the model building. You can filter to preview rows that have missing values, contain outliers, or meet
custom conditions in a column you choose.
Missing values are a common occurrence in machine learning datasets. If you have rows with null or
empty values in certain columns, you might want to filter for and preview those rows.
1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check for missing values.
3. For the Operation, choose Is missing.
SageMaker Canvas filters for rows that contain missing values in the Column you selected and provides a
preview of the filtered rows.
348
Amazon SageMaker Developer Guide
Use custom models
Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. SageMaker Canvas enables you to detect and filter rows that contain
outliers in numeric columns. You can choose to define outliers with either standard deviations or a
custom range.
1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check for outliers.
3. For the Operation, choose Is outlier.
4. Set the Outlier range to either Standard deviation or Custom range.
5. If you choose Standard deviation, specify a SD (standard deviation) value from 1–3. If you choose
Custom range, select either Percentile or Number, and then specify the Min and Max values.
The Standard deviation option detects and filters for outliers in numeric columns using the mean and
standard deviation. You specify the number of standard deviations a value must vary from the mean to
be considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.
The Custom range option detects and filters for outliers in numeric columns using minimum and
maximum values. Use this method if you know your threshold values that delimit outliers. You can set
the Type of the range to either Percentile or Number. If you choose Percentile, the Min and Max values
should be the minimum and maximum of the percentile range (0-100) that you want to allow. If you
choose Number, the Min and Max values should be the minimum and maximum numeric values that you
want to filter in the data.
You can filter for rows with values that meet custom conditions. For example, you might want to preview
rows that have a price value greater than 100 before removing them. With this functionality, you can
filter rows that exceed the threshold you set and preview the filtered data.
1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check.
3. Select the type of Operation you want to use, and then specify the values for the selected condition.
349
Amazon SageMaker Developer Guide
Use custom models
For the Operation, you can choose one of the following options. Note that the available operations
depend on the data type of the column you choose. For example, you cannot create a is greater
than operation for a column containing text values.
Is equal to Binary, numeric, text, Filters rows where the value in Column equals the
categorical values you specify.
Is not equal to Binary, numeric, text, Filters rows where the value in Column doesn't
categorical equal the values you specify.
Is less than Numeric Filters rows where the value in Column is less
than the value you specify.
Is less than or equal to Numeric Filters rows where the value in Column is less
than or equal to the value you specify.
Is greater than Numeric Filters rows where the value in Column is greater
than the value you specify.
Is greater than or equal Numeric Filters rows where the value in Column is greater
to than or equal to the value you specify.
Contains Text, categorical Filters rows where the value in Column contains a
values you specify.
Starts with Text, categorical Filters rows where the value in Column begins
with a value you specify.
Ends with Text, categorical Filters rows where the value in Column ends with
a value you specify.
After you set the filter operation, SageMaker Canvas updates the preview of the dataset to show you the
filtered data.
350
Amazon SageMaker Developer Guide
Use custom models
The section Evaluate your model's performance (p. 351) describes how to view your model’s accuracy
score, broken down by model type. For each model, there is an Overview tab, which gives you a general
overview of the model’s performance, depending on the model type. There is also a Scoring tab, which
shows visualizations that you can use to get more insights into your model's performance beyond the
overall accuracy metric.
The Advanced metrics for your model contain information that you can use for a deeper understanding
of your model's performance. For information about how to view metrics you can use to quantify your
model’s accuracy, see Use advanced metrics in your analyses (p. 355).
The following sections describe how to interpret the scoring for each model type.
The Overview tab shows you the column impact for each column. Column impact is a percentage score
that indicates how much weight a column has in making predictions in relation to the other columns. For
a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other
columns.
The following screenshot shows the Overview tab for a 2 category prediction model.
351
Amazon SageMaker Developer Guide
Use custom models
The Scoring tab for a categorical prediction model gives you the ability to visualize all the predictions.
Line segments extend from the left of the page, indicating all the predictions the model has made. In the
middle of the page, the line segments converge on a perpendicular segment to indicate the proportion
of each prediction to a single category. From the predicted category, the segments branch out to the
actual category. You can get a visual sense of how accurate the predictions were by following each line
segment from the predicted category to the actual category.
The following image gives you an example Scoring section for a 2 category prediction model.
The following image gives you an example Scoring section for a 3+ category prediction model.
352
Amazon SageMaker Developer Guide
Use custom models
The Overview tab shows you the column impact for each column. Column impact is a percentage score
that indicates how much weight a column has in making predictions in relation to the other columns. For
a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other
columns.
The Scoring tab for numeric prediction shows a line to indicate the model's predicted value in relation
to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE (root
mean squared error) value. The value that the model predicts is often within the range of the RMSE.
The width of the purple band around the line indicates the RMSE range. The predicted values often fall
within the range.
The following image shows the Scoring section for numeric prediction.
On the Analyze page for time series forecasting models, you can see an overview of the model’s metrics.
You can hover over each metric for more information, or you can see Use advanced metrics in your
analyses (p. 355).
353
Amazon SageMaker Developer Guide
Use custom models
In the Column impact section, you can see the score for each column. Column impact is a percentage
score that indicates how much weight a column has in making predictions in relation to the other
columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for
the other columns.
The Overview tab shows you the Per label performance, which gives you an overall accuracy score
for the images predicted for each label. You can choose a label to see more specific details, such as the
Correctly predicted and Incorrectly predicted images for the label.
You can turn on the Heatmap toggle to see a heatmap for each image. The heatmap shows you the areas
of interest that have the most impact when your model is making predictions. For more information
about heatmaps and how to use them to improve your model, choose the More info icon next to the
Heatmap toggle.
The Scoring tab for single-label image prediction models shows you a comparison of what the model
predicted as the label versus what the actual label was. You can select up to 10 labels at a time. You
can change the labels in the visualization by choosing the labels dropdown menu and selecting or
deselecting labels.
You can also view insights for individual labels or groups of labels, such as the three labels with the
highest or lowest accuracy, by choosing the View scores for dropdown menu in the Model accuracy
insights section.
The following screenshot shows the Scoring information for a single-label image prediction model.
The Overview tab shows you the Per label performance, which gives you an overall accuracy score for
the passages of text predicted for each label. You can choose a label to see more specific details, such as
the Correctly predicted and Incorrectly predicted passages for the label.
The Scoring tab for multi-category text prediction models shows you a comparison of what the model
predicted as the label versus what the actual label was.
In the Model accuracy insights section, you can see the Most frequent category, which tells you the
category that the model predicted most frequently and how accurate those predictions were. If you
model predicts a label of Positive correctly 99% of the time, then you can be fairly confident that your
model is good at predicting positive sentiment in text.
The following screenshot shows the Scoring information for a multi-category text prediction model.
354
Amazon SageMaker Developer Guide
Use custom models
Numeric prediction refers to the mathematical concept of regression. When your Target column has
values that can be measured, such as yearly revenue or the number of items sold by a department store,
Canvas builds a model on your data using regression.
Categorical prediction, such as 2 category prediction or 3 category prediction, refers to the mathematical
concept of classification. Categorical prediction can be performed on data that can be put into a
category, such as the following:
Image prediction, such as single-label image prediction, refers to using computer vision to identify and
classify information in images. For example, you can use image prediction to predict whether an image is
of a dog or a cat.
Text prediction, such as multi-category text prediction, refers to using natural language processing
(NLP) to analyze language data. You can use multi-category text prediction on text data to analyze the
sentiment of text, or the overall mood of a text, such as Positive, Negative, Neutral, or Mixed.
Time series forecasting refers to making predictions that vary over time. You can perform time series
forecasts on data with timestamps that correlate to a value you want to predict. For example, you can
make a time series forecast that takes daily sales data and makes sales predictions for the next month.
SageMaker Canvas uses confusion matrices to help you visualize when a model makes predictions
correctly. In a confusion matrix, your results are arranged to compare the predicted values against the
actual values. The following example explains how a confusion matrix works for a 2 category prediction
model that predicts positive and negative labels:
• True positive – The model correctly predicted positive when the true label was positive.
• True negative – The model correctly predicted negative when the true label was negative.
• False positive – The model incorrectly predicted positive when the true label was negative.
• False negative – The model incorrectly predicted negative when the true label was positive.
355
Amazon SageMaker Developer Guide
Use custom models
The following defines the advanced metrics for numeric prediction in Amazon SageMaker Canvas and
gives you information about how you can use them.
• R2 – The percentage of the difference in the target column that can be explained by the input column.
• MAE – Mean absolute error. On average, the prediction for the target column is +/- {MAE} from the
actual value.
• MAPE – Mean absolute percent error. On average, the prediction for the target column is +/- {MAPE} %
from the actual value
• RMSE – Root Mean Square Error. The standard deviation of the errors.
The following image shows a graph of the residuals or errors. The horizontal line indicates an error of 0
or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the
magnitude of the errors.
356
Amazon SageMaker Developer Guide
Use custom models
• Missing – A missing value contains no content or is non-existent. Missing values are automatically
inferred.
• Mismatched – A mismatched value has a different data type from the type specified for its column.
SageMaker Canvas categorizes these values as missing and infers values for them.
• Unique – The number and percentage of values that are unique.
• Target correlation – A value between -1 and 1 that represents strength of the linear relationship
between a column and the target column. 0 represents no detectable relationship. 1 represents a
strong positive relationship. -1 represents a strong negative relationship.
• Column impact – Identifies the relative impact of the column in predicting the target column.
357
Amazon SageMaker Developer Guide
Use custom models
The following is a list of available metrics for 3+ category prediction, image prediction, and text
prediction.
The following defines the advanced metrics for time series forecasts in Amazon SageMaker Canvas and
gives you information about how you can use them.
• Average Weighted Quantile Loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10,
P50, and P90 quantiles. A lower value indicates a more accurate model.
• Weighted Absolute Percent Error (WAPE) – The sum of the absolute error normalized by the sum of
the absolute target, which measure the overall deviation of forecasted values from observed values. A
lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
• Root Mean Square Error (RMSE) – The square root of the average squared errors. A lower RMSE
indicates a more accurate model, where RMSE = 0 is a model with no errors.
• Mean Absolute Percent Error (MAPE) – The percentage error (percent difference of the mean
forecasted value versus the actual value) averaged over all time points. A lower value indicates a more
accurate model, where MAPE = 0 is a model with no errors.
• Mean Absolute Scaled Error (MASE) – The mean absolute error of the forecast normalized by the mean
absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model,
where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse
than the baseline.
358
Amazon SageMaker Developer Guide
Use custom models
Numeric and categorical prediction, image prediction, and text prediction custom models support
making the following types of predictions for your data:
• Single predictions — A Single prediction is when you only need to make one prediction. For example,
you have one image or passage of text that you want to classify.
• Batch predictions — A Batch prediction is when you’d like to make predictions for an entire dataset.
For example, you have a CSV file of customer reviews for which you’d like to predict the customer
sentiment, or you have a folder of image files that you'd like to classify. You should make predictions
with a dataset that matches your input dataset. Canvas provides you with the ability to do manual
batch predictions, or you can configure automatic batch predictions that initiate whenever a specified
dataset is updated in Canvas.
For each prediction or set of predictions, SageMaker Canvas returns the following:
Get started
Choose one of the following workflows to make predictions with your custom model:
After generating predictions with your model, you can also do the following:
• Update your model by creating a new version. If you want to try to improve the prediction accuracy
of your model, you can build new versions of your model. You can update your data or change any
advanced transformations you used, and then you can review and compare the versions of your model
to choose the best one.
• Register a model version in the SageMaker model registry (p. 373). You can register versions of your
model to the SageMaker model registry, which is a feature for tracking and managing the status of
model versions and machine learning pipelines. A data scientist or MLOps team user with access to
the SageMaker model registry can review your model versions and approve or reject them before
deploying them to production.
• Send your batch predictions to Amazon QuickSight. In Amazon QuickSight, you can build and publish
dashboards with your batch prediction datasets. This can help you analyze and share results generated
by your custom model.
To make a single prediction for a numeric or categorical prediction model, do the following:
359
Amazon SageMaker Developer Guide
Use custom models
In the right Prediction pane, you’ll see the prediction result. You can Copy the prediction result chart,
or you can also choose Download to either download the prediction result chart as an image or to
download the values and prediction as a CSV file.
To make a single prediction for a single-label image prediction model, do the following:
In the right Prediction results pane, the model lists the possible labels for the image along with a
Confidence score for each label. For example, the model might predict the label Sea for an image, with
a confidence score of 96%. The model may have predicted the image as a Glacier with only a confidence
score of 4%. Therefore, you can determine that your model is fairly confident in predicting images of the
sea.
To make a single prediction for a multi-category text prediction model, do the following:
In the right Prediction results pane, you receive an analysis of your text in addition to a Confidence
score for each possible label. For example, if you entered a good review for a product, you might get
Positive with a confidence score of 85%, while the confidence score for Neutral might be 10% and the
confidence score for Negative only 5%.
• Manual batch predictions are when you have a dataset for which you want to make one-time
predictions.
360
Amazon SageMaker Developer Guide
Use custom models
• Automatic batch predictions are when you set up a configuration that runs a batch prediction
whenever a specific dataset is updated. For example, if you’ve configured weekly updates to a
SageMaker Canvas dataset of inventory data, you can set up automatic batch predictions that run
whenever you update the dataset. After setting up an automated batch predictions workflow, see
Manage automations (p. 375) for more information about viewing and editing the details of your
configuration. For more information about setting up automatic dataset updates, see Configure
automatic updates for a dataset (p. 309).
Note
You can only set up automatic batch predictions for datasets imported through local upload or
Amazon S3. Additionally, automatic batch predictions can only run while you’re logged in to the
Canvas application. If you log out of Canvas, automatic batch prediction jobs resume when you
log back in.
To get started, reviewing the following section for batch prediction dataset requirements, and then
choose one of the following manual or automatic batch prediction workflows.
For batch predictions, make sure that your datasets meet the requirements outlined in Create a
dataset (p. 301).
You might not be able to make predictions on some datasets because they have incompatible schemas. A
schema is an organizational structure. For a tabular dataset, the schema is the names of the columns and
the data type of the data in the columns. An incompatible schema might happen for one of the following
reasons:
• The dataset that you're using to make predictions has fewer columns than the dataset that you're
using to build the model.
• The data types in the columns you used to build the dataset might be different from the data types in
dataset that you're using to make predictions.
• The dataset that you're using to make predictions and the dataset that you've used to build the model
have column names that don't match. The column names are case sensitive. Column1 is not the same
as column1.
To ensure that you can successfully generate batch predictions, match the schema of your batch
predictions dataset to the dataset you used to train the model.
Note
For batch predictions, if you dropped any columns when building your model, Canvas adds the
dropped columns back to the prediction results. However, Canvas does not add the dropped
columns to your batch predictions for time series models.
Make manual batch predictions with numeric and categorical prediction models
To make manual batch predictions for a numeric or categorical prediction model, do the following:
361
Amazon SageMaker Developer Guide
Use custom models
6. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to preview the output data. You can see the input data matched to the prediction
and the probability that the prediction is correct. Then, you can choose Download CSV to download the
results as a CSV file.
To make manual batch predictions for a single-label image prediction model, do the following:
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose View prediction results to see the output data. You can see the images along with their
predicted labels and confidence scores. Then, you can choose Download prediction to download the
results as a CSV or a ZIP file.
To make manual batch predictions for a multi-category text prediction model, do the following:
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to see the output data. You can see the images along with their predicted labels and
confidence scores. Then, you can choose Download CSV to download the results.
362
Amazon SageMaker Developer Guide
Use custom models
Canvas runs a batch predictions job for the dataset after you set up the configuration. Then, every time
you Update a dataset (p. 308), either manually or automatically, another batch predictions job runs.
After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to preview the output data. You can see the input data matched to the prediction
and the probability that the prediction is correct. Then, you can choose Download to download the
results.
The following sections describe how to view, update, and delete your automatic batch prediction
configuration through the Datasets page in the Canvas application. You can only set up a maximum
of 20 automatic configurations in Canvas. For more information about viewing your automated batch
predictions job history or making changes to your automatic configuration through the Automations
page, see Manage automations (p. 375).
To view your job history for your automatic batch predictions, go to the Predict tab of your model.
Each automatic batch prediction job shows up in the Predict tab of your model. Under Predictions, you
can see the All jobs tab and the Configuration tabs:
• All jobs – In this tab, you can see all of the batch prediction jobs for this model. You can filter the
jobs by configuration name. For each job, you can see fields such as the Input dataset, which includes
the version of the dataset, and the Prediction type, such as whether the predictions were automatic
or manual. If you choose the More options icon ( ), you can choose View prediction or Download
prediction.
• Configuration – In this tab, you can see all of the automatic batch prediction configurations you’ve
created for this model. For each configuration, you can see fields such as the timestamp for when it
was Created, the Input dataset it tracks for updates, and the Next job scheduled. If you choose the
More options icon ( ), you can choose View all jobs to see the job history and in progress jobs for the
configuration.
You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.
When you edit a batch prediction configuration, you can change the target dataset but not the frequency
(since automatic batch predictions occur whenever the dataset is updated).
363
Amazon SageMaker Developer Guide
Use custom models
To pause your automatic batch predictions, turn off your automatic configuration by doing the following:
Automatic batch predictions are now paused. You can turn the toggle back on at any time to resume the
update schedule.
To learn how to delete your automatic batch prediction configuration, see Delete an automatic
configuration (p. 377).
Once you generate batch predictions with custom tabular models in SageMaker Canvas, you can send
those predictions as CSV files to Amazon QuickSight, which is a business intelligence (BI) service to build
and publish predictive dashboards.
For example, if you built a 2 category prediction model to determine whether a customer will churn, you
can create a visual, predictive dashboard in QuickSight to show the percentage of customers that are
expected to churn. To learn more about Amazon QuickSight, see the Amazon QuickSight User Guide.
The following sections show you how to send your batch predictions to QuickSight for analysis.
Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your
predictions to QuickSight. Your administrator can set up the IAM permissions for your user. For more
information, see Grant Your Users Permissions to Send Predictions to Amazon QuickSight (p. 283).
Your QuickSight account must contain the default namespace, which is set up when you first create
your QuickSight account. Contact your administrator to help you get access to QuickSight. For more
information, see Setting up for Amazon QuickSight in the Amazon QuickSight User Guide.
364
Amazon SageMaker Developer Guide
Use custom models
Your QuickSight account must be created in the same Region as your Canvas application. If your
QuickSight account’s home Region differs from your Canvas application’s Region, you must either
close and recreate your QuickSight account, or set up a Canvas application in the same Region as your
QuickSight account. You can check your QuickSight home Region by doing the following (assuming you
already have a QuickSight account):
You must know the usernames of the QuickSight users to whom you want to send your predictions. You
can send predictions to yourself or other users who have the right permissions. Any users to whom you
send predictions must be in the default namespace of your QuickSight account and have the Author
or Admin role in QuickSight.
Additionally, QuickSight must have access to the SageMaker default Amazon S3 bucket for your Domain,
which is named with the following format: sagemaker-{REGION}-{ACCOUNT_ID}. The Region should
be the same as your QuickSight account's home Region and your Canvas application’s Region. To learn
how to give QuickSight access to the batch predictions stored in your Amazon S3 bucket, see the topic I
can’t connect to Amazon S3 in the Amazon QuickSight User Guide.
Before sending your predictions, check that the data format of your batch predictions is compatible with
QuickSight.
• To learn more about the accepted data formats for timeseries data, see Supported date formats in the
Amazon QuickSight User Guide.
• To learn more about data values that might prevent you from sending to QuickSight, see Unsupported
values in data in the Amazon QuickSight User Guide.
Also note that Amazon QuickSight uses the character " as a text qualifier, so if your Canvas data contains
any " characters, make sure that you close all matching quotes. Any mismatching quotes can cause
issues with sending your dataset to QuickSight.
Alternatively, you can preview your predictions by choosing the More options icon ( ) and then
View prediction results. From the dataset preview, you can choose Send to Amazon QuickSight.
The following screenshot shows you the Send to Amazon QuickSight button in a dataset preview.
365
Amazon SageMaker Developer Guide
Use custom models
a. For QuickSight users, enter the name of the QuickSight users to whom you want to send your
predictions. If you want to send them to yourself, enter your own username. You can only send
predictions to users in the default namespace of the QuickSight account, and the user must
have the Author or Admin role in QuickSight.
b. Choose Send.
The following screenshot shows the Send to Amazon QuickSight dialog box:
366
Amazon SageMaker Developer Guide
Use custom models
After you send your batch predictions, the QuickSight field for the datasets you sent shows as Sent.
In the confirmation box that confirms your predictions were sent, you can choose Open Amazon
QuickSight to open your QuickSight application. If you’re done using Canvas, you should log out of the
Canvas application.
The QuickSight users that you’ve sent datasets to can open their QuickSight application and view the
Canvas datasets that have been shared with them. Then, they can create predictive dashboards with the
data. For more information, see Getting started with Amazon QuickSight data analysis in the Amazon
QuickSight User Guide.
By default, all of the users to whom you send predictions have owner permissions for the dataset in
QuickSight. Owners are able to create analyses, refresh, edit, delete, and re-share datasets. The changes
that owners make to a dataset change the dataset for all users with access. To change the permissions,
go to the dataset in QuickSight and manage its permissions. For more information, see Viewing and
editing the permissions users that a dataset is shared with in the Amazon QuickSight User Guide.
Amazon SageMaker Canvas gives you the ability to use machine learning time series forecasts. Time
series forecasts give you the ability to make predictions that can vary with time.
You can make a time series forecast for the following examples:
To make a time series forecast, your dataset must have the following:
The datetime values in the timestamp column must use one of the following formats:
• YYYY-MM-DD HH:MM:SS
• YYYY-MM-DDTHH:MM:SSZ
• YYYY-MM-DD
• MM/DD/YY
• MM/DD/YY HH:MM
• MM/DD/YYYY
• YYYY/MM/DD HH:MM:SS
• YYYY/MM/DD
• DD/MM/YYYY
• DD/MM/YY
367
Amazon SageMaker Developer Guide
Use custom models
• DD-MM-YY
• DD-MM-YYYY
• 1 min
• 5 min
• 15 min
• 30 min
• 1 hour
• 1 day
• 1 week
• 1 month
• 1 year
For higher prediction accuracy, your dataset can also have additional columns that can provide data that
can explain the variation in the target column. Using the additional explanatory columns might help you
forecast future values in the target column more accurately.
For example, you can forecast the amount of ice cream sold by a grocery store. To make a forecast, you
must have a timestamp column and a column that indicates how much ice cream the grocery store sold.
For a more accurate forecast, your dataset can also include the price, the ambient temperature, the flavor
of the ice cream, or a unique identifier for the ice cream.
Ice cream sales might increase when the weather is warmer. A decrease in the price of the ice cream
might result in more units sold. Having a column with ambient temperature data and a column with
pricing data can improve your ability to forecast the number of units of ice cream the grocery store sells.
You might have missing data for different reasons. The reason for your missing data might inform
how you want Amazon SageMaker Canvas to impute it. For example, your organization might use an
automatic system that only tracks when a sale happens. If you're using a dataset that comes from this
type of automatic system, you have missing values in the target column.
For missing values in the dataset, SageMaker Canvas imputes the missing values for you.
Important
If you have missing values in the target column, we recommend using a dataset that doesn't
have them. SageMaker Canvas uses the target column to forecast future values. Missing values
in the target column can greatly reduce the accuracy of the forecast.
• Single item
• All items
For a forecast on all the items in your dataset, SageMaker Canvas returns a forecast for the future values
for each item in your dataset.
For a single item forecast, you specify the item and SageMaker Canvas returns a forecast for the future
values. The forecast includes a line graph that plots the predicted values over time.
Topics
• Gain additional insights from your forecast (p. 369)
368
Amazon SageMaker Developer Guide
Use custom models
• Group column
• Holiday schedule
• What-if scenario
You can specify a column in your dataset as a Group column. Amazon SageMaker Canvas groups the
forecast by each value in the column. For example, you can group the forecast on columns containing
price data or unique item identifiers. Grouping a forecast by a column lets you make more specific
forecasts. For example, if you group a forecast on a column containing item identifiers, you can see the
forecast for each item.
Overall sales of items might be impacted by the presence of holidays. For example, in the United States,
the number of items sold in both November and December might differ greatly from the number of
items sold in January. If you use the data from November and December to forecast the sales in January,
your results might be inaccurate. Using a holiday schedule prevents you getting inaccurate results. You
can use a holiday schedule for 251 countries.
For a forecast on a single item in your dataset, you can use what-if scenarios. A what-if scenario gives
you the ability to change values in your data and change the forecast. For example, you can answer the
following questions by using a what-if scenario, "What if I lowered prices? How would that affect the
number of items sold?"
• Item ID column – The column that contains unique identifiers for each item in your dataset. For
example, an SKU number uniquely identifies an item.
• Optional: Group column – Groups the time series forecast by values in the column. For example,
you can group your forecast for an item by store.
• Time stamp column – The column containing the time stamps in your dataset. For a list of the
supported datetime formats for this column, see Time Series Forecasts in Amazon SageMaker
Canvas (p. 367).
369
Amazon SageMaker Developer Guide
Use custom models
• Future timestamp – A timestamp that indicates a future forecast time. SageMaker Canvas
forecasts values up to the point in time specified by the timestamp.
• Optional: Holiday schedule – Activate the holiday schedule to use a country's holiday schedule.
Use it to make your forecasts with holiday data more accurate.
Missing future values are missing values in the target column. SageMaker Canvas uses the values in the
target column to forecast the values in the future. If you have missing values in the target column, your
forecast might be less accurate. We highly recommend updating the dataset.
Missing values are values that are missing in any column other than the target column. With missing
values that aren't in the target column, it's helpful to note the following:
• They generally don't reduce the accuracy of your forecast as much as missing future values.
• SageMaker Canvas automatically imputes the missing values.
You can evaluate the model by seeing how close the predictions are within the actual value. You can also
use the Column Impact metric to determine the direction and magnitude of the column's impact on the
model's predictions. For example, in the following image, holidays had the largest positive impact on the
forecast for demand. Price had the largest negative impact on demand.
After you've built a model, you can make the following types of forecasts:
• Single item – Make a forecast for a single item in a dataset and a line graph of the values that
SageMaker Canvas forecasts. For example, you can see how sales of an item vary over time.
• All items – Make a forecast for all items in a dataset.
• What-if scenario – See how changing values in the dataset can affect the overall forecast for a single
item.
370
Amazon SageMaker Developer Guide
Use custom models
The following image shows a single item forecast with a what-if scenario. In a what-if scenario, you have
the ability to change values that can vary with time. You can see how changing the values affects the
forecast.
The points connected by the solid blue line are the values that the model forecasts. The points
connected by the dashed lines show the what-if scenario.
Each model that you build has a version number. The first model is Version 1, or V1. You can use
model versions to see changes in prediction accuracy when you update your data or use advanced
transformations.
Note
Text prediction and image prediction models only support one model version.
For new versions of a model, you can only choose datasets that have the same target column as the
target column in Version 1. You must build at least one version of a model to add a new version, and you
can delete versions that aren’t useful to you anymore.
You can also see Register a model version in the SageMaker model registry (p. 373) to help you track
your versions over time and collaborate with Studio users who can approve or reject your model versions.
Use the following procedure to add a new model version or to view all of the versions for you model.
371
Amazon SageMaker Developer Guide
Use custom models
4. After choosing your model, the Versions page opens, listing all of the versions of your model.
5. Choose Add version.
The following image shows the Versions page for a model, on which you can view your model versions
and add new versions.
On the Versions page, you can view the following information for each of your model versions:
• Status – This field tells you whether your model is currently building (In building), done building
(Ready), failed to build (Failed), or still being edited (In draft).
• Model score, F1, Precision, Recall, and AUC – If you turn on the Show advanced metrics toggle on
this page, you can see these model metrics. These metrics indicate the accuracy and performance of
your model. For more information, see Evaluate your model.
• Shared – This field tells you whether or not you’ve shared the model version with SageMaker Studio
users.
• Model registry – This field tells you whether or not you’ve registered the version to a model registry.
For more information, see Register a model version in the SageMaker model registry (p. 373).
After you choose a new version, you start the process of building another model. The process for
building a new version of a model is almost the same as the process for building a model for the first
time. For new versions of a model, you can only choose datasets that have the same target column
as the target column in Version 1. For more information about building a model, see Build a custom
model (p. 321).
The following topics describe how you can use features within Canvas to use a Canvas-built model in
production.
Topics
• Register a model version in the SageMaker model registry (p. 373)
372
Amazon SageMaker Developer Guide
Use custom models
After you’ve built a model that you feel confident about, you might want to evaluate its performance
and have it reviewed by a data scientist or MLOps engineer in your organization before using it in
production. To do this, you can register your model versions to the SageMaker model registry. The
SageMaker model registry is a repository that data scientists or engineers can use to catalog machine
learning (ML) models and manage model versions and their associated metadata, such as training
metrics. They can also manage and log the approval status of a model.
After you register your model versions to the SageMaker model registry, a data scientist or your MLOps
team can access the SageMaker model registry through SageMaker Studio, which is a web-based
integrated development environment (IDE) for working with machine learning models. In the SageMaker
model registry interface in Studio, the data scientist or MLOps team can evaluate your model and
update its approval status. If the model doesn’t perform to their requirements, the data scientist or
MLOps team can update the status to Rejected. If the model does perform to their requirements,
then the data scientist or MLOps team can update the status to Approved. Then, they can deploy your
model to an endpoint or automate model deployment with CI/CD pipelines. You can use the SageMaker
model registry feature to seamlessly integrate models built in Canvas with the MLOps processes in your
organization.
The following diagram summarizes an example of registering a model version built in Canvas to the
SageMaker model registry for integration into an MLOps workflow.
You can register tabular, image, and text model versions to the SageMaker model registry.
Note
Currently, registration of time series forecasting or BYOM model versions built in Canvas to the
SageMaker model registry isn’t supported.
The following sections show you how to register a model version to the SageMaker model registry from
Canvas.
Permissions management
By default, you have permissions to register model versions to the SageMaker model registry.
SageMaker grants these permissions for all new and existing Canvas user profiles through the
AmazonSageMakerCanvasFullAccess policy, which is attached to the AWS IAM execution role for the
SageMaker Domain that hosts your Canvas application.
If your Canvas administrator is setting up a new Domain or user profile, when they're setting up the
Domain and following the prerequisite instructions in the Getting started guide, SageMaker turns on the
model registration permissions through the ML Ops permissions configuration option, which is enabled
by default.
373
Amazon SageMaker Developer Guide
Use custom models
The Canvas administrator can manage model registration permissions at the user profile level as well. For
example, if the administrator wants to grant model registration permissions to some user profiles but
remove permissions for others, they can edit the permissions for a specific user. The following procedure
shows how to turn off model registration permissions for a specific user profile:
SageMaker model registry tracks all of the model versions that you build to solve a particular problem in
a model group. When you build a SageMaker Canvas model and register it to SageMaker model registry, it
gets added to a model group as a new model version. For example, if you build and register four versions
of your model, then a data scientist or MLOps team working in the SageMaker model registry interface
can view the model group and review all four versions of the model in one place.
When registering a Canvas model to the SageMaker model registry, a model group is automatically
created and named after your Canvas model. Optionally, you can rename it to a name of your choice, or
use an existing model group in the SageMaker model registry. For more information about creating a
model group, see Create a Model Group.
Note
Currently, you can only register models built in Canvas to the SageMaker model registry in the
same account.
To register a model version to the SageMaker model registry from the Canvas application, use the
following procedure:
a. (Optional) In the SageMaker Studio model group section, for the Model group name field,
enter the name of the model group to which you want to register your version. You can specify
the name for a new model group that SageMaker creates for you, or you can specify an existing
model group. If you don’t specify this field, Canvas registers your version to a default model
group with the same name as your model.
374
Amazon SageMaker Developer Guide
Use custom models
b. Choose Add.
Your model version should now be registered to the model group in the SageMaker model registry. When
you register a model version to a model group in the SageMaker model registry, all subsequent versions
of the Canvas model are registered to the same model group (if you choose to register them). If you want
to register your versions to a different model group, you need to go to the SageMaker model registry and
delete the model group. Then, you can re-register your model versions to the new model group.
To view the status of your models, you can return to the Versions page for your model in the Canvas
application. This page shows you the Model Registry status of each version. If the status is Registered,
then the model has been successfully registered.
If you want to view the details of your registered model version, for the Model Registry status, you can
hover over the Registered field to see the Model registry details pop-up box. These details contain more
info, such as the following:
• The Model package group name is the model group that your version is registered to in the
SageMaker model registry.
• The Approval status, which can be Pending Approval, Approved, or Rejected. If a Studio user
approves or rejects your version in the SageMaker model registry, then this status is updated on your
model versions page when you refresh the page.
The following screenshot shows the Model registry details box, along with an Approval status of
Approved for this particular model version.
Manage automations
In SageMaker Canvas, you can create automations that update your dataset or generate predictions from
your model on a schedule. For example, you might receive new shipping data on a daily basis. You can
set up an automatic update for your dataset and automatic batch predictions that run whenever the
dataset is updated. Using these features, you can set up an automated workflow and reduce the amount
of time you spend manually updating datasets and making predictions.
Note
You can only set up a maximum of 20 automatic configurations in your Canvas application.
Automations are only active while you’re logged in to the Canvas application. If you log out of
Canvas, your automatic jobs pause until you log back in.
The following sections describe how to view, edit, and delete configurations for existing automations. To
learn how to set up automations, see the following topics:
375
Amazon SageMaker Developer Guide
Use custom models
• All jobs – You can see every instance of a Dataset update or Batch prediction job that Canvas has
done. For each job, you can see fields such as the associated Input dataset, the Configuration name of
the associated auto update configuration, and the Status showing whether the job was successful or
not. You can filter the jobs by configuration name:
• For dataset update jobs, you can choose the latest version of the dataset, or the most recent job, to
preview the dataset.
• For batch prediction jobs, you can choose the More options icon ( ) to view or download the
predictions for that job.
• Configuration – You can see all of the Dataset update and Batch prediction configurations you’ve
created. For each configuration, you can see fields such as the associated Input dataset and the
Frequency of the jobs. You can also turn off or turn on the Auto update toggle to pause or resume
automatic updates. If you choose the More options icon ( ) for a specific configuration, you can
choose to View all jobs for the configuration, Update configuration, or Delete configuration.
The following sections show you how to update each type of configuration.
Note
You can’t change the frequency for automatic batch predictions because automatic batch
predictions run every time the target dataset is updated.
You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.
To make changes to your auto update configuration for a dataset, do the following:
To pause your dataset updates, turn off your automatic configuration. One way to turn off auto updates
is by doing the following:
376
Amazon SageMaker Developer Guide
Use custom models
Automatic updates for your dataset are now paused. You can turn this toggle back on at any time to
resume the update schedule.
When you edit a batch prediction configuration, you can change the target dataset but not the frequency
(since automatic batch predictions occur whenever the dataset is updated).
To pause your automatic batch predictions, turn off your automatic configuration. Use the following
procedure to turn off your configuration:
Automatic batch predictions for your dataset are now paused. You can turn this toggle back on at any
time to resume the update schedule.
To delete a configuration for automatic dataset updates or automatic batch predictions, do the
following:
With Amazon SageMaker Canvas, business analysts using Canvas and data scientists using Amazon
SageMaker Studio can share ML models and collaborate with each other while working in their own
environments to share domain knowledge and provide expert inputs towards improving models.
377
Amazon SageMaker Developer Guide
Use custom models
Using SageMaker Canvas collaboration, you can share Standard build models from Canvas with data
scientists in Studio to review, update, and share back with Canvas users. Users in Canvas can share one
version of a model with up to 23 Studio users.
• In the Canvas application, a business analyst shares their model with a Studio user.
• The Studio user receives the shared model in the Studio application. They can choose to share
feedback with the analyst, make updates to the model, or share an alternate model version.
• The business analyst receives the feedback or updated model in Canvas and can generate predictions
in view-only mode.
To collaborate, the Canvas user and Studio user must be in the same Amazon SageMaker Domain. For
more information about setting up your Domain and users, see the SageMaker Canvas Prerequisites.
Note
Model collaboration is different from Bring your own model to SageMaker Canvas (p. 384),
where you can bring a model that you’ve trained anywhere and import it into Canvas for
generating predictions.
Prerequisites
Before a Canvas user and Studio user can collaborate on models, the users' IAM role must have AWS
Identity and Access Management (IAM) permissions to share models. If you haven’t already set up
permissions, see Grant Users Permissions to Collaborate with Studio (p. 282).
The Canvas user must also have a Standard build model trained in Canvas and ready to share.
Note
Collaboration does not support Quick build models.
You should also have the user profile name of the Studio user with whom you want to collaborate. The
Studio user must be in the same Amazon SageMaker Domain as your Canvas user. You can find a user’s
profile name by using the following procedure:
Keep the user profile name ready for the first step of the following tutorial.
To share your Canvas model with Studio users, use the following procedure.
a. From the Choose a model version to share dropdown list, select the model version for which
you want feedback.
378
Amazon SageMaker Developer Guide
Use custom models
b. From the SageMaker Studio users dropdown list, select Studio users by their profile names. You
can add up to 23 Studio users.
c. For the Add a note field, you can enter a quick note that accompanies your model when you
send it to the Studio users.
d. Choose Share.
e. In the Share Model confirmation box that appears, choose Share.
You have now shared your model with the Studio users, and the users receive a notification in Studio that
a model has been shared with them.
Choose View shared models to open the Shared models and notebooks page in Studio. If you miss the
notification, you can find the Shared models and notebooks page by doing the following:
On the Shared models and notebooks page, select the filter Shared with me. You should see the
Canvas model that has been shared with you in the list of shared models. Choose View model on the
shared model, which opens the model details page in Autopilot. The opened model should have a banner
at the top that looks similar to the following screenshot.
From this page, you can see the model details, as well as any notes about the model shared with you by
the Canvas user. In the Canvas banner at the top, you can choose the following actions:
For more information on the preceding actions, see the following sections.
Share feedback
You might want to send a comment or feedback to the Canvas user without making any changes to the
model.
379
Amazon SageMaker Developer Guide
Use custom models
After giving feedback, you can view the feedback you sent in the Canvas banner at the top of the model
details page. The Canvas user receives the feedback in the Canvas application and can make changes
based on your feedback.
You might want to make changes to the model that the Canvas user shared with you. For example, you
might want to use advanced data transformations such as one-hot encoding to improve the accuracy of
the model. You can update the model with Amazon SageMaker Data Wrangler and Amazon SageMaker
Autopilot in Studio, which are features that help you make data transformations and train your model.
Warning
If you exit the following workflow at any time, your model updates are not saved, and you must
restart the workflow.
To update the model and send the updated model to the Canvas user, use the following procedure:
1. On the model details page, in the Canvas banner, choose Update model.
2. In the banner’s dropdown list, choose Update data transformations.
3. The workflow opens your model in Amazon SageMaker Data Wrangler, where you can choose to edit
the data transformations used for the model. Make your data transformations in the Data Wrangler
interface. For more information about Data Wrangler and the data transformations you can use, see
the Data Wrangler documentation.
4. After you’ve finished your data transformations, choose Retrain model on the Canvas banner to
open the Export data and train a model with SageMaker Autopilot page in the Data Wrangler
interface.
5. Verify the fields on the Export data and train a model with SageMaker Autopilot page, and then
choose Export and train to export your data transformations to Amazon SageMaker Autopilot.
6. The workflow opens the Create an Autopilot experiment page in Autopilot, where you can create
an Autopilot experiment and retrain the model with the updated data transformations. Fill out the
fields for each of the Create an Autopilot experiment pages.
For more information about Autopilot and Autopilot experiments, see Create an experiment in the
Autopilot documentation.
7. After you’ve finished configuring your Autopilot experiment and reviewed the final settings, choose
Create experiment in the Autopilot interface to begin training the model. The model trains, during
which you can choose Stop training in the Autopilot interface at any time.
8. After the model has trained, the Canvas banner at the top of the page compares the metrics of the
old model with the updated model. The Best model summary lists the metrics, such as Recall and
Precision, and whether the new model improves the metrics or not. Review the metrics and decide
whether you would like to share the updated model or not. For more information about Autopilot
metrics, see Metrics and validation.
9. If you decide that you want to share the updated model with the Canvas user, choose Share in the
banner.
380
Amazon SageMaker Developer Guide
Use custom models
a. For the Select a model to share dropdown list, the best model from your Autopilot experiment
should already be selected and marked with a label Best Candidate. If the model version that
you want to share is not selected, open the dropdown and select the correct version.
b. For the Add feedback field, you can enter a note for the Canvas user.
c. Choose Share to share the updated model and note with the Canvas user.
After sharing the model, you receive a notification that your model was shared successfully similar to the
following screenshot.
You can choose View shared models in the banner to return to the Shared models and notebooks page.
From this page, you can see the updated model that you shared with the Canvas user under the Shared
by me label.
When SageMaker Canvas builds a model, Amazon SageMaker Autopilot trains multiple versions of
the model and selects the best one. You might decide that an alternate version of the model is better
according to your needs. You can share an alternate Autopilot version of the model with the Canvas user
instead of making changes to the one they sent. For more information about Autopilot, see the Autopilot
documentation.
1. On the model details page, in the Canvas banner, choose Update model.
2. In the banner’s dropdown list, choose Recommend an alternate Auto ML candidate.
3. The page for the Autopilot job opens where you can review all of the trained model versions. When
you're ready to share an alternate version, in the Canvas banner at the top of the page, choose
Share.
4. In the Share dialog box, do the following:
a. For the Select a model to share dropdown list, the best model from the Autopilot experiment
is selected and marked with the label Best Candidate. Open the dropdown and select the
alternate model version that you want to share.
b. For the Add feedback field, you can enter a note for the Canvas user.
c. Choose Share to share the alternate model version and note with the Canvas user.
After sharing the model, you receive a notification that your alternate model was shared successfully
similar to the following screenshot.
You can choose View shared models in the banner to return to the Shared models and notebooks page.
From this page, you can see the updated model that you shared with the Canvas user under the Shared
by me label.
381
Amazon SageMaker Developer Guide
Use custom models
In the Canvas app, the notification looks like the following screenshot.
You can choose View update to see the updated model, or you can go to the Models page in the Canvas
application and select the shared model to view it.
Note
Canvas users can’t edit a model that has been shared with them by a Studio user. Models
imported from Studio are view and predict only.
A model on which a Studio user has collaborated looks like the following card on the Models page.
382
Amazon SageMaker Developer Guide
Use custom models
The model import from Studio can take up to 20 minutes, during which the model shows as Importing.
383
Amazon SageMaker Developer Guide
Use custom models
After importing the model, you can view its metrics and generate predictions with it.
The following screenshot shows the Analyze tab, where you can evaluate the model accuracy
and metrics. For more information, see Evaluate Your Model's Performance in Amazon SageMaker
Canvas (p. 351).
The following screenshot shows the Predict tab, where you can generate predictions with the model. For
more information on generating predictions in Canvas, see Make predictions for your data (p. 358).
On both the Analyze and Predict tabs, you can see the Shared History panel, which shows you the
model versions and comments shared with you by Studio users.
Business analysts can benefit from ML models already built by data scientists to solve business problems
instead of creating a new model in Amazon SageMaker Canvas. However, it might be difficult to use
these models outside the environments in which they are built due to technical requirements, rigidity of
tools, and manual processes to import models. This often forces users to rebuild ML models, resulting in
the duplication of effort and additional time and resources.
384
Amazon SageMaker Developer Guide
Use custom models
SageMaker Canvas removes these limitations so you can generate predictions in Canvas with models that
you’ve trained anywhere. You can register ML models in SageMaker Model Registry, which is a metadata
store for ML models, and import them into SageMaker Canvas. Additionally, you can generate predictions
with models that data scientists have trained in Amazon SageMaker Autopilot or SageMaker JumpStart.
Canvas users can then analyze and generate predictions from any model that has been shared with them.
After you’ve satisfied the Prerequisites (p. 385), see the following sections for instructions on how to
bring your own models into Canvas and generate predictions. The workflow begins in Studio, where a
Studio user shares a model with a Canvas user. Then, the Canvas user signs in to their Canvas app to
receive the shared model and generate predictions with it.
Important
You can only share models trained with tabular data. Also, you can't share time series models.
Prerequisites
To bring your model into SageMaker Canvas, complete the following prerequisites:
• You must have a Amazon SageMaker Studio user who has onboarded to Amazon SageMaker Domain.
The Studio user must be in the same Domain as the Canvas user. Model sharing occurs when a Studio
user shares a model with a Canvas user from within Studio. If you don’t already have a Studio user set
up, see the Studio documentation and Onboard to Amazon SageMaker Domain.
• You must have a trained model from SageMaker Autopilot, SageMaker JumpStart, or SageMaker
Model Registry. For any model that you’ve built outside of SageMaker, you must register your model
in Model Registry before importing it into Canvas. For more information, see the Model Registry
documentation.
• The Canvas user with whom you want to share your model must have permission to access the Amazon
S3 bucket in which you store your datasets and model artifacts. For instructions on how admins
can give Canvas users the permissions they need, see Grant Users Permissions to Collaborate with
Studio (p. 282).
• You should also have the user profile name of the Canvas user with whom you want to collaborate.
The Canvas user must be in the same Amazon SageMaker Domain as your Studio user. You can find a
user’s profile name by using the following procedure:
Keep the user profile name ready for the first step of the following tutorial.
If your SageMaker Canvas app is running in a private customer VPC, any Autopilot models shared from
Studio must use Autopilot HPO mode to support generating predictions in Canvas. For more information
about HPO mode, see Training modes and algorithm support in the Autopilot documentation.
Note
If you want feedback from data scientists on a model built inside Canvas, see Collaborate with
data scientists (p. 377), where a Canvas user shares a model with a Studio user, and the Studio
user shares feedback or model updates.
385
Amazon SageMaker Developer Guide
Use custom models
Autopilot
You can share a model to Canvas from Amazon SageMaker Autopilot in Studio. Autopilot is a feature that
enables you to train and and deploy your models in SageMaker.
You need to have a Studio user and a trained model ready to share from Autopilot. For more information
on how to set up Studio, see the Studio documentation. For more information about Autopilot, see the
Autopilot documentation.
a. For the Add Canvas users field, enter the Canvas user’s profile name. You can enter up to 23
Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it, you can't
enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.
You have now shared the model with the Canvas user.
JumpStart
You can share a model to Canvas from SageMaker JumpStart in Studio. With JumpStart, you can access
and tune pretrained models before deploying them.
You need to have a Studio user and a successfully completed training job in JumpStart. For more
information about how to set up Studio, see the Studio documentation. For more information about
JumpStart, see the JumpStart documentation.
386
Amazon SageMaker Developer Guide
Use custom models
Note
You can only share tabular models to Canvas. Trying to share a model that is not tabular
throws an Unsupported data type error.
8. In the Share to Canvas dialog box, do the following:
a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.
You have now shared the model with the Canvas user.
Model Registry
You can share a model to Canvas from SageMaker Model Registry in Studio. With Model Registry, you can
register models that you bring from outside of SageMaker and integrate them with your ML pipelines.
You need to have a Studio user and a model version saved in the Model Registry. For more information
about how to set up Studio, see the Studio documentation. If you don’t have a model version in the
Model Registry, create a model group and register a version to it. For more information about Model
Registry, see the Model Registry documentation.
To share a model version from Model Registry to Canvas, use the following procedure.
• To share a model version from the model group page, complete the following steps:
1. Choose Versions, and check the box next to the model version you want to share with the
Canvas user. You can only share one model version at a time.
2. In the Actions dropdown menu, choose Share model artifacts.
• To share a model version from the model version page, complete the following steps:
1. Choose Versions, and select the name of the model version you want to share with the
Canvas user. You can only share one model version at a time.
2. In the Actions dropdown menu, choose Share model artifacts.
7. In the Share model dialog box, do the following:
a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For Add model details, do the following:
i. For the Training dataset field, enter the Amazon S3 path for your training dataset.
ii. For the Validation dataset field, enter the Amazon S3 path for your validation dataset.
387
Amazon SageMaker Developer Guide
Use custom models
iii. For Target column, either select Use the first column if the first column in your dataset
is the target, or select Specify the target column name to set the target as a different
column in your dataset.
iv. For Column headers, select one of the following options:
A. Select Use the first row if the first row of your dataset contains the column headers.
B. Select Specify a different dataset in S3 for column headers if you have a file stored
in Amazon S3 containing headers that can be mapped to your dataset. The headers file
must have the same number of columns as your dataset.
C. Select Automatically generate if you don’t already have column headers and would
like SageMaker to generate generic column names for your dataset.
v. From the Problem type dropdown list, select your model type.
vi. If you selected the Binary classification or Multi-class problem types, the Configure model
outputs option appears.
If you already have a file stored in Amazon S3 that maps default target column class names
to your desired class names, then turn on Model output names and enter the Amazon
S3 path to the mapping file. If you don't have a mapping file, then turn off Model output
names and manually enter the Numer of model outputs (the number of target column
classes in your data). Then, enter your desired class names to replace the default class
names.
c. (Optional) For the Add a note field, add a description or note for the Canvas user when they
receive the model.
d. Choose Share to share the model version.
You have now shared the model with the Canvas user.
On the Shared models and notebooks page in Amazon SageMaker Studio, you can view the models
that you've shared and that have been shared with you. This page gives you a central place to view and
manage all of your models in Studio.
You need to have a Studio user and a model ready to share from Autopilot, JumpStart, or Model Registry.
For more information on how to set up Studio, see the Studio documentation. For more information
about the Shared models and notebooks page, see the Shared models and notebooks documentation.
The following example walks you through sharing an Amazon SageMaker Autopilot model, but you can
use the sharing feature on the Shared models and notebooks page to share models from any of the
other features in the previous sections, such as Jumpstart and Model Registry.
To share an Autopilot model from the Shared models and notebooks page, use the following procedure.
388
Amazon SageMaker Developer Guide
Use custom models
a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.
You have now shared the model with the Canvas user.
After you share the model, you receive a notification popup in Studio similar to the following screenshot.
You can choose View model to open the Shared models and notebooks page in Studio. You can also
view your shared models at any time from the Shared models and notebooks page.
From this page, you can see the models that you’ve shared with the Canvas user under the Shared by me
label, as shown in the following screenshot.
Models that you’ve shared to Canvas have text on the card similar to the following example: Shared
to: 12 Canvas users.
389
Amazon SageMaker Developer Guide
Use custom models
You can choose View update to see the shared model, or you can go to the Models page in the Canvas
application to discover all of the models that have been shared with you.
Note
Canvas users can’t edit a model that has been shared with them by a Studio user. Models
imported from Studio are view and predict only.
A model that has been shared by a Studio user looks like the following card on the Models page. This
is different from Collaborate with data scientists (p. 377), where a Canvas user shares a model and a
Studio user shares updates or feedback with the Canvas user.
390
Amazon SageMaker Developer Guide
Use custom models
391
Amazon SageMaker Developer Guide
Logging out
The model import from Studio can take up to 20 minutes, during which the model shows as Importing.
After importing the model, you can view its metrics and generate predictions with it. SageMaker Canvas
uses Amazon SageMaker Serverless Inference resources to generate model analysis and predictions for
shared models. You might see costs associated with Serverless Inference in your AWS account.
The following screenshot shows the Analyze tab in the Canvas application for a shared model, where
you can evaluate the model accuracy and metrics. For more information, see Evaluate Your Model's
Performance in Amazon SageMaker Canvas (p. 351).
The following screenshot shows the Predict tab, where you can generate predictions with the model. For
more information on generating predictions in Canvas, see Make predictions for your data (p. 358).
On both the Analyze and Predict tabs, you can see the Shared History panel, which shows you the
model versions and comments shared with you by Studio users.
392
Amazon SageMaker Developer Guide
Limitations and troubleshooting
When you log out, your models and datasets aren't affected, but SageMaker Canvas cancels any Quick
build tasks. If you log out of SageMaker Canvas while running a Quick build, your build might be
interrupted until you log back in. When you log back in, SageMaker Canvas automatically restarts the
build.
To log out, choose the Log out button ( ) on the left panel of the SageMaker Canvas app.
You can also log out from the SageMaker Canvas app by closing your browser tab and then deleting the
app (p. 284) in the console.
After you log out, SageMaker Canvas tells you to relaunch in a different tab. Logging in takes between
3 minutes and 8 minutes. If you have an administrator who set up SageMaker Canvas for you, use the
instructions they gave you to log back in. If don't have an administrator, see the procedure for accessing
SageMaker Canvas in Prerequisites for setting up Amazon SageMaker Canvas (p. 260).
To edit the trust relationship for your IAM execution role, do the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
393
Amazon SageMaker Developer Guide
Limitations and troubleshooting
}
]
}
You can also update this policy document using the IAM CLI. For more information, see update-trust in
the IAM Command Line Reference.
You can now retry granting the Canvas base permissions or the Ready-to-use models permissions to your
user.
Your user should now be using an execution role with only one trusted service (SageMaker).
You can retry granting the Canvas base permissions or the Ready-to-use models permissions to your
user.
3. Manually attach the AWS managed policy to the execution role instead of
using the toggle in the SageMaker Domain settings.
Instead of using the toggle in the Domain or user profile settings, you can manually attach the AWS
managed policies that grant a user the correct permissions.
To grant a user Canvas base permissions, attach the AmazonSageMakerCanvasFullAccess policy. To grant
a user Ready-to-use models permissions, attach the AmazonSageMakerCanvasAIServicesAccess policy.
Use the following procedure to attach an AWS managed policy to your role:
a. To grant the Canvas base permissions, search for and select the
AmazonSageMakerCanvasFullAccess policy.
b. To grant the Ready-to-use models permissions, search for and select the
AmazonSageMakerCanvasAIServicesAccess policy.
7. Choose Add permissions to attach the policy to the role.
394
Amazon SageMaker Developer Guide
Limitations and troubleshooting
After attaching an AWS managed policy to the user’s role through the IAM console, your user should now
have the Canvas base permissions or Ready-to-use models permissions.
• You can only share successfully trained models from Canvas to Studio. Similarly, you can only share
models that have been successfully trained in Studio back to Canvas.
• You can’t share Quick build models from Canvas to Studio. You can only share Standard build models.
• You can only share one version of a Standard build model trained in Canvas. You can train additional
versions of your model within Canvas, but you can't share them to Studio.
• From Studio, you can only share feedback or share an updated model with Canvas. You can’t perform
both actions at the same time.
• The length limitation for comments shared from Studio to Canvas and Canvas to Studio is 1024
characters.
• You can only share your Canvas or Studio models with a different user profile. You can’t share models
between Canvas and Studio within your own user profile.
• You can't share from a Canvas user to a Canvas user, or from a Studio user to a Studio user.
There are also limitations that apply depending on the type of model you want to share. See the
following sections for limitations on time series forecasting models and numeric and categorical
prediction models.
• You can’t make predictions with time series forecasting models in Studio through an automated Share
button. However, you can create a Jupyter notebook and write your own code.
• For time series forecasting models, you can’t change the model recipe or data transformations in
Studio. You can only make the following updates to time series forecasting models in Studio:
• You can update the length of the forecast horizon.
• You can update the item's metadata field, which groups your data by a certain column.
• You can update other dimension fields, such as specifying a holiday schedule.
• When updating or training models in Studio, if you close the tab with the collaboration banner at
the top, it ends the share model workflow and you lose your progress. In that case, you must restart
the share model workflow from the Shared With Me section on the Shared Models page. For more
information, see Collaborate with data scientists.
• When updating models in Studio, you can’t change the target column if you want to share the model
updates back to Canvas. If you want to change the target column and re-train the model, train the
model and then use the Share button to share to Canvas. For more information about sharing a new
model to Canvas, see Bring your own model to SageMaker Canvas.
• When updating models in the Amazon SageMaker Data Wrangler Recipe interface in Studio, there are
limits to which changes a Studio user can apply that Canvas supports:
395
Amazon SageMaker Developer Guide
Limitations and troubleshooting
• You can only share a model to Canvas that has been trained from the last node in a Data Wrangler
linear data flow.
• Only transformation nodes are supported.
• You can’t perform operations on the Target column.
• You can’t update the data type of columns.
• You can’t update the data source or add a new data source.
• When sharing an alternative candidate to Canvas from the Studio Autopilot page, you can’t select the
model from the leaderboard. You must choose the shared model from the banner and then select an
alternative from the list. For more information, see Share an alternate model with the Canvas user in
the Canvas documentation.
• Only models that are compatible with SageMaker Neo can be shared back to Canvas successfully.
Compatible models are Autopilot models that use XGBoost or MLP algorithms. Incompatible models
include Autopilot models that use the linear learner algorithm.
• For custom formula transforms using Spark SQL, Canvas only supports Unary operations, Aggregate
functions, the String concatenation operation and the Power operation. Other operations are not
supported.
• When a model is shared from Studio to Canvas, the Canvas user cannot update or view details on the
dataset that was used to build the model.
• When a Canvas user wants to run a single prediction on an imported model, there are no data type
restrictions when updating column values. You must manually make sure that when you update values
for single predictions, you match the data type of the existing values.
• When a Canvas user wants to run batch predictions on an imported model, Canvas assumes that you
(the Canvas user) know what the expected input dataset should look like. You should have a dataset
with columns and data types that match the dataset that was used to train the model. If not, consult
with the user who shared the model with you and import a dataset that you can use for running batch
predictions.
• The Canvas application internally uses a serverless endpoint to run predictions and generate model
metrics. The model shared to Canvas must be compatible with serverless endpoints:
• The maximum memory size is 6144 MB.
• When configuring the inference input response keys in your container, use the following
configuration:
INFERENCE_INPUT_RESPONSE_KEYS = {
"BINARY": ["predicted_label", "probability"],
"MULTI_CLASS": ["predicted_label", "probability", "probabilities", "labels"],
}
• You can choose either a SageMaker-provided inference container or bring your own image inference
container to be used for endpoint. SageMaker provides containers for its built-in algorithms and
prebuilt Docker images for some of the most common machine learning frameworks. If you are
bringing your own container, you must modify it to work with SageMaker. For more information
about bringing your own container, see Adapting Your Own Inference Container.
• The Feature exclusions for serverless endpoints also apply.
• To share a model from Studio to Canvas successfully, Canvas accepts model inference outputs in the
format below:
TEXT/CSV
396
Amazon SageMaker Developer Guide
Limitations and troubleshooting
• Regression: The model inference response should be a byte string where each of the output
predictions are separated by \n:
b'-0.0007884334772825241\n-0.015136942267417908\n0.050063662230968475\n0.02891816757619381\n'
• Classification: The model inference response should be a byte string where each of
predicted_label, predicted_probability, probabilities, and labels are separated by
\n. The following example is for binary classification:
APPLICATION/JSON
• Regression: The model inference response should be a JSON string which contains the prediction
key, and its value should be the list of output predictions:
let response = {
"predictions": [
// First instance prediction.
1.75
// Second instance prediction.
3.25
]
}
• Classification: The model inference response should be a JSON string which contains the
probabilities key, and its value should be the list of probabilities.
let response = {
"probabilities": [
// First instance prediction.
[0.9, 0.1]
// Second instance prediction.
[0.2, 0.8]
]
}
let response = {
"probabilities": [
// First instance prediction.
[0.7, 0.2, 0.1]
// Second instance prediction.
[0.2, 0.5, 0.3]
397
Amazon SageMaker Developer Guide
Limitations and troubleshooting
]
}
There are also limitations that apply depending on the type of model you want to bring:
• The following are the supported algorithms for which you can import models into Canvas. For more
details, see the SageMaker JumpStart documentation.
• Tabular classification: LightGBM, CatBoost, XGBoost, AutoGluon-Tabular, TabTransformer, Linear
Learner
• Tabular regression: LightGBM, CatBoost, XGBoost, AutoGluon-Tabular, TabTransformer, Linear
Learner
• In SageMaker JumpStart, the Share button is only turned on if the model is ready to share to Canvas.
If your trained model does not have a Share to SageMaker Canvas button, your model is not supported
for BYOM.
• You must provide training and validation datasets when training the SageMaker JumpStart model. The
datasets should be stored in Amazon S3, and your Studio and Canvas users' execution role must have
access to the Amazon S3 location. You can use the same Amazon S3 URIs to share the training and
validation datasets with Canvas, or you can share different datasets with the same data schema.
Your training or validation data file should look like the following (in CSV format). You should index
your files with the first column as the target.
3 1 22 1 1 0 4 4
0 0 38 0 0 1 3 4
1 0 67 0 1 0 1 6
1 0 67 0 0 2 2 6
0 0 40 0 0 2 6 6
2 0 56 1 0 1 2 6
• By default, SageMaker JumpStart uses the first column of the training and validation datasets as the
target when training a model. The target column (or by default, the first column) of the datasets is
shared to Canvas.
• You must provide the column headers of the training and validation datasets when training the
SageMaker JumpStart model. By default, SageMaker JumpStart only accepts datasets without column
headers, so you must add the column headers as a file while training your model. The Amazon S3 URI
for the column headers file is shared to Canvas as well. Your column headers file should look like the
following example (in CSV format). The first column should be the target.
• The training job in SageMaker JumpStart must be Complete before you can share with Canvas.
• For classification problems (or categorical prediction in Canvas), original class names need to be
provided in the Configure model output section when sharing to Canvas. The order of the class names
must match the indexing used in the model. Your mapping relation file should look like the following
example in CSV format, where index 0 (the first index) is mapped to the class name A:
A B C D
When the Canvas user views the model metrics in the Canvas application, they can only see the index
of each class (0, 1, 2). However, the user can see the class names when viewing the results for a single
prediction.
398
Amazon SageMaker Developer Guide
Limitations and troubleshooting
• You can only share models to Canvas that you’ve successfully trained from an AutoML job with
Ensembling, HPO, or Auto mode (for Auto mode, Autopilot chooses Ensembling or HPO mode based
on the training dataset size). The currently supported Autopilot problem types are Regression, Multi-
class classification, Binary classification.
• For each Autopilot job, you can choose any model (the Best model or any other candidates) to share to
Canvas one at a time. You only need to choose the Share model button and then specify the Canvas
users with whom you’d like to share the model and a note.
• AutoGluon-Tabular models that use Data Wrangler transformers for inference cannot be shared to
Canvas. This is because Data Wrangler transformers cause the model to use more than one container.
• HPO models that aren’t compatible with SageMaker Neo can’t be shared to Canvas successfully.
Compatible models are Autopilot models that use XGBoost or MLP algorithms. Incompatible models
include Autopilot models that use the linear learner algorithm.
• Unlike the Share button provided by SageMaker JumpStart, Model Registry doesn’t provide model
validation, so it’s possible that a registered model shared successfully from Studio can fail while
importing to Canvas due to model incompatibility. Review the following tips before sharing to Canvas
from Model Registry:
• Use a single inference container for your model. You can register models with multiple containers
within the AdditionalInferenceSpecifications field, but Canvas is only optimized for one inference
container per model. For example, when you use a inference pipeline and register multiple
containers in the AdditionalInferenceSpecifications field with multiple data preprocessing
containers and an inference container, by default the first container is selected for model inference
in Canvas. Evaluate if this works for your use case if you're using machine learning pipelines.
• Use a SageMaker built-in tabular algorithm with compatible inference formats. Tested sample
algorithms with compatible inference outputs are Autogluon-Tabular, CatBoost, LightGBM,
TabTransformer and XGBoost. Algorithms like Factorization Machines don't accept CSV as file input,
and the inference output formats for algorithms like Linear Learner and K-NN are not supported by
Canvas.
• You can also bring your own image container and share to Canvas, or modify pre-built SageMaker
containers.
• If you are bringing your own container, you must modify it to work with SageMaker. For more
information about bringing your own container, see Adapting Your Own Inference Container.
• For detailed formatting for your inference output formats, see Limitations for bring your own
model (BYOM) (p. 396).
• When registering your model in a model package group, remember to provide the following attributes
with your inference container:
• Environment:
• Image:
"s3://sagemaker-us-west-2-<account-id>/model-regression-abalone-2022-10-14-23-02-45/
model.tar.gz"
399
Amazon SageMaker Developer Guide
Manage billing and cost
• ModelDataUrl
"<account-id>.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1"
• You must provide training and validation datasets when sharing the model from Model Registry to
Canvas. The datasets should be stored in Amazon S3, and the Studio and Canvas users' execution
role must have access to the Amazon S3 location. You can use the same Amazon S3 URIs to share
the training and validation datasets with Canvas, or you can share different datasets with the same
data schema. The datasets must have the exact input formatting that feeds your model’s inference
container.
• You must provide the target column to Canvas, or the first column of your training/validation dataset
is used by default.
• In the Add model details section when sharing to Canvas, you can provide the first row your training
and validation datasets as the headers, or you can specify the headers as a different file.
• For classification problems (or categorical prediction in Canvas), original class names need to be
provided when sharing to SageMaker Canvas through the Configure model outputs option. The order
of the class names must match the indexing used with the shared model. The mapping can be either a
CSV file in Amazon S3, or you can manually input the class names.
• Workspace instance charges – You are charged for the number of hours that you are logged in to or
using SageMaker Canvas.
• AWS service charges – You are charged for building and making predictions with custom models, or for
making predictions with Ready-to-use models:
• Training charges – You are charged for the resources used to build a custom model.
• Prediction charges – You are charged for the resources used to generate predictions, depending on
the type of custom model that you built or the type of Ready-to-use model you used.
The Ready-to-use models (p. 289) in Canvas leverage other AWS services to generate predictions. When
you use a Ready-to-use model, you are charged by the respective service, and their pricing conditions
apply:
• For sentiment analysis, entities extraction, language detection, and personal information detection,
you’re charged with Amazon Comprehend pricing.
• For object detection in images and text detection in images, you’re charged with Amazon Rekognition
pricing.
• For expense analysis, identity document analysis, and document analysis, you’re charged with Amazon
Textract pricing.
To help you track your costs in Billing and Cost Management, you can assign custom tags to your
SageMaker Canvas app and users. You can track the costs your apps incur, and by tagging individual user
profiles, you can track costs based on the user profile. For more information about tags, see Using Cost
Allocation Tags.
400
Amazon SageMaker Developer Guide
SageMaker geospatial capabilities
You can add tags to your SageMaker Canvas app and users by doing the following:
• If you are setting up your Amazon SageMaker Domain and SageMaker Canvas for the first time,
follow the Getting Started instructions and add tags when creating your Domain or users. You can
add tags either through the General settings in the Domain console setup, or through the APIs
(CreateDomain or CreateUserProfile). SageMaker adds the tags specified in your Domain or UserProfile
to any SageMaker Canvas apps or users you create after you create the Domain.
• If you want to add tags to apps in an existing Domain, you must add tags to either the Domain or the
UserProfile. You can adds tags through either the console or the AddTags API. If you add tags through
the console, then you must delete and relaunch your SageMaker Canvas app in order for the tags to
propagate to the app. If you use the API, the tags are added directly to the app. For more information
about deleting and relaunching a SageMaker Canvas app, see Manage apps.
After you add tags to your Domain, it might take up to 24 hours for the tags to appear in the AWS Billing
and Cost Management console for activation. After they appear in the console, it takes another 24 hours
for the tags to activate.
On the Cost explorer page, you can group and filter your costs by tags and usage types to separate your
Workspace instance (Session-Hrs) charges from your Training charges. The names of the usage types are
as follows:
You can use SageMaker geospatial capabilities to make predictions on geospatial data faster than
do-it-yourself solutions. SageMaker geospatial capabilities make it easier to access geospatial data
from your existing customer data lakes, open-source datasets, and other SageMaker geospatial data
providers. SageMaker geospatial capabilities minimize the need for building custom infrastructure
and data preprocessing functions by offering purpose-built algorithms for efficient data preparation,
model training, and inference. You can also create and share custom visualizations and data with your
company from Amazon SageMaker Studio. SageMaker geospatial capabilities offer pre-trained models
for common uses in agriculture, real estate, insurance, and financial services.
Note
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region.
To view Amazon SageMaker geospatial capabilities, choose the name of the currently displayed
Region in the navigation bar of the console. Then choose the US West (Oregon) Region.
401
Amazon SageMaker Developer Guide
Getting Started
• Through the SageMaker geospatial UI, as a part of Amazon SageMaker Studio UI.
• Through SageMaker notebooks with a SageMaker geospatial image.
Geospatial data represents features or objects on the earth’s surface. The first type of geospatial data is
vector data, which uses two-dimensional geometry such as points, lines, or polygons to represent objects
such as roads and land boundaries. The second type of geospatial data is raster data, such as imagery
captured by satellite, aerial platforms, or remote sensing data. This data type uses a matrix of pixels to
define where features are located. You can use raster formats for storing data that varies. A third type of
geospatial data is geo-tagged location data. It includes points of interest—for example, the Eiffel Tower
—location-tagged social media posts, latitude and longitude coordinates, or different styles and formats
of street addresses. SageMaker has the following SageMaker geospatial capabilities.
Along with this, you can access data from a catalog of geospatial data providers. Currently, the data
collections available include:
• USGS Landsat
• Sentinel-2
Topics
• Getting Started with Amazon SageMaker geospatial capabilities (p. 402)
• Earth Observation Jobs (p. 405)
• Vector Enrichment Jobs (p. 412)
• Visualization Using SageMaker geospatial capabilities (p. 413)
• Amazon SageMaker geospatial Map SDK (p. 418)
• SageMaker geospatial capabilities FAQ (p. 423)
• SageMaker geospatial Security and Permissions (p. 424)
To use SageMaker geospatial capabilities you need to have an AWS account. If you already have an AWS
account, skip this step.
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
402
Amazon SageMaker Developer Guide
Getting Started
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.
AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.
1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.
For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.
For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.
• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).
For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.
• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.
For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.
403
Amazon SageMaker Developer Guide
Getting Started
Within the Studio UI, choose Geospatial under Data from the left navigation panel on the Home menu.
1. From the Launcher, choose Change environment under Notebooks and compute resources.
2. Next, the Change environment dialog opens.
3. Select the Image dropdown and choose Geospatial 1.0. The Instance type should be
ml.geospatial.interactive. Do not change the default values for other settings.
4. Choose Select.
5. Choose Create notebook.
The instance type is determined by the operation that you run. The following table shows the instance
type for each operation.
Operations Instance
404
Amazon SageMaker Developer Guide
Earth Observation Jobs
Operations Instance
Resampling ml.geospatial.jobs
Geomosaic ml.geospatial.jobs
You are charged different rates for each type of compute instance you use. See Geospatial ML with
Amazon SageMaker for more information on pricing.
Topics
• Create an Earth Observation Job Using a Amazon SageMaker Studio Notebook with a SageMaker
geospatial Image (p. 405)
• Types of Operations (p. 409)
• Data Collections (p. 411)
1. From the Launcher, choose Change environment under Notebooks and compute resources.
2. Next, the Change environment dialog opens.
3. Select the Image dropdown and choose Geospatial 1.0. The Instance type should be
ml.geospatial.interactive. Do not change the default values for other settings.
4. Choose Select.
405
Amazon SageMaker Developer Guide
Earth Observation Jobs
You can initiate an EOJ using a Amazon SageMaker Studio notebook with a SageMaker geospatial image
using the code provided below.
import boto3
import sagemaker
import sagemaker_geospatial_map
session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")
The following is an example showing how to create an EOJ in the in the US West (Oregon) Region.
tci_urls = []
data_manifests = []
while search_rdc_args.get("NextToken", True):
search_result = sg_client.search_raster_data_collection(**search_rdc_args)
if search_result.get("NextToken"):
data_manifests.append(search_result)
for item in search_result["Items"]:
tci_url = item["Assets"]["visual"]["Href"]
print(tci_url)
tci_urls.append(tci_url)
search_rdc_args["NextToken"] = search_result.get("NextToken")
# Perform land cover segmentation on images returned from the sentinel dataset.
eoj_input_config = {
406
Amazon SageMaker Developer Guide
Earth Observation Jobs
"RasterDataCollectionQuery": {
"RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-
west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
"AreaOfInterest": {
"AreaOfInterestGeometry": {
"PolygonGeometry": {
"Coordinates": [
[
[-114.529, 36.142],
[-114.373, 36.142],
[-114.373, 36.411],
[-114.529, 36.411],
[-114.529, 36.142],
]
]
}
}
},
"TimeRangeFilter": {
"StartTime": "2021-01-01T00:00:00Z",
"EndTime": "2022-07-10T23:59:59Z",
},
"PropertyFilters": {
"Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound":
1}}}],
"LogicalOperator": "AND",
},
}
}
eoj_config = {"LandCoverSegmentationConfig": {}}
response = sg_client.start_earth_observation_job(
Name="lake-mead-landcover",
InputConfig=eoj_input_config,
JobConfig=eoj_config,
ExecutionRoleArn=execution_role,
)
After your EOJ is created, the Arn is returned to you. You use the Arn to identify
a job and perform further operations. To get the status of a job, you can run
sg_client.get_earth_observation_job(Arn = response['Arn']).
The following example shows how to query the status of an EOJ until it is completed.
eoj_arn = response["Arn"]
job_details = sg_client.get_earth_observation_job(Arn=eoj_arn)
{k: v for k, v in job_details.items() if k in ["Arn", "Status", "DurationInSeconds"]}
# List all jobs in the account
sg_client.list_earth_observation_jobs()["EarthObservationJobSummaries"]
After the EOJ is completed, you can visualize the EOJ outputs directly in the notebook. The following
example shows you how an interactive map can be rendered.
map = sagemaker_geospatial_map.create_map({
'is_raster': True
})
map.set_sagemaker_geospatial_client(sg_client)
# render the map
map.render()
The following example shows how the map can be centered on an area of interest and the input and
output of the EOJ can be rendered as separate layers within the map.
407
Amazon SageMaker Developer Guide
Earth Observation Jobs
# Visualize input.
time_range_filter = {
"start_date": "2022-07-01T00:00:00Z",
"end_date": "2022-07-10T23:59:59Z",
}
config = {"label": "Input"}
input_layer = map.visualize_eoj_input(
Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)
# Visualize output, EOJ needs to be in completed status.
time_range_filter = {
"start_date": "2022-07-01T00:00:00Z",
"end_date": "2022-07-10T23:59:59Z",
}
config = {"preset": "singleBand", "band_name": "mask"}
output_layer = map.visualize_eoj_output(
Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)
You can use the export_earth_observation_job function to export the EOJ results to your Amazon
S3 bucket. The export function makes it convenient to share results across teams. SageMaker also
simplifies dataset management. We can simply share the EOJ results using the job ARN, instead of
crawling thousands of files in the S3 bucket. Each EOJ becomes an asset in the data catalog, as results
can be grouped by the job ARN. The following example shows how you can export the results of an EOJ.
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket() # Replace with your own bucket if
needed
s3_bucket = session.resource("s3").Bucket(s3_bucket_name)
prefix = "eoj_lakemead" # Replace with the S3 prefix desired
export_bucket_and_key = f"s3://{s3_bucket_name}/{prefix}/"
You can monitor the status of your export job using the following snippet.
You are not charged the storage fees after you delete the EOJ.
For an example that showcases how to run an EOJ, see this blog post.
For more example notebooks on SageMaker geospatial capabilities, see this GitHub repository.
408
Amazon SageMaker Developer Guide
Earth Observation Jobs
Types of Operations
When you create an EOJ, you select an operation based on your use case. Amazon SageMaker geospatial
capabilities provide a combination of purpose-built operations and pre-trained models. You can use
these operations to understand the impact of environmental changes and human activities over time or
identify cloud and cloud-free pixels.
Cloud Masking
Identify clouds in satellite images is an essential pre-processing step in producing high-quality geospatial
data. Ignoring cloud pixels can lead to errors in analysis, and over-detection of cloud pixels can decrease
the number of valid observations. Cloud masking has the ability to identify cloudy and cloud-free pixels
in satellite images. An accurate cloud mask helps get satellite images for processing and improves data
generation. The following is the class map for cloud masking.
{
0: "No_cloud",
1: "cloud"
}
Cloud Removal
Cloud removal for Sentinel-2 data uses an ML-based semantic segmentation model to identify clouds
in the image. Cloudy pixels can be replaced by with pixels from other timestamps. USGS Landsat data
contains landsat metadata that is used for cloud removal.
Temporal Statistics
Temporal statistics calculate statistics for geospatial data through time. The temporal statistics currently
supported include mean, median, and standard deviation. You can calculate these statistics by using
GROUPBY and set it to either all or yearly. You can also mention the TargetBands.
Zonal Statistics
Zonal statistics performs statistical operations over a specified area on the image.
Resampling
Resampling is used to upscale and downscale the resolution of a geospatial image. The value attribute
in resampling represents the length of a side of the pixel.
Geomosaic
Band Stacking
Band stacking takes more than one image band as input and stacks them into a single GeoTIFF. The
OutputResolution attribute determines the resolution of the output image. Based on the resolutions
of the input images, you can set it to lowest, highest or average.
Band Math
Band Math, also known as Spectral Index, is a process of transforming the observations from multiple
spectral bands to a single band, indicating the relative abundance of features of interests. For instance,
Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) are helpful for
observing the presence of green vegetation features.
409
Amazon SageMaker Developer Guide
Earth Observation Jobs
Land Cover segmentation is a semantic segmentation model that has the capability to identify the
physical material, such as vegetation, water, and bare ground, at the earth surface. Having an accurate
way to map the land cover patterns helps you understand the impact of environmental change and
human activities over time. Land Cover segmentation is often used for region planning, disaster
response, ecological management, and environmental impact assessment. The following is the class map
for Land Cover segmentation.
{
0: "No_data",
1: "Saturated_or_defective",
2: "Dark_area_pixels",
3: "Cloud_shadows",
4: "Vegetation",
5: "Not_vegetated",
6: "Water",
7: "Unclassified",
8: "Cloud_medium_probability",
9: "Cloud_high_probability",
10: "Thin_cirrus",
11: "Snow_ice"
}
410
Amazon SageMaker Developer Guide
Earth Observation Jobs
Land Cover Segmentation Identify land cover types such as UI, Notebook
vegetation and water in satellite
imagery.
Data Collections
Amazon SageMaker geospatial provides the following data collections to create an EOJ.
• USGS Landsat
• Sentinel-2
The image band information for these data collections is provided below.
USGS Landsat
Band name Wave length Units Valid range Fill value Spatial
range (nm) resolution
411
Amazon SageMaker Developer Guide
Vector Enrichment Jobs
Band name Wave length Units Valid range Fill value Spatial
range (nm) resolution
Sentinel-2
Band name Wave length Scale Valid range Fill value Spatial
range (nm) resolution
Reverse Geocoding
With a reverse geocoding VEJ, you can convert geographic coordinates (latitude, longitude) to human-
readable addresses powered by Amazon Location Service. When you upload a CSV file containing
412
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities
the longitude and latitude coordinates, a it returns the address number, country, label, municipality,
neighborhood, postal code and region of that location. The output file consists of your input data along
with columns containing these the values appended at the end. These jobs are optimized to accept tens
of thousands of GPS traces.
Map Matching
Map matching allows you to snap GPS coordinates to road segments. The input should be a CSV file
containing the trace ID (route), longitude, latitude and the timestamp attributes. There can be multiple
GPS co-ordinates per route. The input can contain multiple routes too. The output is a GeoJSON file
that contains links of the predicted route. It also has the snap points provided in the input. These jobs
are optimized to accept tens of thousands of drives in one request. Map matching is supported by
OpenStreetMap. Map matching fails if the names in the input source field don't match the ones in
MapMatchingConfig. The error message you receive contains the the field names present in the input
file and the expected field name that is not found in MapMatchingConfig.
While you need to use an Amazon SageMaker Studio notebook to execute a VEJ, you can view all the
jobs you create using the UI. To use the visualization in the notebook, you first need to export your
output to your S3 bucket. The VEJ actions you can perform are as follows.
• StartVectorEnrichmentJob
• GetVectorEnrichmentJob
• ListVectorEnrichmentJobs
• StopVectorEnrichmentJob
• DeleteVectorEnrichmentJob
413
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities
414
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities
You can use the left navigation panel to add data, layers, filters, and columns. You can also make
modifications to how you interact with the map.
Dataset
The source of data used for visualization is called a Dataset. To add data for visualization, choose Add
Data in the left navigation panel. You can either upload the data from your Amazon S3 bucket or your
local machine. The data formats supported are CSV, JSON and GeoJSON. You can add multiple datasets
to your map. After you upload the dataset, you can see it loaded on the map screen.
Layers
In the layer panel, a layer is created and populated automatically when you add a dataset. If your map
consists of more than one dataset, you can select which dataset belongs to a layer. You can create
new layers and group them. SageMaker SageMaker geospatial capabilities support various layer types,
including point, arc, icon, and polygon.
You can choose any data point in a layer to have an Outline. You can also further customize the data
points. For example, you can choose the layer type as Point and then Fill Color based on any column of
your dataset. You can also change the radius of the points.
The following image shows the layers panel supported by SageMaker geospatial capabilities.
415
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities
416
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities
Columns
You can view the columns present in your dataset by using the Columns tab in the left navigation panel.
Filters
You can use filters to limit the data points that display on the map.
Interactions
In the Interactions panel, you can customize how you interact with the map. For example, you can
choose what metrics to display when you hover the tooltip over a data point.
Base map
You can have a Single Map, Dual Maps or Swipe Maps. With Dual Maps, you can compare the same
map side-by-side using different layers. Use Swipe Maps to overlay two maps on each other and use
the sliding separator to compare them. You can choose the split map mode by choosing the Split Mode
button on the top right corner of your map.
Spectral Index
When you visualize the output for an EOJ that uses the spectral index operation, you can map the
category based on the color from the legend as shown.
Cloud Masking
When you visualize the output for an EOJ that uses the cloud masking operation, you can map the
category based on the color from the legend as shown.
417
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK
When you visualize the output for an EOJ that uses the Land Cover Segmentation operation, you can
map the category based on the color from the legend as shown.
You can use the APIs provided by the SageMaker geospatial map SDK to visualize your geospatial data,
including the input, output, and AoI for EOJ.
Topics
• add_dataset API (p. 418)
• update_dataset API (p. 419)
• add_layer API (p. 420)
• update_layer API (p. 421)
• visualize_eoj_aoi API (p. 422)
• visualize_eoj_input API (p. 422)
• visualize_eoj_output API (p. 423)
add_dataset API
Adds a raster or vector dataset object to the map.
Request syntax
Request =
add_dataset(
self,
dataset: Union[Dataset, Dict, None] = None,
418
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK
*,
auto_create_layers: bool = True,
center_map: bool = True,
**kwargs: Any,
) -> Optional[Dataset]
Request parameters
Positional arguments
Keyword arguments
Response
This API returns the Dataset object that was added to the map.
update_dataset API
Updates an existing dataset's settings.
Request syntax
Request =
update_dataset(
self,
dataset_id: str,
419
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK
Request parameters
Positional arguments
Keyword arguments
Response
This API returns the updated dataset object for interactive maps, or None for non-interactive HTML
environments.
add_layer API
Adds a new layer to the map. This function requires at least one valid layer configuration.
Request syntax
Request =
add_layer(
self,
layer: Union[LayerCreationProps, dict, None] = None,
**kwargs: Any
) -> Layer
Request parameters
Arguments
420
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK
Response
update_layer API
Update an existing layer with given values.
Request syntax
Request =
update_layer(
self,
layer_id: str,
values: Union[LayerUpdateProps, dict, None],
**kwargs: Any
) -> Layer
Request parameters
Arguments
Keyword arguments
Response
421
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK
visualize_eoj_aoi API
Visualize the AoI of the given job ARN.
Request parameters
Arguments
Response
visualize_eoj_input API
Visualize the input of the given EOJ ARN.
Request parameters
Arguments
422
Amazon SageMaker Developer Guide
SageMaker geospatial capabilities FAQ
Response
visualize_eoj_output API
Visualize the output of the given EOJ ARN.
Request parameters
Arguments
Response
To learn more about visualizing your geospatial data, refer to Visualization Using Amazon SageMaker
geospatial.
423
Amazon SageMaker Developer Guide
Security and Permissions
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region. To
view SageMaker geospatial, choose the name of the currently displayed Region in the navigation bar
of the console. Then choose the US West (Oregon) Region.
2. How can I setup a user role and an execution role to get started with SageMaker geospatial
capabilities?
As a managed service, SageMaker geospatial capabilities perform operations on your behalf on the
AWS hardware managed by SageMaker. It can only perform the operations that the user permits. To
work with SageMaker geospatial capabilities, you need to setup a user role and an execution role. See
SageMaker geospatial capabilities roles to learn more.
3. Can I use SageMaker geospatial capabilities through my VPC environment?
No, currently SageMaker geospatial capabilities only support a public internet environment.
4. Why can't I see the SageMaker geospatial UI link when I navigate to Amazon SageMaker Studio?
Verify that you are launching Amazon SageMaker Studio in the US West (Oregon) Region and that you
are not in an VPC only environment or in a shared space environment.
5. How to create a notebook job in Studio?
See Schedule a notebook job to learn how to create and manage your notebook jobs. Make sure you
are using the latest JupyterLab version.
6. What bands supported for various raster data collections?
Use the GetRasterDataCollection API response and refer to the ImageSourceBands field to find
the bands supported for that particular data collection.
7. Can I use SageMaker geospatial capabilities if my browser does not have internet connection?
You cannot access the list of EOJs, VEJs as well as map visualization from the UI if your browser does
not have internet connection.
For more information about IAM users and roles, see Identities (Users, Groups, and Roles) in the IAM User
Guide.
To learn more about using IAM with SageMaker, see Identity and Access Management for Amazon
SageMaker (p. 3048).
Topics
• Configuration and Vulnerability Analysis in SageMaker geospatial (p. 425)
• Security Best Practices for SageMaker geospatial capabilities (p. 425)
• Use Amazon SageMaker geospatial capabilities in Your Amazon Virtual Private Cloud (p. 426)
• Use AWS KMS Permissions for Amazon SageMaker geospatial capabilities (p. 427)
424
Amazon SageMaker Developer Guide
Security and Permissions
Amazon SageMaker geospatial capabilities provide granular access policy for applications using IAM
roles. We recommend that the roles be granted only the minimum set of privileges required by the job.
We also recommend auditing the jobs for permissions on a regular basis and upon any change to your
application.
Administrators should strictly control Role-based access control (RBAC) permissions for Amazon
SageMaker geospatial capabilities.
Where possible, use temporary credentials instead of long-term credentials, such as access keys.
For scenarios in which you need IAM users with programmatic access and long-term credentials, we
recommend that you rotate access keys. Regularly rotating long-term credentials helps you familiarize
yourself with the process. This is useful in case you are ever in a situation where you must rotate
credentials, such as when an employee leaves your company. We recommend that you use IAM access
last used information to rotate and remove access keys safely. For more information, see Rotating access
keys and Security best practices in IAM.
AWS CloudTrail tracks anyone making API calls in your AWS account. API calls are logged whenever
anyone uses the Amazon SageMaker geospatial capabilities API, the Amazon SageMaker geospatial
capabilities console or Amazon SageMaker geospatial capabilities AWS CLI commands. Enable logging
and specify an Amazon S3 bucket to store the logs.
Your trust, privacy, and the security of your content are our highest priorities. We implement responsible
and sophisticated technical and physical controls designed to prevent unauthorized access to, or
disclosure of, your content and ensure that our use complies with our commitments to you. For more
information, see AWS Data Privacy FAQ.
425
Amazon SageMaker Developer Guide
Security and Permissions
You can change this behavior so that SageMaker sends all traffic over your specified Amazon VPC. If
VPC only has been choosen as the network access mode during the SageMaker Domain creation, the
following requirements need to be considered to still allow usage of SageMaker Studio notebooks within
the created SageMaker Domain.
1. You must use private subnets only. You cannot use public subnets in VpcOnly mode.
2. Ensure your subnets have the required number of IP addresses needed. The expected number
of IP addresses needed per user can vary based on use case. We recommend between 2 and 4 IP
addresses per user. The total IP address capacity for a Studio domain is the sum of available IP
addresses for each subnet provided when the domain is created. Ensure that your estimated IP
address usage does not exceed the capacity supported by the number of subnets you provide.
Additionally, using subnets distributed across many availability zones can aid in IP address
availability. For more information, see VPC and subnet sizing for IPv4.
Note
You can configure only subnets with a default tenancy VPC in which your instance runs on
shared hardware. For more information on the tenancy attribute for VPCs, see Dedicated
Instances.
3. Set up one or more security groups with inbound and outbound rules that together allow the
following traffic:
• NFS traffic over TCP on port 2049 between the domain and the Amazon EFS volume.
• TCP traffic within the security group. This is required for connectivity between the JupyterServer
app and the KernelGateway apps. You must allow access to at least ports in the range
8192-65535.
4. If you want to allow internet access, you must use a NAT gateway with access to the internet, for
example through an internet gateway.
5. If you don't want to allow internet access, create interface VPC endpoints (AWS PrivateLink) to
allow Studio to access the following services with the corresponding service names. You must also
associate the security groups for your VPC with these endpoints.
Note
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon)
Region.
426
Amazon SageMaker Developer Guide
Security and Permissions
If you use the SageMaker Python SDK to run remote training jobs, you must also create the
following Amazon VPC endpoints.
Note
For a customer working within VPC mode, company firewalls can cause connection issues with
SageMaker Studio or between JupyterServer and the KernelGateway. Make the following checks
if you encounter one of these issues when using SageMaker Studio from behind a firewall.
427
Amazon SageMaker Developer Guide
Security and Permissions
For more information, see Customer managed keys in the AWS Key Management Service Developer Guide.
Follow the steps for Creating symmetric encryption KMS keys in the AWS Key Management Service
Developer Guide.
Key policy
Key policies control access to your customer managed key. Every customer managed key must have
exactly one key policy, which contains statements that determine who can use the key and how they can
use it. When you create your customer managed key, you can specify a key policy. For more information,
see Determining access to AWS KMS keys in the AWS Key Management Service Developer Guide.
To use your customer managed key with your SageMaker geospatial capabilities resources, the following
API operations must be permitted in the key policy. The principal for these operations should be the
Execution Role you provide in the SageMaker geospatial capabilities request. SageMaker geospatial
capabilities assumes the provided Execution Role in the request to perform these KMS operations.
• kms:CreateGrant
• kms:GenerateDataKey
• kms:Decrypt
• kms:GenerateDataKeyWithoutPlaintext
The following are policy statement examples you can add for SageMaker geospatial capabilities:
CreateGrant
"Statement" : [
{
"Sid" : "Allow access to Amazon SageMaker geospatial capabilities",
"Effect" : "Allow",
"Principal" : {
"AWS" : "<Customer provided Execution Role ARN>"
},
"Action" : [
"kms:CreateGrant",
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource" : "*",
},
]
428
Amazon SageMaker Developer Guide
Security and Permissions
For more information about specifying permissions in a policy, see AWS KMS permissions in the AWS Key
Management Service Developer Guide. For more information about troubleshooting, see Troubleshooting
key access in the AWS Key Management Service Developer Guide.
If your key policy does not have your account root as key administrator, you need to add the same KMS
permissions on your execution role ARN. Here is a sample policy you can add to the execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"kms:CreateGrant",
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource": [
"<KMS key Arn>"
],
"Effect": "Allow"
}
]
}
Select a tab in the following table to see examples of AWS CloudTrail events to monitor KMS operations
called by SageMaker geospatial capabilities to access data encrypted by your customer managed key.
CreateGrant
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
"arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/
SageMaker-Geospatial-StartEOJ-KMSAccess",
"accountId": "111122223333",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AKIAIOSFODNN7EXAMPLE3",
"arn": "arn:aws:sts::111122223333:assumed-role/
SageMakerGeospatialCustomerRole",
"accountId": "111122223333",
"userName": "SageMakerGeospatialCustomerRole"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-03-17T18:02:06Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "arn:aws:iam::111122223333:root"
},
429
Amazon SageMaker Developer Guide
Security and Permissions
"eventTime": "2023-03-17T18:02:06Z",
"eventSource": "kms.amazonaws.com",
"eventName": "CreateGrant",
"awsRegion": "us-west-2",
"sourceIPAddress": "172.12.34.56",
"userAgent": "ExampleDesktop/1.0 (V1; OS)",
"requestParameters": {
"retiringPrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
"operations": [
"Decrypt"
],
"granteePrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com"
},
"responseElements": {
"grantId":
"0ab0ac0d0b000f00ea00cc0a0e00fc00bce000c000f0000000c0bc0a0000aaafSAMPLE",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
},
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": false,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}
GenerateDataKey
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "sagemaker-geospatial.amazonaws.com"
},
"eventTime": "2023-03-24T00:29:45Z",
"eventSource": "kms.amazonaws.com",
"eventName": "GenerateDataKey",
"awsRegion": "us-west-2",
"sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
"userAgent": "sagemaker-geospatial.amazonaws.com",
"requestParameters": {
"encryptionContext": {
"aws:s3:arn": "arn:aws:s3:::axis-earth-
observation-job-378778860802/111122223333/napy9eintp64/output/
consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
},
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
"keySpec": "AES_256"
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
430
Amazon SageMaker Developer Guide
Security and Permissions
"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}
Decrypt
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "sagemaker-geospatial.amazonaws.com"
},
"eventTime": "2023-03-28T22:04:24Z",
"eventSource": "kms.amazonaws.com",
"eventName": "Decrypt",
"awsRegion": "us-west-2",
"sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
"userAgent": "sagemaker-geospatial.amazonaws.com",
"requestParameters": {
"encryptionAlgorithm": "SYMMETRIC_DEFAULT",
"encryptionContext": {
"aws:s3:arn": "arn:aws:s3:::axis-earth-
observation-job-378778860802/111122223333/napy9eintp64/output/
consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
},
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}
GenerateDataKeyWithoutPlainText
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
"arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/
SageMaker-Geospatial-StartEOJ-KMSAccess",
431
Amazon SageMaker Developer Guide
RStudio on Amazon SageMaker
"accountId": "111122223333",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AKIAIOSFODNN7EXAMPLE3",
"arn": "arn:aws:sts::111122223333:assumed-role/
SageMakerGeospatialCustomerRole",
"accountId": "111122223333",
"userName": "SageMakerGeospatialCustomerRole"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-03-17T18:02:06Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "arn:aws:iam::111122223333:root"
},
"eventTime": "2023-03-28T22:09:16Z",
"eventSource": "kms.amazonaws.com",
"eventName": "GenerateDataKeyWithoutPlaintext",
"awsRegion": "us-west-2",
"sourceIPAddress": "172.12.34.56",
"userAgent": "ExampleDesktop/1.0 (V1; OS)",
"requestParameters": {
"keySpec": "AES_256",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}
RStudio allows customers to create data science insights using an R environment. With RStudio
integration, you can launch an RStudio environment in the Domain to run your RStudio workflows on
SageMaker resources. For more information about RStudio, see the RStudio website.
Topics
• Region availability (p. 433)
432
Amazon SageMaker Developer Guide
Region availability
• R developers use the RStudio IDE interface with popular developer tools from the R ecosystem. Users
can launch new RStudio sessions, write R code, install dependencies from RStudio Package Manager,
and publish Shiny apps using RStudio Connect.
• R developers can quickly scale underlying compute resources to run large scale data processing and
statistical analysis.
• Platform administrators can set up user identities, authorization, networking, storage, and security
for their data science teams through AWS IAM Identity Center (successor to AWS Single Sign-On) and
AWS Identity and Access Management integration. This includes connection to private Amazon Virtual
Private Cloud (Amazon VPC) resources and internet-free mode with AWS PrivateLink.
• Integration with AWS License Manager.
For information on the onboarding steps to create a Domain with RStudio enabled, see Onboard to
Amazon SageMaker Domain (p. 37).
Region availability
The following table gives information about the AWS Regions that RStudio on SageMaker is supported
in.
433
Amazon SageMaker Developer Guide
RStudio components
RStudio components
• RStudioServerPro: The RStudioServerPro app is a multiuser app that is a shared resource among
all user profiles in the Domain. Once an RStudio app is created in a Domain, the admin can give
permissions to users in the Domain.
• RStudio user: RStudio users are users within the Domain that are authorized to use the RStudio license.
• RStudio admin: An RStudio on Amazon SageMaker admin can access the RStudio administrative
dashboard. RStudio on Amazon SageMaker admins differ from "stock" RStudio Workbench admins
because they do not have root access to the instance running the RStudioServerPro app and can't
modify the RStudio configuration file.
• RStudio Server: The RStudio Server instance is responsible for serving the RStudio UI to all authorized
Users. This instance is launched on an Amazon SageMaker instance.
• RSession: An RSession is a browser-based interface to the RStudio IDE running on an Amazon
SageMaker instance. Users can create and interact with their RStudio projects through the RSession.
• RSessionGateway: The RSessionGateway app is used to support an RSession.
• RStudio administrative dashboard: This dashboard gives information on the RStudio users in the
Amazon SageMaker Domain and their sessions. This dashboard can only be accessed by users that have
RStudio admin authorization.
• When using RStudio on SageMaker, users don’t have access to the RStudio configuration files. Amazon
SageMaker manages the configuration file and sets defaults. You can modify the RStudio Connect and
RStudio Package Manager URLs when creating your RStudio-enabled Amazon SageMaker Domain.
• Project sharing, realtime collaboration, and Job Launcher are not currently supported when using
RStudio on Amazon SageMaker.
• When using RStudio on SageMaker, the RStudio IDE runs on Amazon SageMaker instances for on-
demand containerized compute resources.
• RStudio on SageMaker only supports the RStudio IDE and does not support other IDEs supported by
an RStudio Workbench installation.
• RStudio on SageMaker only supports the RStudio version specified in Upgrade the RStudio Version
(p. 436).
For information about creating a Amazon SageMaker Domain with RStudio enabled, see Onboard to
Amazon SageMaker Domain (p. 37).
434
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
For information about the AWS Regions that RStudio on SageMaker is supported in, see Supported
Regions and Quotas (p. 33).
Topics
• RStudio license (p. 435)
• Upgrade the RStudio Version (p. 436)
• Network and Storage (p. 437)
• RStudioServerPro instance type (p. 437)
• RStudio Connect URL (p. 438)
• RStudio Package Manager (p. 438)
• Create an Amazon SageMaker Domain with RStudio using the AWS CLI (p. 439)
• Add RStudio support to an existing Domain (p. 443)
• Bring your own image to RStudio on SageMaker (p. 446)
• Manage users (p. 459)
• RStudio administrative dashboard (p. 460)
• Shut down and restart RStudio (p. 461)
• Manage billing and cost (p. 462)
• Diagnose issues and get support (p. 462)
RStudio license
RStudio on Amazon SageMaker is a paid product and requires that each user is appropriately licensed.
Licenses for RStudio on Amazon SageMaker may be obtained from RStudio PBC directly, or by
purchasing a subscription to RStudio Workbench on AWS Marketplace. For existing customers of RStudio
Workbench Enterprise, licenses are issued at no additional cost.
To use an RStudio license with Amazon SageMaker, you must first have a valid RStudio license registered
with AWS License Manager. Subscriptions to RStudio Workbench on AWS Marketplace automatically
trigger license creation with AWS License Manager. For licenses purchased directly through Rstudio PBC,
a licenses grant for your AWS Account must be created. Contact RStudio for direct license purchases or to
enable existing licenses in AWS License Manager. For more information about registering a license with
AWS License Manager, see Seller issued licenses in AWS License Manager.
The following topics show how to acquire and validate a license granted by RStudio PBC.
1. If you don't have an RStudio license, you may purchase one from the AWS Marketplace or from
RStudio PBC directly.
• To purchase a subscription from the AWS Marketplace, complete the steps in Subscribing to an
AMI product with contract pricing public offer by searching for Posit Workbench.
• To purchase from RStudio PBC directly, navigate to RStudio Pricing or contact [email protected].
When buying or updating an RStudio license, you must provide the AWS Account that will host
your Amazon SageMaker Domain.
If you have an existing RStudio license, contact your RStudio Sales representative or
[email protected] to add RStudio on Amazon SageMaker to your existing RStudio Workbench
Enterprise license, or to convert your RStudio Workbench Standard license. The RStudio Sales
representative will send you the appropriate electronic order form.
2. RStudio grants a RStudio Workbench license to your AWS Account through AWS License Manager in
the US East (N. Virginia) Region. Although the RStudio license is granted in the US East (N. Virginia)
Region, your license can be consumed in any AWS Region that RStudio on Amazon SageMaker is
435
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
supported in. You can expect the license grant process to complete within three business days after
you share your AWS account ID with RStudio.
3. When this license is granted, you receive an email from your RStudio Sales representative with
instructions to accept your license grant.
1. Log into the AWS License Manager console in the same region as your Amazon SageMaker Domain.
If you are using AWS License Manager for the first time, AWS License Manager prompts you to grant
permission to use AWS License Manager.
2. Select Start using AWS License manager.
3. Select I grant AWS License Manager the required permissions and select Grant
Permissions.
4. Navigate to Granted Licenses on the left panel.
5. Select the license grant with RSW-SageMaker as the Product name and select View.
6. From the license detail page, select Accept & activate license.
You can use the RStudio administrative dashboard to see the number of users on the license following
the steps in RStudio administrative dashboard (p. 460).
All newly created SageMaker domains with RStudio and RSession support this new version. To use this
new version with existing SageMaker domains with RStudio, you must relaunch your RStudioServerPro
application. For more information about the changes in this release, see the RStudio Release Notes.
Upgrade Scenarios
All new RStudio applications are created using the 2022.02.2-485.pro2 release.
The RStudioServerPro application is deployed when the domain is created, and it persists unless it's
deleted. If you have an existing RStudio-enabled domain, you must upgrade the RStudioServerPro
application to support end-to-end encryption with new RSessions. If you create a new domain, you don't
need to upgrade.
• If you create a new domain with RStudio Enabled: RStudio applications for newly created domains
are created using the 2022.02.2-485.pro2 release and support end-to-end encryption. No further
action is required.
436
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
• You have a pre-existing SageMaker domain with RStudio and update the RStudioServerPro
App: This enables end-to-end encryption and requires no further changes for new RSessions. You must
delete and re-create your existing RSession applications.
• You have a pre-existing SageMaker domain with RStudio and do not update the RStudioServerPro
App: If you don’t update your application, there is a version mismatch with all new RSessions.
There may be functionality issues because of the version mismatch. Traffic encryption between
RStudioServerPro and RSession is also not available. We recommend that you update your
RStudioServerPro application to the new version.
Encryption
RStudio in Amazon SageMaker supports AWS PrivateLink integration. With this integration, you can
use RStudio on SageMaker in VPC-only mode without direct access to the internet. When you use
RStudio in VPC-only mode, your security groups are automatically managed by the service. This includes
connectivity between your RServer and your RSessions.
The following are required to use RStudio in VPC-only mode. For more information on selecting a VPC,
see Choose an Amazon VPC (p. 46).
• A private subnet with either access the internet to make a call to Amazon SageMaker & License
Manager, or Amazon Virtual Private Cloud (Amazon VPC) endpoints for both Amazon SageMaker &
License Manager.
• The Domain cannot have any more than two associated Security Groups.
• A Security Group ID for use with the Domain in Domain Settings. This must allow all outbound access.
• A Security Group ID for use with the Amazon VPC endpoint. This security group must allow inbound
traffic from the Domain Security Group ID.
• Amazon VPC Endpoint for sagemaker.api and AWS License Manager. This must be in the same
Amazon VPC as the private subnet.
437
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
depending on the workload across all users. The following are the available instance types to use for
your RStudioServerPro. For pricing information about these instances, see Amazon SageMaker Pricing.
• ml.t3.medium: This instance type is recommended for Domains with low UI use and is free to use.
• ml.c5.4xlarge: This instance type is recommended for Domains with moderate UI use.
• ml.c5.9xlarge: This instance type is recommended for Domains with heavy UI use.
To change the instance type of your RStudioServerPro, pass the new instance type as part of a call to
the update-domain CLI command. You then need to delete the existing RStudioServerPro app using
the delete-app CLI command and create a new RStudioServerPro app using the create-app CLI
command.
When you onboard to RStudio on Amazon SageMaker Domain, an RStudio Connect server is not created.
You can create an RStudio Connect server on an Amazon EC2 instance to use Connect with Amazon
SageMaker Domain. For information about how to set up your RStudio Connect server, see Host RStudio
Connect and Package Manager for ML development in RStudio on Amazon SageMaker.
If you have an RStudio Connect URL, you can update the default URL so that your RStudio Users can
publish to it.
CLI
You can set a default RStudio Connect URL when you create your Domain. The only way to update
your RStudio Connect URL from the AWS CLI is to delete your Domain and create a new one with the
updated RStudio Connect URL.
438
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
You can update the Package Manager URL used for your RStudio-enabled Domain as follows.
CLI
The only way to update your Package Manager URL from the AWS CLI is to delete your Domain and
create a new one with the updated Package Manager URL.
Prerequisites
• Install and configure AWS CLI version 2
• Configure the AWS CLI with IAM credentials
The following procedure shows how to create the DomainExecution role with the AWS CLI.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
]
}
}
439
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
]
}
2. Create the DomainExecution role. <REGION> should be the AWS Region to launch your Domain in.
3. Create a file named domain-setting-policy.json with the following content. This policy
allows the RStudioServerPro app to access necessary resources and allows Amazon SageMaker
to automatically launch an RStudioServerPro app when the existing RStudioServerPro app is in a
Deleted or Failed status.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",
"license-manager:CheckInLicense",
"logs:CreateLogDelivery",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}
4. Create the Domain setting policy that is attached to the DomainExecution role. Be aware of the
PolicyArn from the response, you will need to enter that ARN in the following steps.
440
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
The creation of a Amazon SageMaker Domain differs based on the authentication method and the
network type. These options must be used together, with one authentication method and one network
connection type selected. For more information about the requirements to create a new Domain, see
CreateDomain.
• IAM Auth
• SSO Auth
• PublicInternet
• VPCOnly
Authentication methods
The following shows how to create a Amazon SageMaker Domain with RStudio enabled and an IAM
Auth Network Type. For more information about AWS Identity and Access Management, see What is
IAM?.
• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.
• app-network-access-type should be either PublicInternetOnly or VPCOnly.
The following shows how to create a Amazon SageMaker Domain with RStudio enabled and an SSO
Auth Network Type. AWS IAM Identity Center (successor to AWS Single Sign-On) must be enabled for
the region that the domain is launched on. For more information about IAM Identity Center, see What is
AWS IAM Identity Center (successor to AWS Single Sign-On)?.
• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.
441
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Connection types
The following shows how to create a Amazon SageMaker Domain with RStudio enabled and a
PublicInternet Network Type.
• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.
• auth-mode should be either SSO or IAM.
VPCOnly mode
The following shows how to launch a Amazon SageMaker Domain with RStudio enabled and a VPCOnly
Network Type. For more information about using the VPCOnly network access type, see Connect
SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).
• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. Your private subnet must be able to either access the internet to make
a call to Amazon SageMaker, and AWS License Manager or have Amazon VPC endpoints for both
Amazon SageMaker and AWS License Manager. For information about Amazon VPC endpoints, see
Interface Amazon VPC endpoints For information about vpc-id and subnet-ids, see VPCs and
subnets.
• SecurityGroups must allow outbound access to the Amazon SageMaker and AWS License Manager
endpoints.
• auth-mode should be either SSO or IAM.
442
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Note
When using Amazon Virtual Private Cloud endpoints, the security group attached to your
Amazon Virtual Private Cloud endpoints must allow inbound traffic from the security group you
pass as part of the domain-setting parameter of the create-domain CLI call.
With RStudio, Amazon SageMaker manages security groups for you. This means that Amazon SageMaker
manages security group rules to ensure RSessions can access RStudioServerPro Apps. Amazon
SageMaker creates one security group rule per user profile.
Note: The RStudioServerPro app is launched by a special user profile named domain-shared. As a
result, this app is not returned as part of list-app API calls by any other user profiles.
You may have to increase the Amazon VPC quota in your account to increase the number of users. For
more information, see Amazon VPC quotas.
Prerequisites
You must complete the following steps before you update your current Domain to add support for
RStudio on SageMaker.
443
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
the DefaultUserSettings parameter of the CreateDomain API to add SecurityGroups that are
inherited by all the user profiles created in the Domain. You can also provide additional security groups
for a specific user as part of the UserSettings parameter of the CreateUserProfile API. If you have
added security groups this way, you must ensure that the total number of security groups per user
profile doesn’t exceed the maximum quota of 2 in VPCOnly mode and 4 in PublicInternetOnly
mode. If the resulting total number of security groups for any user profile exceeds the quota, you can
combine multiple security groups’ rules into one security group.
To add support for RStudio in your Domain, SageMaker must update the underlying security groups for
all existing user profiles. To complete this, you must delete and recreate all existing apps in the Domain.
The following procedure shows how to delete all of the apps.
aws sagemaker \
list-apps \
--domain-id-equals <DOMAIN_ID>
// JupyterServer apps
aws sagemaker \
delete-app \
--domain-id <DOMAIN_ID> \
--user-profile-name <USER_PROFILE> \
--app-type JupyterServer \
--app-name <APP_NAME>
// KernelGateway apps
aws sagemaker \
delete-app \
--domain-id <DOMAIN_ID> \
--user-profile-name <USER_PROFILE> \
--app-type KernelGateway \
--app-name <APP_NAME>
Step 2 - Update all user profiles with the new list of security groups
This is a one-time action that you must complete for all of the existing user profiles in your Domain
when you have refactored your existing security groups. This prevents you from hitting the quota for the
maximum number of security groups. The UpdateUserProfile API call fails if the user has any apps
that are in InService status. Delete all apps, then call UpdateUserProfile API to update the security
groups.
Note
The following requirement for VPCOnly mode outlined in Connect Amazon SageMaker Studio
Notebooks in a VPC to External Resources is no longer needed when adding RStudio support
because AppSecurityGroupManagement is managed by the SageMaker service:
“TCP traffic within the security group. This is required for connectivity between the
JupyterServer app and the KernelGateway apps. You must allow access to at least ports in the
range 8192-65535.”
444
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
aws sagemaker \
update-user-profile \
--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\", \"<SECURITY_GROUP>\"]}"
1. Call the UpdateDomain API to add support for RStudio on SageMaker. The defaultusersettings
parameter is only needed if you have refactored the default security groups for your user profiles.
aws sagemaker \
update-domain \
--domain-id <DOMAIN_ID> \
--app-security-group-management Service \
--domain-settings-for-update
RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>}
\
--default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\",
\"<SECURITY_GROUP>\"]}"
aws sagemaker \
update-domain \
--domain-id <DOMAIN_ID> \
--domain-settings-for-update
RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} \
--default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\",
\"<SECURITY_GROUP>\"]}"
2. Verify that the Domain status is InService. After the Domain status is InService, support for
RStudio on SageMaker is added.
aws sagemaker \
describe-domain \
--domain-id <DOMAIN_ID>
3. Verify that the RStudioServerPro app’s status is InService using the following command.
As part of the update in Step 3, SageMaker marks the RStudio AccessStatus of all existing user profiles
in the Domain as DISABLED by default. This prevents exceeding the number of users allowed by your
current license. To add access for existing users, there is a one-time opt-in step. Perform the opt-in by
calling the UpdateUserProfile API with the following RStudioServerProAppSettings:
• AccessStatus = ENABLED
• Optional - UserGroup = R_STUDIO_USER or R_STUDIO_ADMIN
aws sagemaker \
update-user-profile \
445
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"ENABLED\"}}"
Note
By default, the number of users that can have access to RStudio is 60.
Unless otherwise specified when calling UpdateDomain, RStudio support is added by default for all
new user profiles created after you have added support for RStudio on SageMaker. To deactivate access
for a new user profile, you must explicitly set the AccessStatus parameter to DISABLED as part of
the CreateUserProfile API call. If the AccessStatus parameter is not specified as part of the
CreateUserProfile API, the default access status is ENABLED.
aws sagemaker \
create-user-profile \
--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"DISABLED\"}}"
The process to bring your own image to use with RStudio on SageMaker takes three steps:
1. Build a custom image from a Dockerfile and push it to a repository in Amazon Elastic Container
Registry (Amazon ECR).
2. Create a SageMaker image that points to a container image in Amazon ECR and attach it to your
Amazon SageMaker Domain.
3. Launch a new session in RStudio with your custom image.
You can create images and image versions, and attach image versions to your Domain, using the
SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS
CLI). You can also create images and image versions using the SageMaker console, even if you haven't
onboarded to a Domain.
The following topics show how to bring your own image to RStudio on SageMaker by creating, attaching,
and launching a custom image.
Key terminology
The following section defines key terms for bringing your own image to use with RStudio on SageMaker.
• Dockerfile: A Dockerfile is a file that identifies the language packages and other dependencies for your
Docker image.
• Docker image: The Docker image is a built Dockerfile. This image is checked into Amazon ECR and
serves as the basis of the SageMaker image.
• SageMaker image: A SageMaker image is a holder for a set of SageMaker image versions based on
Docker images.
446
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
• Image version: An image version of a SageMaker image represents a Docker image that is compatible
with RStudio and stored in an Amazon ECR repository. Each image version is immutable. These image
versions can be attached to a domain and used with RStudio on SageMaker.
Prerequisites
You must complete the following prerequisites before bringing your own image to use with RStudio on
Amazon SageMaker.
• If you have an existing Domain with RStudio that was created before April 7, 2022, you must delete
your RStudioServerPro application and recreate it. For information about how to delete an application,
see Shut down and Update SageMaker Studio (p. 199).
• Install the Docker application. For information about setting up Docker, see Orientation and setup.
• Create a local copy of an RStudio-compatible Dockerfile that works with SageMaker. For information
about creating a sample RStudio dockerfile, see Use a custom image to bring your own development
environment to RStudio on Amazon SageMaker.
• Use an AWS Identity and Access Management execution role that has the AmazonSageMakerFullAccess
policy attached. If you have onboarded to Domain, you can get the role from the Domain Summary
section of the SageMaker control panel.
Add the following permissions to access the Amazon Elastic Container Registry (Amazon ECR) service
to your execution role.
{
"Version":"2012-10-17",
"Statement":[
{
"Sid": "VisualEditor0",
"Effect":"Allow",
"Action":[
"ecr:CreateRepository",
"ecr:BatchGetImage",
"ecr:CompleteLayerUpload",
"ecr:DescribeImages",
"ecr:DescribeRepositories",
"ecr:UploadLayerPart",
"ecr:ListImages",
"ecr:InitiateLayerUpload",
"ecr:BatchCheckLayerAvailability",
"ecr:PutImage"
],
"Resource": "*"
}
]
}
• Install and configure AWS CLI with the following (or higher) version. For information about installing
the AWS CLI, see Installing or updating the latest version of the AWS CLI.
447
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
platform. If either of these sets of requirements aren't satisfied, then your custom image won't function
properly.
RStudio PBC requirements are laid out in the Using Docker images with RStudio Workbench / RStudio
Server Pro, Launcher, and Kubernetes article. Follow the instructions in this article to create the base of
your custom RStudio image.
For instructions about how to install multiple R versions in your custom image, see Installing multiple
versions of R on Linux.
Amazon SageMaker Studio imposes the following set of installation requirements for your RStudio
image.
• You must use an RStudio base image of at least 2022.02.2-485.pro2. For more information, see
Upgrade the RStudio Version (p. 436).
• You must install the following packages:
The following general specifications apply to the image that is represented by an RStudio image version.
ENTRYPOINT and CMD instructions are overridden so that the image is run as an RSession
application.
Stopping the image
The DeleteApp API issues the equivalent of a docker stop command. Other processes in the
container won’t get the SIGKILL/SIGTERM signals.
File system
The /opt/.sagemakerinternal and /opt/ml directories are reserved. Any data in these
directories might not be visible at runtime.
User data
Each user in a SageMaker domain gets a user directory on a shared Amazon Elastic File System
volume in the image. The location of the current user’s directory on the Amazon Elastic File System
volume is /home/sagemaker-user.
448
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Metadata
On a GPU instance, the image is run with the --gpus option. Only the CUDA toolkit should be
included in the image, not the NVIDIA drivers. For more information, see NVIDIA User Guide.
Metrics and logging
Logs from the RSession process are sent to Amazon CloudWatch in the customer’s account. The
name of the log group is /aws/sagemaker/studio. The name of the log stream is $domainID/
$userProfileName/RSession/$appName.
Image size
Image size is limited to 25 GB. To view the size of your image, run docker image ls.
When you create an image, SageMaker also creates an initial image version. The image version
represents a container image in Amazon Elastic Container Registry (ECR). The container image must
satisfy the requirements to be used in RStudio. For more information, see Custom RStudio image
specifications (p. 447).
For information about testing your image locally and resolving common issues, see the SageMaker
Studio Custom Image Samples repo.
Topics
• Add a SageMaker-compatible RStudio Docker container image to Amazon ECR (p. 449)
• Create a SageMaker image from the console (p. 450)
• Create an image from the AWS CLI (p. 451)
Note
The Amazon ECR repository must be in the same AWS Region as your domain.
1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR
console, see Creating a repository.
449
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
--repository-name rstudio-custom \
--image-scanning-configuration scanOnPush=true
Response:
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/rstudio-custom",
"registryId": "acct-id",
"repositoryName": "rstudio-custom",
"repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/rstudio-custom",
...
}
}
2. Authenticate to Amazon ECR using the repository URI returned as a response from the create-
repository command. Make sure that the Docker application is running. For more information, see
Registry Authentication.
Response:
Login Succeeded
3. Build the Docker image. Run the following command from the directory that includes your
Dockerfile.
docker build .
5. Push the container image to the Amazon ECR repository. For more information, see ImagePush and
Pushing an image.
Response:
To create an image
450
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest]
5. Choose Next.
6. Under Image properties, enter the following:
• Image name – The name must be unique to your account in the current AWS Region.
• (Optional) Image display name – The name displayed in the domain user interface. When not
provided, Image name is displayed.
• (Optional) Description – A description of the image.
• IAM role – The role must have the AmazonSageMakerFullAccess policy attached. Use the
dropdown menu to choose one of the following options:
• Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets
that you want your notebooks users to access. If you don't want to allow access to additional
buckets, choose None.
SageMaker attaches the AmazonSageMakerFullAccess policy to the role. The role allows
your notebook users to access the Amazon S3 buckets listed next to the check marks.
• Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
• Use existing role – Choose one of your existing roles from the list.
• (Optional) Image tags – Choose Add new tag. You can add up to 50 tags. Tags are searchable
using the SageMaker console or the SageMaker Search API.
7. Under Image type, select RStudio image.
8. Choose Submit.
The new image is displayed in the Custom images list and briefly highlighted. After the image has been
successfully created, you can choose the image name to view its properties or choose Create version to
create another version.
To use the custom image in RStudio, you must attach it to your domain. For more information, see
Attach a custom SageMaker image (p. 453).
This section shows how to create a custom Amazon SageMaker image using the AWS CLI.
• Create an Image.
• Create an ImageVersion.
• Create a configuration file.
• Create an AppImageConfig.
1. Create a SageMaker image. The role ARN must have at least the
AmazonSageMakerFullAccessPolicy policy attached.
451
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Response:
{
"ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/rstudio-custom-image"
}
2. Create a SageMaker image version from the image. Pass the unique tag value that you chose when
you pushed the image to Amazon ECR.
Response:
{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-
image/1"
}
Response:
{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-
custom-image/1",
"ImageVersionStatus": "CREATED"
}
Note
If the response is "ImageVersionStatus": "CREATED_FAILED", the response
also includes the failure reason. A permissions issue is a common cause of failure. You
also can check your Amazon CloudWatch Logs. The name of the log group is /aws/
sagemaker/studio. The name of the log stream is $domainID/$userProfileName/
KernelGateway/$appName.
4. Create a configuration file, named app-image-config-input.json. The app image config is used
to configuration for running a SageMaker image as a Kernel Gateway application.
{
"AppImageConfigName": "rstudio-custom-config"
}
5. Create the AppImageConfig using the file that you created in the previous step.
452
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Response:
{
"AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/r-image-
config"
}
To use a custom SageMaker image, you must attach a custom RStudio image to your Domain. When
you attach an image version, it appears in the RStudio Launcher and is available in the Select image
dropdown list. You use the dropdown to change the image used by RStudio.
There is a limit to the number of image versions that you can attach. After you reach the limit, you must
first detach a version so that you can attach a different version of the image.
Topics
• Attach an image version to your Domain using the console (p. 453)
• Attach an existing image version to your Domain using the AWS CLI (p. 454)
You can attach a custom SageMaker image version to your Domain using the SageMaker console's
control panel. You can also create a custom SageMaker image, and an image version, and then attach
that version to your Domain.
If you select Existing image, choose an image from the Amazon SageMaker image store.
If you select New image, provide the Amazon ECR registry path for your Docker image. The path
must be in the same AWS Region as the Domain. The Amazon ECR repo must be in the same account
as your Domain, or cross-account permissions for SageMaker must be enabled.
7. Choose an existing image from the list.
8. Choose a version of the image from the list.
9. Choose Next.
10. Enter values for Image name, Image display name, and Description.
11. Choose the IAM role. For more information, see Create a custom RStudio image (p. 449).
12. (Optional) Add tags for the image.
13. (Optional) Choose Add new tag, then add a configuration tag.
14. For Image type, select RStudio Image.
15. Choose Submit.
453
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Wait for the image version to be attached to the Domain. After the version is attached, it appears in the
Custom images list and is briefly highlighted.
Attach an existing image version to your Domain using the AWS CLI
Two methods are presented to attach the image version to your Domain using the AWS CLI. In the first
method, you create a new Domain with the version attached. This method is simpler but you must
specify the Amazon Virtual Private Cloud (Amazon VPC) information and execution role that's required to
create the Domain.
If you have already onboarded to the Domain, you can use the second method to attach the image
version to your current Domain. In this case, you don't need to specify the Amazon VPC information and
execution role. After you attach the version, delete all of the applications in your Domain and relaunch
RStudio.
To use this method, you must specify an execution role that has the AmazonSageMakerFullAccess policy
attached.
Use the following steps to create the Domain and attach the custom SageMaker image:
Response:
vpc-xxxxxxxx
2. Get your default subnet IDs using the VPC ID from the previous step.
Response:
[
"subnet-b55171dd",
"subnet-8a5f99c6",
"subnet-e88d1392"
]
3. Create a configuration file named create-domain-input.json. Insert the VPC ID, subnet IDs,
ImageName, and AppImageConfigName from the previous steps. Because ImageVersionNumber
isn't specified, the latest version of the image is used, which is the only version in this case. Your
execution role must satisfy the requirements in Prerequisites (p. 447).
454
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
{
"DomainName": "domain-with-custom-r-image",
"VpcId": "<vpc-id>",
"SubnetIds": [
"<subnet-ids>"
],
"DomainSettings": {
"RStudioServerProDomainSettings": {
"DomainExecutionRoleArn": "<execution-role>"
}
},
"DefaultUserSettings": {
"ExecutionRole": "<execution-role>",
"RSessionAppSettings": {
"CustomImages": [
{
"AppImageConfigName": "rstudio-custom-config",
"ImageName": "rstudio-custom-image"
}
]
}
},
"AuthMode": "IAM"
}
Response:
{
"DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id",
"Url": "https://domain-id.studio.region.sagemaker.aws/..."
}
This method assumes that you've already onboarded to Domain. For more information, see Onboard to
Amazon SageMaker Domain (p. 37).
Note
You must delete all of the applications in your Domain to update the Domain with the new
image version. For information about deleting these applications, see Delete an Amazon
SageMaker Domain (p. 116).
Use the following steps to add the SageMaker image to your current Domain.
455
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Response:
{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}
{
"DefaultUserSettings": {
"RSessionAppSettings": {
"CustomImages": [
{
"ImageName": "rstudio-custom-image",
"AppImageConfigName": "rstudio-custom-config"
}
]
}
}
}
9. Use the Domain ID and default user settings file to update your Domain.
Response:
{
"DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id"
}
10. Delete the RStudioServerPro application. You must restart the RStudioServerPro domain-
shared application for the RStudio Launcher UI to pick up the latest changes.
456
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
11. Create a new RStudioServerPro application. You must create this application using the AWS CLI.
• Detach the image and image versions from your Amazon SageMaker Domain.
• Delete the image, image version, and app image config.
After you've completed these steps, you can delete the container image and repository from Amazon
ECR. For more information about how to delete the container image and repository, see Deleting a
repository.
When you detach an image from a Domain, all versions of the image are detached. When an image is
detached, all users of the Domain lose access to the image versions.
To detach an image
457
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
4. Choose Environment.
5. Under Custom images attached to domain, choose the image and then choose Detach.
6. (Optional) To delete the image and all versions from SageMaker, select Also delete the selected
images .... This does not delete the associated images from Amazon ECR.
7. Choose Detach.
To clean up resources
1. Detach the image and image versions from your Domain by passing an empty custom image list to
the Domain. Open the update-domain-input.json file that you created in Attach the SageMaker
image to your current domain (p. 177).
2. Delete the RSessionAppSettings custom images and then save the file. Do not modify the
KernelGatewayAppSettings custom images.
{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
},
"RSessionAppSettings": {
"CustomImages": [
],
"DefaultResourceSpec": {
}
...
}
}
}
3. Use the Domain ID and default user settings file to update your Domain.
Response:
{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}
5. Delete the SageMaker image, which also deletes all image versions. The container images in Amazon
ECR that are represented by the image versions are not deleted.
458
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Manage users
After your RStudio-enabled Amazon SageMaker Domain is running, you can add user profiles
(UserProfiles) to the Domain. The following topics show how to create user profiles that are authorized
to use RStudio, as well as update an existing user profile. For information on how to delete an RStudio
App, UserProfile, or Domain, follow the steps in Delete an Amazon SageMaker Domain.
Note
The limit for the total number of UserProfiles in a Amazon SageMaker Domain is 60.
If a user is authorized, they can be given one of the following levels of access to RStudio.
• RStudio User: This is a standard RStudio user and can access RStudio.
• RStudio Admin: The admin of your Amazon SageMaker Domain has the ability to create users, add
existing users, and update the permissions of existing users. Admins can also access the RStudio
Administrative dashboard. However, this admin is not able to update parameters that are managed by
Amazon SageMaker.
To create a user in your RStudio-enabled Amazon SageMaker Domain from the console, complete the
steps in Add user profiles (p. 119).
The following command shows how to add users to a Amazon SageMaker Domain with IAM
authentication. A User can belong to either the R_STUDIO_USER or R_STUDIO_ADMIN User group.
The following command shows how to add users to a Amazon SageMaker Domain with authentication
using IAM Identity Center. A user can belong to either the R_STUDIO_USER or R_STUDIO_ADMIN User
group.
459
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
1. From the list of running apps, identify the app you want to delete.
2. Click the respective Delete app button for the app you are deleting.
You cannot delete a user if the user is running any apps. Delete all apps before attempting to delete a
user.
1. From the User Profile page, select Edit. This opens a new General settings page.
2. Under Delete user, select Delete user.
https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/
workspaces
https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/admin
Dashboard tab
This tab gives an overview of your RStudio Server instance utilization, as well as information on the
number of active RSessions.
Sessions tab
This tab gives information on the active RSessions, such as the user that launched the RSessions, the
time that the RSessions have been running, and their resource utilization.
460
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
Users tab
This tab gives information on the RStudio authorized users in the Domain, such as the time that the
last RSession was launched and their resource utilization. The following procedure shows how to get
information about the user's historical resource utilization.
1. From the list of users, select the user that you want to view information for. This opens a new page
that is specific to the user.
2. To view the user's historical resource utilization, select the Stats tab. This tab gives information
about the historical CPU and memory usage, as well as the number of active RSessions.
3. To view Amazon CloudWatch Logs specific to the user, select the Logs tab.
Stats tab
This tab gives information on the historical utilization of your RStudio Server instance.
Logs tab
This tab displays Amazon CloudWatch Logs for the RStudio Server instance. For more information about
logging events with Amazon CloudWatch Logs, see What is Amazon CloudWatch Logs?.
Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't
impacted.
Note
If you are using a custom image with RStudio, ensure that your docker image is using an RStudio
version that is compatible with the version of RStudio Workbench being used by SageMaker
after you restart your RStudioServerPro app.
The following topics show how to shut down the RSessionGateway and RStudioServerPro apps and
restart them.
1. From the RStudio Launcher, identify the RSession that you want to suspend.
2. Select Suspend for the session.
3. Repeat this for all RSessions.
1. From the RStudio Launcher, identify the RSession that you want to delete.
2. Select Quit for the session. This opens a new Quit Session window.
3. From the Quit Session window, select Force Quit, to end all child processes in the session.
4. Select Quit Session to confirm deletion of the session.
5. Repeat this for all RSessions.
461
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker
The following describes components required to run RStudio on Amazon SageMaker and how each
component factors into billing for your RStudio instance.
• RStudio License –You must purchase an RStudio license. There is no additional charge for using your
RStudio license with Amazon SageMaker. For more information about your RStudio license, see
RStudio license (p. 435).
• RSession - These are RStudio working sessions launched by end users. You are charged while the
RSession is running.
• RStudio Server - A multi-tenant server manages all the RSessions. You can choose the instance type
to run RStudio Server on, and pay the related costs. The default instance, "system", is free, but you
can choose to pay for higher tiers. For more information about the available instance types for your
RStudio Server, see RStudioServerPro instance type (p. 437).
To track billing at the user level using Cost Allocation Tags, see Using Cost Allocation Tags.
462
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker
You can view metrics and logs directly from the RStudio administrative dashboard.
Amazon CloudWatch monitors your AWS resources and the applications that you run on AWS in real
time. You can use Amazon CloudWatch to collect and track metrics, which are variables that you can
measure for your resources and applications. To ensure that your RStudio apps have permissions for
Amazon CloudWatch, you must include the permissions described in Onboard to Amazon SageMaker
Domain (p. 37). You don’t need to do any setup to gather Amazon CloudWatch Logs.
The following steps show how to view Amazon CloudWatch Logs for your RSession.
These logs can be found in the /aws/sagemaker/studio log stream from the AWS CloudWatch
console.
<DomainId>/domain-shared/rstudioserverpro/default
For information about the onboarding steps to create an Amazon SageMaker Domain with RStudio
enabled, see Onboard to Amazon SageMaker Domain (p. 37).
For information about the AWS Regions that RStudio on SageMaker is supported in, see Supported
Regions and Quotas (p. 33).
Topics
• Collaborate in RStudio (p. 464)
• Base R image (p. 464)
• Open RStudio Launcher and launch RSessions (p. 464)
• Publish to RStudio Connect (p. 465)
463
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker
• Access Amazon SageMaker features with RStudio on Amazon SageMaker (p. 465)
Collaborate in RStudio
To share your RStudio project, you can connect RStudio to your Git repo. For information on setting this
up, see Version Control with Git and SVN.
Note: Project sharing and realtime collaboration are not currently supported when using RStudio on
Amazon SageMaker.
Base R image
When launching your RStudio instance, the Base R image serves as the basis of your instance. This image
extends the r-session-complete Docker image.
• R v4.0 or higher
• awscli, sagemaker, and boto3 Python packages
• Reticulate package for R SDK integration
The procedure to open the RStudio Launcher using the AWS CLI differs depending on the method used
to manage your users.
1. Use the AWS access portal to open your Amazon SageMaker Domain.
2. Modify the URL path to “/rstudio/default” as follows.
#Studio URL
https://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/lab
#modified URL
https://<domain-id>.studio.<region>.sagemaker.aws/rstudio/default
IAM
To open the RStudio Launcher from the AWS CLI in IAM mode, complete the following procedure.
464
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker
Launch RSessions
After you’ve launched the RStudio Launcher, you can create a new RSession.
For more information on RStudio Connect, see the RStudio Connect User Guide.
465
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker
Your Amazon SageMaker Studio JupyterLab and RStudio instances share the same Amazon EFS file
system. This means that files that you import and create using JupyterLab can be accessed using RStudio
and vice versa. This allows you to work on the same files using both JupyterLab and RStudio without
having to move your files between the two. For more information on this workflow, see the Announcing
Fully Managed RStudio on Amazon SageMaker for Data Scientists blog.
The reticulate package is used as an R interface to Amazon SageMaker Python SDK to make API calls
to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon
SageMaker provides a serverless data science environment to train and deploy Machine Learning (ML)
models at scale. For general information about the reticulate package, see R Interface to Python.
For a blog that outlines how to use the reticulate package with Amazon SageMaker, see Using R with
Amazon SageMaker.
The following examples show how to use reticulate for specific use cases.
• For a notebook that describes how to use reticulate to do batch transform to make predictions, see
Batch Transform Using R with Amazon SageMaker.
• For a notebook that describes how to use reticulate to conduct hyperparameter tuning and generate
predictions, see Hyperparameter Optimization Using R with Amazon SageMaker.
466
Amazon SageMaker Developer Guide
You can use Autopilot in different ways: on autopilot (hence the name) or with various degrees of
human guidance, without code through Amazon SageMaker Studio or with code using one of the AWS
SDKs. Autopilot currently supports regression and binary and multiclass classification problem types. It
supports tabular data formatted as CSV or Parquet files in which each column contains a feature with
a specific data type and each row contains an observation. The column data types accepted include
numerical, categorical, text, and time series that consists of strings of comma-separate numbers.
Autopilot supports building machine learning models on large datasets up to hundreds of GBs.
Autopilot also helps explain how models make predictions using a feature attribution approach
developed for Amazon SageMaker Clarify. Autopilot automatically generates a report that indicates
the importance of each feature for the predictions made by the best candidate. This explainability
functionality can make machine learning models more understandable to AWS customers. The model
governance report generated can be used to inform risk and compliance teams and external regulators.
You get full visibility into how the data was wrangled and how the models were selected, trained, and
tuned for each of the candidates tested. This is provided by notebooks that Autopilot generates for each
trial that contains the code used to explore the data and find the best candidate. The notebooks also
provide educational tools to help you learn about and conduct your own ML experiments. You can learn
about the impact of various inputs and trade-offs made in experiments by examining the various data
exploration and candidate definition notebooks exposed by Autopilot. You can also conduct further
experiments on the higher performing candidates by making your own modifications to the notebooks
and rerunning them.
The following graphic outlines the principal tasks of an AutoML process managed by Autopilot.
With Amazon SageMaker, you pay only for what you use. You pay for the underlying compute and
storage resources within SageMaker or other AWS services, based on your usage. For more information
about the cost of using SageMaker, see Amazon SageMaker Pricing.
Topics
• Get started with Amazon SageMaker Autopilot (p. 468)
467
Amazon SageMaker Developer Guide
Get started
Topics
• Samples: Explore modeling with Amazon SageMaker Autopilot (p. 468)
• Videos: Use Autopilot to automate and explore the machine learning process (p. 469)
• Tutorials: Get started with Amazon SageMaker Autopilot (p. 470)
• Direct marketing with Amazon SageMaker Autopilot: This notebook demonstrates how uses the Bank
Marketing Data Set to predict whether a customer will enroll for a term deposit at a bank. You can
use Autopilot on this dataset to get the most accurate ML pipeline by exploring options contained
in various candidate pipelines. Autopilot generates each candidate in a two-step procedure. The first
step performs automated feature engineering on the dataset. The second step trains and tunes an
algorithm to produce a model. The notebook contains instructions on how to train the model and how
to deploy the model to perform batch inference using the best candidate.
• Customer Churn Prediction with Amazon SageMaker Autopilot: This notebook describes using
machine learning for the automated identification of unhappy customers, also known as customer
churn prediction. The sample shows how to analyze a publicly available dataset and perform feature
engineering on it. Next it shows how to tune a model by selecting the best performing pipeline along
with the optimal hyperparameters for the training algorithm. Finally, it shows how to deploy the
model to a hosted endpoint and how to evaluate its predictions against ground truth. However, ML
models rarely give perfect predictions. That's why this notebook also shows how to incorporate the
relative costs of prediction mistakes when determining the financial outcome of using ML.
• Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform
(Python SDK): This notebook also describes using machine learning for the automated identification
of unhappy customers, also known as customer churn prediction. This notebook demonstrates how
to configure the model to obtain the inference probability, select the top N models, and make Batch
Transform on a hold-out test set for evaluation.
Note
This notebook works with SageMaker Python SDK >= 1.65.1 released on 6/19/2020.
468
Amazon SageMaker Developer Guide
Videos
• Bringing your own data processing code to Amazon SageMaker Autopilot: This notebook demonstrates
how to incorporate and deploy custom data processing code when using Amazon SageMaker
Autopilot. It adds a custom feature selection step to remove irrelevant variables to an Autopilot job. It
then shows how to deploy both the custom processing code and models generated by Autopilot on a
real-time endpoint and, alternatively, for batch processing.
Topics
• Start an AutoML job with Amazon SageMaker Autopilot (p. 469)
• Review data exploration and feature engineering automated in Autopilot. (p. 469)
• Tune models to optimize performance (p. 469)
• Choose and deploy the best model (p. 469)
• Amazon SageMaker Autopilot tutorial (p. 469)
469
Amazon SageMaker Developer Guide
Tutorials
optimized using auto-generated notebooks. We also look at the top candidates with Amazon SageMaker
Experiments. Finally, we deploy the top candidate (based on XGBoost), and configure data capture with
SageMaker Model Monitor.
• Create a machine learning model automatically with Autopilot: You assume the role of a developer
working at a bank in this tutorial. You have been asked to develop a machine learning model to predict
if a customer will enroll for a certificate of deposit (CD). This is a binary classification problem. The
model is trained on the marketing dataset that contains information on customer demographics,
responses to marketing events, and external factors.
You can use a user interface (Amazon SageMaker Studio UI) to help you populate the input, output,
target, and parameters to run and evaluate an Autopilot experiment or use SageMaker API Reference.
The UI has descriptions, toggle switches, dropdown menus, radio buttons, and more to help you navigate
creating your model candidates. You can also view statistics while the experiment is running. After it
runs, you can compare trials and delve into the details of the pre-processing steps, algorithms, and
hyperparameter ranges of each model. You also have the option to download their explainability and
performance reports. Use the provided notebooks to see the results of the automated data exploration
or the candidate model definitions.
The following instructions show how to create an Amazon SageMaker Autopilot job as a pilot experiment
using Studio UI or SageMaker API reference. You name your experiment, provide locations for the input
and output data, and specify which target data to predict. Optionally, you can also specify the type of
machine learning problem that you want to solve, choose your modeling strategy (stacked ensembles or
hyperparameters optimization), select the list of algorithms used by the Autopilot job to train the data,
and more.
470
Amazon SageMaker Developer Guide
Create an Autopilot experiment using Studio
3. On the Home tab, choose the AutoML card. This opens a new AutoML tab.
4. Choose Create an AutoML experiment. This opens a new Create experiment tab.
5. In the Experiment and data details section, enter the following information:
a. Experiment name – Must be unique to your account in the current AWS Region and contain a
maximum of 63 alphanumeric characters. Can include hyphens (-) but not spaces.
b. Input data – Provide the Amazon Simple Storage Service (Amazon S3) bucket location of your
input data. This S3 bucket must be in your current AWS Region. The URL must be in an s3://
format where Amazon SageMaker has write permissions. The file must be in CSV or Parquet
format and contain at least 500 rows. Select Browse to scroll through available paths and
Preview to see a sample of your input data.
c. Is your S3 input a manifest file? – A manifest file includes metadata with your input data. The
metadata specifies the location of your data in Amazon S3. It also specifies how the data is
formatted and which attributes from the dataset to use when training your model. You can use
a manifest file as an alternative to preprocessing when your labeled data is being streamed in
Pipe mode.
d. Auto split data? – Autopilot can split your data into an 80-20% split for training and validation
data. If you prefer a custom split, you can choose the Specify split ratio. To use a custom
dataset for validation, choose Provide a validation set.
e. Output data location (S3 bucket) – The name of the S3 bucket location where you want to
store the output data. The URL for this bucket must be in an Amazon S3 format where Amazon
SageMaker has write permissions. The S3 bucket must be in the current AWS Region. Autopilot
can also create this for you in the same location as your input data.
6. Choose Next: Target and features. The Target and features tab opens.
7. In the Target and features section:
For more information on the training modes and the available algorithms, see the Autopilot
training modes section in the Training modes and algorithms page.
10. Choose Next: Deployment and advanced settings to open the Deployment and advanced settings
tab. Settings include auto display endpoint name, machine learning problem type, and additional
choices for running your experiment.
a. Deployment settings – Autopilot can automatically create an endpoint and deploy your model
for you.
471
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically
Amazon SageMaker Data Wrangler, you have additional options to auto deploy the best model
with or without the transforms from Data Wrangler.
Note
If your Data Wrangler flow contains multi-row operations such as groupby, join,
or concatenate, you won't be able to auto deploy with these transforms. For more
information see Automatically Train Models on Your Data Flow.
b. Advanced settings (optional) – Autopilot provides additional controls to manually set
experimental parameters such as defining your problem type, time constraints on your
Autopilot job and trials, security, and encryption settings.
• Machine learning problem type – Autopilot can automatically select the machine learning
problem type. If you prefer to choose it manually, use the Select the machine learning
problem type dropdown menu.
A. Auto – Autopilot infers the problem type from the values of the attribute that you
want to predict. In some cases, SageMaker is unable to infer accurately. When that
happens, you must provide the value for the job to succeed.
B. Binary classification– Binary classification is a type of supervised learning that assigns
an individual to one of two predefined and mutually exclusive classes, based on their
attributes. For example, medical diagnosis based on results of diagnostic tests that
determine if someone has a disease.
C. Regression – Regression estimates the values of a dependent target variable based
on one or more variables or attributes that are correlated with it. For example, house
prices based on features, such as square footage and number of bathrooms.
D. Multiclass classification – Multiclass classification is a type of supervised learning that
assigns an individual to one of several classes based on their attributes. For example,
the prediction of the topic most relevant to a text document, such as politics, finance,
or philosophy.
c. Choose Next: Review and create to get a summary of your Autopilot experiment before you
create it.
11. Select Create experiment.The creation of the experiment starts an Autopilot job in SageMaker.
Autopilot provides status on the course of the experiment, information on the data exploration
process and model candidates in notebooks, a list of generated models and their reports, and the
job profile used to create them.
For information on the notebooks generated by an Autopilot job, see Amazon SageMaker Autopilot
notebooks generated to manage AutoML tasks (p. 509). For information on the details of
each model candidate and their reports, see Models generated by Amazon SageMaker Autopilot
(p. 497).
Note
To avoid incurring unnecessary charges: If you deploy a model that is no longer needed, delete
the endpoints and resources that were created during that deployment. Information about
pricing instances by Region is available at Amazon SageMaker Pricing.
For information on how this API action translates into a function in the language of your choice, see the
See Also section of CreateAutoMLJob and choose an SDK.
As an example, for Python users, see the full request syntax of create_auto_ml_job in AWS SDK for
Python.
472
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically
Required parameters
When using CreateAutoMLJob to create an AutoML Job, you must provide the following four values:
Optional parameters
The following sections provide details of some additional parameters that you can pass to your AutoML
job.
If you keep it blank (or null), the Mode is inferred based on the size of your dataset.
For information on Autopilot's stacked ensembles and hyperparameters optimization training methods,
see Training modes and algorithm support (p. 476)
{
"AutoMLJobConfig": {
"CandidateGenerationConfig": {
"FeatureSpecificationS3Uri":"string"
},
}
}
Selected features should be contained within a JSON file in the following format:
The values listed in ["col1", "col2", ...] are case sensitive. They should be a list of strings
containing unique values that are subsets of the column names in the input data.
Note
The list of columns provided as features cannot include the target column.
Algorithms selection
By default, your Autopilot job runs a pre-defined list of algorithms on your dataset to train
model candidates. The list of algorithms depends on the training mode (ENSEMBLING or
HYPERPARAMETER_TUNING) used by the job.
473
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically
You can provide a subset of the default selection of algorithms by adding the AlgorithmsConfig
attribute and its nested AutoMLAlgorithms field to the AutoMLCandidateGenerationConfig within the
CreateAutoMLJob API.
For the list of available algorithms per training Mode, see AutoMLAlgorithms. For details on each
algorithm, see Training modes and algorithm support (p. 476).
{
"AutoMLJobConfig": {
"CandidateGenerationConfig": {
"AlgorithmsConfig":[
{"AutoMLAlgorithms":["xgboost", "fastai", "catboost"]}
]
},
"Mode": "ENSEMBLING"
}
You can provide your own validation dataset and custom data split ratio, or let Autopilot split the
dataset automatically. Each AutoMLChannel object (see the required parameter InputDataConfig) has
a ChannelType, which can be set to either training or validation values that specify how the data
is to be used when building a machine learning model. At least one data source must be provided and a
maximum of two data sources is allowed: one for training data and one for validation data.
How you split the data into training and validation datasets depends on whether you have one or two
data sources.
• If you only have one data source, the ChannelType is set to training by default and must have this
value.
• If the ValidationFraction value in AutoMLDataSplitConfig is not set, 0.2 (20%) of the data
from this source is used for validation by default.
• If the ValidationFraction is set to a value between 0 and 1, the dataset is split based on the
value specified, where the value specifies the fraction of the dataset used for validation.
• If you have two data sources, the ChannelType of one of the AutoMLChannel objects must be set to
training, the default value. The ChannelType of the other data source must be set to validation.
The two data sources must have the same format, either CSV or Parquet, and the same schema. You
must not set the value for the ValidationFraction in this case because all of the data from each
source is used for either training or validation. Setting this value causes an error.
For information on split and cross-validation in Autopilot see Cross-validation in Autopilot (p. 481).
You can set the type of problem on an AutoML job with the CreateAutoPilot.ProblemType
parameter. This limits the kind of preprocessing and algorithms that Autopilot tries.
After the job is finished, if you had set the CreateAutoPilot.ProblemType, then the
ResolvedAttribute.ProblemType matches the ProblemType you set. If you keep it blank (or null),
the ProblemType is inferred on your behalf.
Note
In some cases, Autopilot is unable to infer the ProblemType with high enough confidence, in
which case you must provide the value for the job to succeed.
474
Amazon SageMaker Developer Guide
Datasets and problem types
You can add a sample weights column to your tabular dataset and then pass it to your AutoML job to
request dataset rows to be weighted during training and evaluation.
To set sample weights when creating an experiment (see CreateAutoMLJob), you can pass the name of
your sample weights column in the SampleWeightAttributeName parameter of the AutoMLChannel
object. This ensures that your objective metric uses the weights for the training, evaluation, and selection
of model candidates.
Support for sample weights is available in ensembling mode only. Your weights should be numeric and
non-negative. Data points with invalid or no weight value are excluded. For more information on the
available objective metrics, see Autopilot weighted metrics (p. 480).
Topics
• Autopilot datasets, data types, and formats (p. 475)
• Amazon SageMaker Autopilot problem types (p. 475)
• CSV (comma-separated-values) is a row-based file format that stores data in human readable plaintext
which a popular choice for data exchange as they are supported by a wide range of applications.
• Parquet is a column-based file format where the data is stored and processed more efficiently than
row-based file formats. This makes them a better option for big data problems.
The data types accepted for columns include numerical, categorical, text, and time series that consists
of strings of comma-separate numbers. If Autopilot detects it is dealing with time series sequences, it
processes them through specialized feature transformers provided by the tsfresh library. This library
takes the time series as an input and outputs a feature such as the highest absolute value of the time
series or descriptive statistics on autocorrelation. These outputted features are then used as inputs to
one of the three problem types.
Autopilot supports building machine learning models on large datasets up to hundreds of GBs.
For details on the default resource limits for input datasets and how to increase them, see Amazon
SageMaker Autopilot quotas (p. 522)
475
Amazon SageMaker Developer Guide
Training modes and algorithm support
Regression
Regression estimates the values of a dependent target variable based on one or more other variables or
attributes that are correlated with it. An example is the prediction of house prices using features like the
number of bathrooms and bedrooms, square footage of the house and garden. Regression analysis can
create a model that takes one or more of these features as an input and predicts the price of a house.
Binary classification
Binary classification is a type of supervised learning that assigns an individual to one of two predefined
and mutually exclusive classes based on their attributes. It is supervised because the models are trained
using examples where the attributes are provided with correctly labelled objects. A medical diagnosis
for whether an individual has a disease or not based on the results of diagnostic tests is an example of
binary classification.
Multiclass classification
Multiclass classification is a type of supervised learning that assigns an individual to one of several
classes based on their attributes. It is supervised because the models are trained using examples where
the attributes are provided with correctly labelled objects. An example is the prediction of the topic most
relevant to a text document. A document may be classified as being about, say, religion or politics or
finance, or about one of several other predefined topic classes.
Training modes
SageMaker Autopilot can automatically select the training method based on the dataset size, or you can
select it manually. The choices are as follows:
• Ensembling – Autopilot uses the AutoGluon library to train several base models. To find the best
combination for your dataset, ensemble mode runs 10 trials with different model and meta parameter
settings. Then Autopilot combines these models using a stacking ensemble method to create an
optimal predictive model. For a list of algorithms that Autopilot supports in ensembling mode, see the
following Algorithm support section.
• Hyperparameter optimization (HPO) – Autopilot finds the best version of a model by tuning
hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs
on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects
the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to
100 trials (default) to find the optimal hyperparameters settings within the selected range. If your
dataset size is less than 100 MB, Autopilot uses Bayesian optimization. Autopilot chooses multi-fidelity
optimization if your dataset is larger than 100 MB.
In multi-fidelity optimization, metrics are continuously emitted from the training containers. A trial
that is performing poorly against a selected objective metric is stopped early. A trial that is performing
well is allocated more resources.
For a list of algorithms that Autopilot supports in HPO mode, see the following Algorithm support
section.
476
Amazon SageMaker Developer Guide
Algorithm support
• Auto – Autopilot automatically chooses either ensembling mode or HPO mode based on your dataset
size. If your dataset is larger than 100 MB, Autopilot chooses HPO. Otherwise, it chooses ensembling
mode. Autopilot can fail to read the size of your dataset in the following cases.
• If you enable Virtual Private Cloud (VPC) mode, for an AutoML job but the S3 bucket containing the
dataset only allows access from the VPC.
• The input S3DataType of your dataset is a ManifestFile.
• The input S3Uri contains more than 1000 items.
If Autopilot is unable to read your dataset size, it defaults to choosing HPO mode.
Note
For optimal runtime and performance, use ensemble training mode for datasets that are smaller
than 100 MB.
Algorithm support
In HPO mode, Autopilot supports the following types of machine learning algorithms:
• Linear learner – A supervised learning algorithm that can solve either classification or regression
problems.
• XGBoost – A supervised learning algorithm that attempts to accurately predict a target variable by
combining an ensemble of estimates from a set of simpler and weaker models.
• Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network.
This algorithm can handle data that is not linearly separable.
Note
You don't need to specify an algorithm to use for your machine learning problem. Autopilot
automatically selects the appropriate algorithm to train.
In ensembling mode, Autopilot supports the following types of machine learning algorithms:
• LightGBM – An optimized framework that uses tree-based algorithms with gradient boosting. This
algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
• CatBoost – A framework that uses tree-based algorithms with gradient boosting. Optimized for
handling categorical variables.
• XGBoost – A framework that uses tree-based algorithms with gradient boosting that grows in depth,
rather than breadth.
• Random Forest – A tree-based algorithm that uses several decision trees on random sub-samples of
the data with replacement. The trees are split into optimal nodes at each level. The decisions of each
tree are averaged together to prevent overfitting and improve predictions.
• Extra Trees – A tree-based algorithm that uses several decision trees on the entire dataset. The trees
are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to
improve predictions. Extra trees add a degree of randomization in comparison to the random forest
algorithm.
• Linear Models – A framework that uses a linear equation to model the relationship between two
variables in observed data.
• Neural network torch – A neural network model that's implemented using Pytorch.
• Neural network fast.ai – A neural network model that's implemented using fast.ai.
477
Amazon SageMaker Developer Guide
Metrics and validation
Autopilot metrics
The following list contains the names of the metrics that are currently available to measure model
performance within Autopilot.
Note
Autopilot supports sample weights. To learn more about sample weights and the available
objective metrics, see Autopilot weighted metrics (p. 480).
Accuracy
The ratio of the number of correctly classified items to the total number of (correctly and
incorrectly) classified items. It is used for both binary and multiclass classification. Accuracy
measures how close the predicted class values are to the actual values. Values for accuracy metrics
vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect
inaccuracy.
AUC
The area under the curve (AUC) metric is used to compare and evaluate binary classification by
algorithms that return probabilities, such as logistic regression. To map the probabilities into
classifications, these are compared against a threshold value.
The relevant curve is the receiver operating characteristic curve (ROC curve). The ROC curve plots
the true positive rate (TPR) of predictions (or recall) against the false positive rate (FPR) as a function
of the threshold value, above which a prediction is considered positive. Increasing the threshold
results in fewer false positives, but more false negatives.
AUC is the area under this ROC curve. Therefore, AUC provides an aggregated measure of the model
performance across all possible classification thresholds. AUC scores vary between 0 and 1. A score
of 1 indicates perfect accuracy, and a score of one half (0.5) indicates that the prediction is not
better than a random classifier.
BalancedAccuracy
BalancedAccuracy is a metric that measures the ratio of accurate predictions to all predictions.
This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total
number of positive (P) and negative (N) values. It is used in both binary and multiclass classification
and is defined as follows: 0.5*((TP/P)+(TN/N)), with values ranging from 0 to 1. BalancedAccuracy
gives a better measure of accuracy when the number of positives or negatives differ greatly from
each other in an imbalanced dataset, such as when only 1% of email is spam.
F1
The F1 score is the harmonic mean of the precision and recall, defined as follows: F1 = 2 * (precision
* recall) / (precision + recall). It is used for binary classification into classes traditionally referred to
as positive and negative. Predictions are said to be true when they match their actual (correct) class,
and false when they do not.
Precision is the ratio of the true positive predictions to all positive predictions, and it includes the
false positives in a dataset. Precision measures the quality of the prediction when it predicts the
positive class.
478
Amazon SageMaker Developer Guide
Autopilot metrics
Recall (or sensitivity) is the ratio of the true positive predictions to all actual positive instances.
Recall measures how completely a model predicts the actual class members in a dataset.
F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0
indicates the worst.
F1macro
The F1macro score applies F1 scoring to multiclass classification problems. It does this by
calculating the precision and recall, and then taking their harmonic mean to calculate the F1 score
for each class. Lastly, the F1macro averages the individual scores to obtain the F1macro score.
F1macro scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0
indicates the worst.
InferenceLatency
Inference latency is the approximate amount of time between making a request for a model
prediction to receiving it from a real time endpoint to which the model is deployed. This metric is
measured in seconds and only available in ensembling mode.
LogLoss
Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability
outputs, rather than the outputs themselves. It is used in both binary and multiclass classification
and in neural nets. It is also the cost function for logistic regression. Log loss is an important metric
to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to
infinity. A value of 0 represents a model that perfectly predicts the data.
MAE
The mean absolute error (MAE) is a measure of how different the predicted and actual values are,
when they're averaged over all values. MAE is commonly used in regression analysis to understand
model prediction error. If there is linear regression, MAE represents the average distance from
a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the
number of observations. Values range from 0 to infinity, with smaller numbers indicating a better
model fit to the data.
MSE
The mean squared error (MSE) is the average of the squared differences between the predicted
and actual values. It is used for regression. MSE values are always positive. The better a model is at
predicting the actual values, the smaller the MSE value is.
Precision
Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives
that it identifies. It is defined as follows: Precision = TP/(TP+FP), with values ranging from zero (0) to
one (1), and is used in binary classification. Precision is an important metric when the cost of a false
positive is high. For example, the cost of a false positive is very high if an airplane safety system is
falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative
in the data.
PrecisionMacro
The precision macro computes precision for multiclass classification problems. It does this by
calculating precision for each class and averaging scores to obtain precision for several classes.
PrecisionMacro scores range from zero (0) to one (1). Higher scores reflect the model's ability
to predict true positives (TP) out of all of the positives that it identifies, averaged across multiple
classes.
R2
2
R , also known as the coefficient of determination, is used in regression to quantify how much a
model can explain the variance of a dependent variable. Values range from one (1) to negative one
(-1). Higher numbers indicate a higher fraction of explained variability. R2 values close to zero (0)
indicate that very little of the dependent variable can be explained by the model. Negative values
479
Amazon SageMaker Developer Guide
Autopilot weighted metrics
indicate a poor fit and that the model is outperformed by a constant function. For linear regression,
this is a horizontal line.
Recall
Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A
true positive is a positive prediction that is also an actual positive value in the data. Recall is defined
as follows: Recall = TP/(TP+FN), with values ranging from 0 to 1. Higher scores reflect a better ability
of the model to predict true positives (TP) in the data. It is used in binary classification.
Recall is important when testing for cancer because it's used to find all of the true positives. A false
positive (FP) reflects a positive prediction that is actually negative in the data. It is often insufficient
to measure only recall, because predicting every output as a true positive yields a perfect recall
score.
RecallMacro
The RecallMacro computes recall for multiclass classification problems by calculating recall for
each class and averaging scores to obtain recall for several classes. RecallMacro scores range from
0 to 1. Higher scores reflect the model's ability to predict true positives (TP) in a dataset, whereas a
true positive reflects a positive prediction that is also an actual positive value in the data. It is often
insufficient to measure only recall, because predicting every output as a true positive will yield a
perfect recall score.
RMSE
Root mean squared error (RMSE) measures the square root of the squared difference between
predicted and actual values, and is averaged over all values. It is used in regression analysis to
understand model prediction error. It's an important metric to indicate the presence of large model
errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better
model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of
different sizes.
Metrics that are automatically calculated for a model candidate are determined by the type of problem
being addressed.
Users can add a sample weights column to their data to ensure that each observation used to train
a machine learning model is given a weight corresponding to its perceived importance to the model.
This is especially useful in scenarios in which the observations in the dataset have varying degrees
of importance, or in which a dataset contains a disproportionate number of samples from one class
compared to others. Assigning a weight to each observation based on its importance or greater
importance to a minority class can help a model’s overall performance, or ensure that a model is not
biased toward the majority class.
480
Amazon SageMaker Developer Guide
Cross-validation in Autopilot
For information about how to pass sample weights when creating an experiment in the Studio UI, see
Step 7 in Create an Autopilot experiment using Studio.
For information about how to pass sample weights programmatically when creating an Autopilot
experiment using the API, see How to add sample weights to an AutoML job in Create an Autopilot
experiment programmatically.
Cross-validation in Autopilot
Cross-validation is used in to reduce overfitting and bias in model selection. It is also used to assess how
well a model can predict the values of an unseen validation dataset, if the validation dataset is drawn
from the same population. This method is especially important when training on datasets that have a
limited number of training instances.
Autopilot uses cross-validation to build models in hyperparameter optimization (HPO) and ensemble
training mode. The first step in the Autopilot cross-validation process is to split the data into k-folds.
K-fold splitting
K-fold splitting is a method that separates an input training dataset into multiple training and validation
datasets. The dataset is split into k equally-sized sub-samples called folds. Models are then trained
th
on k-1 folds and tested against the remaining k fold, which is the validation dataset. The process is
repeated k times using a different data set for validation.
The following image depicts k-fold splitting with k = 4 folds. Each fold is represented as a row. The dark-
toned boxes represent the parts of the data used in training. The remaining light-toned boxes indicate
the validation datasets.
Autopilot uses k-fold cross-validation for both hyperparameter optimization (HPO) mode and
ensembling mode.
You can deploy Autopilot models that are built using cross-validation like you would with any other
Autopilot or SageMaker model.
HPO mode
K-fold cross-validation uses the k-fold splitting method for cross-validation. In HPO mode, Autopilot
automatically implements k-fold cross-validation for small datasets with 50,000 or fewer training
instances. Performing cross-validation is especially important when training on small datasets because it
protects against overfitting and selection bias.
HPO mode uses a k value of 5 on each of the candidate algorithms that are used to model the dataset.
Multiple models are trained on different splits, and the models are stored separately. When training is
complete, validation metrics for each of the models are averaged to produce a single estimation metric.
481
Amazon SageMaker Developer Guide
Cross-validation in Autopilot
Lastly, Autopilot combines the models from the trial with the best validation metric into an ensemble
model. Autopilot uses this ensemble model to make predictions.
The validation metric for the models trained by Autopilot is presented as the objective metric in
the model leaderboard. Autopilot uses the default validation metric for each problem type that it
handles, unless you specify otherwise. For the list of all metrics that Autopilot uses, see Autopilot
metrics (p. 478).
For example, the Boston Housing dataset contains only 861 samples. If you build a model to predict
house sale prices using this dataset without cross-validation, you risk training on a dataset that is not
representative of the Boston housing stock. If you split the data only once into training and validation
subsets, the training fold may only contain data mainly from the suburbs. As a result, you would train on
data that isn't representative of the rest of the city. In this example, your model would likely overfit on
this biased selection. K-fold cross-validation can reduce the risk of this kind of error by making full and
randomized use of the available data for both training and validation.
Cross-validation can increase training times by an average of 20%. Training times may also increase
significantly for complex datasets.
Note
In HPO mode, you can see the training and validation metrics from each fold in your /aws/
sagemaker/TrainingJobs CloudWatch Logs. For more information about CloudWatch Logs,
see Log Amazon SageMaker Events with Amazon CloudWatch (p. 3284).
Ensembling mode
Note
Autopilot supports sample weights in ensembling mode. For the list of available metrics
supporting sample weights, see Autopilot metrics (p. 478).
In ensembling mode, cross-validation is performed regardless of dataset size. Customers can either
provide their own validation dataset and custom data split ratio, or let Autopilot split the dataset
automatically into an 80-20% split ratio. The training data is then split into k-folds for cross-validation,
where the value of k is determined by the AutoGluon engine. An ensemble consists of multiple machine
learning models, where each model is known as the base model. A single base model is trained on (k-1)
folds and makes out-of-fold predictions on the remaining fold. This process is repeated for all k folds,
and the out-of-fold (OOF) predictions are concatenated to form a single set of predictions. All base
models in the ensemble follow this same process of generating OOF predictions.
The following image depicts k-fold validation with k = 4 folds. Each fold is represented as a row. The
dark-toned boxes represent the parts of the data used in training. The remaining light-toned boxes
indicate the validation datasets.
In the upper part of the image, in each fold, the first base model makes predictions on the validation
dataset after training on the training datasets. At each subsequent fold, the datasets change roles. A
dataset that was previously used for training is now used for validation, and this also applies in reverse.
At the end of k folds, all of the predictions are concatenated to form a single set of predictions called an
out-of-fold (OOF) prediction. This process is repeated for each n base models.
482
Amazon SageMaker Developer Guide
Model Deployment and Prediction
The OOF predictions for each base model are then used as features to train a stacking model. The
stacking model learns the importance weights for each base model. These weights are used to combine
the OOF predictions to form the final prediction. Performance on the validation dataset determines
which base or stacking model is the best, and this model is returned as the final model.
In ensemble mode, you can either provide your own validation dataset or let Autopilot split the input
dataset automatically into 80% train and 20% validation datasets. The training data is then split into k-
folds for cross-validation and produces an OOF prediction and a base model for each fold.
These OOF predictions are used as features to train a stacking model, which simultaneously learns
weights for each base model. These weights are used to combine the OOF predictions to form the final
prediction. The validation datasets for each fold are used for hyperparameter tuning of all base models
and the stacking model. Performance on the validation datasets determines which base or stacking
model is the best model, and this model is returned as the final model.
After you train your SageMaker Autopilot models, you can deploy them to get predictions in one of two
ways:
1. Use Real-time inferencing (p. 483) to set up an endpoint and obtain predictions interactively.
2. Use Batch inferencing (p. 490) to make predictions in parallel on batches of observations on an
entire dataset.
Note
To avoid incurring unnecessary charges: After the endpoints and resources that were created
from model deployment are longer needed, you can delete them. For information about pricing
of instances by Region, see Amazon SageMaker Pricing.
Real-time inferencing
Real-time inference is ideal for inference workloads where you have real-time, interactive, low
latency requirements. This section shows how you can use real-time inferencing to obtain predictions
interactively from your model.
To deploy the model that produced the best validation metric in an Autopilot experiment, you have
several options. For example, when using Autopilot in SageMaker Studio, you can deploy the model
automatically or manually. You can also use SageMaker APIs to manually deploy an Autopilot model.
The following tabs show three options for deploying your model. These instructions assume that you
have already created a model in Autopilot. If you don't have a model, see Create an Amazon SageMaker
Autopilot experiment (p. 470). To see examples for each option, open each tab.
483
Amazon SageMaker Developer Guide
Real-time inferencing
• Automatic Deployment: To automatically deploy the best model from an Autopilot experiment to an
endpoint
1. Create an experiment in SageMaker Studio.
2. Toggle the Auto deploy value to Yes.
Note
Automatic deployment will fail if either the default resource quota or your customer
quota for endpoint instances in a Region is too limited. In hyperparameter optimization
(HPO) mode, you are required to have at least two ml.m5.2xlarge instances. In ensembling
mode, you are required to have at least one ml.m5.12xlarge instance. If you encounter a
failure related to quotas, you can request a service limit increase for SageMaker endpoint
instances.
• Manual Deployment: To manually deploy the best model from an Autopilot experiment to an
endpoint
1. Create an experiment in SageMaker Studio.
2. Toggle the Auto deploy value to No.
3. Select the model that you want to deploy under Model name.
4. Select the orange Deployment and advanced settings button located on the right of the
leaderboard. This opens a new tab.
5. Configure the endpoint name, instance type, and other optional information.
6. Select the orange Deploy model to deploy to an endpoint.
7. Check the progress of the endpoint creation process in the https://console.aws.amazon.com/
sagemaker/ by navigating to the Endpoints section. That section is located in the Inference
dropdown menu in the navigation panel.
8. After the endpoint status changes from Creating to InService, as shown below, return to Studio and
invoke the endpoint.
For complete code examples for both AWS CLI commands and AWS SDK for Python (Boto3), open the
tabs directly following these steps.
Obtain the candidate container definitions from InferenceContainers. These candidate definitions are
used to create a SageMaker model.
The following example uses the DescribeAutoMLJob API to obtain candidate definitions for the best
model candidate. See the following AWS CLI command as an example.
2. List candidates
The following example uses the ListCandidatesForAutoMLJob API to list all candidates. See the
following AWS CLI command as an example.
484
Amazon SageMaker Developer Guide
Real-time inferencing
Use the container definitions from the previous steps to create a SageMaker model by using the
CreateModel API. See the following AWS CLI command as an example.
The following example uses the CreateEndpointConfig API to create an endpoint configuration. See
the following AWS CLI command as an example.
The following AWS CLI example uses the CreateEndpoint API to create the endpoint.
Check the progress of your endpoint deployment by using the DescribeEndpoint API. See the
following AWS CLI command as an example.
After the EndpointStatus changes to InService, the endpoint is ready to use for real-time
inference.
6. Invoke the endpoint
The following command structure invokes the endpoint for real-time inferencing.
The following tabs contain complete code examples for deploying a model with AWS SDK for Python
(Boto3) or the AWS CLI.
import sagemaker
import boto3
485
Amazon SageMaker Developer Guide
Real-time inferencing
session = sagemaker.session.Session()
describe_response = sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
# extract the best candidate definition from DescribeAutoMLJob response
best_candidate = describe_response['BestCandidate']
# extract the InferenceContainers definition from the caandidate definition
inference_containers = best_candidate['InferenceContainers']
# Create Model
model_name = 'test-model'
sagemaker_role = 'arn:aws:iam:444455556666:role/sagemaker-execution-role'
create_model_response = sagemaker_client.create_model(
ModelName = model_name,
ExecutionRoleArn = sagemaker_role,
Containers = inference_containers
)
3. Create the endpoint configuration by using the following the code example.
endpoint_config_name = 'test-endpoint-config'
instance_type = 'ml.m5.2xlarge'
# for all supported instance types, see
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_ProductionVariant.html#sagemaker-Type-ProductionVariant-InstanceType # Create
endpoint config
endpoint_config_response = sagemaker_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"VariantName": "variant1",
"ModelName": model_name,
"InstanceType": instance_type,
"InitialInstanceCount": 1
}
]
)
4. Create the endpoint and deploy the model with the following code example.
Check the status of creating the endpoint by using the following the code example.
486
Amazon SageMaker Developer Guide
Real-time inferencing
5. Invoke the endpoint for real-time inferencing by using the following command structure.
# once endpoint status is InService, you can invoke the endpoint for inferencing
if status == "InService":
sm_runtime = boto3.Session().client('sagemaker-runtime')
inference_result = sm_runtime.invoke_endpoint(EndpointName='test-endpoint',
ContentType='text/csv', Body='1,2,3,4,class')
The create model command will return a response in the following format.
487
Amazon SageMaker Developer Guide
Real-time inferencing
{
"ModelArn": "arn:aws:sagemaker:us-west-2:1234567890:model/test-sagemaker-model"
}
The create endpoint configuration command will return a response in the following format.
{
"EndpointConfigArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint-config/
test-endpoint-config"
}
The create endpoint command will return a response in the following format.
{
"EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint"
}
Check the progress of the endpoint deployment by using the following describe-endpoint CLI
code example.
The previous progress check will return a response in the following format.
{
"EndpointName": "test-endpoint",
"EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint",
"EndpointConfigName": "test-endpoint-config",
"EndpointStatus": "Creating",
"CreationTime": 1660251167.595,
"LastModifiedTime": 1660251167.595
}
After the EndpointStatus changes to InService, the endpoint is ready for use in real-time
inference.
5. Invoke the endpoint for real-time inferencing by using the following command structure.
488
Amazon SageMaker Developer Guide
Real-time inferencing
--body '1,51,3.5,1.4,0.2' \
--content-type 'text/csv' \
'/tmp/inference_output'
To assume the role in the generating account, you must grant permission to the deploying account.
This allows the deploying account to describe Autopilot jobs in the generating account.
The following example uses a generating account with a trusted sagemaker-role entity. The
example shows how to give a deploying account with the ID 111122223333 permission to assume the
role of the generating account.
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
],
"AWS": [ "111122223333"]
},
"Action": "sts:AssumeRole"
}
The new account with the ID 111122223333 can now assume the role for the generating account.
Next, call the DescribeAutoMLJob API from the deploying account to obtain a description of the job
created by the generating account.
The following code example describes the model from the deploying account.
import sagemaker
import boto3
session = sagemaker.session.Session()
sts_client = boto3.client('sts')
sts_client.assume_role
role = 'arn:aws:iam::111122223333:role/sagemaker-role'
role_session_name = "role-session-name"
_assumed_role = sts_client.assume_role(RoleArn=role, RoleSessionName=role_session_name)
credentials = _assumed_role["Credentials"]
access_key = credentials["AccessKeyId"]
secret_key = credentials["SecretAccessKey"]
session_token = credentials["SessionToken"]
session = boto3.session.Session()
489
Amazon SageMaker Developer Guide
Batch inferencing
aws_secret_access_key=secret_key,
aws_session_token=session_token)
job_name = "test-job"
response= sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
2. Grant access to the deploying account to the model artifacts in the generating account.
The deploying account only needs access to the model artifacts in the generating account to deploy it.
These are located in the S3OutputPath that was specified in the original CreateAutoMLJob API call
during model generation.
To give the deploying account access to the model artifacts, choose one of the following options:
a. Give access to the ModelDataUrl from the generating account to the deploying account.
Next, you need to give the deploying account permission to assume the role. follow the real-time
inferencing steps to deploy.
b. Copy model artifacts from the generating account's original S3OutputPath to the generating
account.
To grant access to the model artifacts, you must define a best_candidate model and reassign
model containers to the new account.
The following example shows how to define a best_candidate model and reassign the
ModelDataUrl.
best_candidate = automl.describe_auto_ml_job()['BestCandidate']
After this assignment of containers, follow the steps in Deploy using SageMaker APIs (p. 484) to
deploy.
To build a payload in real-time inferencing, see the notebook example to define a test payload. To create
the payload from a CSV file and invoke an endpoint, see the Predict with your model section in Create a
machine learning model automatically.
Batch inferencing
Batch inferencing, also known as offline inferencing, generates model predictions on a batch of
observations. Batch inference is a good option for large datasets or if you don't need an immediate
response to a model prediction request.
You can make batch inferences from an Autopilot model using the SageMaker Python SDK, the Autopilot
user interface (UI), the AWS SDK for Python (Boto3), or the AWS Command Line Interface (AWS CLI).
The following tabs show three options for deploying your model: Using APIs, Autopilot UI, or using APIs
to deploy from different accounts. These instructions assume that you have already created a model in
490
Amazon SageMaker Developer Guide
Batch inferencing
Autopilot. If you don't have a model, see Create an Amazon SageMaker Autopilot experiment (p. 470).
To see examples for each option, open each tab.
The following steps show how to deploy a model from an Autopilot experiment for batch predictions.
491
Amazon SageMaker Developer Guide
Batch inferencing
The following example shows how to use the DescribeAutoMLJob API to obtain candidate definitions
for the best model candidate. See the following AWS CLI command as an example.
Use the ListCandidatesForAutoMLJob API to list all candidates. See the following AWS CLI command
as an example.
To create a SageMaker model using the CreateModel API, use the container definitions from the
previous steps. See the following AWS CLI command as an example.
The following example creates a SageMaker transform job with the CreateTransformJob API. See the
following AWS CLI command as an example.
Check the progress of your transform job using the DescribeTransformJob API. See the following AWS
CLI command as an example.
After the job is finished, the predicted result will be available in <your-output-path>.
The output file name has the following format: <input_data_file_name>.out. As an example, if
your input file is text_x.csv, the output name will be text_x.csv.out.
492
Amazon SageMaker Developer Guide
Batch inferencing
The following tabs show code examples for SageMaker Python SDK, AWS SDK for Python (Boto3), and
the AWS CLI.
The following example uses the SageMaker Python SDK to make predictions in batches.
sagemaker_session= sagemaker.session.Session()
# create model
model = automl.create_model(name=best_candidate_name,
candidate=best_candidate)
# create transformer
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.2xlarge',
assemble_with='Line',
output_path=output_path)
# do batch transform
transformer.transform(data=input_data,
split_type='Line',
content_type='text/csv',
wait=True)
The following example uses AWS SDK for Python (Boto3) to make predictions in batches.
import sagemaker
import boto3
session = sagemaker.session.Session()
best_candidate = sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
['BestCandidate']
best_candidate_containers = best_candidate['InferenceContainers']
best_candidate_name = best_candidate['CandidateName']
# create model
reponse = sm_client.create_model(
ModelName = best_candidate_name,
ExecutionRoleArn = role,
Containers = best_candidate_containers
)
493
Amazon SageMaker Developer Guide
Batch inferencing
TransformJobName=f'{best_candidate_name}-transform-job',
ModelName=model_name,
TransformInput={
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': input_data
}
},
'ContentType': "text/csv",
'SplitType': 'Line'
},
TransformOutput={
'S3OutputPath': output_path,
'AssembleWith': 'Line',
},
TransformResources={
'InstanceType': 'ml.m5.2xlarge',
'InstanceCount': 1,
},
)
{'TransformJobArn': 'arn:aws:sagemaker:us-west-2:1234567890:transform-job/test-
transform-job',
'ResponseMetadata': {'RequestId': '659f97fc-28c4-440b-b957-a49733f7c2f2',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amzn-requestid': '659f97fc-28c4-440b-b957-a49733f7c2f2',
'content-type': 'application/x-amz-json-1.1',
'content-length': '96',
'date': 'Thu, 11 Aug 2022 22:23:49 GMT'},
'RetryAttempts': 0}}
1. Obtain the candidate definitions by using the following the code example.
494
Amazon SageMaker Developer Guide
Batch inferencing
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED": "predicted_label,probability,probabilities"
}
}, {
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-
automl:2.5-1-cpu-py3",
"ModelDataUrl": "s3://test-bucket/out/test-job1/data-processor-models/test-job1-
dpp0-1-e569ff7ad77f4e55a7e549a/output/model.tar.gz",
"Environment": {
"AUTOML_TRANSFORM_MODE": "inverse-label-transform",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_INPUT": "predicted_label",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED":
"predicted_label,probability,labels,probabilities",
"SAGEMAKER_PROGRAM": "sagemaker_serve",
"SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
}
}]' \
--execution-role-arn 'arn:aws:iam::1234567890:role/sagemaker-execution-role' \
--region 'us-west-2'
3. Create the transform job by using the following the code example.
4. Check the progress of the transform job by using the following the code example.
{
"TransformJobName": "test-tranform-job",
"TransformJobArn": "arn:aws:sagemaker:us-west-2:1234567890:transform-job/test-
tranform-job",
"TransformJobStatus": "InProgress",
"ModelName": "test-model",
"TransformInput": {
"DataSource": {
"S3DataSource": {
495
Amazon SageMaker Developer Guide
Explainability
"S3DataType": "S3Prefix",
"S3Uri": "s3://test-bucket/data.csv"
}
},
"ContentType": "text/csv",
"CompressionType": "None",
"SplitType": "Line"
},
"TransformOutput": {
"S3OutputPath": "s3://test-bucket/output/",
"AssembleWith": "Line",
"KmsKeyId": ""
},
"TransformResources": {
"InstanceType": "ml.m5.2xlarge",
"InstanceCount": 1
},
"CreationTime": 1662495635.679,
"TransformStartTime": 1662495847.496,
"DataProcessing": {
"InputFilter": "$",
"OutputFilter": "$",
"JoinSource": "None"
}
}
After the TransformJobStatus changes to Completed, you can check the inference result in
the S3OutputPath.
You can use explanations for auditing and meeting regulatory requirements, building trust in the model,
supporting human decision-making, and debugging and improving model performance.
For additional information on Shapely values and baselines, see Feature Attributions that Use Shapley
Values (p. 2094) and SHAP Baselines for Explainability (p. 2095).
For a guide to the Amazon SageMaker Clarify documentation, see Guide to the SageMaker Clarify
Documentation (p. 10).
496
Amazon SageMaker Developer Guide
Models generated
Prerequisites
Before you begin this procedure, you must have created and run an Autopilot experiment. For
instructions, see Create an Amazon SageMaker Autopilot experiment (p. 470).
To share the model in the Autopilot user interface using a button, see the following section View model
details. The Share Model button is discussed in Step 6.
For more information about how to share a model, see Bring Your Own Model Into Canvas.
• A plot of the aggregated SHAP values that indicate the importance of each feature. This helps explain
your models predictions.
• The summary statistics for various training and validation metrics, including the objective metric.
• A list of the hyperparameters used to train and tune the model.
To view model details after running an Autopilot job, follow these steps:
1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the details that you want to examine. This
opens a new Autopilot job tab.
4. The Autopilot job panel lists the metric values including the Objective metric for each model
under Model name. The Best model is listed at the top of the list under Model name and is also
highlighted in the Models tab.
• To review model details, select the model that you are interested in and select View model
details. This opens a new Model Details tab.
5. The Model Details tab is divided into four subsections.
1. The top of the Explainability tab contains a plot of aggregated SHAP values that indicate the
importance of each feature. Following that are the metrics and hyperparameter values for this
model.
2. The Performance tab contains metrics statistics a confusion matrix.
3. The Artifacts tab contains information about model inputs, outputs, and intermediate results.
497
Amazon SageMaker Developer Guide
Model Performance Report
4. The Network tab summarizes your network isolation and encryption choices.
Note
Feature importance and information in the Performance tab is only generated for the Best
model.
For more information about how the SHAP values help explain predictions based on feature
importance, see the whitepaper Understanding the model explainability. Additional information
is also available in the Amazon SageMaker Clarify Model Explainability (p. 2093) topic in the
SageMaker Developer Guide.
6. To share your Autopilot model with another SageMaker Canvas user, choose Share Model. That
button is located at the top right of the Model Details tab.
• In the Add Canvas users section, use the down arrow to select a SageMaker Canvas user.
For example, in classification problems, the model quality report includes the following:
• Confusion matrix
• Area under the receiver operating characteristic curve (AUC)
• Information to understand false positives and false negatives
• Tradeoffs between true positives and false positives
• Tradeoffs between precision and recall
Autopilot also provides performance metrics for all of your candidate models. These metrics are
calculated using all of the training data and are used to estimate model performance. The main working
area includes these metrics by default. The type of metric is determined by the type of problem being
addressed.
The following performance metrics are associated with the corresponding problem type:
You can sort your model candidates with the relevant metric to help you select and deploy the model
that addresses your business needs. For definitions of these metrics, see the Autopilot candidate metrics
topic.
1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
498
Amazon SageMaker Developer Guide
Model Performance Report
3. In the Name section, select the Autopilot job that has the details that you want to examine. This
opens a new Autopilot job tab.
4. The Autopilot job panel lists the metric values including the Objective metric for each model under
Model name. The Best model is listed at the top of the list under Model name and it is highlighted
in the Models tab.
• To review model details, select the model that you are interested in and select View in model
details. This opens a new Model Details tab.
5. Choose the Performance tab between the Explainability and Artifacts tab.
a. On the top right section of the tab, select the down arrow on the Download Performance
Reports button.
b. The down arrow provides two options to view Autopilot performance metrics:
i. You can download a PDF of the performance report to view the metrics graphically.
ii. You can view metrics as raw data and download it as a JSON file.
For instructions on how to create and run an AutoML job in SageMaker Studio, see Create an Amazon
SageMaker Autopilot experiment (p. 470).
The performance report contains two sections. The first contains details about the Autopilot job that
produced the model. The second section contains a model quality report.
Metrics tables
The first part of the model quality report contains metrics tables. These are appropriate for the type of
problem that the model addressed.
The following image is an example of a metrics table that Autopilot generates for a regression problem.
It shows the metric name, value, and standard deviation.
499
Amazon SageMaker Developer Guide
Model Performance Report
The following image is an example of a metrics table generated by Autopilot for a multiclass
classification problem. It shows the metric name, value, and standard deviation.
The area under the receiver operating characteristic curve (AUC ROC curve)
The AUC ROC curve represents the trade-off between true positive and false positive rates. The AUC ROC
curve is an industry-standard accuracy metric used for binary classification models. AUC measures the
ability the model to predict a higher score for positive examples, as compared to negative examples. The
AUC metric provides an aggregated measure of the model performance across all possible classification
thresholds.
The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate that the machine
learning model is highly accurate. Values near 0.5 indicate that the model is performing no better than
guessing at random. AUC values close to 0 indicate that the model has learned the correct patterns, but
is making predictions that are as inaccurate as possible. Values near zero can indicate a problem with the
data. For more information about the AUC metric, see the Receiver operating characteristic article on
Wikipedia.
The following is an example of an AUC ROC curve graph to evaluate predictions made by a binary
classification model. The dashed thin line represents the AUC ROC curve that a model which classifies
no-better-than-random guessing would score, with an AUC score of 0.5. The curves of more accurate
classification models lie above this random baseline, where the rate of true positives exceeds the rate of
false positives. The AUC ROC curve representing the performance of the binary classification model is
the thicker solid line.
500
Amazon SageMaker Developer Guide
Model Performance Report
A summary of the graph's components of false positive rate (FPR) and true positive rate (TPR) are
defined as follows.
• Correct predictions
• True positive (TP): The predicted the value is 1, and the true value is 1.
• True negative (TN): The predicted the value is 0, and the true value is 0.
• Erroneous predictions
• False positive (FP): The predicted the value is 1, but the true value is 0.
• False negative (FN): The predicted the value is 0, but the true value is 1.
The false positive rate (FPR) measures the fraction of true negatives (TN) that were falsely predicted as
positives (FP), over the sum of FP and TN. The range is 0 to 1. A smaller value indicates better predictive
accuracy.
• FPR = FP/(FP+TN)
The true positive rate (TPR) measures the fraction true positives that were correctly predicted as
positives (TP) over the sum of TP and false negatives (FN). The range is 0 to 1. A larger value indicates
better predictive accuracy.
• TPR = TP/(TP+FN)
Confusion matrix
A confusion matrix provides a way to visualize the accuracy of the predictions made by a model for
binary and multiclass classification for different problems. The confusion matrix in the model quality
report contains the following.
• The number and percentage of correct and incorrect predictions for the actual labels
• The number and percentage of accurate predictions on the diagonal from the upper-left to the lower-
right corner
501
Amazon SageMaker Developer Guide
Model Performance Report
• The number and percentage of inaccurate predictions on the diagonal from the upper-right to the
lower-left corner
The following screenshot is an example of a confusion matrix for a binary classification problem. It
contains the following information:
• The vertical axis is divided into two rows containing true and false actual labels.
• The horizontal axis is divided into two columns containing true and false labels that were predicted by
the model.
• The color bar assigns a darker tone to a larger number of samples to visually indicate the number of
values that were classified in each category.
In this example, the model predicted actual 2817 false values correctly, and 353 actual true values
correctly. The model incorrectly predicted 130 actual true values to be false and 33 actual false values to
be true. The difference in tone indicates that the dataset is not balanced. The imbalance is because there
are many more actual false labels than actual true labels.
The following screenshot is an example of a confusion matrix for a multi-class classification problem. The
confusion matrix in the model quality report contains the following.
• The vertical axis is divided into three rows containing three different actual labels.
• The horizontal axis is divided into three columns containing labels that were predicted by the model.
• The color bar assigns a darker tone to a larger number of samples to visually indicate the number of
values that were classified in each category.
502
Amazon SageMaker Developer Guide
Model Performance Report
In the example below, the model correctly predicted actual 354 values for label f, 1094 values for label
i and 852 values for label m. The difference in tone indicates that the dataset is not balanced because
there are many more labels for the value i than for f or m.
The confusion matrix in the model quality report provides can accommodate a maximum of 15 labels
for multiclass classification problem types. If a row corresponding to a label shows a Nan value, it means
that the validation dataset used to check model predictions does not contain data with that label.
Gain curve
In binary classification, a gain curve predicts the cumulative benefit of using a percentage of the dataset
to find a positive label. The gain value is calculated during training by dividing the cumulative number
of positive observations by the total number of positive observations in the data, at each decile. If the
classification model created during training is representative of the unseen data, you can use the gain
curve to predict the percentage of data that you must target to obtain a percentage of positive labels.
The greater the percentage of the dataset used, the higher the percentage of positive labels found.
In the following example graph, the gain curve is the line with changing slope. The straight line is the
percentage of positive labels found by selecting a percentage of data from the dataset at random. Upon
targeting 20% of the dataset, you would expect to find larger than 40% of the positive labels. As an
example, you might consider using a gain curve to determine your efforts in a marketing campaign.
Using our gain curve example, for 83% of people in a neighborhood to purchase cookies, you'd send an
advertisement to about 60% of the neighborhood.
503
Amazon SageMaker Developer Guide
Model Performance Report
Lift curve
In binary classification, the lift curve illustrates the uplift of using a trained model to predict the
likelihood of finding a positive label compared to a random guess. The lift value is calculated during
training using the ratio of percentage gain to the ratio of positive labels at each decile. If the model
created during training is representative of the unseen data, use the lift curve to predict the benefit of
using the model over randomly guessing.
In the following example graph, the lift curve is the line with changing slope. The straight line is the
lift curve associated with selecting the corresponding percentage randomly from the dataset. Upon
targeting 40% of the dataset with your model's classification labels, you would expect to find about 1.7
times the number of the positive labels that you would have found by randomly selecting 40% of the
unseen data.
504
Amazon SageMaker Developer Guide
Model Performance Report
Precision-recall curve
The precision-recall curve represents the tradeoff between precision and recall for binary classification
problems.
Precision measures the fraction of actual positives that are predicted as positive (TP) out of all positive
predictions (TP and false positive). The range is 0 to 1. A larger value indicates better accuracy in the
predicted values.
• Precision = TP/(TP+FP)
Recall measures the fraction of actual positives (TP) that are predicted as positive out of all positive
predictions (TP and false negative). This is also known as the sensitivity and as the true positive rate. The
range is 0 to 1. A larger value indicates better detection of positive values from the sample.
• Recall = TP/(TP+FN)
The objective of a classification problem is to correctly label as many elements as possible. A system with
high recall but low precision returns a high percentage of false positives.
The following graphic depicts a spam filter that marks every email as spam. It has high recall, but low
precision, because recall doesn't measure false positives.
Give more weight to recall over precision if your problem has a low penalty for false positive values, but
a high penalty for missing a true positive result. For example, detecting an impending collision in a self-
driving vehicle.
505
Amazon SageMaker Developer Guide
Model Performance Report
By contrast, a system with high precision, but low recall, returns a high percentage of false negatives.
A spam filter that marks every email as desirable (not spam) has high precision but low recall because
precision doesn't measure false negatives.
If your problem has a low penalty for false negative values, but a high penalty for missing a true
negative results, give more weight to precision over recall. For example, flagging a suspicious filter for a
tax audit.
The following graphic depicts a spam filter that has high precision but low recall, because precision
doesn't measure false negatives.
A model that makes predictions with both high precision and high recall produces a high number of
correctly labeled results. For more information, see Precision and recall article in Wikipedia.
For binary classification problems, Amazon SageMaker Autopilot includes a graph of the area under
the precision-recall curve (AUPRC). The AUPRC metric provides an aggregated measure of the model
performance across all possible classification thresholds and uses both precision and recall. AUPRC
does not take the number of true negatives into account. Therefore, it can be useful to evaluate model
performance in cases where there's a large number of true negatives in the data. For example, to model a
gene containing a rare mutation.
The following graphic is an example of an AUPRC graph. Precision at its highest value is 1, and recall is at
0. In the lower right corner of the graph, recall is its highest value (1) and precision is 0. In between these
two points , the AUPRC curve illustrates the tradeoff between precision and recall at different thresholds.
506
Amazon SageMaker Developer Guide
Model Performance Report
The actual against predicted plot shows the difference between actual and predicted model values. In
the following example graph, the solid line is a linear line of best fit. If the model were 100% accurate,
each predicted point would equal its corresponding actual point and lie on this line of best fit. The
distance away from the line of best fit is a visual indication of model error. The larger the distance away
from the line of best fit, the higher the model error.
507
Amazon SageMaker Developer Guide
Model Performance Report
residual
A (raw) residual shows the difference between actual and values predicted by your model. The larger
the difference, the larger the residual value.
standard deviation
The standard deviation is a measure of how values vary from an average value. A high standard
deviation indicates that many values are very different from their average value. A low standard
deviation indicates that many values are close to their average value.
standardized residual
A standardized residual divides the raw residuals by their standard deviation. Standardized residuals
have units of standard deviation and are useful in identifying outliers in data regardless of the
difference in scale of the raw residuals. If a standardized residual is much smaller or larger than the
other standardized residuals, it indicates that the model is not fitting these observations well.
The standardized residual plot measures the strength of the difference between observed and expected
values. The actual predicted value is displayed on the x axis. A point with a value larger than an absolute
value of 3 is commonly regarded as an outlier.
The following example graph shows that a large number of standardized residuals are clustered around
0 on the horizontal axis. The values close to zero indicate that the model is fitting these points well. The
points towards the top and bottom of the plot are not predicted well by the model.
Residual histogram
residual
A (raw) residual shows the difference between actual and values predicted by your model. The larger
the difference, the larger the residual value.
standard deviation
The standard deviation is a measure of how much values vary from an average value. A high
standard deviation indicates that many values are very different from their average value. A low
standard deviation indicates that many values are close to their average value.
508
Amazon SageMaker Developer Guide
Notebooks generated
standardized residual
A standardized residual divides the raw residuals by their standard deviation. Standardized residuals
have units of standard deviation. These are useful in identifying outliers in data regardless of the
difference in scale of the raw residuals. If a standardized residual is much smaller or larger than the
other standardized residuals, it would indicate that the model is not fitting these observations well.
histogram
The residual histogram shows the distribution of standardized residual values. A histogram distributed
in a bell shape and centered at zero indicates that the model does not systematically overpredict or
underpredict any particular range of target values.
In the following graphic, the standardized residual values indicate that the model is fitting the data well.
If the graph showed values far away from the center value, it would indicate that those values don't fit
the model well.
The AutoML job creates three notebook-based reports that describe the plan that Autopilot follows to
generate candidate models. A candidate model consists of a (pipeline, algorithm) pair. First, there’s a
data exploration notebook that describes what Autopilot learned about the data that you provided.
Second, there’s a candidate definition notebook, which uses the information about the data to generate
candidates. Third, a model insights report that can help detail the performance characteristics of the
best model in the leaderboard of an Autopilot experiment.
Topics
509
Amazon SageMaker Developer Guide
Data exploration report
You can run these notebooks in Amazon SageMaker, or locally, if you have installed the Amazon
SageMaker Python SDK. You can share the notebooks just like any other SageMaker Studio
notebook. The notebooks are created for you to conduct experiments. For example, you could edit the
following items in the notebooks:
Modifications to the candidate definition notebook are encouraged as a learning tool. With this
capability, you learn how decisions made during the machine learning process impact your results.
Note
When you run the notebooks in your default instance, you incur baseline costs. However, when
you run HPO jobs from the candidate notebook, these jobs use additional compute resources
that incur additional costs.
There are issues with customer-provided datasets that cannot be fixed automatically without the benefit
of some domain knowledge. Large outlier values in the target column for regression problems, for
example, may cause suboptimal predictions for the non-outlier values. Outliers may need to be removed
depending on the modeling objective. If a target column is included by accident as one of the input
features, the final model will validate well, but be of little value for future predictions.
To help customers discover these sorts of issues, Autopilot provides a data exploration report that
contains insights into potential issues with their data. The report also suggests how to handle the issues.
A data exploration notebook containing the report is generated for every Autopilot job. The report
is stored in an Amazon S3 bucket and can be accessed from your output path. The path of the data
exploration report usually adheres to the following pattern.
The location of the data exploration notebook can be obtained from the Autopilot API using the
DescribeAutoMLJob operation response, which is stored in DataExplorationNotebookLocation.
When running Autopilot from SageMaker Studio, you can open the data exploration report using the
following steps:
1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
510
Amazon SageMaker Developer Guide
Data exploration report
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the data exploration notebook that you want
to examine. This opens a new Autopilot job tab.
4. Select Open data exploration notebook from the top right section of the Autopilot job tab.
The data exploration report is generated from your data before the training process begins. This allows
you to stop Autopilot jobs that might lead to meaningless results. Likewise, you can address any issues
or improvements with your dataset before rerunning Autopilot. This way, you can use your domain
expertise to improve the data quality manually, before you train a model on a better-curated dataset.
The data report contains only static markdown and can be opened in any Jupyter environment. The
notebook that contains the report can be converted to other formats, such as PDF or HTML. For more
information about conversions, see Using the nbconvert script to convert Jupyter notebooks to other
formats..
Topics
• Dataset Summary (p. 511)
• Target Analysis (p. 511)
• Data Sample (p. 513)
• Duplicate rows (p. 514)
• Cross column correlations (p. 514)
• Anomalous Rows (p. 515)
• Missing values, cardinality, and descriptive statistics (p. 516)
Dataset Summary
This Dataset Summary provides key statistics characterizing your dataset including the number of rows,
columns, percent duplicate rows and missing target values. It is intended to provide you with a quick
alert when there are issue with your dataset that Amazon SageMaker Autopilot has detected and that
are likely to require your intervention. The insights are surfaced as warnings that are classified as being
of either “high” or “low” severity. The classification depends on the level of confidence that the issue will
adversely impact the performance of the model.
The high and low severity insights appear in the summary as pop-ups. For most of the insights,
recommendations are offered for how to confirm that there is an issue with the dataset that requires
your attention. Proposals are also provided for how to resolve the issues.
Autopilot provides additional statistics about missing or not valid target values in our dataset to help
you detect other issues that may not be captured by high severity insights. An unexpected number of
columns of a particular type might indicate that some columns that you want to use may be missing
from the dataset. It could also indicate that there was an issue with how the data was prepared or stored.
Fixing these data problems brought to your attention by Autopilot can improve the performance of the
machine learning models trained on your data.
High severity insights are shown in the summary section and in other relevant sections in the report.
Examples of high and low-severity insights are usually given depending on the section of the data report.
Target Analysis
Various high and low-severity insights are shown in this section related to the distribution of values
in the target column. Check that target column contains the correct values. Incorrect values in target
column will likely result in a machine learning model that doesn't serve the intended business purpose.
Several data insights of high and low severity are present in this section. Here are several examples.
511
Amazon SageMaker Developer Guide
Data exploration report
• Outlier target values - Skewed or unusual target distribution for regression, such as heavy tailed
targets.
• High or low target cardinality - Infrequent number of class labels or a large number of unique classes
for classification.
For both regression and classification problem types, not valid values such as numeric infinity, NaN or
empty space in target column are surfaced. Depending on the problem type, different dataset statistics
are presented. A distribution of target column values for a regression problem allows you to verify if the
distribution is what you expected.
The following screenshot shows an Autopilot data report, which includes statistics such as the mean,
median, minimum, maximum, percentage of outliers in your dataset. The screenshot also includes a
histogram showing the distribution of labels in the target column. The histogram shows Target Column
Values on the horizontal axis and Count on the vertical axis. A box highlights the Outliers Percentage
section of the screenshot to indicate where this statistic appears.
Multiple statistics are shown regarding target values and their distribution. If any of the outliers,
not valid values, or missing percentages are greater than zero, these values are surfaced so you can
investigate why your data contains unusable target values. Some unusable target values are highlighted
as a low severity insight warning.
In the following screenshot, a ` symbol was added accidentally to the target column, which
prevented the numeric value of the target from being parsed. A Low severity insight: "Invalid
target values" warning appears. The warning in this example states "0.14% of the labels in the
target column could not be converted to numeric values. The most common non-numeric values are:
["-3.8e-05","-9-05","-4.7e-05","-1.4999999999999999e-05","-4.3e-05"]. That usually indicates that there
are problems with data collection or processing. Amazon SageMaker Autopilot ignores all observations
with invalid target label."
Autopilot also provides a histogram showing the distribution of labels for classification.
The following screenshot shows an example of statistics given for your target column including the
number of classes, missing or not valid values. A histogram with Target Label on the horizontal axis and
Frequency on the vertical axis shows the distribution of each label category.
512
Amazon SageMaker Developer Guide
Data exploration report
Note
You can find definitions of all the terms presented in this and other sections in Definitions
section at the bottom of the report notebook.
Data Sample
Autopilot presents an actual sample of your data to help you spot issues with your dataset. The sample
table scrolls horizontally. Inspect the sample data to verify that all the necessary columns are present in
the dataset.
Autopilot also calculates a measure of prediction power, that can be used to identify a linear or nonlinear
relationship between a feature and the target variable. A value of 0 indicates that the feature has no
predictive value in predicting the target variable. A value of 1 indicates the highest predictive power for
the target variable. For more information on predictive power, see the Definitions section.
Note
It is not recommended that you use prediction power as a substitute for feature importance.
Only use it if you're certain that prediction power is an appropriate measure for your use case.
The following screenshot shows example data sample. The top row contains the prediction power of
each column in your dataset. The second row contains the column data type. Subsequent rows contain
the labels. The columns contain the target column followed by each feature column. Each feature
column has an associated prediction power, highlighted in this screenshot, with a box. In this example,
the column containing the feature x51 has a predictive power of 0.68 for the target variable y. The
feature x55 is slightly less predictive with a prediction power of 0.59.
513
Amazon SageMaker Developer Guide
Data exploration report
Duplicate rows
If duplicate rows are present in the dataset, Amazon SageMaker Autopilot displays a sample of them.
Note
It is not recommended to balance a dataset by up-sampling before providing it to Autopilot.
This may result in inaccurate validation scores for the models trained by Autopilot, and the
models that are produced may be unusable.
You can use the information in the correlation matrix to remove highly correlated features. A smaller
number of features reduces chances of overfitting a model and can reduce the costs of production in
two ways. It lessens the Autopilot runtime needed and, for some applications, can make data collection
procedures cheaper.
The following screenshot shows an example of a correlation matrix between 7 features. Each feature
is displayed in a matrix on both the horizontal and vertical axes. The Pearson's correlation coefficient is
displayed at the intersection between two features. Each feature intersection has a color tone associated
with it. The higher the correlation, the darker the tone. The darkest tones occupy the diagonal of the
matrix, where each feature is correlated with itself, representing perfect correlation.
514
Amazon SageMaker Developer Guide
Data exploration report
Anomalous Rows
Amazon SageMaker Autopilot detects which rows in your dataset might be anomalous. It then assigns an
anomaly score to each row. Rows with negative anomaly scores are considered anomalous.
The following screenshot shows the output from an Autopilot analysis for rows containing anomalies. A
column containing an anomalous score appears next to the dataset columns for each row.
515
Amazon SageMaker Developer Guide
Candidate definition notebook
Autopilot calculates several statistics on the categorical values in columns that contain them. These
include the number of unique entries and, for text, the number of unique words.
Autopilot calculates several standard statistics on the numerical values in columns that contain them.
The following image depicts these statistics, including the mean, median, minimum and maximum
values, and the percentages of numerical types and of outlier values.
You can choose which candidate to train and tune in two ways. The first, by running sections of the
notebook. The second, by running the entire notebook to optimize all candidates to identify a best
candidate. If you run the entire notebook, only the best candidate is displayed after job completion.
To run Autopilot from SageMaker Studio, open the candidate definition notebook by following these
steps:
1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the candidate definition notebook that you
want to examine. This opens a new Autopilot job tab.
4. Choose Open candidate generation notebook from the top right section of the Autopilot job tab.
This opens a new read-only preview of the Amazon SageMaker Autopilot Candidate Definition
Notebook.
1. Choose Import notebook at the top right of the Amazon SageMaker Autopilot Candidate
Definition Notebook tab. This opens a tab to set up a new notebook environment to run the
notebook.
2. Select an existing SageMaker Image or use a Custom Image.
516
Amazon SageMaker Developer Guide
Configure inference output
Customers can list inference container definitions with the ListCandidateForAutoMLJob API.
The list of inference container definitions that represent the best candidate is also available in the
DescribeAutoMLJob response.
517
Amazon SageMaker Developer Guide
Inference responses
• predicted_label: The label with the highest probability of predicting the correct label, as
determined by Autopilot.
• probability:
• HPO models: The probability of the True class for binary classification. The probability of the
predicted_label for multiclass classification.
• Ensemble models: The probability of the predicted_label for binary and multiclass
classification.
• probabilities: The list of probabilities for all corresponding classes.
• labels: The list of all labels.
For example, for a binary classification problem, if you pass the inference response keys
['predicted_label', 'probability', 'probabilities', 'labels'] and the output
response appears as [1, 0.1, "[0.9, 0.1]", "['1', '0']"], you should interpret it as follows:
1. predicted_label equals 1 because label "1" has a higher probability (0.9 in this case).
2. For HPO models, probability equals 0.1 which is the probability of the positive_class (0 in
this case) selected by Autopilot.
For Ensemble models, probability equals 0.9 which is the probability of the predicted_label.
3. probabilities lists the probability of each label in labels.
4. labels are the unique labels in the dataset, where the second label ("0" in this case) is the
positive_class selected by Autopilot.
By default, inference containers are configured to generate only the predicted_label. To select
additional inference content, you can update the inference_response_keys parameter to include up
to these three environment variables:
• SAGEMAKER_INFERENCE_SUPPORTED: This is set to provide hints to you about what content each
container supports.
• SAGEMAKER_INFERENCE_INPUT: This should be set to the keys that the container expects in input
payload.
• SAGEMAKER_INFERENCE_OUTPUT: This should be populated with the set of keys that the container
outputs.
To choose the inference response content in HPO mode: Add the SAGEMAKER_INFERENCE_INPUT and
SAGEMAKER_INFERENCE_OUTPUT variables to the second and third containers that are generated in
HPO mode for classification problems.
The keys supported by the second container (algorithm) are predicted_label, probability, and
probabilities. Note that labels is deliberately not added to SAGEMAKER_INFERENCE_SUPPORTED.
The keys supported by the third classification model container are predicted_label, labels,
probability, and probabilities. Therefore, the SAGEMAKER_INFERENCE_SUPPORTED environment
includes the names of these keys.
To update the definition of the inference containers to receive predicted_label and probability,
use the following code example.
518
Amazon SageMaker Developer Guide
Inference responses
containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT': 'predicted_label,
probability'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})
The following code example updates the definition of the inference containers to receive
predicted_label, probabilities, and labels. Do not pass the labels to the second container
(the algorithm container), because it is generated by the third container independently.
containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label,probabilities'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT':
'predicted_label,probabilities'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probabilities,labels'})
The following collapsible sections provide code examples for AWS SDK for Python (Boto3) and for
SageMaker SDK for Python. Each section shows how to select the content of the inference responses in
HPO mode for the respective code example.
import boto3
best_candidate_containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT':
'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})
# create model
reponse = sm_client.create_model(
ModelName = '<Model Name>',
ExecutionRoleArn = role,
Containers = best_candidate_containers
)
519
Amazon SageMaker Developer Guide
Inference responses
'ContentType': "text/CSV",
'SplitType': 'Line'
},
TransformOutput={
'S3OutputPath': output_path,
'AssembleWith': 'Line',
},
TransformResources={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
},
)
aml_transformer = aml_best_model.transformer(accept='text/csv',
assemble_with='Line',
instance_type='ml.m5.xlarge',
instance_count=1,)
In ensembling mode, to choose the content of the inference response, update the
SAGEMAKER_INFERENCE_OUTPUT environment variable.
To update the inference container definition to receive predicted_label and probability, refer to
the following code example.
containers[0]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})
The following collapsible section provides a code example for selecting the content of the inference
responses in ensembling mode. The example uses AWS SDK for Python (Boto3).
import boto3
520
Amazon SageMaker Developer Guide
Inference responses
*best_candidate_containers[0]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})
*
# create model
reponse = sm_client.create_model(
ModelName = '<Model Name>',
ExecutionRoleArn = role,
Containers = best_candidate_containers
)
The following collapsible section provides a code example that is identical to the SageMaker SDK for
Python example for HPO. It is included for your convenience.
The following HPO code example uses SageMaker SDK for Python.
aml_transformer = aml_best_model.transformer(accept='text/csv',
assemble_with='Line',
instance_type='ml.m5.xlarge',
instance_count=1,)
521
Amazon SageMaker Developer Guide
Quotas
Topics
• Quotas that you can increase (p. 522)
• Resource quotas (p. 523)
Resource limits
Note
*This 2 GB size limit is for a single compressed Parquet file. You can provide a Parquet dataset
that includes multiple compressed Parquet files. After the files are decompressed, they may
each expand to a larger size.
**Autopilot automatically subsamples input datasets that are larger than the target dataset size
while accounting for class imbalance and preserving rare class labels.
522
Amazon SageMaker Developer Guide
Resource quotas
1. Open the AWS Support Center page, sign in if necessary, and then choose Create case.
2. On the Create case page, choose Service limit increase.
3. In the Case details panel, select SageMaker AutoML for the Limit Type.
4. On the Requests panel for Request 1, select the Region, the resource Limit to increase, and the New
Limit value that you are requesting. If you have additional requests for quota increases, select Add
another request.
Resource quotas
The following table contains the runtime resource limits for an Amazon SageMaker Autopilot job in an
AWS Region.
523
Amazon SageMaker Developer Guide
API reference
For information on the entire SageMaker REST APIs and the available SDKs, see API and SDK Reference.
If your language of choice is Python, you can also refer to Amazon SageMaker Python SDK directly or
AWS SDK for Python (Boto3).
Actions
This list details the operations available in the Reference API to manage AutoML jobs programmatically.
• CreateAutoMLJob
• DescribeAutoMLJob
• ListAutoMLJobs
• ListCandidatesForAutoMLJob
• StopAutoMLJob
Data Types
This list details the API AutoML objects used by the actions above as inbound requests or outbound
responses.
• AutoMLAlgorithmConfig
• AutoMLCandidate
• AutoMLCandidateGenerationConfig
• AutoMLCandidateStep
• AutoMLChannel
• AutoMLContainerDefinition
• AutoMLDataSource
• AutoMLDataSplitConfig
• AutoMLJobArtifacts
• AutoMLJobCompletionCriteria
• AutoMLJobConfig
• AutoMLJobObjective
• AutoMLJobStepMetadata
• AutoMLJobSummary
• AutoMLOutputDataConfig
• AutoMLPartialFailureReason
• AutoMLS3DataSource
• AutoMLSecurityConfig
• CandidateArtifactLocations
• CandidateProperties
• FinalAutoMLJobObjectiveMetric
• MetricDatum
524
Amazon SageMaker Developer Guide
API reference
• ModelDeployConfig
• ModelDeployResult
• ResolvedAttributes
• TuningJobCompletionCriteria
525
Amazon SageMaker Developer Guide
Ground Truth
Label Data
To train a machine learning model, you need a large, high-quality, labeled dataset. You can label your
data using Amazon SageMaker Ground Truth. Choose from one of the Ground Truth built-in task types or
create your own custom labeling workflow. To improve the accuracy of your data labels and reduce the
total cost of labeling your data, use Ground Truth enhanced data labeling features like automated data
labeling and annotation consolidation.
Topics
• Use Amazon SageMaker Ground Truth to Label Data (p. 526)
• Use Amazon SageMaker Ground Truth Plus to Label Data (p. 844)
• Use Amazon SageMaker Ground Truth Synthetic Data to Generate and Label Data (p. 855)
• Create and Manage Workforces (p. 863)
• Crowd HTML Elements Reference (p. 889)
Depending on your ML application, you can choose from one of the Ground Truth built-in task types
to have workers generate specific types of labels for your data. You can also build a custom labeling
workflow to provide your own UI and tools to workers labeling your data. To learn more about the
Ground Truth built in task types, see Built-in Task Types (p. 704). To learn how to create a custom
labeling workflow, see Creating Custom Labeling Workflows (p. 671).
In order to automate labeling your training dataset, you can optionally use automated data labeling,
a Ground Truth process that uses machine learning to decide which data needs to be labeled by
humans. Automated data labeling may reduce the labeling time and manual effort required. For more
information, see Automate Data Labeling (p. 807). To create a custom labeling workflow, see Creating
Custom Labeling Workflows (p. 671).
Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI
template is a webpage that Ground Truth uses to present tasks and instructions to your workers. The
SageMaker console provides built-in templates for labeling data. You can use these templates to get
started , or you can build your own tasks and instructions by using our HTML 2.0 components. For more
information, see Creating Custom Labeling Workflows (p. 671).
Use the workforce of your choice to label your dataset. You can choose your workforce from:
• The Amazon Mechanical Turk workforce of over 500,000 independent contractors worldwide.
526
Amazon SageMaker Developer Guide
Are You a First-time User of Ground Truth?
• A private workforce that you create from your employees or contractors for handling data within your
organization.
• A vendor company that you can find in the AWS Marketplace that specializes in data labeling services.
For more information, see Create and Manage Workforces (p. 863).
You store your datasets in Amazon S3 buckets. The buckets contain three things: The data to be labeled,
an input manifest file that Ground Truth uses to read the data files, and an output manifest file. The
output file contains the results of the labeling job. For more information, see Use Input and Output
Data (p. 734).
Events from your labeling jobs appear in Amazon CloudWatch under the /aws/sagemaker/
LabelingJobs group. CloudWatch uses the labeling job name as the name for the log stream.
1. Read Getting started (p. 527)—This section walks you through setting up your first Ground Truth
labeling job.
2. Explore other topics—Depending on your needs, do the following:
• Explore built-in task types— Use built-in task types to streamline the process of creating a labeling
job. See Built-in Task Types (p. 704) to learn more about Ground Truth built-in task types.
• Manage your labeling workforce—Create new work teams and manage your existing workforce.
For more information, see Create and Manage Workforces (p. 863).
• Learn about streaming labeling jobs— Create a streaming labeling job and send new dataset
objects to workers in real time using a perpetually running labeling job. Workers continuously
receive new data objects to label as long as the labeling job is active and new objects are being sent
to it. To learn more, see Ground Truth Streaming Labeling Jobs (p. 738).
3. See the Reference—This section describes operations to automate Ground Truth operations.
Getting started
This video shows you how to setup and use Amazon SageMaker Ground Truth. (Length: 9:37)
To get started using Amazon SageMaker Ground Truth, follow the instructions in the following sections.
The sections here explain how to use the console to create a labeling job, assign a public or private
workforce, and send the labeling job to your workforce. You can also learn how to monitor the progress
of a labeling job.
If you want to create a custom labeling workflow, see Creating Custom Labeling Workflows (p. 671) for
instructions.
Before you create a labeling job, you must upload your dataset to an Amazon S3 bucket. For more
information, see Use Input and Output Data (p. 734).
Topics
• Step 1: Before You Begin (p. 528)
• Step 2: Create a Labeling Job (p. 528)
• Step 3: Select Workers (p. 529)
• Step 4: Configure the Bounding Box Tool (p. 531)
527
Amazon SageMaker Developer Guide
Getting started
1. Save two images at publicly available HTTP URLs. The images are used when creating instructions
for completing a labeling task. The images should have an aspect ratio of around 2:1. For this
exercise, the content of the images is not important.
2. Create an Amazon S3 bucket to hold the input and output files. The bucket must be in the same
Region where you are running Ground Truth. Make a note of the bucket name because you use it
during step 2.
Ground Truth requires all S3 buckets that contain labeling job input image data have a CORS policy
attached. To learn more about this change, see CORS Permission Requirement (p. 816).
3. You can create an IAM role or let SageMaker create a role with the AmazonSageMakerFullAccess IAM
policy. Refer to Creating IAM roles and assign the following permissions policy to the user that is
creating the labeling job:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "sagemakergroundtruth",
"Effect": "Allow",
"Action": [
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:AdminCreateUser",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:DescribeUserPool",
"cognito-idp:UpdateUserPool"
],
"Resource": "*"
}
]
}
Next
Step 2: Create a Labeling Job (p. 528)
528
Amazon SageMaker Developer Guide
Getting started
• Job name – Give the labeling job a name that describes the job. This name is shown in your job
list. The name must be unique in your account in an AWS Region.
• Label attribute name – Leave this unchecked as the default value is the best option for this
introductory job.
• Input data setup – Select Automated data setup. This option allows you to automatically connect
to your input data in S3.
• S3 location for input datasets – Enter the S3 location where you added the images in step 1.
• S3 location for output datasets – The location where your output data is written in S3.
• Data type – Use the drop down menu to select Image. Ground Truth will use all images found in
the S3 location for input datasets as input for your labeling job.
• IAM role – Create or choose an IAM role with the AmazonSageMakerFullAccess IAM policy
attached.
5. In the Task type section, for the Task category field, choose Image.
6. In the Task selection choose Bounding box.
7. Choose Next to move on to configuring your labeling job.
Next
Step 3: Select Workers (p. 529)
If you add yourself to the private workforce, you will receive an email that looks similar to the following.
Amazon, Inc. is replaced by the organization you enter in step 3 of the preceding procedure. Select the
link in the email to log in using the temporary password provided. If prompted, change your password.
When you successfully log in, you see the worker portal where your labeling tasks appear.
529
Amazon SageMaker Developer Guide
Getting started
Tip
You can find the link to your private workforce's worker portal in the Labeling workforces
section of the Ground Truth area of the SageMaker console. To see the link, select the Private
tab. The link is under the Labeling portal sign-in URL header in Private workforce summary.
If you choose to use the Amazon Mechanical Turk workforce to label the dataset, you are charged for
labeling tasks completed on the dataset.
You understand and agree that the Amazon Mechanical Turk workforce consists of independent
contractors located worldwide and that you should not share confidential information, personal
information or protected health information with this workforce.
Next
Step 4: Configure the Bounding Box Tool (p. 531)
530
Amazon SageMaker Developer Guide
Getting started
1. In the Task description field type in brief instructions for the task. For example:
Replace objects with the name of an object that appears in your images.
2. In the Labels field, type a category name for the objects that the worker should draw a bounding
box around. For example, if you are asking the worker to draw boxes around football players, you
could use "Football Player" in this field.
3. The Short instructions section enables you to create instructions that are displayed on the page
with the image that your workers are labeling. We suggest that you include an example of a
correctly drawn bounding box and an example of an incorrectly drawn box. To create your own
instructions, use these steps:
a. Select the text between GOOD EXAMPLE and the image placeholder. Replace it with the
following text:
Don't make the bounding box too large or cut into the object.
e. Select the second image placeholder and delete it.
f. Choose the image button and then enter the HTTPS URL of the other image that you created in
step 1.
4. Select Preview to preview the worker UI. The preview opens in a new tab, and so if your browser
blocks pop ups you may need to manually enable the tab to open. When you add one or more
annotations to the preview and then select Submit you can see a preview of the output data your
annotation would created.
5. After you have configured and verified your instructions, select Create to create the labeling job.
If you used a private workforce, you can navigate to the worker portal that you logged into in Step 3:
Select Workers (p. 529) of this tutorial to see your labeling tasks. The tasks may take a few minutes to
appear.
Next
Step 5: Monitoring Your Labeling Job (p. 532)
531
Amazon SageMaker Developer Guide
Label Images
• Name – The name that you assigned the job when you created it.
• Status – The completion status of the job. The status can be one of Complete, Failed, In progress, or
Stopped.
• Labeled objects/total – Shows the total number of objects in the labeling job and how many of them
have been labeled.
• Creation time – The date and time that you created the job.
You can also clone, chain, or stop a job. Select a job and then select one of the following from the
Actions menu:
• Clone – Creates a new labeling job with the configuration copied from the selected job. You can clone
a job when you want to change to the job and run it again. For example, you can clone a job that was
sent to a private workforce so that you can send it to the Amazon Mechanical Turk workforce. Or you
can clone a job to rerun it against a new dataset stored in the same location as the original job.
• Chain – Creates a new labeling job that can build upon the data and models (if any) of a stopped,
failed, or completed job. For more information about the use cases and how to use it, see Chaining
Labeling Jobs (p. 813).
• Stop – Stops a running job. You cannot restart a stopped job. You can clone a job to start over or chain
the job to continue from where it left off. Labels for any already labeled objects are written to the
output file location. For more information, see Output Data (p. 776).
Label Images
Use Ground Truth to label images. Select one of the following built in task types to learn more about
that task type. Each page includes instructions to help you create a labeling job using that task type.
Tip
To learn more about supported file types and input data quotas, see Input Data (p. 734).
Topics
• Bounding Box (p. 532)
• Image Semantic Segmentation (p. 538)
• Auto-Segmentation Tool (p. 541)
• Image Classification (Single Label) (p. 545)
• Image Classification (Multi-label) (p. 547)
• Image Label Verification (p. 551)
Bounding Box
The images used to train a machine learning model often contain more than one object. To classify and
localize one or more objects within images, use the Amazon SageMaker Ground Truth bounding box
labeling job task type. In this context, localization means the pixel-location of the bounding box.
You create a bounding box labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.
532
Amazon SageMaker Developer Guide
Label Images
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and up to 50
labels that workers can choose from.
533
Amazon SageMaker Developer Guide
Label Images
534
Amazon SageMaker Developer Guide
Label Images
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-BoundingBox. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-BoundingBox. To find
the annotation-consolidation Lambda ARN for your Region, see AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-bounding-box-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
BoundingBox',
'TaskKeywords': [
'Bounding Box',
],
'TaskTitle': 'Bounding Box task',
'TaskDescription': 'Draw bounding boxes around objects in an image',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-BoundingBox'
}
535
Amazon SageMaker Developer Guide
Label Images
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header. Upload this template to S3, and provide the S3 URI for this file in
UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="boundingBox"
src="{{ task.input.taskObject | grant_read_access }}"
header="please draw box"
labels="{{ task.input.labels | to_json | escape }}"
>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description of a correct bounding box label and add images</p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3>
<p>Enter description of an incorrect bounding box label and add images</p>
</short-instructions>
</crowd-bounding-box>
</crowd-form>
For example, the output manifest file of a successfully completed single-class bounding box task will
contain the following:
[
{
"boundingBox": {
"boundingBoxes": [
{
536
Amazon SageMaker Developer Guide
Label Images
"height": 2832,
"label": "bird",
"left": 681,
"top": 599,
"width": 1364
}
],
"inputImageProperties": {
"height": 3726,
"width": 2662
}
}
}
]
The boundingBoxes parameter identifies the location of the bounding box drawn around an object
identified as a "bird" relative to the top-left corner of the image which is taken to be the (0,0) pixel-
coordinate. In the previous example, left and top identify the location of the pixel in the top-left
corner of the bounding box relative to the top-left corner of the image. The dimensions of the bounding
box are identified with height and width. The inputImageProperties parameter gives the pixel-
dimensions of the original input image.
When you use the bounding box task type, you can create single- and multi-class bounding box labeling
jobs. The output manifest file of a successfully completed multi-class bounding box will contain the
following:
[
{
"boundingBox": {
"boundingBoxes": [
{
"height": 938,
"label": "squirrel",
"left": 316,
"top": 218,
"width": 785
},
{
"height": 825,
"label": "rabbit",
"left": 1930,
"top": 2265,
"width": 540
},
{
"height": 1174,
"label": "bird",
"left": 748,
"top": 2113,
"width": 927
},
{
"height": 893,
"label": "bird",
"left": 1333,
"top": 847,
"width": 736
}
],
"inputImageProperties": {
"height": 3726,
"width": 2662
}
537
Amazon SageMaker Developer Guide
Label Images
}
}
]
To learn more about the output manifest file that results from a bounding box labeling job, see
Bounding Box Job Output (p. 782).
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
Images that contain large numbers of objects that need to be segmented require more time. To help
workers (from a private or vendor workforce) label these objects in less time and with greater accuracy,
Ground Truth provides an AI-assisted auto-segmentation tool. For information, see Auto-Segmentation
Tool (p. 541).
You create a semantic segmentation labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.
538
Amazon SageMaker Developer Guide
Label Images
539
Amazon SageMaker Developer Guide
Label Images
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-SemanticSegmentation. To find
the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
SemanticSegmentation. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-semantic-segmentation-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
SemanticSegmentation,
'TaskKeywords': [
'Semantic Segmentation',
],
'TaskTitle': 'Semantic segmentation task',
'TaskDescription': 'For each category provided, segment out each relevant object
using the color associated with that category',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
540
Amazon SageMaker Developer Guide
Label Images
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-SemanticSegmentation'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.
Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-semantic-segmentation
name="crowd-semantic-segmentation"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please segment out all pedestrians."
labels="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Segmentation instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits an object and paint
that object using the tools provided.</li></ol>
</full-instructions>
<short-instructions>
<h2><span style="color: rgb(0, 138, 0);">Good example</span></h2>
<p>Enter description to explain a correctly done segmentation</p>
<p><br></p><h2><span style="color: rgb(230, 0, 0);">Bad example</span></h2>
<p>Enter description of an incorrectly done segmentation</p>
</short-instructions>
</crowd-semantic-segmentation>
</crowd-form>
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of an output manifest file for a semantic segmentation labeling job, see 3D Point
Cloud Semantic Segmentation Output (p. 792).
Auto-Segmentation Tool
Image segmentation is the process of dividing an image into multiple segments, or sets of labeled
pixels. In Amazon SageMaker Ground Truth, the process of identifying all pixels that fall under a given
541
Amazon SageMaker Developer Guide
Label Images
label involves applying a colored filler, or "mask", over those pixels. Some labeling job tasks contain
images with a large numbers of objects that need to be segmented. To help workers label these
objects in less time and with greater accuracy, Ground Truth provides an auto-segmentation tool for
segmentation tasks assigned to private and vendor workforces. This tool uses a machine learning model
to automatically segment individual objects in the image with minimal worker input. Workers can refine
the mask generated by the auto-segmentation tool using other tools found in the worker console. This
helps workers complete image segmentation tasks faster and more accurately, resulting in lower cost
and higher label quality.
Note
The auto-segmentation tool is available for segmentation tasks that are sent to a private
workforce or vendor workforce. It isn't available for tasks sent to the public workforce (Amazon
Mechanical Turk).
Tool Preview
When workers are assigned a labeling job that provides the auto-segmentation tool, they are provided
with detailed instructions on how to use the tool. For example, a worker might see the following in the
worker console:
542
Amazon SageMaker Developer Guide
Label Images
543
Amazon SageMaker Developer Guide
Label Images
Workers can use View full instructions to learn how to use the tool. Workers will need to place a point
on four extreme-points ( top-most, bottom-most, left-most, and right-most points ) of the object of
interest, and the tool will automatically generate a mask for the object. Workers can further-refine the
mask using the other tools provided, or by using the auto-segment tool on smaller portions of the object
that were missed.
Tool Availability
The auto-segmentation tool automatically appears in your workers' consoles if you create a semantic
segmentation labeling job using the Amazon SageMaker console. While creating a semantic
segmentation job in the SageMaker console, you will be able to preview the tool while creating worker
instructions. To learn how to create a semantic segmentation labeling job in the SageMaker console, see
Getting started (p. 527).
If you are creating a custom instance segmentation labeling job in the SageMaker console or creating
an instance- or semantic-segmentation labeling job using the Ground Truth API, you need to create a
custom task template to design your worker console and instructions. To include the auto-segmentation
tool in your worker console, ensure that the following conditions are met in your custom task template:
• For semantic segmentation labeling jobs created using the API, the <crowd-semantic-
segmentation> is present in the task template. For custom instance segmentation labeling jobs, the
<crowd-instance-segmentation> tag is present in the task template.
• The task is assigned to a private workforce or vendor workforce.
• The images to be labeled are Amazon Simple Storage Service Amazon S3) objects that have been
pre-signed for the Worker so that they can access it. This is true if the task template includes the
grant_read_access filter. For information about the grant_read_access filter, see Adding
automation with Liquid (p. 675).
The following is an example of a custom task template for a custom instance segmentation labeling job,
which includes the <crowd-instance-segmentation/> tag and the grant_read_access Liquid
filter.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instance-segmentation
name="crowd-instance-segmentation"
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Car','Road']"
<full-instructions header="Segmentation instructions">
Segment each instance of each class of objects in the image.
</full-instructions>
<short-instructions>
<p>Segment each instance of each class of objects in the image.</p>
544
Amazon SageMaker Developer Guide
Label Images
You can create an image classification labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.
545
Amazon SageMaker Developer Guide
Label Images
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-ImageMultiClass. To find the
pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
ImageMultiClass. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-image-classification-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
ImageMultiClass,
'TaskKeywords': [
Image classification',
],
'TaskTitle': Image classification task',
'TaskDescription': 'Carefully inspect the image and classify it by selecting one
label from the categories provided.',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-ImageMultiClass'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
546
Amazon SageMaker Developer Guide
Label Images
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.
Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="{{ task.input.taskObject | grant_read_access }}"
header="please classify"
categories="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Image classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li></
ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3><p>Enter description
of an incorrect label</p>
</short-instructions>
</crowd-image-classifier>
</crowd-form>
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of an output manifest file from an image classification labeling job, see Classification
Job Output (p. 780).
547
Amazon SageMaker Developer Guide
Label Images
When working on a multi-label image classification task, workers should choose all applicable labels,
but must choose at least one. When creating a job using this task type, you can provide up to 50 label-
categories.
When creating a labeling job in the console, Ground Truth doesn't provide a "none" category for when
none of the labels applies to an image. To provide this option to workers, include a label similar to
"none" or "other" when you create a multi-label image classification job.
To restrict workers to choosing a single label for each image, use the Image Classification (Single
Label) (p. 545) task type.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create a labeling
job in the console, you specify instructions to help workers complete the job and labels that workers can
choose from.
548
Amazon SageMaker Developer Guide
Label Images
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-ImageMultiClassMultiLabel.
To find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
ImageMultiClassMultiLabel. To find the annotation-consolidation Lambda ARN for your Region,
see AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-multi-label-image-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
549
Amazon SageMaker Developer Guide
Label Images
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
ImageMultiClassMultiLabel',
'TaskKeywords': [
'Image Classification',
],
'TaskTitle': 'Multi-label image classification task',
'TaskDescription': 'Select all labels that apply to the images shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-ImageMultiClassMultiLabel'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.
Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier-multi-select
name="crowd-image-classifier-multi-select"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please identify all classes in image"
categories="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Multi Label Image classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the image.</li></
ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</p>
550
Amazon SageMaker Developer Guide
Label Images
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of output manifest files for multi-label image classification labeling job, see Multi-
label Classification Job Output (p. 781).
You can use an Amazon SageMaker Ground Truth image label verification task to direct workers
to review a dataset's labels and improve label accuracy. Workers can indicate if the existing labels
are correct or rate label quality. They can also add comments to explain their reasoning. Amazon
SageMaker Ground Truth supports label verification for Bounding Box (p. 532) and Image Semantic
Segmentation (p. 538) labels.
You create an image label verification labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Ground Truth provides a worker console similar to the following for labeling tasks. When you create the
labeling job with the console, you can modify the images and content that are shown. To learn how to
create a labeling job using the Ground Truth console, see Create a Labeling Job (Console) (p. 706).
551
Amazon SageMaker Developer Guide
Label Text
You can create a label verification labeling job using the SageMaker console or API. To learn how to
create a labeling job using the Ground Truth API operation CreateLabelingJob, see Create a Labeling
Job (API) (p. 709).
Topics
• Named Entity Recognition (p. 552)
• Text Classification (Single Label) (p. 556)
• Text Classification (Multi-label) (p. 559)
When tasked with a named entity recognition labeling job, workers apply your labels to specific words
or phrases within a larger text block. They choose a label, then apply it by using the cursor to highlight
the part of the text to which the label applies. The Ground Truth named entity recognition tool supports
552
Amazon SageMaker Developer Guide
Label Text
overlapping annotations, in-context label selection, and multi-label selection for a single highlight. Also,
workers can use their keyboards to quickly select labels.
You can create a named entity recognition labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-NamedEntityRecognition. To
find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
553
Amazon SageMaker Developer Guide
Label Text
• Annotation-consolidation Lambda functions for this task type end with ACS-
NamedEntityRecognition. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
• You must provide the following ARN for HumanTaskUiArn:
arn:aws:sagemaker:aws-region:394669845002:human-task-ui/NamedEntityRecognition
Replace aws-region with the AWS Region you use to create the labeling job. For example, use us-
west-1 if you create a labeling job in US West (N. California).
• Provide worker instructions in the label category configuration file using the instructions
parameter. You can use a string, or HTML markup language in the shortInstruction and
fullInstruction fields. For more details, see Provide Worker Instructions in a Label Category
Configuration File (p. 555).
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-ner-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*',
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn': 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
NamedEntityRecognition'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
NamedEntityRecognition',
'TaskKeywords': [
'Named entity Recognition',
],
'TaskTitle': 'Named entity Recognition task',
'TaskDescription': 'Apply the labels provided to specific words or phrases within
the larger text block.',
554
Amazon SageMaker Developer Guide
Label Text
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 28800,
'TaskAvailabilityLifetimeInSeconds': 864000,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-NamedEntityRecognition'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
You must provide worker instructions in the label category configuration file you identify with the
LabelCategoryConfigS3Uri parameter in CreateLabelingJob. You can use these instructions to
provide details about the task you want workers to perform and help them use the tool efficiently.
You provide short and long instructions using shortInstruction and fullInstruction in the
instructions parameter, respectively. To learn more about these instruction types, see Creating
Instruction Pages (p. 704).
The following is an example of a label category configuration file with instructions that can be used for a
named entity recognition labeling job.
{
"document-version": "2018-11-28",
"labels": [
{
"label": "label1",
"shortDisplayName": "L1"
},
{
"label": "label2",
"shortDisplayName": "L2"
},
{
"label": "label3",
"shortDisplayName": "L3"
},
{
"label": "label4",
"shortDisplayName": "L4"
},
{
"label": "label5",
"shortDisplayName": "L5"
}
],
"instructions": {
"shortInstruction": "<p>Enter description of the labels that workers have
to choose from</p><br><p>Add examples to help workers understand
the label</p>",
"fullInstruction": "<ol>
<li><strong>Read</strong> the text carefully.</li>
<li><strong>Highlight</strong> words, phrases, or sections of the
text.</li>
<li><strong>Choose</strong> the label that best matches what you
have highlighted.</li>
555
Amazon SageMaker Developer Guide
Label Text
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
You create a text classification labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.
556
Amazon SageMaker Developer Guide
Label Text
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-TextMultiClass. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
TextMultiClass. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-text-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
557
Amazon SageMaker Developer Guide
Label Text
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
TextMultiClass,
'TaskKeywords': [
Text classification',
],
'TaskTitle': Text classification task',
'TaskDescription': 'Carefully read and classify this text using the categories
provided.',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-TextMultiClass'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.
Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="crowd-classifier"
categories="{{ task.input.labels | to_json | escape }}"
header="classify text"
>
<classification-target style="white-space: pre-wrap">
{{ task.input.taskObject }}
</classification-target>
<full-instructions header="Classifier instructions">
<ol><li><strong>Read</strong> the text carefully.</li>
<li><strong>Read</strong> the examples to understand more about the options.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the text.</li></ol>
</full-instructions>
<short-instructions>
<p>Enter description of the labels that workers have to choose from</p>
<p><br></p><p><br></p><p>Add examples to help workers understand the label</p>
<p><br></p><p><br></p><p><br></p><p><br></p><p><br></p>
</short-instructions>
558
Amazon SageMaker Developer Guide
Label Text
</crowd-classifier>
</crowd-form>
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of an output manifest files from a text classification labeling job, see Classification
Job Output (p. 780).
When working on a multi-label text classification task, workers should choose all applicable labels,
but must choose at least one. When creating a job using this task type, you can provide up to 50 label
categories.
Amazon SageMaker Ground Truth doesn't provide a "none" category for when none of the labels applies.
To provide this option to workers, include a label similar to "none" or "other" when you create a multi-
label text classification job.
To restrict workers to choosing a single label for each document or text selection, use the Text
Classification (Single Label) (p. 556) task type.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.
559
Amazon SageMaker Developer Guide
Label Text
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Pre-annotation Lambda functions for this task type end with PRE-TextMultiClassMultiLabel. To
find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
TextMultiClassMultiLabel. To find the annotation-consolidation Lambda ARN for your Region,
see AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.
response = client.create_labeling_job(
LabelingJobName='example-multi-label-text-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
560
Amazon SageMaker Developer Guide
Label Text
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/custom-worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda::function:PRE-TextMultiClassMultiLabel,
'TaskKeywords': [
'Text Classification',
],
'TaskTitle': 'Multi-label text classification task',
'TaskDescription': 'Select all labels that apply to the text shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-TextMultiClassMultiLabel'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier-multi-select
name="crowd-classifier-multi-select"
categories="{{ task.input.labels | to_json | escape }}"
header="Please identify all classes in the below text"
>
<classification-target style="white-space: pre-wrap">
{{ task.input.taskObject }}
</classification-target>
<full-instructions header="Classifier instructions">
<ol><li><strong>Read</strong> the text carefully.</li>
<li><strong>Read</strong> the examples to understand more about the options.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the text.</li></ol>
</full-instructions>
<short-instructions>
<p>Enter description of the labels that workers have to choose from</p>
<p><br></p>
<p><br></p><p>Add examples to help workers understand the label</p>
<p><br></p><p><br></p><p><br></p><p><br></p><p><br></p>
</short-instructions>
</crowd-classifier-multi-select>
</crowd-form>
561
Amazon SageMaker Developer Guide
Label Videos and Video Frames
To learn how to create a custom template, see Creating Custom Labeling Workflows (p. 671).
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of output manifest files for multi-label text classification labeling job, see Multi-label
Classification Job Output (p. 781).
• Video clip classification – Enable workers to classify videos into categories you specify. For example,
you can use this task type to have workers categorize videos into topics like sports, comedy, music, and
education. To learn more, see Video Classification (p. 562).
• Video frame labeling jobs – Enable workers to annotate video frames extracted from a video using
bounding boxes, polylines, polygons or keypoint annotation tools. Ground Truth offers two built-in
task types to label video frames:
• Video frame object detection: Enable workers to identify and locate objects in video frames.
• Video frame object tracking: Enable workers to track the movement of objects across video frames.
• Video frame adjustment jobs: Have workers adjust labels, label category attributes, and frame
attributes from a previous video frame object detection or object tracking labeling job.
• Video frame verification jobs: Have workers verify labels, label category attributes, and frame
attributes from a previous video frame object detection or object tracking labeling job.
If you have video files, you can use the Ground Truth automatic frame extraction tool to extract video
frames from your videos. To learn more, see Video Frame Input Data (p. 770).
Tip
To learn more about supported file types and input data quotas, see Input Data (p. 734).
Topics
• Video Classification (p. 562)
• Label Video Frames (p. 567)
• Worker Instructions (p. 579)
Video Classification
Use an Amazon SageMaker Ground Truth video classification labeling task when you need workers to
classify videos using predefined labels that you specify. Workers are shown videos and are asked to
choose one label for each video.
You create a video classification labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.
562
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Your video files must be encoded in a format that is supported by the browser used by the work team
that labels your data. It is recommended that you verify that all video file formats in your input manifest
file display correctly using the worker UI preview. You can communicate supported browsers to your
workers using worker instructions. To see supported file formats, see Supported Data Formats (p. 737).
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each video file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).
Ground Truth provides a worker UI similar to the following for labeling tasks. When you create a labeling
job in the console, you specify instructions to help workers complete the job and labels from which
workers can choose.
563
Amazon SageMaker Developer Guide
Label Videos and Video Frames
564
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:
• Use a pre-annotation Lambda function that ends with PRE-VideoClassification. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Use an annotation-consolidation Lambda function that ends with ACS-
VideoClassification. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.
response = client.create_labeling_job(
LabelingJobName='example-video-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoClassification',
'TaskKeywords': [
'Video Classification',
],
'TaskTitle': 'Video classification task',
'TaskDescription': 'Select a label to classify this video',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
565
Amazon SageMaker Developer Guide
Label Videos and Video Frames
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoClassification'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template by modifying the short-instructions, full-
instructions, and header. Upload this template to Amazon S3, and provide the Amazon S3 URI to
this file in UiTemplateS3Uri.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="crowd-classifier"
categories="{{ task.input.labels | to_json | escape }}"
header="Please classify video"
>
<classification-target>
<video width="100%" controls/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/mp4"/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/webm"/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/ogg"/>
Your browser does not support the video tag.
</video>
</classification-target>
<full-instructions header="Video classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the
video.</li>
<li><strong>Read</strong> the options and review the examples
provided to understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits
the video.</li></ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</
p>
<p><img src="https://d7evko5405gb7.cloudfront.net/
fe4fed9b-660c-4477-9294-2c66a15d6bbe/src/images/quick-instructions-example-placeholder.png"
style="max-width:100%"></p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3>
<p>Enter description of an incorrect label</p>
<p><img src="https://d7evko5405gb7.cloudfront.net/
fe4fed9b-660c-4477-9294-2c66a15d6bbe/src/images/quick-instructions-example-placeholder.png"
style="max-width:100%"></p>
</short-instructions>
</crowd-classifier>
</crowd-form>
566
Amazon SageMaker Developer Guide
Label Videos and Video Frames
To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).
To see an example of output manifest files for video classification labeling jobs, see Classification Job
Output (p. 780).
If you do not have video frames, you can provide video files (MP4 files) and use the Ground
Truth automated frame extraction tool to extract video frames. To learn more, see Provide Video
Files (p. 772).
You can use the following built-in video task types to create video frame labeling jobs using the Amazon
SageMaker console, API, and language-specific SDKs.
• Video frame object detection – Use this task type when you want workers to identify and locate
objects in sequences of video frames. You provide a list of categories, and workers can select one
category at a time and annotate objects which the category applies to in all frames. For example, you
can use this task to ask workers to identify and localize various objects in a scene, such as cars, bikes,
and pedestrians.
• Video frame object tracking – Use this task type when you want workers to track the movement of
instances of objects across sequences of video frames. When a worker adds an annotation to a single
frame, that annotation is associated with a unique instance ID. The worker adds annotations associated
with the same ID in all other frames to identify the same object or person. For example, a worker
can track the movement of a vehicle across a sequences of video frames by drawing bounding boxes
associated with the same ID around the vehicle in each frame that it appears.
Use the following topics to learn more about these built-in task types and to how to create a labeling
job using each task type. See Task Types (p. 576) to learn more about the annotations tools (bounding
boxes, polylines, polygons and keypoints) available for these task types.
Before you create a labeling job, we recommend that you review Video Frame Labeling Job
Overview (p. 575).
Topics
• Video Frame Object Detection (p. 567)
• Video Frame Object Tracking (p. 571)
• Video Frame Labeling Job Overview (p. 575)
567
Amazon SageMaker Developer Guide
Label Videos and Video Frames
or keypoint annotation tools. The tool you choose defines the video frame task type you create. For
example, you can use a bounding box video frame object detection task type workers to identify and
localize various objects in a series of video frames, such as cars, bikes, and pedestrians.
You can create a video frame object detection labeling job using the Amazon SageMaker Ground Truth
console, the SageMaker API, and language-specific AWS SDKs. To learn more, see Create a Video Frame
Object Detection Labeling Job (p. 568) and select your preferred method. See Task Types (p. 576) to
learn more about the annotations tools you can choose from when you create a labeling job.
Ground Truth provides a worker UI and tools to complete your labeling job tasks: Preview the Worker
UI (p. 568).
You can create a job to adjust annotations created in a video object detection labeling job using the
video object detection adjustment task type. To learn more, see Create Video Frame Object Detection
Adjustment or Verification Labeling Job (p. 571).
Ground Truth provides workers with a web user interface (UI) to complete your video frame object
detection annotation tasks. You can preview and interact with the worker UI when you create a labeling
job in the console. If you are a new user, we recommend that you create a labeling job through the
console using a small input dataset to preview the worker UI and ensure your video frames, labels, and
label attributes appear as expected.
The UI provides workers with the following assistive labeling tools to complete your object detection
tasks:
• For all tasks, workers can use the Copy to next and Copy to all features to copy an annotation to the
next frame or to all subsequent frames respectively.
• For tasks that include the bounding box tools, workers can use a Predict next feature to draw a
bounding box in a single frame, and then have Ground Truth predict the location of boxes with the
same label in all other frames. Workers can then make adjustments to correct predicted box locations.
You can create a video frame object detection labeling job using the SageMaker console or the
CreateLabelingJob API operation.
This section assumes that you have reviewed the Video Frame Labeling Job Overview (p. 575) and have
chosen the type of input data and the input dataset connection you are using.
You can follow the instructions in Create a Labeling Job (Console) (p. 706) to learn how to create a
video frame object tracking job in the SageMaker console. In step 10, choose Video - Object detection
from the Task category dropdown list. Select the task type you want by selecting one of the cards in
Task selection.
568
Amazon SageMaker Developer Guide
Label Videos and Video Frames
569
Amazon SageMaker Developer Guide
Label Videos and Video Frames
You create an object detection labeling job using the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.
Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.
response = client.create_labeling_job(
LabelingJobName='example-video-od-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://DOC-EXAMPLE-BUCKET/path/video-frame-sequence-input-
manifest.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
570
Amazon SageMaker Developer Guide
Label Videos and Video Frames
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://DOC-EXAMPLE-BUCKET/prefix/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/prefix/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn: 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
VideoObjectDetection'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoObjectDetection',
'TaskKeywords': [
'Video Frame Object Detection',
],
'TaskTitle': 'Video frame object detection task',
'TaskDescription': 'Classify and identify the location of objects and people in
video frames',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoObjectDetection'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).
When you create a video frame object detection labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 output location you specified when
you created the labeling job. To learn about the video frame object detection output data format, see
Video Frame Object Detection Output (p. 788). If you are a new user of Ground Truth, see Output
Data (p. 776) to learn more about the Ground Truth output data format.
571
Amazon SageMaker Developer Guide
Label Videos and Video Frames
or keypoint annotation tools. The tool you choose defines the video frame task type you create. For
example, you can use a bounding box video frame object tracking task type to ask workers to track the
movement of objects, such as cars, bikes, and pedestrians by drawing boxes around them.
You provide a list of categories, and each annotation that a worker adds to a video frame is identified as
an instance of that category using an instance ID. For example, if you provide the label category car, the
first car that a worker annotates will have the instance ID car:1. The second car the worker annotates will
have the instance ID car:2. To track an object's movement, the worker adds annotations associated with
the same instance ID around to object in all frames.
You can create a video frame object tracking labeling job using the Amazon SageMaker Ground Truth
console, the SageMaker API, and language-specific AWS SDKs. To learn more, see Create a Video Frame
Object Detection Labeling Job (p. 568) and select your preferred method. See Task Types (p. 576) to
learn more about the annotations tools you can choose from when you create a labeling job.
Ground Truth provides a worker UI and tools to complete your labeling job tasks: Preview the Worker
UI (p. 568).
You can create a job to adjust annotations created in a video object detection labeling job using the
video object detection adjustment task type. To learn more, see Create Video Frame Object Detection
Adjustment or Verification Labeling Job (p. 571).
Ground Truth provides workers with a web user interface (UI) to complete your video frame object
tracking annotation tasks. You can preview and interact with the worker UI when you create a labeling
job in the console. If you are a new user, we recommend that you create a labeling job through the
console using a small input dataset to preview the worker UI and ensure your video frames, labels, and
label attributes appear as expected.
The UI provides workers with the following assistive labeling tools to complete your object tracking
tasks:
• For all tasks, workers can use the Copy to next and Copy to all features to copy an annotation with the
same unique ID to the next frame or to all subsequent frames respectively.
• For tasks that include the bounding box tools, workers can use a Predict next feature to draw a
bounding box in a single frame, and then have Ground Truth predict the location of boxes with the
same unique ID in all other frames. Workers can then make adjustments to correct predicted box
locations.
You can create a video frame object tracking labeling job using the SageMaker console or the
CreateLabelingJob API operation.
This section assumes that you have reviewed the Video Frame Labeling Job Overview (p. 575) and have
chosen the type of input data and the input dataset connection you are using.
You can follow the instructions in Create a Labeling Job (Console) (p. 706) to learn how to create a
video frame object tracking job in the SageMaker console. In step 10, choose Video - Object tracking
from the Task category dropdown list. Select the task type you want by selecting one of the cards in
Task selection.
572
Amazon SageMaker Developer Guide
Label Videos and Video Frames
573
Amazon SageMaker Developer Guide
Label Videos and Video Frames
You create an object tracking labeling job using the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.
Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.
response = client.create_labeling_job(
LabelingJobName='example-video-ot-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://DOC-EXAMPLE-BUCKET/path/video-frame-sequence-input-
manifest.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
574
Amazon SageMaker Developer Guide
Label Videos and Video Frames
]
}
},
OutputConfig={
'S3OutputPath': 's3://DOC-EXAMPLE-BUCKET/prefix/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/prefix/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn: 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
VideoObjectTracking'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoObjectTracking',
'TaskKeywords': [
'Video Frame Object Tracking,
],
'TaskTitle': 'Video frame object tracking task',
'TaskDescription': Tracking the location of objects and people across video
frames',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoObjectTracking'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).
When you create a video frame object tracking labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 output location you specified when
you created the labeling job. To learn about the video frame object tracking output data format, see
Video Frame Object Tracking Output (p. 790). If you are a new user of Ground Truth, see Output
Data (p. 776) to learn more about the Ground Truth output data format.
575
Amazon SageMaker Developer Guide
Label Videos and Video Frames
• You can either provide data objects that are ready to be annotated (video frames), or you can provide
video files and have Ground Truth automatically extract video frames.
• Workers have the ability to save work as they go.
• You cannot use the Amazon Mechanical Turk workforce to complete your labeling tasks.
• Ground Truth provides a worker UI, as well as assistive and basic labeling tools, to help workers
complete your tasks. You do not need to provide a worker task template.
Topics
• Input Data (p. 576)
• Job Completion Times (p. 576)
• Task Types (p. 576)
• Workforces (p. 577)
• Worker User Interface (UI) (p. 577)
• Video Frame Job Permission Requirements (p. 579)
Input Data
The video frame labeling job uses sequences of video frames. A single sequence is a series of images that
have been extracted from a single video. You can either provide your own sequences of video frames, or
have Ground Truth automatically extract video frame sequences from your video files. To learn more, see
Provide Video Files (p. 772).
Ground Truth uses sequence files to identify all images in a single sequence. All of the sequences that
you want to include in a single labeling job are identified in an input manifest file. Each sequence is
used to create a single worker task. You can automatically create sequence files and an input manifest
file using Ground Truth automatic data setup. To learn more, see Automated Video Frame Input Data
Setup (p. 773).
To learn how to manually create sequence files and an input manifest file, see Create a Video Frame
Input Manifest File (p. 775).
Video and video frame labeling jobs can take workers hours to complete. You can set the total amount of
time that workers can work on each task when you create a labeling job. The maximum time you can set
for workers to work on tasks is 7 days. The default value is 3 days.
We strongly recommend that you create tasks that workers can complete within 12 hours. Workers must
keep the worker UI open while working on a task. They can save work as they go and Ground Truth saves
their work every 15 minutes.
When using the SageMaker CreateLabelingJob API operation, set the total time a task is available to
workers in the TaskTimeLimitInSeconds parameter of HumanTaskConfig.
When you create a labeling job in the console, you can specify this time limit when you select your
workforce type and your work team.
Task Types
When you create a video object tracking or video object detection labeling job, you specify the type of
annotation that you want workers to create while working on your labeling task. The annotation type
determines the type of output data Ground Truth returns and defines the task type for your labeling job.
576
Amazon SageMaker Developer Guide
Label Videos and Video Frames
If you are creating a labeling job using the API operation CreateLabelingJob, you specify the task
type using the label category configuration file parameter annotationType. To learn more, see Create
a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719).
The following task types are available for both video object tracking or video object detection labeling
jobs:
• Bounding box – Workers are provided with tools to create bounding box annotations. A bounding box
is a box that a worker draws around an objects to identify the pixel-location and label of that object in
the frame.
• Polyline – Workers are provided with tools to create polyline annotations. A polyline is defined by the
series of ordered x, y coordinates. Each point added to the polyline is connected to the previous point
by a line. The polyline does not have to be closed (the start point and end point do not have to be the
same) and there are no restrictions on the angles formed between lines.
• Polygon – Workers are provided with tools to create polygon annotations. A polygon is a closed shape
defined by a series of ordered x, y coordinates. Each point added to the polygon is connected to the
previous point by a line and there are no restrictions on the angles formed between lines. Two lines
(sides) of the polygon cannot cross. The start and end point of a polygon must be the same.
• Keypoint – Workers are provided with tools to create keypoint annotations. A keypoint is a single point
associated with an x, y coordinate in the video frame.
Workforces
When you create a video frame labeling job, you need to specify a work team to complete your
annotation tasks. You can choose a work team from a private workforce of your own workers, or from a
vendor workforce that you select in the AWS Marketplace. You cannot use the Amazon Mechanical Turk
workforce for video frame labeling jobs.
To learn more about vendor workforces, see Managing Vendor Workforces (p. 867).
To learn how to create and manage a private workforce, see Use a Private Workforce (p. 868).
Ground Truth provides a worker user interface (UI), tools, and assistive labeling features to help workers
complete your video labeling tasks. You can preview the worker UI when you create a labeling job in the
console.
When you create a labeling job using the API operation CreateLabelingJob, you must provide an ARN
provided by Ground Truth in the parameter HumanTaskUiArn to specify the worker UI for your task
type. You can use HumanTaskUiArn with the SageMaker RenderUiTemplate API operation to preview
the worker UI.
You provide worker instructions, labels, and optionally, attributes that workers can use to provide more
information about labels and video frames. These attributes are referred to as label category attributes
and frame attributes respectively. They are all displayed in the worker UI.
When you create a video object tracking or video object detection labeling job, you can add one or more
label category attributes and frame attributes:
• Label category attribute – A list of options (strings), a free form text box, or a numeric field associated
with one or more labels. It is used by workers to provide metadata about a label.
• Frame attribute – A list of options (strings), a free form text box, or a numeric field that appears on
each video frame a worker is sent to annotate. It is used by workers to provide metadata about video
frames.
577
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Additionally, you can use label and frame attributes to have workers verify labels in a video frame label
verification job.
Use the following sections to learn more about these attributes. To learn how to add label category and
frame attributes to a labeling job, use the Create Labeling Job sections on the task type page (p. 567)
of your choice.
For example, if you add the label category car, you might also want to capture additional data about
your labeled cars, such as if they are occluded or the size of the car. You can capture this metadata using
label category attributes. In this example, if you added the attribute occluded to the car label category,
you can assign partial, completely, no to the occluded attribute and enable workers to select one of these
options.
When you create a label verification job, you add labels category attributes to each label you want
workers to verify.
For example, you can add a number-frame attribute to have workers identify the number of objects they
see in a particular frame.
In another example, you may want to provide a free-form text box to give workers the ability to provide
an answer to a question.
When you create a label verification job, you can add one or more frame attributes to ask workers to
provide feedback on all labels in a video frame.
Worker Instructions
You can provide worker instructions to help your workers complete your video frame labeling tasks. You
might want to cover the following topics when writing your instructions:
You can add your worker instructions using the SageMaker console while creating a labeling job. If you
create a labeling job using the API operation CreateLabelingJob, you specify worker instructions in
your label category configuration file.
In addition to your instructions, Ground Truth provides a link to help workers navigate and use the
worker portal. View these instructions by selecting the task type on Worker Instructions (p. 579).
Declining Tasks
Workers are able to decline tasks.
Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.
578
Amazon SageMaker Developer Guide
Label Videos and Video Frames
When you create a video frame labeling job, in addition to the permission requirements found in Assign
IAM Permissions to Use Ground Truth (p. 817), you must add a CORS policy to your S3 bucket that
contains your input manifest file.
When you create a video frame labeling job, you specify buckets in S3 where your input data and
manifest file are located and where your output data will be stored. These buckets may be the same. You
must attach the following Cross-origin resource sharing (CORS) policy to your input and output buckets.
If you use the Amazon S3 console to add the policy to your bucket, you must use the JSON format.
JSON
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"HEAD",
"PUT"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": [
"Access-Control-Allow-Origin"
],
"MaxAgeSeconds": 3000
}
]
XML
To learn how to add a CORS policy to an S3 bucket, see How do I add cross-domain resource sharing
with CORS? in the Amazon Simple Storage Service User Guide.
Worker Instructions
This topic provides an overview of the Ground Truth worker portal and the tools available to complete
your video frame labeling task. First, select the type of task you are working on from Topics.
Important
It is recommended that you complete your task using a Google Chrome or Firefox web browser.
579
Amazon SageMaker Developer Guide
Label Videos and Video Frames
For adjustment jobs, select the original labeling job task type that produced the labels you are adjusting.
Review and adjust the labels in your task as needed.
Topics
• Work on Video Frame Object Tracking Tasks (p. 580)
• Work on Video Frame Object Detection Tasks (p. 587)
You can use the worker UI to navigate between video frames and use the tools provided to identify
unique objects and track their movement from one from to the next. Use this page to learn how to
navigate your worker UI, use the tools provided, and complete your task.
It is recommended that you complete your task using a Google Chrome or Firefox web browser.
Important
If you see annotations have already been added to one or more video frames when you open
your task, adjust those annotations and add additional annotations as needed.
Topics
• Your Task (p. 580)
• Navigate the UI (p. 582)
• Bulk Edit Label and Frame Attributes (p. 582)
• Tool Guide (p. 583)
• Icons Guide (p. 585)
• Shortcuts (p. 586)
• Release, Stop and Resume, and Decline Tasks (p. 587)
• Saving Your Work and Submitting (p. 587)
Your Task
When you work on a video frame object tracking task, you need to select a category from the Label
category menu on the right side of your worker portal to start annotating. After you've chosen a
category, use the tools provided to annotate the objects that the category applies to. This annotation
will be associated with a unique label ID that should only be used for that object. Use this same label ID
to create additional annotations for the same object in all of the video frames that it appears in. Refer to
Tool Guide (p. 583) to learn more about the tools provided.
After you've added a label, you may see a downward pointing arrow next to the label in the Labels
menu. Select this arrow and then select one option for each label attribute you see to provide more
information about that label.
You may see frame attributes under the Labels menu. These attributes will appear on each frame in your
task. Use these attribute prompts to enter additional information about each frame.
580
Amazon SageMaker Developer Guide
Label Videos and Video Frames
After you've added a label, you can quickly add and edit a label category attribute value by using the
downward pointing arrow next to the label in the Labels menu. If you select the pencil icon next to the
label in the Labels menu, the Edit instance menu will appear. You can edit the label ID, label category,
and label category attributes using this menu.
To edit an annotation, select the label of the annotation that you want to edit in the Labels menu or
select the annotation in the frame. When you edit or delete an annotation, the action will only modify
the annotation in a single frame.
If you are working on a task that includes a bounding box tool, use the predict next icon to predict the
location of all bounding boxes that you have drawn in a frame in the next frame. If you select a single
box and then select the predict next icon, only that box will be predicted in the next frame. If you have
not added any boxes to the current frame, you will receive an error. You must add at least one box to the
frame before using this feature.
581
Amazon SageMaker Developer Guide
Label Videos and Video Frames
After you've used the predict next icon, review the location of each box in the next frame and make
adjustments to the box location and size if necessary.
For all other tools, you can use the Copy to next and Copy to all tools to copy your annotations to the
next or all frames respectively.
Navigate the UI
You can navigate between video frames using the navigation bar in the bottom-left corner of your UI.
Use the play button to automatically move through the entire sequence of frames.
Use the next frame and previous frame buttons to move forward or back one frame at a time. You can
also input a frame number to navigate to that frame.
You can zoom in to and out of all video frames. Once you have zoomed into a video frame, you can move
around in that frame using the move icon. When you set a new view in a single video frame by zooming
and moving within that frame, all video frames are set to the same view. You can reset all video frames
to their original view using the fit screen icon. For additional view options, see Icons Guide (p. 585).
When you are in the worker UI, you see the following menus:
• Instructions – Review these instructions before starting your task. Additionally, select More
instructions and review these instructions.
• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate video frames and
use the tools provided.
• Help – Use this option to refer back to this documentation.
When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.
To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.
You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.
1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.
If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.
You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.
582
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Tool Guide
Your task will include one or more tools. The tool provided dictates the type of annotations you will
create to identify and track objects. Use the following table to learn more about each tool provided.
583
Amazon SageMaker Developer Guide
Label Videos and Video Frames
A polygon is a closed
shape defined by a
series of points that
you place. Each point
added to the polygon
is connected to the
previous point by a
line and there are no
restrictions on the
angles formed between
lines. The start and end
point must be the same.
584
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Icons Guide
Use this table to learn about the icons you see in your UI. You can automatically select some of these
icons using the keyboard shortcuts found in the Shortcuts menu.
brightness Choose this icon to adjust the brightness of all video frames.
contrast Choose this icon to adjust the contrast of all video frames.
zoom in Choose this icon to zoom into all of the video frames.
585
Amazon SageMaker Developer Guide
Label Videos and Video Frames
zoom out Choose this icon to zoom out of all of the video frames.
move screen After you've zoomed into a video frame, choose this icon to
move around in that video frame. You can move around the
video frame using your mouse by clicking and dragging the
frame in the direction you want it to move. This will change
the view in all view frames.
undo Undo an action. You can use this icon to remove a bounding
box that you just added, or to undo an adjustment you made
to a bounding box.
redo Redo an action that was undone using the undo icon.
delete label Delete a label. This will delete the bounding box associated
with the label in a single frame.
show or hide label Select this icon to show a label that has been hidden. If this
icon has a slash through it, select it to hide the label.
edit label Select this icon to open the Edit instance menu. Use this
menu to edit a label category, ID, and to add or edit label
attributes.
Shortcuts
The keyboard shortcuts listed in the Shortcuts menu can help you quickly select icons, undo and redo
annotations, and use tools to add and edit annotations. For example, once you add a bounding box, you
can use P to quickly predict the location of that box in subsequent frames.
Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.
586
Amazon SageMaker Developer Guide
Label Videos and Video Frames
When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:
• Decline task: You should only decline a task if something is wrong with the task, such as unclear video
frame images or an issue with the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.
Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.
You should periodically save your work using the Save button. Ground Truth will automatically save your
work ever 15 minutes.
When you open a task, you must complete your work on it before pressing Submit.
You can use the worker UI to navigate between video frames and create annotations to identify objects
of interest. Use the sections on this page to learn how to navigate your worker UI, use the tools provided,
and complete your task.
It is recommended that you complete your task using a Google Chrome web browser.
Important
If you see annotations have already been added to one or more video frames when you open
your task, adjust those annotations and add additional annotations as needed.
Topics
• Your Task (p. 588)
• Navigate the UI (p. 589)
• Bulk Edit Label and Frame Attributes (p. 589)
• Tool Guide (p. 590)
• UI Icon Guide (p. 593)
• Shortcuts (p. 594)
• Release, Stop and Resume, and Decline Tasks (p. 594)
• Saving Your Work and Submitting (p. 595)
587
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Your Task
When you work on a video frame object detection task, you need to select a category from the Label
category menu on the right side of your worker portal to start annotating. After you've chosen a
category, draw annotations around objects that this category applies to. To learn more about the tools
you see in your worker UI, refer to the Tool Guide (p. 590).
After you've added a label, you may see a downward pointing arrow next to the label in the Labels
menu. Select this arrow and then select one option for each label attribute you see to provide more
information about that label.
You may see frame attributes under the Labels menu. These attributes will appear on each frame in your
task. Use these attribute prompts to enter additional information about each frame.
588
Amazon SageMaker Developer Guide
Label Videos and Video Frames
To edit an annotation, select the label of the annotation that you want to edit in the Labels menu or
select the annotation in the frame. When you edit or delete an annotation, the action will only modify
the annotation in a single frame.
If you are working on a task that includes a bounding box tool, use the predict next icon to predict the
location of all bounding boxes that you have drawn in a frame in the next frame. If you select a single
box and then select the predict next icon, only that box will be predicted in the next frame. If you have
not added any boxes to the current frame, you will receive an error. You must add at least one box to the
frame before using this feature.
Note
The predict next feature will not overwrite manually created annotations. It will only add
annotations. If you use predict next and as a result have more than one bounding box around a
single object, delete all but one box. Each object should only be identified with a single box.
After you've used the predict next icon, review the location of each box in the next frame and make
adjustments to the box location and size if necessary.
For all other tools, you can use the Copy to next and Copy to all tools to copy your annotations to the
next or all frames respectively.
Navigate the UI
You can navigate between video frames using the navigation bar in the bottom-left corner of your UI.
Use the next frame and previous frame buttons to move forward or back one frame at a time. You can
also input a frame number to navigate to that frame.
You can zoom in to and out of all video frames. Once you have zoomed into a video frame, you can move
around in that frame using the move icon. When you navigate to a new view in a single video frame by
zooming and moving within that frame, all video frames are set to the same view. You can reset all video
frames to their original view using the fit screen icon. To learn more, see UI Icon Guide (p. 593).
When you are in the worker UI, you see the following menus:
• Instructions – Review these instructions before starting your task. Additionally, select More
instructions and review these instructions.
• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate video frames and
use the annotation tools provided.
• Help – Use this option to refer back to this documentation.
If you
When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.
To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.
You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.
589
Amazon SageMaker Developer Guide
Label Videos and Video Frames
1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.
If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.
You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.
Tool Guide
Your task will include one or more tools. The tool provided dictates the type of annotations you will
create to identify and label objects. Use the following table to learn more about the tool or tools you
may see in your worker UI.
590
Amazon SageMaker Developer Guide
Label Videos and Video Frames
591
Amazon SageMaker Developer Guide
Label Videos and Video Frames
A polygon is a closed
shape defined by a
series of points that
you place. Each point
added to the polygon
is connected to the
previous point by a
line and there are
no restrictions on
the angles formed
between lines. Two
lines (sides) of the
polygon cannot cross. A
line will become red if it
violates this condition.
The start and end point
must be the same.
592
Amazon SageMaker Developer Guide
Label Videos and Video Frames
UI Icon Guide
Use this table to learn about the icons you see in your worker task portal. You can automatically select
these icons using the keyboard shortcuts found in the Shortcuts menu.
Icon Description
brightness Choose this icon to adjust the brightness of all video frames.
contrast Choose this icon to adjust the contrast of all video frames.
zoom in Choose this icon to zoom into all of the video frames.
zoom out Choose this icon to zoom out of all of the video frames.
move screen After you've zoomed into a video frame, choose this icon to
move around in that video frame. You can move around in
the video frame using your mouse by clicking and dragging
the frame in the direction you want it to move. This will
change the view in all view frames.
593
Amazon SageMaker Developer Guide
Label Videos and Video Frames
Icon Description
undo Undo an action. You can use this icon to remove a bounding
box that you just added, or to undo an adjustment you made
to a bounding box.
redo Redo an action that was undone using the undo icon.
delete label Delete a label. This will delete the bounding box associated
with the label in a single frame.
show or hide label Select this icon to show a label that has been hidden. If this
icon has a slash through it, select it to hide the label.
Shortcuts
The keyboard shortcuts listed in the Shortcuts menu can help you quickly select icons, undo and redo
annotations, and use tools to add and edit annotations. For example, once you add a bounding box, you
can use P to quickly predict the location of that box in subsequent frames.
Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.
When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:
• Decline task: You should only decline a task if something is wrong with the task, such as unclear video
frame images or an issue with the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.
Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.
594
Amazon SageMaker Developer Guide
Label 3D Point Clouds
You should periodically save your work. Ground Truth automatically saves your work every 15 minutes.
When you open a task, you must complete your work before pressing Submit.
3D Point Clouds
Point clouds are made up of three-dimensional (3D) visual data that consists of points. Each point is
described using three coordinates, typically x, y, and z. To add color or variations in point intensity to
the point cloud, points may be described with additional attributes, such as i for intensity or values for
the red (r), green (g), and blue (b) 8-bit color channels. When you create a Ground Truth 3D point cloud
labeling job, you can provide point cloud and, optionally, sensor fusion data.
The following image shows a single, 3D point cloud scene rendered by Ground Truth and displayed in the
semantic segmentation worker UI.
595
Amazon SageMaker Developer Guide
Label 3D Point Clouds
596
Amazon SageMaker Developer Guide
Label 3D Point Clouds
LiDAR
A Light Detection and Ranging (LiDAR) sensor is a common type of sensor used to collect measurements
that are used to generate point cloud data. LiDAR is a remote sensing method that uses light in the form
of a pulsed laser to measure the distances of objects from the sensor. You can provide 3D point cloud
data generated from a LiDAR sensor for a Ground Truth 3D point cloud labeling job using the raw data
formats described in Accepted Raw 3D Data Formats (p. 746).
Sensor Fusion
Ground Truth 3D point cloud labeling jobs include a sensor fusion feature that supports video camera
sensor fusion for all task types. Some sensors come with multiple LiDAR devices and video cameras that
capture images and associate them with a LiDAR frame. To help annotators visually complete your tasks
with high confidence, you can use the Ground Truth sensor fusion feature to project annotations (labels)
from a 3D point cloud to 2D camera images and vice versa using 3D scanner (such as LiDAR) extrinsic
matrix and camera extrinsic and intrinsic matrices. To learn more, see Sensor Fusion (p. 763).
The following demonstrates how a worker would use the Ground Truth worker portal and tools to
annotate a 3D point cloud for an object detection task. For similar visual examples of other task types,
see 3D Point Cloud Task types (p. 599).
597
Amazon SageMaker Developer Guide
Label 3D Point Clouds
598
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Next Steps
You can create six types of tasks when you use Ground Truth 3D point cloud labeling jobs. Use the topics
in 3D Point Cloud Task types (p. 599) to learn more about these task types and to learn how to create a
labeling job using the task type of your choice.
The 3D point cloud labeling job is different from other Ground Truth labeling modalities. Before
creating a labeling job, we recommend that you read 3D Point Cloud Labeling Jobs Overview (p. 630).
Additionally, review input data quotas in 3D Point Cloud and Video Frame Labeling Job Quotas (p. 744).
For an end-to-end demo using the SageMaker API and AWS Python SDK (boto 3) to create a 3D point
cloud labeling job, see create-3D-pointcloud-labeling-job.ipynb in the SageMaker Examples notebook
tab.
Important
If you use a notebook instance created before June 5th, 2020 to run this notebook, you must
stop and restart that notebook instance for the notebook to work.
Topics
• 3D Point Cloud Task types (p. 599)
• 3D Point Cloud Labeling Jobs Overview (p. 630)
• Worker Instructions (p. 634)
• 3D point cloud object detection – Use this task type when you want workers to locate and classify
objects in a 3D point cloud by adding and fitting 3D cuboids around objects.
• 3D point cloud object tracking – Use this task type when you want workers to add and fit 3D cuboids
around objects to track their movement across a sequence of 3D point cloud frames. For example, you
can use this task type to ask workers to track the movement of vehicles across multiple point cloud
frames.
• 3D point cloud semantic segmentation – Use this task type when you want workers to create a point-
level semantic segmentation mask by painting objects in a 3D point cloud using different colors where
each color is assigned to one of the classes you specify.
• 3D point cloud adjustment task types – Each of the task types above has an associated adjustment task
type that you can use to audit and adjust annotations generated from a 3D point cloud labeling job.
Refer to the task type page of the associated type to learn how to create an adjustment labeling job
for that task.
599
Amazon SageMaker Developer Guide
Label 3D Point Clouds
For this task type, the data object that workers label is a single point cloud frame. Ground Truth renders
a 3D point cloud using point cloud data you provide. You can also provide camera data to give workers
more visual information about scenes in the frame, and to help workers draw 3D cuboids around objects.
Ground Truth providers workers with tools to annotate objects with 9 degrees of freedom
(x,y,z,rx,ry,rz,l,w,h) in three dimensions in both 3D scene and projected side views (top, side, and back).
If you provide sensor fusion information (like camera data), when a worker adds a cuboid to identify an
object in the 3D point cloud, the cuboid shows up and can be modified in the 2D images. After a cuboid
has been added, all edits made to that cuboid in the 2D or 3D scene are projected into the other view.
You can create a job to adjust annotations created in a 3D point cloud object detection labeling job using
the 3D point cloud object detection adjustment task type.
If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this page provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.
Topics
• View the Worker Task Interface (p. 600)
• Create a 3D Point Cloud Object Detection Labeling Job (p. 604)
• Create a 3D Point Cloud Object Detection Adjustment or Verification Labeling Job (p. 605)
• Output Data Format (p. 606)
Ground Truth provides workers with a web portal and tools to complete your 3D point cloud object
detection annotation tasks. When you create the labeling job, you provide the Amazon Resource Name
(ARN) for a pre-built Ground Truth worker UI in the HumanTaskUiArn parameter. When you create a
labeling job using this task type in the console, this worker UI is automatically used. You can preview
and interact with the worker UI when you create a labeling job in the console. If you are a new user, it
is recommended that you create a labeling job using the console to ensure your label attributes, point
cloud frames, and if applicable, images, appear as expected.
The following is a GIF of the 3D point cloud object detection worker task interface. If you provide camera
data for sensor fusion in the world coordinate system, images are matched up with scenes in the point
cloud frame. These images appear in the worker portal as shown in the following GIF.
600
Amazon SageMaker Developer Guide
Label 3D Point Clouds
601
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Worker can navigate in the 3D scene using their keyboard and mouse. They can:
• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
Once a worker places a cuboid in the 3D scene, a side-view will appear with the three projected side
views: top, side, and back. These side-views show points in and around the placed cuboid and help
workers refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views
using their mouse.
The following video demonstrates movements around the 3D point cloud and in the side-view.
602
Amazon SageMaker Developer Guide
Label 3D Point Clouds
603
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Additional view options and features are available in the View menu in the worker UI. See the worker
instruction page for a comprehensive overview of the Worker UI.
Ground Truth helps workers annotate 3D point clouds faster and more accurately using machine learning
and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks. The
following assistive labeling tools are available for this task type:
• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object.
• Set to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side panel displays front,
side, and top perspectives to help the worker adjust the cuboid tightly around the object. In all of
these views, the cuboid includes an arrow that indicates the orientation, or heading of the object.
When the worker adjusts the cuboid, the adjustment will appear in real time on all of the views (that is,
3D, top, side, and front).
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations will be projected into the other view in real time. Additionally,
workers will have the option to view the direction the camera is facing and the camera frustum.
• View options – Enables workers to easily hide or view cuboids, label text, a ground mesh, and
additional point attributes like color or intensity. Workers can also choose between perspective and
orthogonal projections.
You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:
• A single-frame input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Frame Input Manifest File (p. 748). If you are a new user of Ground Truth 3D point cloud
labeling modalities, you may also want to review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for video
frame labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).
Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).
Use one of the following sections to learn how to create a labeling job using the console or an API.
You can follow the instructions Create a Labeling Job (Console) (p. 706) in order to learn how to create
a 3D point cloud object detection labeling job in the SageMaker console. While you are creating your
labeling job, be aware of the following:
• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• Optionally, you can provide label category and frame attributes. Workers can assign one or more of
these attributes to annotations to provide more information about that object. For example, you might
want to use the attribute occluded to have workers identify when an object is partially obstructed.
604
Amazon SageMaker Developer Guide
Label 3D Point Clouds
• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud object detection labeling jobs can take multiple hours to complete. You can specify
a longer time limit for these labeling jobs when you select your work team (up to 7 days, or 604800
seconds).
This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.
Create a Labeling Job (API) (p. 709), provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:
You can create an adjustment or verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).
When you create an adjustment labeling job, your input data to the labeling job can include labels, and
yaw, pitch, and roll measurements from a previous labeling job or external source. In the adjustment job,
pitch, and roll will be visualized in the worker UI, but cannot be modified. Yaw is adjustable.
605
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Ground Truth uses Tait-Bryan angles with the following intrinsic rotations to visualize yaw, pitch and roll
in the worker UI. First, rotation is applied to the vehicle according to the z-axis (yaw). Next, the rotated
vehicle is rotated according to the intrinsic y'-axis (pitch). Finally, the vehicle is rotated according to the
intrinsic x''-axis (roll).
When you create a 3D point cloud object detection labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 bucket you specified when you created
the labeling job. The output data format determines what you see in your Amazon S3 bucket when your
labeling job status (LabelingJobStatus) is Completed.
If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object detection output data format, see 3D Point
Cloud Object Detection Output (p. 794).
Use this task type when you want workers to add and fit 3D cuboids around objects to track their
movement across 3D point cloud frames. For example, you can use this task type to ask workers to track
the movement of vehicles across multiple point cloud frames.
For this task type, the data object that workers label is a sequence of point cloud frames. A sequence
is defined as a temporal series of point cloud frames. Ground Truth renders a series of 3D point cloud
visualizations using a sequence you provide and workers can switch between these 3D point cloud
frames in the worker task interface.
Ground Truth providers workers with tools to annotate objects with 9 degrees of freedom:
(x,y,z,rx,ry,rz,l,w,h) in three dimensions in both 3D scene and projected side views (top, side, and back).
When a worker draws a cuboid around an object, that cuboid is given a unique ID, for example Car:1 for
one car in the sequence and Car:2 for another. Workers use that ID to label the same object in multiple
frames.
You can also provide camera data to give workers more visual information about scenes in the frame,
and to help workers draw 3D cuboids around objects. When a worker adds a 3D cuboid to identify an
object in either the 2D image or the 3D point cloud, and the cuboid shows up in the other view.
You can adjust annotations created in a 3D point cloud object detection labeling job using the 3D point
cloud object tracking adjustment task type.
If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this page provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.
Topics
• View the Worker Task Interface (p. 606)
• Create a 3D Point Cloud Object Tracking Labeling Job (p. 614)
• Create a 3D Point Cloud Object Tracking Adjustment or Verification Labeling Job (p. 615)
• Output Data Format (p. 615)
Ground Truth provides workers with a web portal and tools to complete your 3D point cloud object
tracking annotation tasks. When you create the labeling job, you provide the Amazon Resource Name
606
Amazon SageMaker Developer Guide
Label 3D Point Clouds
(ARN) for a pre-built Ground Truth UI in the HumanTaskUiArn parameter. When you create a labeling
job using this task type in the console, this UI is automatically used. You can preview and interact with
the worker UI when you create a labeling job in the console. If you are a new use, it is recommended that
you create a labeling job using the console to ensure your label attributes, point cloud frames, and if
applicable, images, appear as expected.
The following is a GIF of the 3D point cloud object tracking worker task interface and demonstrates how
the worker can navigate the point cloud frames in the sequence. The annotating tools are a part of the
worker task interface. They are not available for the preview interface.
607
Amazon SageMaker Developer Guide
Label 3D Point Clouds
608
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Once workers add a single cuboid, that cuboid is replicated in all frames of the sequence with the same
ID. Once workers adjust the cuboid in another frame, Ground Truth will interpolate the movement of that
object and adjust all cuboids between the manually adjusted frames. The following GIF demonstrates
this interpolation feature. In the navigation bar on the bottom-left, red-areas indicate manually adjusted
frames.
609
Amazon SageMaker Developer Guide
Label 3D Point Clouds
610
Amazon SageMaker Developer Guide
Label 3D Point Clouds
If you provide camera data for sensor fusion, images are matched up with scenes in point cloud frames.
These images appear in the worker portal as shown in the following GIF.
Worker can navigate in the 3D scene using their keyboard and mouse. They can:
• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
Once a worker places a cuboids in the 3D scene, a side-view will appear with the three projected side
views: top, side, and back. These side-views show points in and around the placed cuboid and help
workers refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views
using their mouse.
The following video demonstrates movements around the 3D point cloud and in the side-view.
611
Amazon SageMaker Developer Guide
Label 3D Point Clouds
612
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.
Worker Tools
Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. If workers click on a point in the point cloud,
the UI will automatically zoom into that area. Workers can use various tools to draw 3D cuboid around
objects. For more information, see Assistive Labeling Tools.
After workers have placed a 3D cuboid in the point cloud, they can adjust these cuboids to fit tightly
around cars using a variety of views: directly in the 3D cuboid, in a side-view featuring three zoomed-in
perspectives of the point cloud around the box, and if you include images for sensor fusion, directly in
the 2D image.
View options that enable workers to easily hide or view label text, a ground mesh, and additional point
attributes. Workers can also choose between perspective and orthogonal projections.
Ground Truth helps workers annotate 3D point clouds faster and more accurately using UX, machine
learning and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks.
The following assistive labeling tools are available for this task type:
• Label autofill – When a worker adds a cuboid to a frame, a cuboid with the same dimensions and
orientation is automatically added to all frames in the sequence.
• Label interpolation – After a worker has labeled a single object in two frames, Ground Truth uses
those annotations to interpolate the movement of that object between those two frames. Label
interpolation can be turned on and off.
• Bulk label and attribute management – Workers can add, delete, and rename annotations, label
category attributes, and frame attributes in bulk.
• Workers can manually delete annotations for a given object before or after a frame. For example,
a worker can delete all labels for an object after frame 10 if that object is no longer located in the
scene after that frame.
• If a worker accidentally bulk deletes all annotations for a object, they can add them back. For
example, if a worker deletes all annotations for an object before frame 100, they can bulk add them
to those frames.
• Workers can rename a label in one frame and all 3D cuboids assigned that label are updated with
the new name across all frames.
• Workers can use bulk editing to add or edit label category attributes and frame attributes in
multiple frames.
• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object's boundaries.
• Fit to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side -panel displays front
and two side perspectives to help the worker adjust the cuboid tightly around the object. Workers can
annotation the 3D point cloud, the side panel and the adjustments appear in the other views in real
time.
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations will be projected into the other view in real time.
• Auto-merge cuboids – Workers can automatically merge two cuboids across all frames if they
determine that cuboids with different labels actually represent a single object.
613
Amazon SageMaker Developer Guide
Label 3D Point Clouds
• View options – Enables workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.
You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:
• A sequence input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Sequence Input Manifest (p. 754). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for 3D
point cloud labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).
Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).
To learn how to create a labeling job using the console or an API, see the following sections.
This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.
Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:
614
Amazon SageMaker Developer Guide
Label 3D Point Clouds
• 3D point cloud object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800
seconds).
• Your input manifest file must be a sequence manifest file. For more information, see Create a Point
Cloud Sequence Input Manifest (p. 754).
• Optionally, you can provide label category attributes. Workers can assign one or more of these
attributes to annotations to provide more information about that object. For example, you might want
to use the attribute occluded to have workers identify when an object is partially obstructed.
• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs when you select your work team (up to 7 days, or 604800
seconds).
When you create an adjustment labeling job, your input data to the labeling job can include labels, and
yaw, pitch, and roll measurements from a previous labeling job or external source. In the adjustment job,
pitch, and roll will be visualized in the worker UI, but cannot be modified. Yaw is adjustable.
Ground Truth uses Tait-Bryan angles with the following intrinsic rotations to visualize yaw, pitch and roll
in the worker UI. First, rotation is applied to the vehicle according to the z-axis (yaw). Next, the rotated
vehicle is rotated according to the intrinsic y'-axis (pitch). Finally, the vehicle is rotated according to the
intrinsic x''-axis (roll).
If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object tracking output data format, see 3D Point
Cloud Object Tracking Output (p. 796).
For this task type, the data object that workers label is a single point cloud frame. Ground Truth
generates a 3D point cloud visualization using point cloud data you provide. You can also provide camera
615
Amazon SageMaker Developer Guide
Label 3D Point Clouds
data to give workers more visual information about scenes in the frame, and to help workers paint
objects. When a worker paints an object in either the 2D image or the 3D point cloud, the paint shows up
in the other view.
You can adjust annotations created in a 3D point cloud object detection labeling job using the 3D point
cloud semantic segmentation adjustment task type.
If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this topic provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.
Topics
• View the Worker Task Interface (p. 616)
• Create a 3D Point Cloud Semantic Segmentation Labeling Job (p. 622)
• Create a 3D Point Cloud Semantic Segmentation Adjustment or Verification Labeling Job (p. 623)
• Output Data Format (p. 623)
Ground Truth provides workers with a web portal and tools to complete your 3D point cloud semantic
segmentation annotation tasks. When you create the labeling job, you provide the Amazon Resource
Name (ARN) for a pre-built Ground Truth UI in the HumanTaskUiArn parameter. When you create
a labeling job using this task type in the console, this UI is automatically used. You can preview and
interact with the worker UI when you create a labeling job in the console. If you are a new use, it is
recommended that you create a labeling job using the console to ensure your label attributes, point
cloud frames, and if applicable, images, appear as expected.
The following is a GIF of the 3D point cloud semantic segmentation worker task interface. If you provide
camera data for sensor fusion, images are matched with scenes in the point cloud frame. Workers can
paint objects in either the 3D point cloud or the 2D image, and the paint appears in the corresponding
location in the other medium. These images appear in the worker portal as shown in the following GIF.
616
Amazon SageMaker Developer Guide
Label 3D Point Clouds
617
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Worker can navigate in the 3D scene using their keyboard and mouse. They can:
• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
The following video demonstrates movements around the 3D point cloud. Workers can hide and re-
expand all side views and menus. In this GIF, the side-views and menus have been collapsed.
618
Amazon SageMaker Developer Guide
Label 3D Point Clouds
619
Amazon SageMaker Developer Guide
Label 3D Point Clouds
The following GIF demonstrates how a worker can label multiple objects quickly, refine painted objects
using the Unpaint option and then view only points that have been painted.
620
Amazon SageMaker Developer Guide
Label 3D Point Clouds
621
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.
Worker Tools
Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. When you create a semantic segmentation
job, workers have the following tools available to them:
• A paint brush to paint and unpaint objects. Workers paint objects by selecting a label category and
then painting in the 3D point cloud. Workers unpaint objects by selecting the Unpaint option from the
label category menu and using the paint brush to erase paint.
• A polygon tool that workers can use to select and paint an area in the point cloud.
• A background paint tool, which enables workers to paint behind objects they have already annotated
without altering the original annotations. For example, workers might use this tool to paint the road
after painting all of the cars on the road.
• View options that enable workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.
You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:
• A single-frame input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Frame Input Manifest File (p. 748). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk workers
for 3D point cloud labeling jobs. To learn how to create workforces and work teams, see Create and
Manage Workforces (p. 863).
• A label category configuration file. For more information, see Create a Labeling Category
Configuration File with Label Category and Frame Attributes (p. 719).
Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).
Use one of the following sections to learn how to create a labeling job using the console or an API.
You can follow the instructions Create a Labeling Job (Console) (p. 706) in order to learn how to create
a 3D point cloud semantic segmentation labeling job in the SageMaker console. While you are creating
your labeling job, be aware of the following:
• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud semantic segmentation labeling jobs can take multiple hours to complete. You can
specify a longer time limit for these labeling jobs when you select your work team (up to 7 days, or
604800 seconds).
622
Amazon SageMaker Developer Guide
Label 3D Point Clouds
This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.
The page, Create a Labeling Job (API) (p. 709), provides an overview of the CreateLabelingJob
operation. Follow these instructions and do the following while you configure your request:
You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).
When you create a 3D point cloud semantic segmentation labeling job, tasks are sent to workers. When
these workers complete their tasks, their annotations are written to the Amazon S3 bucket you specified
when you created the labeling job. The output data format determines what you see in your Amazon S3
bucket when your labeling job status (LabelingJobStatus) is Completed.
If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object detection output data format, see 3D Point
Cloud Semantic Segmentation Output (p. 792).
623
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Ground Truth provides workers with tools to annotate cuboids in a 3D point cloud and bounding boxes
in up to 8 cameras using the same annotation UI. Workers can also link various bounding boxes for
the same object across different cameras. For example, a bounding box in camera1 can be linked to a
bounding box in camera2. This lets you to correlate an object across multiple cameras using a unique ID.
Note
Currently, SageMaker does not support creating a 3D-2D linking job using the console. To create
a 3D-2D linking job using the SageMaker API, see Create a Labeling Job (API) (p. 629).
Topics
• View the Worker Task Interface (p. 624)
• Input Data Format (p. 628)
• Create a 3D-2D Point Cloud Object Tracking Labeling Job (p. 629)
• Output Data (p. 630)
Ground Truth provides workers with a web portal and tools to complete your 3D-2D object tracking
annotation tasks. When you create the labeling job, you provide the Amazon Resource Name (ARN) for a
pre-built Ground Truth UI in the HumanTaskUiArn parameter. To use the UI when you create a labeling
job for this task type using the API, you need to provide the HumanTaskUiArn. You can preview and
interact with the worker UI when you create a labeling job through the API. The annotating tools are a
part of the worker task interface. They are not available for the preview interface. The following image
demonstrates the worker task interface used for the 3D-2D point cloud object tracking annotation task.
624
Amazon SageMaker Developer Guide
Label 3D Point Clouds
625
Amazon SageMaker Developer Guide
Label 3D Point Clouds
When interpolation is enabled by default. After a worker adds a single cuboid, that cuboid is replicated in
all frames of the sequence with the same ID. If the worker adjusts the cuboid in another frame, Ground
Truth interpolates the movement of that object and adjust all cuboids between the manually adjusted
frames. Additionally, using the camera view section, a cuboid can be shown with a projection (using to B
button for "toggle labels" in the camera view) that provides the worker with a reference from the camera
images. The accuracy of the cuboid to image projection is based on accuracy of calibrations captured in
the extrinsic and intrinsinc data.
If you provide camera data for sensor fusion, images are matched up with scenes in point cloud frames.
Note that the camera data should be time synchronized with the point cloud data to ensure an accurate
depiction of point cloud to imagery over each frame in the sequence as shown in the following image.
The manifest file holds the extrinsic and intrinsic data and the pose to allow the cuboid projection on the
camera image to be shown by using the P button.
Worker can navigate in the 3D scene using their keyboard and mouse. They can:
• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
Once a worker places a cuboids in the 3D scene, a side-view appears with the three projected side views:
top, side, and front. These side-views show points in and around the placed cuboid and help workers
refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views using
their mouse.
The worker should first select the cuboid to draw a corresponding bounding box on any of the camera
views. This links the cuboid and the bounding box with a common name and unique ID.
The worker can also first draw a bounding box, select it and draw the corresponding cuboid to link them.
Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.
Worker Tools
Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. If workers click on a point in the point cloud,
the UI automatically zooms into that area. Workers can use various tools to draw 3D cuboid around
objects. For more information, see Assistive Labeling Tools in the following discussion.
626
Amazon SageMaker Developer Guide
Label 3D Point Clouds
After workers have placed a 3D cuboid in the point cloud, they can adjust these cuboids to fit tightly
around cars using a variety of views: directly in the 3D point cloud, in a side-view featuring three
zoomed-in perspectives of the point cloud around the box, and if you include images for sensor fusion,
directly in the 2D image.
Additional view options enable workers to easily hide or view label text, a ground mesh, and additional
point attributes. Workers can also choose between perspective and orthogonal projections.
Ground Truth helps workers annotate 3D point clouds faster and more accurately using UX, machine
learning and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks.
The following assistive labeling tools are available for this task type:
• Label autofill – When a worker adds a cuboid to a frame, a cuboid with the same dimensions,
orientation and xyz position is automatically added to all frames in the sequence.
• Label interpolation – After a worker has labeled a single object in two frames, Ground Truth
uses those annotations to interpolate the movement of that object between all the frames. Label
interpolation can be turned on and off. It is on by default. For example, if a worker working with 5
frames adds a cuboid in frame 2, it is copied to all the 5 frames. If the worker then makes adjustments
in frame 4, frame 2 and 4 now act as two points, through which a line is fit. The cuboid is then
interpolated in frames 1,3 and 5.
• Bulk label and attribute management – Workers can add, delete, and rename annotations, label
category attributes, and frame attributes in bulk.
• Workers can manually delete annotations for a given object before and after a frame, or in all
frames. For example, a worker can delete all labels for an object after frame 10 if that object is no
longer located in the scene after that frame.
• If a worker accidentally bulk deletes all annotations for a object, they can add them back. For
example, if a worker deletes all annotations for an object before frame 100, they can bulk add them
to those frames.
• Workers can rename a label in one frame and all 3D cuboids assigned that label are updated with
the new name across all frames.
• Workers can use bulk editing to add or edit label category attributes and frame attributes in
multiple frames.
• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object's boundaries.
• Fit to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side-panel displays front
and two side perspectives to help the worker adjust the cuboid tightly around the object. Workers can
annotation the 3D point cloud, the side panel and the adjustments appear in the other views in real
time.
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations are projected into the other view in real time. To learn more
about the data for sensor fusion, see Understand Coordinate Systems and Sensor Fusion.
• Auto-merge cuboids – Workers can automatically merge two cuboids across all frames if they
determine that cuboids with different labels actually represent a single object.
• View options – Enables workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.
627
Amazon SageMaker Developer Guide
Label 3D Point Clouds
You can create a 3D-2D object tracking job using the SageMaker API operation, CreateLabelingJob.
To create a labeling job for this task type you need the following:
• A sequence input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Sequence Input Manifest (p. 754). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• You specify your labels, label category and frame attributes, and worker instructions in a label
category configuration file. For more information, see Create a Labeling Category Configuration File
with Label Category and Frame Attributes to learn how to create this file. The following is an example
showing a label category configuration file for creating a 3D-2D object tracking job.
{
"document-version": "2020-03-01",
"categoryGlobalAttributes": [
{
"name": "Occlusion",
"description": "global attribute that applies to all label categories",
"type": "string",
"enum":[
"Partial",
"Full"
]
}
],
"labels":[
{
"label": "Car",
"attributes": [
{
"name": "Type",
"type": "string",
"enum": [
"SUV",
"Sedan"
]
}
]
},
{
"label": "Bus",
"attributes": [
{
"name": "Size",
"type": "string",
"enum": [
"Large",
"Medium",
"Small"
]
}
]
}
],
"instructions": {
"shortIntroduction": "Draw a tight cuboid around objects after you select a
category.",
"fullIntroduction": "<p>Use this area to add more detailed worker instructions.</
p>"
},
"annotationType": [
{
628
Amazon SageMaker Developer Guide
Label 3D Point Clouds
"type": "BoundingBox"
},
{
"type": "Cuboid"
}
]
}
Note
You need to provide BoundingBox and Cuboid as annotationType in the label category
configuration file to create a 3D-2D object tracking job.
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for 3D
point cloud labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).
• Add a CORS policy to an S3 bucket that contains input data in the Amazon S3 console. To set the
required CORS headers on the S3 bucket that contains your input images in the S3 console, follow the
directions detailed in CORS Permission Requirement.
• Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use
Ground Truth (p. 817).
To learn how to create a labeling job using the API, see the following sections.
Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:
629
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Note
After you have successfully created a 3D-2D object tracking job, it shows up on the console
under labeling jobs. The task type for the job is displayed as Point Cloud Object Tracking.
Output Data
When you create a 3D-2D object tracking labeling job, tasks are sent to workers. When these workers
complete their tasks, their annotations are written to the Amazon S3 bucket you specified when you
created the labeling job. The output data format determines what you see in your Amazon S3 bucket
when your labeling job status (LabelingJobStatus) is Completed.
If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D-2D point cloud object tracking output data format, see 3D-2D
Object Tracking Point Cloud Object Tracking Output (p. 799).
• A frame input manifest file that has a single point cloud frame on each line.
• A sequence input manifest file that has a single sequence on each line. A sequence is defined as a
temporal series of point cloud frames.
For both types of manifest files, job pre-processing time (that is, the time before Ground Truth starts
sending tasks to your workers) depends on the total number and size of point cloud frames you provide
in your input manifest file. For frame input manifest files, this is the number of lines in your manifest
file. For sequence manifest files, this is the number of frames in each sequence multiplied by the total
number of sequences, or lines, in your manifest file.
Additionally, the number of points per point cloud and the number of fused sensor data objects (like
images) factor into job pre-processing times. On average, Ground Truth can pre-process 200 point cloud
frames in approximately 5 minutes. If you create a 3D point cloud labeling job with a large number of
point cloud frames, you might experience longer job pre-processing times. For example, if you create a
sequence input manifest file with 4 point cloud sequences, and each sequence contains 200 point clouds,
Ground Truth pre-processes 800 point clouds and so your job pre-processing time might be around 20
minutes. During this time, your labeling job status is InProgress.
While your 3D point cloud labeling job is pre-processing, you receive CloudWatch
messages notifying you of the status of your job. To identify these messages, search for
3D_POINT_CLOUD_PROCESSING_STATUS in your labeling job logs.
For frame input manifest files, your CloudWatch logs will have a message similar to the following:
630
Amazon SageMaker Developer Guide
Label 3D Point Clouds
{
"labeling-job-name": "example-point-cloud-labeling-job",
"event-name": "3D_POINT_CLOUD_PROCESSING_STATUS",
"event-log-message": "datasetObjectId from: 0 to 10, status: IN_PROGRESS"
}
The event log message, datasetObjectId from: 0 to 10, status: IN_PROGRESS identifies the
number of frames from your input manifest that have been processed. You receive a new message every
time a frame has been processed. For example, after a single frame has processed, you receive another
message that says datasetObjectId from: 1 to 10, status: IN_PROGRESS.
For sequence input manifest files, your CloudWatch logs will have a message similar to the following:
{
"labeling-job-name": "example-point-cloud-labeling-job",
"event-name": "3D_POINT_CLOUD_PROCESSING_STATUS",
"event-log-message": "datasetObjectId: 0, status: IN_PROGRESS"
}
The event log message, datasetObjectId from: 0, status: IN_PROGRESS identifies the number
of sequences from your input manifest that have been processed. You receive a new message every
time a sequence has been processed. For example, after a single sequence has processed, you receive
a message that says datasetObjectId from: 1, status: IN_PROGRESS as the next sequence
begins processing.
It is strongly recommended that you create tasks that workers can complete within 12 hours. Workers
must keep the worker UI open while working on a task. They can save work as they go and Ground Truth
will save their work every 15 minutes.
When using the SageMaker CreateLabelingJob API operation, set the total time a task is available to
workers in the TaskTimeLimitInSeconds parameter of HumanTaskConfig.
When you create a labeling job in the console, you can specify this time limit when you select your
workforce type and your work team.
Workforces
When you create a 3D point cloud labeling job, you need to specify a work team that will complete
your point cloud annotation tasks. You can choose a work team from a private workforce of your own
workers, or from a vendor workforce that you select in the AWS Marketplace. You cannot use the Amazon
Mechanical Turk workforce for 3D point cloud labeling jobs.
To learn more about vendor workforce, see Managing Vendor Workforces (p. 867).
To learn how to create and manage a private workforce, see Use a Private Workforce (p. 868).
You can preview the worker UI when you create a labeling job in the console.
When you create a labeling job using the API operation CreateLabelingJob, you must provide an ARN
provided by Ground Truth in the parameter HumanTaskUiArn to specify the worker UI for your task
631
Amazon SageMaker Developer Guide
Label 3D Point Clouds
type. You can use HumanTaskUiArn with the SageMaker RenderUiTemplate API operation to preview
the worker UI.
You provide worker instructions, labels, and optionally, label category attributes that are displayed in the
worker UI.
• Label category attribute – A list of options (strings), a free form text box, or a numeric field associated
with one or more labels. It is used by workers to to provide metadata about a label.
• Frame attribute – A list of options (strings), a free form text box, or a numeric field that appears on
each point cloud frame a worker is sent to annotate. It is used by workers to provide metadata about
frames.
Additionally, you can use label and frame attributes to have workers verify labels in a 3D point cloud
label verification job.
Use the following sections to learn more about these attributes. To learn how to add label category and
frame attributes to a labeling job, use the Create Labeling Job section on the task type page of your
choice.
For example, if you add the label category car, you might also want to capture additional data about
your labeled cars, such as if they are occluded or the size of the car. You can capture this metadata using
label category attributes. In this example, if you added the attribute occluded to the car label category,
you can assign partial, completely, no to the occluded attribute and enable workers to select one of these
options.
When you create a label verification job, you add labels category attributes to each label you want
workers to verify.
Frame Attributes
Add frame attributes to give workers the ability to provide more information about individual point
cloud frames. You can specify up to 10 frame attributes, and these attributes will appear on all frames.
For example, you can add a frame attribute that allows workers to enter a number. You may want to use
this attribute to have workers identify the number of objects they see in a particular frame.
In another example, you may want to provide a free-form text box to give workers the ability to provide
a free form answer to a question.
When you create a label verification job, you can add one or more frame attributes to ask workers to
provide feedback on all labels in a point cloud frame.
Worker Instructions
You can provide worker instructions to help your workers complete your point cloud labeling tasks. You
might want to use these instructions to do the following:
632
Amazon SageMaker Developer Guide
Label 3D Point Clouds
You can add your worker instructions using the SageMaker console while creating a labeling job. If you
create a labeling job using the API operation CreateLabelingJob, you specify worker instructions in
your label category configuration file.
In addition to your instructions, Ground Truth provides a link to help workers navigate and use the
worker portal. View these instructions by selecting the task type on Worker Instructions (p. 634).
Declining Tasks
Workers are able to decline tasks.
Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.
JSON
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"HEAD",
"PUT"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": [
"Access-Control-Allow-Origin"
],
"MaxAgeSeconds": 3000
}
]
XML
633
Amazon SageMaker Developer Guide
Label 3D Point Clouds
<MaxAgeSeconds>3000</MaxAgeSeconds>
<ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>
To learn how to add a CORS policy to an S3 bucket, see How do I add cross-domain resource sharing with
CORS? in the Amazon Simple Storage Service User Guide.
Worker Instructions
This topic provides an overview of the Ground Truth worker portal and the tools available to complete
your 3D Point Cloud labeling task. First, select the type of task you are working on from Topics.
For adjustment jobs, select the original labeling job task type that produced the labels you are adjusting.
Review and adjust the labels in your task as needed.
Important
It is recommended that you complete your task using a Google Chrome or Firefox web browser.
Topics
• 3D Point Cloud Semantic Segmentation (p. 634)
• 3D Point Cloud Object Detection (p. 643)
• 3D Point Cloud Object Tracking (p. 653)
Topics
• Your Task (p. 634)
• Navigate the UI (p. 639)
• Icon Guide (p. 641)
• Shortcuts (p. 642)
• Release, Stop and Resume, and Decline Tasks (p. 642)
• Saving Your Work and Submitting (p. 643)
Your Task
When you work on a 3D point cloud semantic segmentation task, you need to select a category from the
Annotations menu on the right side of your worker portal using the drop down menu Label Categories.
After you've selected a category, use the paint brush and polygon tools to paint each object in the 3D
point cloud that this category applies to. For example, if you select the category Car, you would use
these tools to paint all of the cars in the point cloud. The following video demonstrates how to use the
paint brush tool to paint an object.
If you see one or more images in your worker portal, you can paint in the images or paint in the 3D point
cloud and the paint will show up in the other medium.
You may see frame attributes under the Labels menu. Use these attribute prompts to enter additional
information about the point cloud.
634
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Important
If you see that objects have already been painted when you open the task, adjust those
annotations.
The following video includes an image that can be annotated. You may not see an image in your task.
635
Amazon SageMaker Developer Guide
Label 3D Point Clouds
636
Amazon SageMaker Developer Guide
Label 3D Point Clouds
After you've painted one or more objects using a label category, you can select that category from the
Label Category menu on the right to only view points painted for that category.
637
Amazon SageMaker Developer Guide
Label 3D Point Clouds
638
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Navigate the UI
You can navigate in the 3D scene using their keyboard and mouse. You can:
• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
The following video demonstrates movements around the 3D point cloud and in the side-view. You can
hide and re-expand all side views using the full screen icon. In this GIF, the side-views and menus have
been collapsed.
639
Amazon SageMaker Developer Guide
Label 3D Point Clouds
640
Amazon SageMaker Developer Guide
Label 3D Point Clouds
When you are in the worker UI, you see the following menus:
When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon.
After you select the paint icon, you can add paint to the point cloud and images (if included). You must
select the move scene icon again to move to another area in the 3D point cloud or image.
To collapse all panels on the right and make the 3D point cloud full screen, select the full screen icon.
For the camera images and side-panels, you have the following view options:
Icon Guide
Use this table to learn about the icons available in your worker task portal.
brush Choose this icon to turn on the brush tool. To use with this
tool, choose and move over the objects that you want to
paint with your mouse. After you choose it, everything you
paint be associated with the category you chose.
polygon Choose this icon to use the polygon paint tool. Use this tool
to draw polygons around objects that you want to paint.
After you choose it, everything you draw a polygon around
will be associated with the category you have chosen.
reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.
move scene Choose this icon to move the scene. By default, this icon will
be selected when you first start a task.
641
Amazon SageMaker Developer Guide
Label 3D Point Clouds
full screen Choose this icon to make the 3D point cloud visualization
full screen, and to collapse all side panels.
When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.
After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.
Shortcuts
The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use the paint
tool.
Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.
When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:
• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point cloud, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: If you release a task, you loose all work done on that task. When the task is released,
other workers on your team can pick it up. If enough workers pick up the task, you may not be able
to return to it. When you select this button and then select Confirm, you are returned to the worker
portal. If the task is still available, its status will be Available. If other workers pick it up, it will
disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
642
Amazon SageMaker Developer Guide
Label 3D Point Clouds
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.
Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.
You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.
When you open a task, you must complete your work on it before pressing Submit.
Topics
• Your Task (p. 643)
• Navigate the UI (p. 645)
• Icon Guide (p. 651)
• Shortcuts (p. 652)
• Release, Stop and Resume, and Decline Tasks (p. 652)
• Saving Your Work and Submitting (p. 653)
Your Task
When you work on a 3D point cloud object detection task, you need to select a category from the
Annotations menu on the right side of your worker portal using the Label Categories menu. After you've
chosen a category, use the add cuboid and fit cuboid tools to fit a cuboid around objects in the 3D point
cloud that this category applies to. After you place a cuboid, you can modify its dimensions, location, and
orientation directly in the point cloud, and the three panels shown on the right.
If you see one or more images in your worker portal, you can also modify cuboids in the images or in the
3D point cloud and the edits will show up in the other medium.
If you see cuboids have already been added to the 3D point cloud when you open your task, adjust those
cuboids and add additional cuboids as needed.
To edit a cuboid, including moving, re-orienting, and changing cuboid dimensions, you must use
shortcut keys. You can see a full list of shortcut keys in the Shortcuts menu in your UI. The following are
important key-combinations that you should become familiar with before starting your labeling task.
643
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Individual labels may have one or more label attributes. If a label has a label attribute associated with it,
it will appear when you select the downward pointing arrow next to the label from the Label Id menu.
Fill in required values for all label attributes.
You may see frame attributes under the Labels menu. Use these attribute prompts to enter additional
information about each frame.
644
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Navigate the UI
You can navigate in the 3D scene using your keyboard and mouse. You can:
• Double click on specific objects in the point cloud to zoom into them.
• You can use the [ and ] keys on your keyboard to zoom into and move from one label to the next. If no
label is selected, when you select [ or ], the UI will zoom into the first label in the Lable Id list.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
Once you place a cuboids in the 3D scene, a side-view will appear with three projected views: top, side,
and back. These side-views show points in and around the placed cuboid and help workers refine cuboid
boundaries in that area. Workers can zoom in and out of each of those side-views using their mouse.
The following video demonstrates movements around the 3D point cloud and in the side-view.
645
Amazon SageMaker Developer Guide
Label 3D Point Clouds
646
Amazon SageMaker Developer Guide
Label 3D Point Clouds
When you are in the worker UI, you see the following menus:
When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon. Resetting the view will not modify
your annotations.
After you select the add cuboid icon, you can add cuboids to the 3D point cloud visualization. Once
you've added a cuboid, you can adjust it in the three views (top, side, and front) and in the images (if
included).
647
Amazon SageMaker Developer Guide
Label 3D Point Clouds
648
Amazon SageMaker Developer Guide
Label 3D Point Clouds
You must choose the move scene icon again to move to another area in the 3D point cloud or image.
To collapse all panels on the right and make the 3D point cloud full-screen, choose the full screen icon.
If camera images are included, you may have the following view options:
The following video demonstrates how to use these view options. The F option is used to view the field
of view of the camera (the gray area), the C options shows the direction the camera is facing and angle of
the camera (blue lines), and the B option is used to view the cuboid.
649
Amazon SageMaker Developer Guide
Label 3D Point Clouds
650
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Icon Guide
Use this table to learn about the icons you see in your worker task portal.
Icon Description
add cuboid Choose this icon to add a cuboid. Each cuboid you add is
associated with the category you chose.
edit cuboid Choose this icon to edit a cuboid. After you have added
a cuboid, you can edit its dimensions, location, and
orientation. After a cuboid is added, it automatically
switches to edit cuboid mode.
When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.
After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.
reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.
move scene Choose this icon to move the scene. By default, this icon is
chosen when you first start a task.
651
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Icon Description
full screen Choose this icon to make the 3D point cloud visualization
full screen, and to collapse all side panels.
Shortcuts
The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use tools to
add and edit cuboids.
Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands. You need to use some of the 3D cuboid controls to edit your cuboid.
When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:
• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point cloud, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: If you release a task, you loose all work done on that task. When the task is released,
other workers on your team can pick it up. If enough workers pick up the task, you may not be able
to return to it. When you select this button and then select Confirm, you are returned to the worker
portal. If the task is still available, its status will be Available. If other workers pick it up, it will
disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.
Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.
652
Amazon SageMaker Developer Guide
Label 3D Point Clouds
You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.
When you open a task, you must complete your work on it before pressing Submit.
Topics
• Your Task (p. 653)
• Navigate the UI (p. 657)
• Bulk Edit Label Category and Frame Attributes (p. 661)
• Icon Guide (p. 662)
• Shortcuts (p. 663)
• Release, Stop and Resume, and Decline Tasks (p. 663)
• Saving Your Work and Submitting (p. 664)
Your Task
When you work on a 3D point cloud object tracking task, you need to select a category from the
Annotations menu on the right side of your worker portal using the Label Categories menu. After you've
selected a category, use the add cuboid and fit cuboid tools to fit a cuboid around objects in the 3D point
cloud that this category applies to. After you place a cuboid, you can modify its location, dimensions, and
orientation directly in the point cloud, and the three panels shown on the right. If you see one or more
images in your worker portal, you can also modify cuboids in the images or in the 3D point cloud and the
edits will show up in the other medium.
Important
If you see cuboids have already been added to the 3D point cloud frames when you open your
task, adjust those cuboids and add additional cuboids as needed.
To edit a cuboid, including moving, re-orienting, and changing cuboid dimensions, you must use
shortcut keys. You can see a full list of shortcut keys in the Shortcuts menu in your UI. The following are
important key-combinations that you should become familiar with before starting your labeling task.
653
Amazon SageMaker Developer Guide
Label 3D Point Clouds
When you open your task, two frames will be loaded. If your task includes more than two frames, you
need to use the navigation bar in the lower-left corner, or the load frames icon to load additional frames.
You should annotate and adjust labels in all frames before submitting.
After you fit a cuboid tightly around the boundaries of an object, navigate to another frame using the
navigation bar in the lower-left corner of the UI. If that same object has moved to a new location, add
another cuboid and fit it tightly around the boundaries of the object. Each time you manually add a
cuboid, you see the frame sequence bar in the lower-left corner of the screen turn red where that frame
is located temporally in the sequence.
Your UI automatically infers the location of that object in all other frames after you've placed a cuboid.
This is called interpolation. You can see the movement of that object, and the inferred and manually
created cuboids using the arrows. Adjust inferred cuboids as needed. The following video demonstrates
how to navigate between frames. The following video shows how, if you add a cuboid in one frame, and
then adjust it in another, your UI will automatically infer the location of the cuboid in all of the frames
in-between.
654
Amazon SageMaker Developer Guide
Label 3D Point Clouds
655
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Tip
You can turn off the automatic cuboid interpolation across frames using the 3D Point Cloud
menu item. Select 3D Point Cloud from the top-menu, and then select Interpolate Cuboids
Across Frames. This will uncheck this option and stop cuboid interpolation. You can reselect this
item to turn cuboid interpolation back on.
Turning cuboid interpolation off will not impact cuboids that have already been interpolated
across frames.
Individual labels may have one or more label attributes. If a label has a label attribute associated with it,
it will appear when you select the downward pointing arrow next to the label from the Label Id menu.
Fill in required values for all label attributes.
You may see frame attributes under the Label Id menu. These attributes will appear on each frame in
your task. Use these attribute prompts to enter additional information about each frame.
656
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Navigate the UI
You can navigate in the 3D scene using your keyboard and mouse. You can:
• Double click on specific objects in the point cloud to zoom into them.
• You can use the [ and ] keys on your keyboard to zoom into and move from one label to the next. If no
label is selected, when you select [ or ], the UI will zoom into the first label in the Label Id list.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.
Once you place a cuboids in the 3D scene, a side-view will appear with three projected views: top, side,
and back. These side-views show points in and around the placed cuboid and help workers refine cuboid
boundaries in that area. Workers can zoom in and out of each of those side-views using their mouse.
The following video demonstrates movements around the 3D point cloud and in the side-view.
657
Amazon SageMaker Developer Guide
Label 3D Point Clouds
658
Amazon SageMaker Developer Guide
Label 3D Point Clouds
When you are in the worker UI, you see the following menus:
When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon.
After you select the add cuboid icon, you can add cuboids to the point cloud and images (if included).
You must select the move scene icon again to move to another area in the 3D point cloud or image.
To collapse all panels on the right and make the 3D point cloud full-screen, choose the full screen icon.
If camera images are included, you may have the following view options:
The following video demonstrates how to use these view options. The F option is used to view the field
of view of the camera (the gray area), the C options shows the direction the camera is facing and angle of
the camera (blue lines), and the B option is used to view the cuboid.
659
Amazon SageMaker Developer Guide
Label 3D Point Clouds
660
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Delete Cuboids
You can select a cuboid or label ID and:
A common use-case for cuboid deletion is if the object leaves the scene.
You can use one or more of these options to delete both manually placed and interpolated cuboids with
the same label ID.
• To delete all cuboids before or after the frame you are currently on, select the cuboid, select the Label
menu item at the top of the UI and then select one of Delete in previous frames or Delete in next
frames. Use the Shortcuts menu to see the shortcut keys you can use for these options.
• To delete a label in all frames, select Delete in all frames from the Labels menu, or use the shortcut
Shift + Delete on your keyboard.
• To delete an individual cuboid from a single frame, select the cuboid and either select the trashcan
icon ( ) next to that label ID in the Label ID sidebar on the right or use the Delete key on your
keyboard to delete that cuboid.
If you have manually placed more than one cuboid with the same label in different frames, when
you delete one of the manually placed cuboids, all interpolated cuboids adjust. This adjustment
happens because the UI uses manually placed cuboids as anchor points when calculating the location of
interpolated cuboid. When you remove one of these anchor points, the UI must recalculate the position
of interpolated cuboids.
If you delete a cuboid from a frame, but later decide that you want to get it back, you can use the
Duplicate to previous frames or Duplicate to next frames options in the Label menu to copy the cuboid
into all the previous or all of the following frames, respectively.
When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.
To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.
You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.
1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.
661
Amazon SageMaker Developer Guide
Label 3D Point Clouds
If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.
You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.
Icon Guide
Use this table to learn about the icons you see in your worker task portal.
Icon Description
add cuboid Choose this icon to add a cuboid. Each cuboid you add is
associated with the category you chose.
edit cuboid Choose this icon to edit a cuboid. After you add a cuboid,
you can edit its dimensions, location, and orientation. After
a cuboid is added, it automatically switches to edit cuboid
mode.
When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.
After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.
reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.
move scene Choose this icon to move the scene. By default, this icon is
chosen when you first start a task.
662
Amazon SageMaker Developer Guide
Label 3D Point Clouds
Icon Description
full screen Choose this icon to make the 3D point cloud visualization
full screen and to collapse all side panels.
delete labels Delete a label. This option can only be used to delete labels
you have manually created or adjusted.
Shortcuts
The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use tools to
add and edit cuboids.
Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands. You need to use some of the 3D cuboid controls to edit your cuboid.
When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:
• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point clouds, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.
663
Amazon SageMaker Developer Guide
Verify and Adjust Labels
Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.
You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.
When you open a task, you must complete your work on it before pressing Submit.
• Label verification — Workers indicate if the existing labels are correct, or rate their quality, and can add
comments to explain their reasoning. Workers will not be able to modify or adjust labels.
If you create a 3D point cloud or video frame label adjustment or verification job, you can choose to
make label category attributes (not supported for 3D point cloud semantic segmentation) and frame
attributes editable by workers.
• Label adjustment — Workers adjust prior annotations and, if applicable, label category and frame
attributes to correct them.
The following Ground Truth built-in task types support adjustment and verification labeling jobs:
• Bounding box
• Semantic segmentation
• 3D point cloud object detection, 3D point cloud object tracking, and 3D point cloud semantic
segmentation
• All video frame object detection and video frame object tracking task types — bounding box, polyline,
polygon and keypoint
Tip
For 3D point cloud and video frame labeling verification jobs, it is recommended that you add
new label category attributes or frame attributes to the labeling job. Workers can use these
attribute to verify individual labels or the entire frame. To learn more about label category and
frame attributes, see Worker User Interface (UI) (p. 631) for 3D point cloud and Worker User
Interface (UI) (p. 577) for video frame.
You can start a label verification and adjustment jobs using the SageMaker console or the API.
Topics
• Requirements to Create Verification and Adjustment Labeling Jobs (p. 665)
• Create a Label Verification Job (Console) (p. 665)
• Create a Label Adjustment Job (Console) (p. 667)
• Start a Label Verification or Adjustment Job (API) (p. 668)
• Label Verification and Adjustment Data in the Output Manifest (p. 670)
• Cautions and Considerations (p. 671)
664
Amazon SageMaker Developer Guide
Verify and Adjust Labels
• For non streaming labeling jobs: The input manifest file you use must contain the label attribute
name (LabelAttributeName) of the labels that you want adjusted. When you chain a successfully
completed labeling job, the output manifest file is used as the input manifest file for the new, chained
job. To learn more about the format of the output manifest file Ground Truth produces for each task
type, see Output Data (p. 776).
For streaming labeling jobs: The Amazon SNS message you sent to the Amazon SNS input topic of the
adjustment or verification labeling job must contain the label attribute name of the labels you want
adjusted or verified. To see an example of how you can create an adjustment or verification labeling
job with streaming labeling jobs, see this Jupyter Notebook example in GitHub.
• The task type of the verification or adjustment labeling job must be the same as the task type of the
original job unless you are using the Image Label Verification (p. 551) task type to verify bounding
box or semantic segmentation image labels. See the next bullet point for more details about the video
frame task type requirements.
• For video frame annotation verification and adjustment jobs, you must use the same annotation task
type used to create the annotations from the previous labeling job. For example, if you create a video
frame object detection job to have workers draw bounding boxes around objects, and then you create
a video object detection adjustment job, you must specify bounding boxes as the annotation task type.
To learn more video frame annotation task types, see Task Types (p. 576).
• The task type you select for the adjustment or verification labeling job must support an audit
workflow. The following Ground Truth built-in task types support adjustment and verification labeling
jobs: bounding box, semantic segmentation, 3D point cloud object detection, 3D point cloud object
tracking, and 3D point cloud semantic segmentation, and all video frame object detection and video
frame object tracking task types — bounding box, polyline, polygon and keypoint.
Topics
• Create an Image Label Verification Job (Console) (p. 665)
• Create a Point Cloud or Video Frame Label Verification Job (Console) (p. 666)
665
Amazon SageMaker Developer Guide
Verify and Adjust Labels
You can add new labels that workers choose from to verify labels. For example, you can ask workers
to verify the image quality, and provide the labels Clear and Blurry. Workers will also have the option
to add a comment to explain their selection.
9. Choose See preview to check that the tool is displaying the prior labels correctly and presents the
label verification task clearly.
10. Select Create. This will create and start your labeling job.
You cannot modify or add new labels. You can remove, modify and add new label category
attributes or frame attributes. It is recommended that you add new label category attributes or
frame attributes to the labeling job. Workers can use these attribute to verify individual labels or the
entire frame.
666
Amazon SageMaker Developer Guide
Verify and Adjust Labels
By default, preexisting label category attributes and frame attributes will not be editable by workers.
If you want to make a label category or frame attribute editable, select the Allow workers to edit
this attribute check box for that attribute.
To learn more about label category and frame attributes, see Worker User Interface (UI) (p. 631)
for 3D point cloud and Worker User Interface (UI) (p. 577) for video frame.
11. Choose See preview to check that the tool is displaying the prior labels correctly and presents the
label verification task clearly.
12. Select Create. This will create and start your labeling job.
Topics
• Create an Image Label Adjustment Job (Console) (p. 667)
• Create a Point Cloud or Video Frame Label Adjustment Job (Console) (p. 668)
667
Amazon SageMaker Developer Guide
Verify and Adjust Labels
You cannot remove or modify existing labels but you can add new labels. You can remove, modify
and add new label category attributes or frame attributes.
Be default, preexisting label category attributes and frame attributes will be editable by workers. If
you want to make a label category or frame attribute uneditable, deselect the Allow workers to edit
this attribute check box for that attribute.
To learn more about label category and frame attributes, see Worker User Interface (UI) (p. 631)
for 3D point cloud and Worker User Interface (UI) (p. 577) for video frame.
8. Choose See preview to check that the tool shows the prior labels correctly and presents the task
clearly.
9. Select Create. This will create and start your labeling job.
When you create an adjustment or verification labeling job using the Ground Truth API, you must use a
different LabelAttributeName than the original labeling job. The original labeling job is the job used
to create the labels you want adjusted or verified.
Important
The label category configuration file you identify for an adjustment or verification job in
LabelCategoryConfigS3Uri of CreateLabelingJob must contain the same labels used in
the original labeling job. You can add new labels. For 3D point cloud and video frame jobs, you
can add new label category and frame attributes to the label category configuration file.
668
Amazon SageMaker Developer Guide
Verify and Adjust Labels
To create a bounding box or semantic segmentation label verification or adjustment job, use the
following guidelines to specify API attributes for the CreateLabelingJob operation.
• Use the LabelAttributeName parameter to specify the output label name that you want to use for
verified or adjusted labels. You must use a different LabelAttributeName than the one used for the
original labeling job.
• If you are chaining the job, the labels from the previous labeling job to be adjusted or verified will be
specified in the custom UI template. To learn how to create a custom template, see Create Custom
Worker Task Templates (p. 2995).
Identify the location of the UI template in the UiTemplateS3Uri parameter. SageMaker provides
widgets that you can use in your custom template to display old labels. Use the initial-value
attribute in one of the following crowd elements to extract the labels that need verification or
adjustment and include them in your task template:
• crowd-semantic-segmentation (p. 948)—Use this crowd element in your custom UI task template
to specify semantic segmentation labels that need to be verified or adjusted.
• crowd-bounding-box (p. 894)—Use this crowd element in your custom UI task template to specify
bounding box labels that need to be verified or adjusted.
• The LabelCategoryConfigS3Uri parameter must contain the same label categories as the previous
labeling job.
• Use the bounding box or semantic segmentation adjustment or verification lambda ARNs for
PreHumanTaskLambdaArn and AnnotationConsolidationLambdaArn:
• For bounding box, the adjustment labeling job lambda function ARNs end with
AdjustmentBoundingBox and the verification lambda function ARNs end with
VerificationBoundingBox.
• For semantic segmentation, the adjustment labeling job lambda function ARNs end with
AdjustmentSemanticSegmentation and the verification lambda function ARNs end with
VerificationSemanticSegmentation.
• Use the LabelAttributeName parameter to specify the output label name that you want to use for
verified or adjusted labels. You must use a different LabelAttributeName than the one used for the
original labeling job.
• You must use the human task UI Amazon Resource Name (ARN) (HumanTaskUiArn) used for the
original labeling job. To see supported ARNs, see HumanTaskUiArn.
• In the label category configuration file, you must specify the label attribute name
(LabelAttributeName) of the previous labeling job that you use to create the adjustment or
verification labeling job in the auditLabelAttributeName parameter.
• You specify whether your labeling job is a verification or adjustment labeling job using
the editsAllowed parameter in your label category configuration file identified by the
LabelCategoryConfigS3Uri parameter.
• For verification labeling jobs, you must use the editsAllowed parameter to specify that all labels
cannot be modified. editsAllowed must be set to "none" in each entry in labels. Optionally,
you can specify whether or not label categories attributes and frame attributes can be adjusted by
workers.
• Optionally, for adjustment labeling jobs, you can use the editsAllowed parameter to specify
labels, label category attributes, and frame attributes that can or cannot be modified by workers.
If you do not use this parameter, all labels, label category attributes, and frame attributes will be
adjustable.
669
Amazon SageMaker Developer Guide
Verify and Adjust Labels
To learn more about the editsAllowed parameter and configuring your label category configuration
file, see Label Category Configuration File Schema (p. 719).
• Use the 3D point cloud or video frame adjustment lambda ARNs for PreHumanTaskLambdaArn and
AnnotationConsolidationLambdaArn for both adjustment and verification labeling jobs:
• For 3D point clouds, the adjustment and verification labeling job lambda
function ARNs end with Adjustment3DPointCloudSemanticSegmentation,
Adjustment3DPointCloudObjectTracking, and
Adjustment3DPointCloudObjectDetection for 3D point cloud semantic segmentation, object
detection, and object tracking respectively.
• For video frames, the adjustment and verification labeling job lambda function ARNs end with
AdjustmentVideoObjectDetection and AdjustmentVideoObjectTracking for video frame
object detection and object tracking respectively.
Ground Truth stores the output data from a label verification or adjustment job in the S3 bucket that
you specified in the S3OutputPath parameter of the CreateLabelingJob operation. For more
information about the output data from a label verification or adjustment labeling job, see Label
Verification and Adjustment Data in the Output Manifest (p. 670).
The following example output manifest shows how label verification data appears:
{
"source-ref":"S3 bucket location",
"verify-bounding-box":"1",
"verify-bounding-box-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-bounding-boxes",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"worker-feedback": [
{"comment": "The bounding box on the bird is too wide on the right side."},
{"comment": "The bird on the upper right is not labeled."}
]
}
}
The worker output of adjustment tasks resembles the worker output of the original task, except that
it contains the adjusted values and an adjustment-status property with the value of adjusted or
unadjusted to indicate whether an adjustment was made.
For more examples of the output of different tasks, see Output Data (p. 776).
670
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
• If you are using image data, verify that your manifest file contains hexadecimal RGB color information.
• To save money on processing costs, filter your data to ensure you are not including unwanted objects
in your labeling job input manifest.
• Add required Amazon S3 permissions to ensure your input data is processed correctly.
When you create an adjustment or verification labeling job using the Ground Truth API, you must use a
different LabelAttributeName than the original labeling job.
In prior iterations of the Semantic Segmentation tool, category color information wasn't output in
hexadecimal RGB format to the output manifest. That feature was introduced to the output manifest
at the same time the verification and adjustment workflows were introduced. Therefore, older output
manifests aren't compatible with this new workflow.
If you create a verification job using the console, you can use the filtering tools provided there. If you
create jobs using the API, make filtering your data part of your workflow where needed.
Topics
• Step 1: Setting up your workforce (p. 672)
• Step 2: Creating your custom worker task template (p. 672)
• Step 3: Processing with AWS Lambda (p. 678)
• Demo Template: Annotation of Images with crowd-bounding-box (p. 692)
• Demo Template: Labeling Intents with crowd-classifier (p. 696)
• Custom Workflows via the API (p. 703)
For more information about creating custom labeling workflows, see Build a custom data labeling
workflow with Amazon SageMaker Ground Truth.
671
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
1. First choose an option from the Worker types. There are three types currently available:
You are also asked to set a price per task by using a drop-down menu. The menu recommends price
points based on how long it will take to complete the task.
The recommended method to determine this is to first run a short test of your task with a private
workforce. The test provides a realistic estimate of how long the task takes to complete. You can
then select the range your estimate falls within on the Price per task menu. If your average time is
more than 5 minutes, consider breaking your task into smaller units.
Next
Step 2: Creating your custom worker task template (p. 672)
Use the following topics to learn how you can create a worker task template. You can see a repository of
example Ground Truth worker task templates on GitHub.
672
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
1. Following the instructions in Create a Labeling Job (Console) (p. 706) and select Custom for the
labeling job Task type.
2. When you select Next, you will be able to access the template editor and base templates in the
Custom labeling task setup section.
3. (Optional) Select a base template from the drop-down menu under Templates. If you prefer to
create a template from scratch, choose Custom from the drop down-menu for a minimal template
skeleton.
Example
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
This loads the necessary code to render the custom HTML elements. Use this if you want to develop your
template's look and feel in your preferred editor rather than in the console.
Remember, though, this will not parse your variables. You may want to replace them with sample
content while developing locally.
Example
<script src="https://www.example.com/my-enhancment-script.js"></script>
<link rel="stylesheet" type="text/css" href="https://www.example.com/my-enhancement-
styles.css">
If you encounter errors, ensure that your originating server is sending the correct MIME type and
encoding headers with the assets.
For example, the MIME and encoding types for remote scripts are: application/
javascript;CHARSET=UTF-8.
The MIME and encoding type for remote stylesheets are: text/css;CHARSET=UTF-8.
673
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
sample templates, you will need to make sure you're aware of the variables it already uses. When you
create your pre-annotation AWS Lambda script, its output will need to contain values for any of those
variables you choose to keep.
The values you use for the variables can come from your manifest file. All the key-value pairs in your
data object are provided to your pre-annotation Lambda. If it's a simple pass-through script, matching
keys for values in your data object to variable names in your template is the easiest way to pass those
values through to the tasks forms your workers see.
A simple sample
All tasks begin and end with the <crowd-form> </crowd-form> elements. Like standard HTML
<form> elements, all of your form code should go between them.
For a simple tweet-analysis task, use the <crowd-classifier> element. It requires the following
attributes:
• name - the variable name to use for the result in the form output.
• categories - a JSON formatted array of the possible answers.
• header - a title for the annotation tool
• <classification-target> - the text the worker will classify based on the options specified in the
categories attribute above.
• <full-instructions> - instructions that are available from the "View full instructions" link in the tool.
This can be left blank, but it is recommended that you give good instructions to get better results.
• <short-instructions> - a more brief description of the task that appears in the tool's sidebar. This can be
left blank, but it is recommended that you give good instructions to get better results.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="tweetFeeling"
categories="['positive','negative','neutral', 'unclear']"
header="Which term best describes this tweet?"
>
<classification-target>
My favorite football team won today!
Bring on the division finals!
</classification-target>
<short-instructions>
Pick the term best describing the sentiment
of the tweet.
</short-instructions>
</crowd-classifier>
</crowd-form>
674
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
You can copy and paste the code into the editor in the Ground Truth labeling job creation workflow to
preview the tool, or try out a demo of this code on CodePen.
The most common use of Liquid will be to parse the data coming from your pre-annotation Lambda
and pull out the relevant variables to create the task. The taskInput object returned by your Pre-
annotation Lambda (p. 679) will be available as the task.input object in your templates.
The properties in your manifest's data objects are passed into your Pre-annotation Lambda (p. 679)
as the event.dataObject. A simple pass-through script simply returns that object as the taskInput
object. You would represent values from your manifest as variables as follows.
{
"source": "This is a sample text for classification",
"labels": [ "angry" , "sad" , "happy" , "inconclusive" ],
"header": "What emotion is the speaker feeling?"
}
<crowd-classifier
name='tweetFeeling'
categories='{{ task.input.labels | to_json }}'
header='{{ task.input.header }}' >
<classification-target>
675
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
{{ task.input.source }}
</classification-target>
Note the addition of " | to_json" to the labels property above. That's a filter to turn the array into a
JSON representation of the array. Variable filters are explained in the next section.
The following list includes two types of Liquid tags that you may find useful to automate template
input data processing. If you select one of the following tag-types, you will be redirected to the Liquid
documentation.
• Control flow: Includes programming logic operators like if/else, unless, and case/when.
• Iteration: Enables you to run blocks of code repeatedly using statements like for loops.
For an example of an HTML template that uses Liquid elements to create a for loop, see translation-
review-and-correction.liquid.html in GitHub.
Variable filters
In addition to the standard Liquid filters and actions, Ground Truth offers a few additional filters. Filters
are applied by placing a pipe (|) character after the variable name, then specifying a filter name. Filters
can be chained in the form of:
Example
escape_once
escape_once ensures that if you've already escaped your code, it doesn't get re-escaped on top of that.
For example, so that & doesn't become &amp;.
skip_autoescape
skip_autoescape is useful when your content is meant to be used as HTML. For example, you might
have a few paragraphs of text and some images in the full instructions for a bounding box.
Use skip_autoescape sparingly
The best practice in templates is to avoid passing in functional code or markup with
skip_autoescape unless you are absolutely sure you have strict control over what's being
passed. If you're passing user input, you could be opening your workers up to a Cross Site
Scripting attack.
to_json
to_json will encode what you feed it to JSON (JavaScript Object Notation). If you feed it an object, it
will serialize it.
grant_read_access
grant_read_access takes an S3 URI and encodes it into an HTTPS URL with a short-lived access token
for that resource. This makes it possible to display to workers photo, audio, or video objects stored in S3
buckets that are not otherwise publicly accessible.
676
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
Example
Output
The text classification template is below with automation added. The changes/additions are highlighted
in bold.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="tweetFeeling"
categories="['positive', 'negative', 'neutral', 'cannot determine']"
header="Which term best describes this tweet?"
>
<classification-target>
{{ task.input.source }}
</classification-target>
<short-instructions>
Pick the term best describing the sentiment
of the tweet.
</short-instructions>
</crowd-classifier>
</crowd-form>
The tweet text that was in the prior sample is now replaced with an object. The entry.taskInput
object uses source (or another name you specify in your pre-annotation Lambda) as the property name
for the text and it is inserted directly in the HTML by virtue of being between double curly braces.
End-to-end demos
You can view the following end-to-end demos which include sample Lambda function:
677
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
Next
Step 3: Processing with AWS Lambda (p. 678)
• Pre-annotation Lambda: This function initiates for and pre-processes each data object sent to your
labeling job prior to sending it to workers.
• Post-annotation Lambda: This function processes the results once workers submit a task. If you specify
multiple workers per data object, this function may include logic to consolidate annotations.
If you are a new user of Lambda and Ground Truth, we recommend that you use the pages in this section
as follows:
1. First, review Pre-annotation and Post-annotation Lambda Function Requirements (p. 678).
2. Then, use the page Required Permissions To Use AWS Lambda With Ground Truth (p. 685) to learn
about security and permission requirements to use your pre-annotation and post-annotation Lambda
functions in a Ground Truth custom labeling job.
3. Next, you need to visit the Lambda console or use Lambda's APIs to create your functions. Use the
section Create Lambda Functions for a Custom Labeling Workflow (p. 689) to learn how to create
Lambda functions.
4. To learn how to test your Lambda functions, see Test Pre-Annotation and Post-Annotation Lambda
Functions (p. 689).
5. After you create pre-processing and post-processing Lambda functions, select them from the Lambda
functions section that comes after the code editor for your custom HTML in the Ground Truth console.
To learn how to use these functions in a CreateLabelingJob API request, see Create a Labeling Job
(API) (p. 709).
For a custom labeling workflow tutorial that includes example pre-annotation and post-annotation
Lambda functions, in the "Demo Template: Annotation of Images with crowd-bounding-
box (p. 692)" document.
Topics
• Pre-annotation and Post-annotation Lambda Function Requirements (p. 678)
• Required Permissions To Use AWS Lambda With Ground Truth (p. 685)
• Create Lambda Functions for a Custom Labeling Workflow (p. 689)
• Test Pre-Annotation and Post-Annotation Lambda Functions (p. 689)
Topics
678
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
Pre-annotation Lambda
Before a labeling task is sent to the worker, your pre-annotation Lambda function is invoked.
Ground Truth sends your Lambda function a JSON-formatted request to provide details about the
labeling job and the data object. The following table contains the pre-annotation request schemas. Each
parameter is described below.
{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source-ref": <s3Uri>
}
}
{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source": <string>
}
}
The following table includes code block examples of a pre-annotation request. Each parameter in these
example requests is explained below the tabbed table.
{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:<aws_region>:<aws_account_number>:labeling-
job/<labeling_job_name>"
"dataObject" : {
"source-ref": "s3://<input-data-bucket>/<data-object-file-name>"
}
679
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:<aws_region>:<aws_account_number>:labeling-
job/<labeling_job_name>"
"dataObject" : {
"source": "Sue purchased 10 shares of the stock on April 10th, 2020"
}
}
{
"taskInput": <json object>,
"isHumanAnnotationRequired": <boolean> # Optional
}
In the previous example, the <json object> needs to contain all the data your custom worker task
template needs. If you're doing a bounding box task where the instructions stay the same all the time, it
may just be the HTTP(S) or Amazon S3 resource for your image file. If it's a sentiment analysis task and
different objects may have different choices, it is the object reference as a string and the choices as an
array of strings.
Implications of isHumanAnnotationRequired
This value is optional because it defaults to true. The primary use case for explicitly setting it is
when you want to exclude this data object from being labeled by human workers.
If you have a mix of objects in your manifest, with some requiring human annotation and some not
needing it, you can include a isHumanAnnotationRequired value in each data object. You can add
logic to your pre-annotation Lambda to dynamically determine if an object requires annotation, and set
this boolean value accordingly.
The following, basic pre-annotation Lambda function accesses the JSON object in dataObject from the
initial request, and returns it in the taskInput parameter.
import json
Assuming the input manifest file uses "source-ref" to identify data objects, the worker task template
used in the same labeling job as this pre-annotation Lambda must include a Liquid element like the
following to ingest dataObject:
{{ task.input.source-ref | grant_read_access }}
If the input manifest file used source to identify the data object, the work task template can ingest
dataObject with the following:
680
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
{{ task.input.source }}
The following pre-annotation Lambda example includes logic to identify the key used in dataObject,
and to point to that data object using taskObject in the Lambda's return statement.
import json
# Event received
print("Received event: " + json.dumps(event, indent=2))
print(output)
# If neither source nor source-ref specified, mark the annotation failed
if task_object is None:
print(" Failed to pre-process {} !".format(event["labelingJobArn"]))
output["humanAnnotationRequired"] = "false"
return output
Post-annotation Lambda
When all workers have annotated the data object or when TaskAvailabilityLifetimeInSeconds
has been reached, whichever comes first, Ground Truth sends those annotations to your post-annotation
Lambda. This Lambda is generally used for Consolidate Annotations (p. 806).
Tip
To see an example of a post-consolidation Lambda function, see
annotation_consolidation_lambda.py in the aws-sagemaker-ground-truth-recipe GitHub
repository.
The following code block contains the post-annotation request schema. Each parameter is described in
the following bulleted list.
{
"version": "2018-10-16",
"labelingJobArn": <string>,
"labelCategories": [<string>],
"labelAttributeName": <string>,
"roleArn" : <string>,
"payload": {
"s3Uri": <string>
}
681
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
The following code block contains an example of a post-annotation request. Each parameter in this
example request is explained below the code block.
{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:us-west-2:111122223333:labeling-job/labeling-job-
name",
"labelCategories": ["Ex Category1","Ex Category2", "Ex Category3"],
"labelAttributeName": "labeling-job-attribute-name",
"roleArn" : "arn:aws:iam::111122223333:role/role-name",
"payload": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/annotations.json"
}
}
Note
If no worker works on the data object and TaskAvailabilityLifetimeInSeconds has been
reached, the data object is marked as failed and not included as part of post-annotation Lambda
invocation.
The following code block contains the payload schema. This is the file that is indicated by the
s3Uri parameter in the post-annotation Lambda request payload JSON object. For example, if the
previous code block is the post-annotation Lambda request, the following annotation file is located at
s3://DOC-EXAMPLE-BUCKET/annotations.json.
[
{
"datasetObjectId": <string>,
"dataObject": {
"s3Uri": <string>,
"content": <string>
},
"annotations": [{
"workerId": <string>,
"annotationData": {
"content": <string>,
682
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
"s3Uri": <string>
}
}]
}
]
• datasetObjectId (string): Identifies a unique ID that Ground Truth assigns to each data object you
send to the labeling job.
• dataObject (JSON object): The data object that was labeled. If the data object is included in the
input manifest file and identified using the source key (for example, a string), dataObject includes a
content key, which identifies the data object. Otherwise, the location of the data object (for example,
a link or S3 URI) is identified with s3Uri.
• annotations (list of JSON objects): This list contains a single JSON object for each annotation
submitted by workers for that dataObject. A single JSON object contains a unique workerId
that can be used to identify the worker that submitted that annotation. The annotationData key
contains one of the following:
• content (string): Contains the annotation data.
• s3Uri (string): Contains an S3 URI that identifies the location of the annotation data.
The following table contains examples of the content that you may find in payload for different types of
annotation.
[
{
"datasetObjectId": "1",
"dataObject": {
"content": "Sift 3 cups of flour into the bowl."
},
"annotations": [
{
"workerId": "private.us-west-2.ef7294f850a3d9d1",
"annotationData": {
"content": "{\"crowd-entity-annotation\":{\"entities\":[{\"endOffset
\":4,\"label\":\"verb\",\"startOffset\":0},{\"endOffset\":6,\"label\":\"number
\",\"startOffset\":5},{\"endOffset\":20,\"label\":\"object\",\"startOffset\":15},
{\"endOffset\":34,\"label\":\"object\",\"startOffset\":30}]}}"
}
}
]
}
]
[
{
"datasetObjectId": "2",
"dataObject": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/gt-input-data/images/bird3.jpg"
},
"annotations": [
{
"workerId": "private.us-west-2.ab1234c5678a919d0",
"annotationData": {
"content": "{\"crowd-semantic-segmentation\":{\"inputImageProperties\":
{\"height\":2000,\"width\":3020},\"labelMappings\":{\"Bird\":{\"color\":\"#2ca02c\"}},
\"labeledImage\":{\"pngImageData\":\"iVBOR...\"}}}"
683
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
}
}
]
}
]
[
{
"datasetObjectId": "0",
"dataObject": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/gt-input-data/images/bird1.jpg"
},
"annotations": [
{
"workerId": "private.us-west-2.ab1234c5678a919d0",
"annotationData": {
"content": "{\"boundingBox\":{\"boundingBoxes\":[{\"height\":2052,\"label
\":\"Bird\",\"left\":583,\"top\":302,\"width\":1375}],\"inputImageProperties\":
{\"height\":2497,\"width\":3745}}}"
}
}
]
}
]
Your post-annotation Lambda function may contain logic similar to the following to
loop through and access all annotations contained in the request. For a full example, see
annotation_consolidation_lambda.py in the aws-sagemaker-ground-truth-recipe GitHub repository. In
this GitHub example, you must add your own annotation consolidation logic.
for i in range(len(annotations)):
worker_id = annotations[i]["workerId"]
annotation_content = annotations[i]['annotationData'].get('content')
annotation_s3_uri = annotations[i]['annotationData'].get('s3uri')
annotation = annotation_content if annotation_s3_uri is None else
s3_client.get_object_from_s3(
annotation_s3_uri)
annotation_from_single_worker = json.loads(annotation)
Tip
When you run consolidation algorithms on the data, you can use an AWS database service to
store results, or you can pass the processed results back to Ground Truth. The data you return
to Ground Truth is stored in consolidated annotation manifests in the S3 bucket specified for
output during the configuration of the labeling job.
[
{
"datasetObjectId": <string>,
"consolidatedAnnotation": {
"content": {
"<labelattributename>": {
684
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
At this point, all the data you're sending to your S3 bucket, other than the datasetObjectId, is in the
content object.
When you return annotations in content, this results in an entry in your job's output manifest like the
following:
Because of the potentially complex nature of a custom template and the data it collects, Ground Truth
does not offer further processing of the data.
• You need to grant an IAM role or user (collectively, an IAM entity) permission to create the pre-
annotation and post-annotation Lambda functions using AWS Lambda, and to choose them when
creating the labeling job.
• The IAM execution role specified when the labeling job is configured needs permission to invoke the
pre-annotation and post-annotation Lambda functions.
• The post-annotation Lambda functions may need permission to access Amazon S3.
Use the following sections to learn how to create the IAM entities and grant permissions described
above.
Topics
• Grant Permission to Create and Select an AWS Lambda Function (p. 686)
685
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
• Grant IAM Execution Role Permission to Invoke AWS Lambda Functions (p. 686)
• Grant Post-Annotation Lambda Permissions to Access Annotation (p. 687)
If you do not require granular permissions to develop pre-annotation and post-annotation Lambda
functions, you can attach the AWS managed policy AWSLambda_FullAccess to a user or role. This
policy grants broad permissions to use all Lambda features, as well as permission to perform actions in
other AWS services with which Lambda interacts.
To create a more granular policy for security-sensitive use cases, refer to the documentation Identity-
based IAM policies for Lambda in the to AWS Lambda Developer Guide to learn how to create an IAM
policy that fits your use case.
If you want to grant an IAM entity permission to use the Lambda console, see Using the Lambda console
in the AWS Lambda Developer Guide.
Additionally, if you want the user to be able to access and deploy the Ground Truth starter pre-
annotation and post-annotation functions using the AWS Serverless Application Repository in the
Lambda console, you must specify the <aws-region> where you want to deploy the functions (this
should be the same AWS Region used to create the labeling job), and add the following policy to the IAM
role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"serverlessrepo:ListApplicationVersions",
"serverlessrepo:GetApplication",
"serverlessrepo:CreateCloudFormationTemplate"
],
"Resource": "arn:aws:serverlessrepo:<aws-region>:838997950401:applications/aws-
sagemaker-ground-truth-recipe"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "serverlessrepo:SearchApplications",
"Resource": "*"
}
]
}
To grant an IAM entity permission to view Lambda functions in the Ground Truth console when the user
is creating a custom labeling job, the entity must have the permissions described in Grant IAM Permission
to Use the Amazon SageMaker Ground Truth Console (p. 818), including the permissions described in
the section Custom Labeling Workflow Permissions (p. 821).
If you add the IAM managed policy AmazonSageMakerGroundTruthExecution to the IAM execution role
used to create the labeling job, this role has permission to list and invoke Lambda functions with one
686
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
of the following strings in the function name: GtRecipe, SageMaker, Sagemaker, sagemaker, or
LabelingFunction.
If the pre-annotation or post-annotation Lambda function names do not include one of the
terms in the preceding paragraph, or if you require more granular permission than those in the
AmazonSageMakerGroundTruthExecution managed policy, you can add a policy similar to the
following to give the execution role permission to invoke pre-annotation and post-annotation functions.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action":
"lambda:InvokeFunction",
"Resource": [
"arn:aws:lambda:<region>:<account-id>:function:<pre-annotation-lambda-
name>",
"arn:aws:lambda:<region>:<account-id>:function:<post-annotation-lambda-
name>"
]
}
]
}
As described in Post-annotation Lambda (p. 681), the post-annotation Lambda request includes the
location of the annotation data in Amazon S3. This location is identified by the s3Uri string in the
payload object. To process the annotations as they come in, even for a simple pass through function,
you need to assign the necessary permissions to the post-annotation Lambda execution role to read files
from the Amazon S3.
There are many ways that you can configure your Lambda to access annotation data in Amazon S3. Two
common ways are:
• Allow the Lambda execution role to assume the SageMaker execution role identified in roleArn in the
post-annotation Lambda request. This SageMaker execution role is the one used to create the labeling
job, and has access to the Amazon S3 output bucket where the annotation data is stored.
• Grant the Lambda execution role permission to access the Amazon S3 output bucket directly.
To allow a Lambda function to assume a SageMaker execution role, you must attach a policy to the
Lambda function's execution role, and modify the trust relationship of the SageMaker execution role to
allow Lambda to assume it.
1. Attach the following IAM policy to your Lambda function's execution role to assume the SageMaker
execution role identified in Resource. Replace 222222222222 with an AWS account ID. Replace sm-
execution-role with the name of the assumed role.
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",
687
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
"Resource": "arn:aws:iam::222222222222:role/sm-execution-role"
}
}
2. Modify the trust policy of the SageMaker execution role to include the following Statement. Replace
222222222222 with an AWS account ID. Replace my-lambda-execution-role with the name of
the assumed role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::222222222222:role/my-lambda-execution-role"
},
"Action": "sts:AssumeRole"
}
]
}
You can add a policy similar to the following to the post-annotation Lambda function execution role to
give it S3 read permissions. Replace DOC-EXAMPLE-BUCKET with the name of the output bucket you
specify when you create a labeling job.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
}
]
}
To add S3 read permissions to a Lambda execution role in the Lambda console, use the following
procedure.
• Search for and select AmazonS3ReadOnlyAccess to give the function permission to read all
buckets and objects in the account.
• If you require more granular permissions, select Create policy and use the policy example in
the preceding section to create a policy. Note that you must navigate back to the execution role
summary page after you create the policy.
688
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
If you created a new policy, navigate back to the Lambda execution role summary page and attach
the policy you just created.
• To learn how to create a Lambda function using the console, see Create a Lambda function with the
console.
• To learn how to create a Lambda function using the AWS CLI, see Using AWS Lambda with the AWS
Command Line Interface.
• Select the relevant section in the table of contents to learn more about working with Lambda in the
language of your choice. For example, select Working with Python to learn more about using Lambda
with the AWS SDK for Python (Boto3).
Ground Truth provides pre-annotation and post-annotation templates through an AWS Serverless
Application Repository (SAR) recipe. Use the following procedure to select the Ground Truth recipe in the
Lambda console.
Use the Ground Truth SAR recipe to create pre-annotation and post-annotation Lambda
functions:
Once the app deploys, two functions appear in the Functions section of the Lambda console:
serverlessrepo-aws-sagema-GtRecipePreHumanTaskFunc-<id> and serverlessrepo-
aws-sagema-GtRecipeAnnotationConsol-<id>.
6. Select one of these functions and add your custom logic in the Code section.
7. When you are finished making changes, select Deploy to deploy them.
You can use the sections on this page to learn how to test the Ground Truth pre-annotation and post-
annotation templates provided through an AWS Serverless Application Repository (SAR).
Topics
• Prerequisites (p. 690)
• Test the Pre-annotation Lambda Function (p. 690)
• Test the Post-Annotation Lambda Function (p. 691)
689
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
Prerequisites
You must do the following to use the tests described on this page.
• You need access to the Lambda console, and you need permission to create and invoke Lambda
functions. To learn how to set up these permissions, see Grant Permission to Create and Select an AWS
Lambda Function (p. 686).
• If you have not deployed the Ground Truth SAR recipe, use the procedure in Create Lambda Functions
for a Custom Labeling Workflow (p. 689) to do so.
• To test the post-annotation Lambda function, you must have a data file in Amazon S3 with sample
annotation data. For a simple test, you can copy and paste the following code into a file and save it
as sample-annotations.json and upload this file to Amazon S3. Note the S3 URI of this file—you
need this information to configure the post-annotation Lambda test.
• You must use the directions in Grant Post-Annotation Lambda Permissions to Access
Annotation (p. 687) to give your post-annotation Lambda function's execution role permission
to assume the SageMaker execution role you use to create the labeling job. The post-annotation
Lambda function uses the SageMaker execution role to access the annotation data file, sample-
annotations.json, in S3.
690
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
2. Select the pre-annotation function that was deployed from the Ground Truth SAR
recipe. The name of this function is similar to serverlessrepo-aws-sagema-
GtRecipePreHumanTaskFunc-<id>.
3. In the Code source section, select the arrow next to Test.
4. Select Configure test event.
5. Keep the Create new test event option selected.
6. Under Event template, select SageMaker Ground Truth PreHumanTask.
7. Give your test an Event name.
8. Select Create.
9. Select the arrow next to Test again and you should see that the test you created is selected, which is
indicated with a dot by the event name. If it is not selected, select it.
10. Select Test to run the test.
After you run the test, you can see the Execution results. In the Function logs, you should see a
response similar to the following:
In this response, we can see the Lambda function's output matches the required pre-annotation response
syntax:
Use the following procedure to test the post-annotation Lambda function created when you deployed
the Ground Truth AWS Serverless Application Repository (SAR) recipe.
691
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
• Replace the Amazon Resource Name (ARN) in roleArn with the ARN of the SageMaker execution
role you used to create the labeling job.
• Replace the S3 URI in s3Uri with the URI of the sample-annotations.json file you added to
Amazon S3.
After you make these modifications, your test should look similar to the following:
{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:us-east-2:123456789012:labeling-job/example-
job",
"labelAttributeName": "example-attribute",
"roleArn": "arn:aws:iam::222222222222:role/sm-execution-role",
"payload": {
"s3Uri": "s3://your-bucket/sample-annotations.json"
}
}
9. Select Create.
10. Select the arrow next to Test again and you should see that the test you created is selected, which is
indicated with a dot by the event name. If it is not selected, select it.
11. Select the Test to run the test.
After you run the test, you should see a -- Consolidated Output -- section in the Function Logs,
which contains a list of all annotations included in sample-annotations.json.
This demonstration works with the BoundingBox template. The demonstration also works with the AWS
Lambda functions needed for processing your data before and after the task. In the Github repository
above, to find templates that work with AWS Lambda functions, look for {{ task.input.<property
name> }} in the template.
Topics
• Starter Bounding Box custom template (p. 692)
• Your own Bounding Box custom template (p. 693)
• Your manifest file (p. 694)
• Your pre-annotation Lambda function (p. 695)
• Your post-annotation Lambda function (p. 695)
• The output of your labeling job (p. 696)
692
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="boundingBox"
src="{{ task.input.taskObject | grant_read_access }}"
header="{{ task.input.header }}"
labels="{{ task.input.labels | to_json | escape }}"
>
<!-- The <full-instructions> tag is where you will define the full instructions of your
task. -->
<full-instructions header="Bounding Box Instructions" >
<p>Use the bounding box tool to draw boxes around the requested target of interest:</
p>
<ol>
<li>Draw a rectangle using your mouse over each instance of the target.</li>
<li>Make sure the box does not cut into the target, leave a 2 - 3 pixel margin</li>
<li>
When targets are overlapping, draw a box around each object,
include all contiguous parts of the target in the box.
Do not include parts that are completely overlapped by another object.
</li>
<li>
Do not include parts of the target that cannot be seen,
even though you think you can interpolate the whole shape of the target.
</li>
<li>Avoid shadows, they're not considered as a part of the target.</li>
<li>If the target goes off the screen, label up to the edge of the image.</li>
</ol>
</full-instructions>
<!-- The <short-instructions> tag allows you to specify instructions that are displayed
in the left hand side of the task interface.
It is a best practice to provide good and bad examples in this section for quick
reference. -->
<short-instructions>
Use the bounding box tool to draw boxes around the requested target of interest.
</short-instructions>
</crowd-bounding-box>
</crowd-form>
The custom templates use the Liquid template language, and each of the items between double
curly braces is a variable. The pre-annotation AWS Lambda function should provide an object named
taskInput and that object's properties can be accessed as {{ task.input.<property name> }} in
your template.
In the starter sample, there are three variables: taskObject, header, and labels.
• taskObject is an HTTP(S) URL or S3 URI for the photo to be annotated. The added |
grant_read_access is a filter that will convert an S3 URI to an HTTPS URL with short-lived access to
that resource. If you're using an HTTP(S) URL, it's not needed.
• header is the text above the photo to be labeled, something like "Draw a box around the bird in the
photo."
693
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
• labels is an array, represented as ['item1', 'item2', ...]. These are labels that can be
assigned by the worker to the different boxes they draw. You can have one or many.
Each of the variable names come from the JSON object in the response from your pre-annotation
Lambda, The names above are merely suggested, Use whatever variable names make sense to you and
will promote code readability among your team.
Only use variables when necessary
If a field will not change, you can remove that variable from the template and replace it with
that text, otherwise you have to repeat that text as a value in each object in your manifest or
code it into your pre-annotation Lambda function.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="boundingBox"
labels="[ '{{ task.input.animal }}' ]"
src="{{ task.input.source-ref | grant_read_access }}"
header="Draw a box around the {{ task.input.animal }}."
>
<full-instructions header="Bounding Box Instructions" >
<p>Draw a bounding box around the {{ task.input.animal }} in the image. If
there is more than one {{ task.input.animal }} per image, draw a bounding
box around the largest one.</p>
<p>The box should be tight around the {{ task.input.animal }} with
no more than a couple of pixels of buffer around the
edges.</p>
<p>If the image does not contain a {{ task.input.animal }}, check the <strong>
Nothing to label</strong> box.
</full-instructions>
<short-instructions>
<p>Draw a bounding box around the {{ task.input.animal }} in each image. If
there is more than one {{ task.input.animal }} per image, draw a bounding
box around the largest one.</p>
</short-instructions>
</crowd-bounding-box>
</crowd-form>
Note the re-use of {{ task.input.animal }} throughout the template. If your manifest had
all of the animal names beginning with a capital letter, you could use {{ task.input.animal |
downcase }}, incorporating one of Liquid's built-in filters in sentences where it needed to be presented
lowercase.
694
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
When you're using the console, if you have AWS Lambda functions that are owned by your account, a
drop-down list of functions meeting the naming requirements will be provided to choose one.
In this very basic example, you're just passing through the information from the manifest without doing
any additional processing on it. This sample pre-annotation function is written for Python 3.7.
import json
The JSON object from your manifest will be provided as a child of the event object. The properties
inside the taskInput object will be available as variables to your template, so simply setting the value
of taskInput to event['dataObject'] will pass all the values from your manifest object to your
template without having to copy them individually. If you wish to send more values to the template, you
can add them to the taskInput object.
import json
import boto3
from urlparse import urlparse
parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);
695
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'boxesInfo': new_annotation,
'imageSource': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)
return consolidated_labels
The post-annotation Lambda will often receive batches of task results in the event object. That batch
will be the payload object the Lambda should iterate through. What you send back will be an object
meeting the API contract (p. 678).
For a bounding box task, the output you find in the output manifest will look a bit like the demo below.
The example has been cleaned up for printing. The actual output will be a single line per record.
{
"source-ref":"<URL>",
"<label attribute name>":
{
"workerId":"<URL>",
"imageSource":"<image URL>",
"boxesInfo":"{\"boundingBox\":{\"boundingBoxes\":[{\"height\":878, \"label\":\"bird
\", \"left\":208, \"top\":6, \"width\":809}], \"inputImageProperties\":{\"height\":924,
\"width\":1280}}}"},
"<label attribute name>-metadata":
{
"type":"groundTruth/custom",
"job_name":"<Labeling job name>",
"human-annotated":"yes"
},
"animal" : "bird"
}
Note how the additional animal attribute from your original manifest is passed to the output manifest
on the same level as the source-ref and labeling data. Any properties from your input manifest,
whether they were used in your template or not, will be passed to the output manifest.
In this demonstration, you work with the Intent Detection template, which uses the crowd-
classifier (p. 903) element, and the AWS Lambda functions needed for processing your data
before and after the task.
696
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
Topics
• Starter Intent Detection custom template (p. 697)
• Your Intent Detection custom template (p. 697)
• Your pre-annotation Lambda function (p. 701)
• Your post-annotation Lambda function (p. 701)
• Your labeling job output (p. 702)
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="intent"
categories="{{ task.input.labels | to_json | escape }}"
header="Pick the most relevant intention expressed by the below text"
>
<classification-target>
{{ task.input.utterance }}
</classification-target>
<short-instructions>
Pick the most relevant intention expressed by the text
</short-instructions>
</crowd-classifier>
</crowd-form>
The custom templates use the Liquid template language, and each of the items between double
curly braces is a variable. The pre-annotation AWS Lambda function should provide an object named
taskInput and that object's properties can be accessed as {{ task.input.<property name> }} in
your template.
Unless you need to offer different sets of labels with different utterances, avoiding a variable and
just using text will save processing time and creates less possibility of error. The template used in this
demonstration will remove that variable, but variables and filters like to_json are explained in more
detail in the crowd-bounding-box demonstration article.
Two parts of these custom elements that sometimes get overlooked are the <full-instructions>
and <short-instructions> regions. Good instructions generate good results.
697
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
In the elements that include these regions, the <short-instructions> appear automatically in the
"Instructions" pane on the left of the worker's screen. The <full-instructions> are linked from the
"View full instructions" link near the top of that pane. Clicking the link opens a modal pane with more
detailed instructions.
You can not only use HTML, CSS, and JavaScript in these sections, you are encouraged to if you believe
you can provide a strong set of instructions and examples that will help workers complete your tasks
with better speed and accuracy.
Try out an example <crowd-classifier> task. The example is rendered by JSFiddle, therefore all the
template variables are replaced with hard-coded values. Click the "View full instructions" link to see a set
of examples with extended CSS styling. You can fork the project to experiment with your own changes to
the CSS, adding sample images, or adding extended JavaScript functionality.
This uses the example <crowd-classifier> task, but with a variable for the <classification-
target>. If you are trying to keep a consistent CSS design among a series of different labeling jobs, you
can include an external stylesheet using a <link rel...> element the same way you'd do in any other
HTML document.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="intent"
categories="['buy', 'eat', 'watch', 'browse', 'leave']"
header="Pick the most relevant intent expressed by the text below"
>
<classification-target>
698
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
{{ task.input.source }}
</classification-target>
<short-instructions>
What is the speaker expressing they would like to do next?
</short-instructions>
</crowd-classifier>
</crowd-form>
<style>
greenbg {
background: #feee23;
display: block;
}
table {
*border-collapse: collapse; /* IE7 and lower */
border-spacing: 0;
}
th:first-child {
border-radius: 6px 0 0 0;
}
th:last-child {
border-radius: 0 6px 0 0;
699
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
th:only-child{
border-radius: 6px 6px 0 0;
}
tfoot:first-child {
border-radius: 0 0 6px 0;
}
tfoot:last-child {
border-radius: 0 0 0 6px;
}
tfoot:only-child{
border-radius: 6px 6px;
}
td {
padding-left: 15px ;
padding-right: 15px ;
}
botchoice {
display: block;
height: 17px;
width: 490px;
overflow: hidden;
position: relative;
background: #fff;
padding-bottom: 20px;
}
botchoice:after {
position: absolute;
bottom: 0;
left: 0;
height: 100%;
width: 100%;
content: "";
background: linear-gradient(to top,
rgba(255,255,255, 1) 55%,
rgba(255,255,255, 0) 100%
);
pointer-events: none; /* so the text is still selectable */
}
</style>
If you are preparing your manifest file manually for a text-classification task like this, have your data
formatted in the following manner.
This differs from the manifest file used for the "Demo Template: Annotation of Images with crowd-
bounding-box (p. 692)" demonstration in that source-ref was used as the property name instead
of source. The use of source-ref designates S3 URIs for images or other files that must be converted
to HTTP. Otherwise, source should be used like it is with the text strings above.
700
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
This Lambda function is required to have one of the following four strings as part of the function name:
SageMaker, Sagemaker, sagemaker, or LabelingFunction.
When you're using the console, if you have Lambdas that are owned by your account, a drop-down list of
functions meeting the naming requirements will be provided to choose one.
In this very basic sample, where you have only one variable, it's primarily a pass-through function. Here's
a sample pre-labeling Lambda using Python 3.7.
import json
The dataObject property of the event contains the properties from a data object in your manifest.
In this demonstration, which is a simple pass through, you just pass that straight through as
the taskInput value. If you add properties with those values to the event['dataObject']
object, they will be available to your HTML template as Liquid variables with the format
{{ task.input.<property name> }}.
import json
import boto3
from urllib.parse import urlparse
parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);
701
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows
return consolidated_labels
You'll find the output of the job in a folder named after your labeling job in the target S3 bucket you
specified. It will be in a subfolder named manifests.
For an intent detection task, the output in the output manifest will look a bit like the demo below. The
example has been cleaned up and spaced out to be easier for humans to read. The actual output will be
more compressed for machine reading.
[
{
"datasetObjectId":"<Number representing item's place in the manifest>",
"consolidatedAnnotation":
{
"content":
{
"<name of labeling job>":
{
"workerId":"private.us-east-1.XXXXXXXXXXXXXXXXXXXXXX",
"result":
{
"intent":
{
"label":"<label chosen by worker>"
}
},
"labeledContent":
{
"content":"<text content that was labeled>"
}
}
}
}
},
"datasetObjectId":"<Number representing item's place in the manifest>",
"consolidatedAnnotation":
{
"content":
{
"<name of labeling job>":
{
702
Amazon SageMaker Developer Guide
Create a Labeling Job
"workerId":"private.us-east-1.6UDLPKQZHYWJQSCA4MBJBB7FWE",
"result":
{
"intent":
{
"label": "<label chosen by worker>"
}
},
"labeledContent":
{
"content": "<text content that was labeled>"
}
}
}
}
},
...
...
...
]
This should help you create and use your own custom template.
Use the CreateLabelingJob action to configure your task. You'll use the location of a custom template
(Step 2: Creating your custom worker task template (p. 672)) stored in a <filename>.liquid.html
file on S3 as the value for the UiTemplateS3Uri field in the UiConfig object within the
HumanTaskConfig object.
For the AWS Lambda tasks described in Step 3: Processing with AWS Lambda (p. 678), the post-
annotation task's ARN will be used as the value for the AnnotationConsolidationLambdaArn field,
and the pre-annotation task will be used as the value for the PreHumanTaskLambdaArn.
Before you create a labeling job it is recommended that you review the following pages, as applicable:
• You can specify our input data using an automatic data setup in the console, or an input manifest
file in either the console or when using CreateLabelingJob API. For automated data setup, see
Automated Data Setup (p. 736). To learn how to create an input manifest file, see Use an Input
Manifest File (p. 735).
• Review labeling job input data quotas: Input Data Quotas (p. 742).
After you have chosen your task type, use the topics on this page to learn how to create a labeling job.
If you are a new Ground Truth user, we recommend that you start by walking through the demo in
Getting started (p. 527).
Important
Ground Truth requires all S3 buckets that contain labeling job input image data to have a CORS
policy attached. To learn more, see CORS Permission Requirement (p. 816).
703
Amazon SageMaker Developer Guide
Create a Labeling Job
Topics
• Built-in Task Types (p. 704)
• Creating Instruction Pages (p. 704)
• Create a Labeling Job (Console) (p. 706)
• Create a Labeling Job (API) (p. 709)
• Create a Streaming Labeling Job (p. 714)
• Create a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719)
Label Images Label Text Label Videos and Video Label 3D Point Clouds
Frames
Note
Each of the video frame and 3D point cloud task types has an adjustment task type that you use
to verify and adjust labels from a previous labeling job. Select a video frame or 3D point cloud
task type page above to learn how to adjust labels created using that task type.
• Short instructions—instructions that are shown on the same webpage where the worker completes
their task. These instructions should provide an easy reference to show the worker the correct way to
label an object.
• Full instructions—instructions that are shown on a dialog box that overlays the page where the worker
completes their task. We recommend that you provide detailed instructions for completing the task
with multiple examples showing edge cases and other difficult situations for labeling objects.
704
Amazon SageMaker Developer Guide
Create a Labeling Job
Create instructions in the console when you are creating your labeling job. Start with the existing
instructions for the task and use the editor to modify them to suit your labeling job.
Note
Once you create your labeling job, it will automatically start and you will not be able to modify
your worker instructions. If you need to change your worker instructions, stop the labeling job
that you created, clone it, and modify your worker instructions before creating a new job.
You can clone a labeling job in the console by selecting the labeling job and then selecting
Clone in the Actions menu.
To clone a labeling job using the Amazon SageMaker API or your preferred Amazon SageMaker
SDK, make a new request to the CreateLabelingJob operation with the same specifications
as your original job after modifying your worker instructions.
Short Instructions
Short instructions appear on the same web page that workers use to label your data object. For example,
the following is the editing page for a bounding box task. The short instructions panel is on the left.
Keep in mind that a worker will only spend seconds looking at the short instructions. Workers must be
able to scan and understand your information quickly. In all cases it should take less time to understand
the instructions than it takes to complete the task. Keep these points in mind:
705
Amazon SageMaker Developer Guide
Create a Labeling Job
• Pictures are better than words. Create a simple illustration of your task that your workers can
immediately understand.
• If you must use words, use short, concise examples.
• Your short instructions are more important than your full instructions.
The Amazon SageMaker Ground Truth console provides an editor so that you can create your short
instructions. Replace the placeholder text and images with instructions for your task. Preview the
worker's task page by choosing Preview. The preview will open in a new window, be sure to turn off pop-
up blocking so that the window will show.
Full Instructions
You can provide additional instructions for your workers in a dialog box that overlays the page where
workers label your data objects. Use full instructions to explain more complex tasks and to show workers
the proper way to label edge cases or other difficult objects.
You can create full instructions using an editor in the Ground Truth console. As with quick instructions,
keep the following in mind:
• Workers will want detailed instruction the first few times that the complete your task. Any information
that they must have should be in the quick instructions.
• Pictures are more important than words.
• Text should be concise.
• Full instructions should supplement the short instructions. Don't repeat information that appears in
the short instructions.
The Ground Truth console provides an editor so that you can create your full instructions. Replace the
placeholder text and images with instructions for your task. Preview the full instruction page by choosing
Preview. The preview will open in a new window, be sure to turn off pop-up blocking so that the window
will show.
• Place the cursor where the image should go in the instructions editor.
• Click the image icon in the editor toolbar.
• Enter the URL of your image.
706
Amazon SageMaker Developer Guide
Create a Labeling Job
You need to provide the following to create a labeling job in the SageMaker console:
• An input manifest file in Amazon S3. You can place your input dataset in Amazon S3 and automatically
generate a manifest file using the Ground Truth console (not supported for 3D point cloud labeling
jobs).
Alternatively, you can manually create an input manifest file. To learn how, see Input Data (p. 734).
• An Amazon S3 bucket to store your output data.
• An IAM role with permission to access your resources in Amazon S3 and with a SageMaker
execution policy attached. For a general solution, you can attach the managed policy,
AmazonSageMakerFullAccess, to an IAM role and include sagemaker in your bucket name.
For more granular policies, see the section called “IAM Permissions” (p. 817).
3D point cloud task types have additional security considerations. Learn more.
• A work team. You create a work team from a workforce made up of Amazon Mechanical Turk workers,
vendors, or your own private workers.To lean more, see Create and Manage Workforces (p. 863).
You cannot use the Mechanical Turk workforce for 3D point cloud or video frame labeling jobs.
• If you are using a custom labeling workflow, you must save a worker task template in Amazon S3 and
provide an Amazon S3 URI for that template. For more information, see Step 2: Creating your custom
worker task template (p. 672).
• (Optional) An AWS KMS key ARN if you want SageMaker to encrypt the output of your labeling job
using your own AWS KMS encryption key instead of the default Amazon S3 service key.
• (Optional) Existing labels for the dataset you use for your labeling job. Use this option if you want
workers to adjust, or approve and reject labels.
• If you want to create an adjustment or verification labeling job, you must have an output manifest file
in Amazon S3 that contains the labels you want adjusted or verified. This option is only supported for
bounding box and semantic segmentation image labeling jobs and 3D point cloud and video frame
labeling jobs. It is recommended that you use the instructions on Verify and Adjust Labels (p. 664) to
create a verification or adjustment labeling job.
Important
Your work team, input manifest file, output bucket, and other resources in Amazon S3 must be
in the same AWS Region you use to create your labeling job.
When you create a labeling job using the SageMaker console, you add worker instructions and labels to
the worker UI that Ground Truth provides. You can preview and interact with the worker UI while creating
your labeling job in the console. You can also see a preview of the worker UI on your built-in task type
page.
707
Amazon SageMaker Developer Guide
Create a Labeling Job
• Follow the instructions in Automated Video Frame Input Data Setup (p. 773) for video frame
labeling jobs.
• For Manual data setup:
• For Input dataset location, provide the location in Amazon S3 in which your input manifest file
is located. For example, if your input manifest file, manifest.json, is located in example-bucket,
enter s3://example-bucket/manifest.json.
• For Output dataset location, provide the location in Amazon S3 where you want Ground Truth
to store the output data from your labeling job.
7. For IAM Role, choose an existing IAM role or create an IAM role with permission to access your
resources in Amazon S3, to write to the output Amazon S3 bucket specified above, and with a
SageMaker execution policy attached.
8. (Optional) For Additional configuration, you can specify how much of your dataset you want
workers to label, and if you want SageMaker to encrypt the output data for your labeling job using
an AWS KMS encryption key. To encrypt your output data, you must have the required AWS KMS
permissions attached to the IAM role you provided in the previous step. For more details, see the
section called “IAM Permissions” (p. 817).
9. In the Task type section, under Task category, use the dropdown list to select your task category.
10. In Task selection, choose your task type.
11. (Optional) Provide tags for your labeling job to make it easier to find in the console later.
12. Choose Next.
13. In the Workers section, choose the type of workforce you would like to use. For more details about
your workforce options see Create and Manage Workforces (p. 863).
14. (Optional) After you've selected your workforce, specify the Task timeout. This is the maximum
amount of time a worker has to work on a task.
For 3D point cloud annotation tasks, the default task timeout is 3 days. The default timeout for text
and image classification and label verification labeling jobs is 5 minutes. The default timeout for all
other labeling jobs is 60 minutes.
15. (Optional) For bounding box, semantic segmentation, video frame, and 3D point cloud task types,
you can select Display existing labels if you want to display labels for your input data set for
workers to verify or adjust.
For bounding box and semantic segmentation labeling jobs, this will create an adjustment labeling
job.
• Select Adjustment to create an adjustment labeling job. When you select this option, you can add
new labels but you cannot remove or edit existing labels from the previous job. Optionally, you
can choose label category attributes and frame attributes that you want workers to edit. To make
an attribute editable, select the check box Allow workers to edit this attribute for that attribute.
Optionally, you can add new label category and frame attributes.
• Select Verification to create an adjustment labeling job. When you select this option, you cannot
add, modify, or remove existing labels from the previous job. Optionally, you can choose label
category attributes and frame attributes that you want workers to edit. To make an attribute
editable, select the check box Allow workers to edit this attribute for that attribute.
We recommend that you can add new label category attributes to the labels that you want
workers to verify, or add one or more frame attributes to have workers provide information about
the entire frame.
For more information, see Verify and Adjust Labels (p. 664).
708
Amazon SageMaker Developer Guide
Create a Labeling Job
• If you are using a built-in task type, specify workers instructions and labels.
• For image classification and text classification (single and multi-label) you must specify at
least two label categories. For all other built-in task types, you must specify at least one label
category.
• (Optional) If you are creating a 3D point cloud or video frame labeling job, you can specify
label category attributes (not supported for 3D point cloud semantic segmentation) and frame
attributes. Label category attributes can be assigned to one or more labels. Frame attributes
will appear on each point cloud or video frame workers label. To learn more, see Worker User
Interface (UI) (p. 631) for 3D point cloud and Worker User Interface (UI) (p. 577) for video
frame.
• (Optional) Add Additional instructions to help your worker complete your task.
• If you are creating a custom labeling workflow you must :
• Enter a custom template in the code box. Custom templates can be created using a combination
of HTML, the Liquid templating language and our pre-built web components. Optionally, you
can choose a base-template from the drop-down menu to get started.
• Specify pre-annotation and post-annotation lambda functions. To learn how to create these
functions, see Step 3: Processing with AWS Lambda (p. 678).
17. (Optional) You can select See preview to preview your worker instructions, labels, and interact
with the worker UI. Make sure the pop-up blocker of the browser is disabled before generating the
preview.
18. Choose Create.
After you've successfully created your labeling job, you are redirected to the Labeling jobs page. The
status of the labeling job you just created is In progress. This status progressively updates as workers
complete your tasks. When all tasks are successfully completed, the status changes to Completed.
If an issue occurs while creating the labeling job, its status changes to Failed.
To view more details about the job, choose the labeling job name.
Next Steps
After your labeling job status changes to Completed, you can view your output data in the Amazon S3
bucket that you specified while creating that labeling job. For details about the format of your output
data, see Output Data (p. 776).
709
Amazon SageMaker Developer Guide
Create a Labeling Job
• If you are using a custom labeling workflow, you can create a custom template and save the
template in your S3 bucket. To learn how to built a custom worker template, see Step 2: Creating
your custom worker task template (p. 672). For custom HTML elements that you can use to
customize your template, see Crowd HTML Elements Reference (p. 889). For a repository of demo
templates for a variety of labeling tasks, see Amazon SageMaker Ground Truth Sample Task UIs .
• An input manifest file that specifies your input data in Amazon S3. Specify the location of your
input manifest file in ManifestS3Uri. For information about creating an input manifest, see Input
Data (p. 734). If you create a streaming labeling job, this is optional. To learn how to create a
streaming labeling job, see Create a Streaming Labeling Job (p. 714).
• An Amazon S3 bucket to store your output data. You specify this bucket, and optionally, a prefix in
S3OutputPath.
• A label category configuration file. Each label category name must be unique. Specify the location
of this file in Amazon S3 using the LabelCategoryConfigS3Uri parameter. The format and label
categories for this file depend on the task type you use:
• For image classification and text classification (single and multi-label) you must specify at least two
label categories. For all other task types, the minimum number of label categories required is one.
• For named entity recognition tasks, you must provide worker instructions in this file. See Provide
Worker Instructions in a Label Category Configuration File (p. 555) for details and an example.
• For 3D point cloud and video frame task type, use the format in Create a Labeling Category
Configuration File with Label Category and Frame Attributes (p. 719).
• For all other built-in task types and custom tasks, your label category configuration file must be
a JSON file in the following format. Identify the labels you want to use by replacing label_1,
label_2,...,label_n with your label categories.
{
"document-version": "2018-11-28"
"labels": [
{"label": "label_1"},
{"label": "label_2"},
...
{"label": "label_n"}
]
}
If your input or output bucket name does not contain sagemaker, you can attach a policy similar to
the following to the role that is passed to the CreateLabelingJob operation.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::my_input_bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
710
Amazon SageMaker Developer Guide
Create a Labeling Job
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my_output_bucket/*"
]
}
]
}
arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
If you use the Amazon Mechanical Turk workforce, use the ContentClassifiers parameter in
DataAttributes of InputConfig to declare that your content is free of personally identifiable
information and adult content.
Ground Truth requires that your input data is free of personally identifiable information (PII) if you
use the Mechanical Turk workforce. If you use Mechanical Turk and do not specify that your input
data is free of PII using the FreeOfPersonallyIdentifiableInformation flag, your labeling
job will fail. Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Amazon Mechanical Turk workers that can view your task if it
contains adult content.
To learn more about work teams and workforces, see Create and Manage Workforces (p. 863).
• If you use the Mechanical Turk workforce, you must specify the price you'll pay workers for performing
a single task in PublicWorkforceTaskPrice.
• To configure the task, you must provide a task description and title using TaskDescription and
TaskTitle respectively. Optionally, you can provide time limits that control how long the workers
have to work on an individual task (TaskTimeLimitInSeconds) and how long tasks remain in the
worker portal, available to workers (TaskAvailabilityLifetimeInSeconds).
• (Optional) For some task types, you can have multiple workers label a single data object by inputting
a number greater than one for the NumberOfHumanWorkersPerDataObject parameter. For more
information about annotation consolidation, see Consolidate Annotations (p. 806).
• (Optional) To create an automated data labeling job, specify one of the ARNs listed in
LabelingJobAlgorithmSpecificationArn in LabelingJobAlgorithmsConfig. This ARN identifies
the algorithm used in the automated data labeling job. The task type associated with this ARN must
match the task type of the PreHumanTaskLambdaArn and AnnotationConsolidationLambdaArn
you specify. Automated data labeling is supported for the following task types: image classification,
bounding box, semantic segmentation, and text classification. The minimum number of objects
allowed for automated data labeling is 1,250, and we strongly suggest providing a minimum of 5,000
objects. To learn more about automated data labeling jobs, see Automate Data Labeling (p. 807).
• (Optional) You can provide StoppingConditions that cause the labeling job to stop if one the
conditions is met. You can use stopping conditions to control the cost of the labeling job.
711
Amazon SageMaker Developer Guide
Create a Labeling Job
Examples
The following code examples demonstrate how to create a labeling job using CreateLabelingJob.
For additional examples, we recommend you use one of the Ground Truth Labeling Jobs Jupyter
notebooks in the SageMaker Examples section of a SageMaker notebook instance. To learn how to use
a notebook example from the SageMaker Examples, see Example Notebooks (p. 220). You can also see
these example notebooks on GitHub in the SageMaker Examples repository.
The following is an example of an AWS Python SDK (Boto3) request to create a labeling job for a
built-in task type in the US East (N. Virginia) Region using a private workforce. Replace all red-
italized text with your labeling job resources and specifications.
response = client.create_labeling_job(
LabelingJobName="example-labeling-job",
LabelAttributeName="label",
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': "s3://bucket/path/manifest-with-input-data.json"
}
},
'DataAttributes': {
'ContentClassifiers': [
"FreeOfPersonallyIdentifiableInformation"|"FreeOfAdultContent",
]
}
},
OutputConfig={
'S3OutputPath': "s3://bucket/path/file-to-store-output-data",
'KmsKeyId': "string"
},
RoleArn="arn:aws:iam::*:role/*",
LabelCategoryConfigS3Uri="s3://bucket/path/label-categories.json",
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': "arn:aws:sagemaker:region:*:workteam/private-crowd/*",
'UiConfig': {
'UiTemplateS3Uri': "s3://bucket/path/custom-worker-task-template.html"
},
'PreHumanTaskLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype",
'TaskKeywords': [
"Images",
"Classification",
"Multi-label"
],
'TaskTitle': "Multi-label image classification task",
'TaskDescription': "Select all labels that apply to the images shown",
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 3600,
'TaskAvailabilityLifetimeInSeconds': 21600,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:ACS-"
},
Tags=[
{
'Key': "string",
712
Amazon SageMaker Developer Guide
Create a Labeling Job
'Value': "string"
},
]
)
AWS CLI
The following is an example of an AWS CLI request to create a labeling job for a built-in task type in
the US East (N. Virginia) Region using the Amazon Mechanical Turk workforce. For more information,
see start-human-loop in the AWS CLI Command Reference. Replace all red-italized text with
your labeling job resources and specifications.
713
Amazon SageMaker Developer Guide
Create a Labeling Job
For more information about this operation, see CreateLabelingJob. For information about how to use
other language-specific SDKs, see See Also in the CreateLabelingJobs topic.
Use the following sections to create the resources that you need and can use to create a streaming
labeling job:
• Learn how to create SNS topics with the permissions required for Ground Truth streaming labeling jobs
by following the steps in Create Amazon SNS Input and Output Topics (p. 714). Your SNS topics must
be created in the same AWS Region as your labeling job.
• See Subscribe an Endpoint to Your Amazon SNS Output Topic (p. 716) to learn how to set up an
endpoint to receive labeling task output data at a specified endpoint each time a labeling task is
completed.
• To learn how to configure your Amazon S3 bucket to send notifications to your Amazon SNS input
topic, see Set up Amazon S3 Bucket Event Notifications (p. 717).
• Optionally, add data objects that you want to have labeled as soon as the labeling job starts to your
input manifest. For more information, see Create a Manifest File (Optional) (p. 717).
• There are other resources required to create a labeling job, such as an IAM role, Amazon S3 bucket, a
worker task template and label categories. These are described in the Ground Truth documentation on
creating a labeling job. For more information, see Create a Labeling Job (p. 703).
Important
When you create a labeling job you must provide an IAM execution role. Attach the AWS
managed policy AmazonSageMakerGroundTruthExecution to this role to ensure it has
required permissions to execute your labeling job.
When you submit a request to create a streaming labeling job, the state of your labeling job is
Initializing. Once the labeling job is active, the state changes to InProgress. Do not send new
data objects to your labeling job or attempt to stop your labeling job while it is in the Initializing
state. Once the state changes to InProgress, you can start sending new data objects using Amazon
SNS and the Amazon S3 configuration.
Topics
• Create Amazon SNS Input and Output Topics (p. 714)
• Set up Amazon S3 Bucket Event Notifications (p. 717)
• Create a Manifest File (Optional) (p. 717)
• Example: Use SageMaker API To Create Streaming Labeling Job (p. 717)
• Stop a Streaming Labeling Job (p. 718)
714
Amazon SageMaker Developer Guide
Create a Labeling Job
When you create an Amazon SNS topic to use in your streaming labeling job, note down the topic
Amazon Resource Name (ARN). The ARN will be the input values for the parameter SnsTopicArn in
InputConfig and OutputConfig when you create a labeling job.
Your input topic is used to send new data objects to Ground Truth. To create an input topic, follow the
instructions in Creating an Amazon SNS topic in the Amazon Simple Notification Service Developer
Guide.
Note down your input topic ARN and use it as input for the CreateLabelingJob parameter
SnsTopicArn in InputConfig.
If you provide an output topic, it is used to send notifications when a data object is labeled. When
you create a topic, you have the option to add an encryption key. Use this option to add a AWS Key
Management Service customer managed key to your topic to encrypt the output data of your labeling
job before it is published to your output topic.
To create an output topic, follow the instructions in Creating an Amazon SNS topic in the Amazon Simple
Notification Service Developer Guide.
If you add encryption, you must attach additional permission to the topic. See Add Encryption to Your
Output Topic (Optional) (p. 715). for more information.
Important
To add a customer managed key to your output topic while creating a topic in the console, do
not use the (Default) alias/aws/sns option. Select a customer managed key that you created.
Note down your input topic ARN and use it in your CreateLabelingJob request in the parameter
SnsTopicArn in OutputConfig.
To encrypt messages published to your output topic, you need to provide an AWS KMS customer
managed key to your topic. Modify the following policy and add it to your customer managed key to give
Ground Truth permission to encrypt output data before publishing it to your output topic.
Replace <account_id> with the ID of the account that you are using to create your topic. To learn how
to find your AWS account ID, see Finding Your AWS Account ID.
{
"Id": "key-console-policy",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_id>:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow access for Key Administrators",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_id>:role/Admin"
},
715
Amazon SageMaker Developer Guide
Create a Labeling Job
"Action": [
"kms:Create*",
"kms:Describe*",
"kms:Enable*",
"kms:List*",
"kms:Put*",
"kms:Update*",
"kms:Revoke*",
"kms:Disable*",
"kms:Get*",
"kms:Delete*",
"kms:TagResource",
"kms:UntagResource",
"kms:ScheduleKeyDeletion",
"kms:CancelKeyDeletion"
],
"Resource": "*"
}
]
}
Additionally, you must modify and add the following policy to the execution role that you use to create
your labeling job (the input value for RoleArn).
Replace <account_id> with the ID of the account that you are using to create your topic. Replace
<region> with the AWS Region you are using to create your labeling job. Replace <key_id> with your
customer managed key ID.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "sid1",
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Resource": "arn:aws:kms:<region>:<account_id>:key/<key_id>"
}
]
}
For more information on creating and securing keys, see Creating Keys and Using Key Policies in the AWS
Key Management Service Developer Guide.
When a worker completes a labeling job task from a Ground Truth streaming labeling job, Ground Truth
uses your output topic to publish output data to one or more endpoints that you specify. To receive
notifications when a worker finishes a labeling task, you must subscribe an endpoint to your Amazon
SNS output topic.
To learn how to add endpoints to your output topic, see Subscribing to an Amazon SNS topic in the
Amazon Simple Notification Service Developer Guide.
To learn more about the output data format that is published to these endpoints, see Output
Data (p. 776).
Important
If you do not subscribe an endpoint to your Amazon SNS output topic, you will not receive
notifications when new data objects are labeled.
716
Amazon SageMaker Developer Guide
Create a Labeling Job
You decide the types of events that you want to send to your Amazon SNS topic. Ground Truth creates a
labeling job when you send object creation events.
The event structure sent to your Amazon SNS input topic must be a JSON message formatted using the
same structure found in Event message structure.
To see examples of how you can set up an event notification for your Amazon S3 bucket using the
Amazon S3 console, AWS SDK for .NET, and AWS SDK for Java, follow this walkthrough, Walkthrough:
Configure a bucket for notifications (SNS topic or SQS queue) in the Amazon Simple Storage Service User
Guide.
If you want to provide initial objects to be labeled, create a manifest file that identifies these objects and
place it in Amazon S3. Specify the S3 URI of this manifest file in ManifestS3Uri within InputConfig.
To learn how to format your manifest file, see Input Data (p. 734). To use the SageMaker console to
automatically generate a manifest file (not supported for 3D point cloud task types), see Automated
Data Setup (p. 736).
response = client.create_labeling_job(
717
Amazon SageMaker Developer Guide
Create a Labeling Job
LabelingJobName= 'example-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
},
'SnsDataSource': {
'SnsTopicArn': 'arn:aws:sns:us-east-1:123456789012:your-sns-input-topic'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string',
'SnsTopicArn': 'arn:aws:sns:us-east-1:123456789012:your-sns-output-topic'
},
RoleArn='arn:aws:iam::*:role/*',
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/custom-worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype',
'TaskKeywords': [
'Example key word',
],
'TaskTitle': 'Multi-label image classification task',
'TaskDescription': 'Select all labels that apply to the images shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-tasktype'
}
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)
If your labeling job remains idle for over 10 days, it is automatically stopped by Ground Truth. In this
context, a labeling job is considered idle if no objects are sent to the Amazon SNS input topic and no
objects remain in your Amazon SQS queue, waiting to be labeled. For example, if no data objects are fed
to the Amazon SNS input topic and all the objects fed to the labeling job are already labeled, Ground
Truth starts a timer. After the timer starts, if no items are received within a 10 day period, the labeling
job is stopped.
718
Amazon SageMaker Developer Guide
Create a Labeling Job
When a labeling job is stopped, its status is STOPPING while Ground Truth cleans up labeling job
resources and unsubscribes your Amazon SNS topic from your Amazon SQS queue. The Amazon SQS
is not deleted by Ground Truth because this queue may contain unprocessed data objects. You should
manually delete the queue if you want to avoid incurring additional charges from Amazon SQS. To learn
more, see Amazon SQS pricing .
• You can provide label category attributes for video frame and 3D point cloud object tracking and
object detection task types. Workers can use one or more attributes to give more information about
an object. For example, you may want to use the attribute occluded to have workers identify when an
object is partially obstructed. You can either specify a label category attribute for a single label using
the categoryAttributes parameter, or for all labels using the categoryGlobalAttributes
parameter.
• You can provide frame attributes for video frame and 3D point cloud object tracking and object
detection task types using frameAttributes. When you create a frame attribute, it appears on each
frame or point cloud in the worker task. In video frame labeling jobs, these are attributes that workers
assign to an entire video frame. For 3D point cloud labeling jobs, these attributes are applied to a
single point cloud. Use frame attributes to have workers provide more information about the scene in
a specific frame or point cloud.
• For video frame labeling jobs, you use the label category configuration file to specify the task type
(bounding box, polyline, polygon, or keypoint) sent to workers.
For workers, specifying values for label category attributes and frame attributes will be optional.
Important
You should only provide a label attribute name in auditLabelAttributeName if
you are running an audit job to verify or adjust labels. Use this parameter to input the
LabelAttributeName used in the labeling job that generated the annotations you want your
worker to adjust. When you create a labeling job in the console, if you did not specify a label
attribute name, the Name of your job is used as the LabelAttributeName.
Topics
• Label Category Configuration File Schema (p. 719)
• Example: Label Category Configuration Files for 3D Point Cloud Labeling Jobs (p. 726)
• Example: Label Category Configuration Files for Video Frame Labeling Jobs (p. 730)
• Creating Worker Instructions (p. 733)
719
Amazon SageMaker Developer Guide
Create a Labeling Job
No
categoryGlobalAttributes A list of JSON objects. Use this parameter to
create label category
Required Parameters in each JSON attributes that are
Object: applied to all labels you
specify in labels.
name, type See the third table in
this section for more
minimum and maximum are required
information.
if type is "number"
description, enum,
editsAllowed, isRequired
720
Amazon SageMaker Developer Guide
Create a Labeling Job
Short instructions
must be under 255
characters and long
instruction must be
under 2,048 characters.
Required
auditLabelAttributeName String Enter the
for LabelAttributeName
adjustment used in the labeling
and job you want to adjust
verification annotations of.
task types
Only use this parameter
if you are creating
an adjustment job
for video frame
and 3D point cloud
object detection,
object tracking, or 3D
point cloud semantic
segmentation.
The following table describes the parameters that you can and must use to create a list of Labels. Each
parameter should be included in a JSON object.
721
Amazon SageMaker Developer Guide
Create a Labeling Job
The following table describes the parameters that you can and must use to create a frame attributes
using frameAttributes and label category attribute using the categoryGlobalAttributes and
categoryAttributes parameters.
722
Amazon SageMaker Developer Guide
Create a Labeling Job
If you specify
"string" for type
and do not provide an
enum value, workers
can enter free form
text.
723
Amazon SageMaker Developer Guide
Create a Labeling Job
724
Amazon SageMaker Developer Guide
Create a Labeling Job
You can specify up to 10 label category attributes per class. This 10-attribute quotas includes global
label category attributes. For example, if you create four global label category attributes, and then
assign three label category attributes to label X, that label will have 4+3=7 label category attributes in
total. For all label category and label category attribute limits, refer to the following table.
Labels (Labels) 1 30
Frame attributes 0 10
725
Amazon SageMaker Developer Guide
Create a Labeling Job
Example: Label Category Configuration Files for 3D Point Cloud Labeling Jobs
Select a tab in the following tables to see examples of 3D point cloud label category configuration files
for object detection, object tracking, semantic segmentation, adjustment, and verification labeling jobs.
The following is an example of a label category configuration file that includes label category
attributes for a 3D point cloud object detection or object tracking labeling job. This example
includes a two frame attributes, which will be added to all point clouds submitted to the labeling
job. The Car label will include four label category attributes—X, Y, Z, and the global attribute, W.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"],
"isRequired":true
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"]
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
726
Amazon SageMaker Developer Guide
Create a Labeling Job
}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"}
}
The following is an example of a label category configuration file for a 3D point cloud semantic
segmentation labeling job.
Label category attributes are not supported for 3D point cloud semantic segmentation task types.
Frame attributes are supported. If you provide label category attributes for a semantic segmentation
labeling job, they will be ignored.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"labels": [
{
"label": "Car",
},
{
"label": "Pedestrian",
},
{
"label": "Cyclist",
}
],
"instructions": {"shortInstruction":"Select the appropriate label and
paint all objects in the point cloud that it applies to the same color",
"fullInstruction":"<html markup>"}
}
Select a tab in the following table to see an example of a label category configuration file for 3D point
cloud verification or adjustment labeling jobs.
The following is an example of a label category configuration file for a 3D point cloud object
detection or object tracking adjustment labeling job. For 3D point cloud semantic segmentation
adjustment labeling jobs, categoryGlobalAttributes and categoryAttributes are not
supported.
727
Amazon SageMaker Developer Guide
Create a Labeling Job
You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the adjustment labeling job. Optionally, you can use the
editsAllowed parameter to specify whether or not a label or frame attribute can be edited.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"none",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"any",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"any",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}
728
Amazon SageMaker Developer Guide
Create a Labeling Job
The following is an example of a label category configuration file you may use for a 3D point
cloud object detection or object tracking verification labeling job. For a 3D point cloud semantic
segmentation verification labeling job, categoryGlobalAttributes and categoryAttributes
are not supported.
You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Additionally, you must use the
editsAllowed parameter to specify that no labels can be edited.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"any",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"any",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"none",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"none",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"none"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
729
Amazon SageMaker Developer Guide
Create a Labeling Job
"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"},
// include auditLabelAttributeName for label verification jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}
Example: Label Category Configuration Files for Video Frame Labeling Jobs
The annotation tools available to your worker and task type used depends on the value you specify for
annotationType. For example, if you want workers to use key points to track changes in the pose of
specific objects across multiple frames, you would specify Keypoint for the annotationType. If you do
not specify an annotation type, BoundingBox will be used by default.
The following is an example of a video frame keypoint label category configuration file with label
category attributes. This example includes two frame attributes, which will be added to all frames
submitted to the labeling job. The Car label will include four label category attributes—X, Y, Z, and the
global attribute, W.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [
{
"label": "Car",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"]
},
{
730
Amazon SageMaker Developer Guide
Create a Labeling Job
"name":"Z",
"description":"submit a free-form response",
"type":"string",
}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"}
}
Select a tab from the following table to see examples of label category configuration files for video
frame adjustment and verification labeling jobs.
The following is an example of a label category configuration file you may use for a video frame
adjustment labeling job.
You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Optionally, you can use the
editsAllowed parameter to specify whether or not labels, label category attributes, or frame
attributes can be edited.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"none",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"any",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"any",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"any"
731
Amazon SageMaker Developer Guide
Create a Labeling Job
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}
The following is an example of a label category configuration file for a video frame labeling job.
You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Additionally, you must use the
editsAllowed parameter to specify that no labels can be edited.
{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"none",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"any",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"none",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [
732
Amazon SageMaker Developer Guide
Create a Labeling Job
{
"label": "Car",
"editsAllowed":"none",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"any"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}
• Short instructions – These instructions are shown to works when they select Instructions in the
worker UI menu. They should provide an easy reference to show the worker the correct way to label an
object.
• Full instructions – These instructions are shown when workers select More Instructions in instructions
the pop-up window. We recommend that you provide detailed instructions for completing the task
with multiple examples showing edge cases and other difficult situations for labeling objects.
For 3D point cloud and video frame labeling jobs, you can add worker instructions to your label category
configuration file. You can use a single string to create instructions or you can add HTML mark up to
customize the appearance of your instructions and add images. Make sure that any images you include in
your instructions are publicly available, or if your instructions are in Amazon S3, that your workers have
read-access so that they can view them.
733
Amazon SageMaker Developer Guide
Use Input and Output Data
The output data is the result of your labeling job. The output data file, or augmented manifest file,
contains label data for each object you send to the labeling job and metadata about the label assigned
to data objects.
When you use image classification (single and multi-label), text classification (single and multi-label),
object detection, and semantic segmentation built in task types to create a labeling job, you can use the
resulting augmented manifest file to launch a SageMaker training job. For a demonstration of how to use
an augmented manifest to train an object detection machine learning model with Amazon SageMaker,
see object_detection_augmented_manifest_training.ipynb. For more information, see Provide Dataset
Metadata to Training Jobs with an Augmented Manifest File (p. 2138).
Topics
• Input Data (p. 734)
• 3D Point Cloud Input Data (p. 746)
• Video Frame Input Data (p. 770)
• Output Data (p. 776)
Input Data
The input data are the data objects that you send to your workforce to be labeled. There are two ways to
send data objects to Ground Truth for labeling:
• Send a list of data objects that require labeling using an input manifest file.
• Send individual data objects in real time to a perpetually running, streaming labeling job.
If you have a dataset that needs to be labeled one time, and you do not require an ongoing labeling job,
create a standard labeling job using an input manifest file.
If you want to regularly send new data objects to your labeling job after it has started, create a
streaming labeling job. When you create a streaming labeling job, you can optionally use an input
manifest file to specify a group of data that you want labeled immediately when the job starts. You can
continuously send new data objects to a streaming labeling job as long as it is active.
Note
Streaming labeling jobs are only supported through the SageMaker API. You cannot create a
streaming labeling job using the SageMaker console.
The following task types have special input data requirements and options:
• For 3D point cloud labeling job input data requirements, see 3D Point Cloud Input Data (p. 746).
• For video frame labeling job input data requirements, see Video Frame Input Data (p. 770).
Topics
• Use an Input Manifest File (p. 735)
• Automated Data Setup (p. 736)
• Supported Data Formats (p. 737)
• Ground Truth Streaming Labeling Jobs (p. 738)
734
Amazon SageMaker Developer Guide
Use Input and Output Data
Input data and the manifest file must be stored in Amazon Simple Storage Service (Amazon S3). Each
has specific storage and access requirements, as follows:
• The Amazon S3 bucket that contains the input data must be in the same AWS Region in which you
are running Amazon SageMaker Ground Truth. You must give Amazon SageMaker access to the data
stored in the Amazon S3 bucket so that it can read it. For more information about Amazon S3 buckets,
see Working with Amazon S3 buckets.
• The manifest file must be in the same AWS Region as the data files, but it doesn't need to be in the
same location as the data files. It can be stored in any Amazon S3 bucket that is accessible to the AWS
Identity and Access Management (IAM) role that you assigned to Ground Truth when you created the
labeling job.
Note
3D point cloud and video frame task types have different input manifest requirements and
attributes.
For 3D point cloud task types, refer to Create an Input Manifest File for a 3D Point Cloud
Labeling Job (p. 748).
For video frame task types, refer to Create a Video Frame Input Manifest File (p. 775).
The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line is
delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you can't
have unescaped line break characters. For more information about data format, see JSON Lines.
Each JSON object in the manifest file can be no larger than 100,000 characters. No single attribute
within an object can be larger than 20,000 characters. Attribute names can't begin with $ (dollar sign).
Each JSON object in the manifest file must contain one of the following keys: source-ref or source.
The value of the keys are interpreted as follows:
• source-ref – The source of the object is the Amazon S3 object specified in the value. Use this value
when the object is a binary object, such as an image.
• source – The source of the object is the value. Use this value when the object is a text value.
The following is an example of a manifest file for files stored in an Amazon S3 bucket:
Use the source-ref key for image files for bounding box, image classification (single and multi-label),
semantic segmentation, and video clips for video classification labeling jobs. 3D point cloud and video
frame labeling jobs also use the source-ref key but these labeling jobs require additional information
in the input manifest file. For more information see 3D Point Cloud Input Data (p. 746) and Video
Frame Input Data (p. 770).
The following is an example of a manifest file with the input data stored in the manifest:
735
Amazon SageMaker Developer Guide
Use Input and Output Data
Use the source key for single and multi-label text classification and named entity recognition labeling
jobs.
You can include other key-value pairs in the manifest file. These pairs are passed to the output file
unchanged. This is useful when you want to pass information between your applications. For more
information, see Output Data (p. 776).
Before using the following procedure, ensure that your input images or files are correctly formatted:
• Image files – Image files must comply with the size and resolution limits listed in the tables found in
Input File Size Quota (p. 742).
• Text files – Text data can be stored in one or more .txt files. Each item that you want labeled must be
separated by a standard line break.
• CSV files – Text data can be stored in one or more .csv files. Each item that you want labeled must be
in a separate row.
• Videos – Video files can be any of the following formats: .mp4, .ogg, and .webm. If you want to
extract video frames from your video files for object detection or object tracking, see Provide Video
Files (p. 772).
• Video frames – Video frames are images extracted from a videos. All images extracted from a single
video are referred to as a sequence of video frames. Each sequence of video frames must have unique
prefix keys in Amazon S3. See Provide Video Frames (p. 771). For this data type, see Automated
Video Frame Input Data Setup (p. 773)
Important
For video frame object detection and video frame object tracking labeling jobs, see Automated
Video Frame Input Data Setup (p. 773) to learn how to use the automated data setup.
Use these instructions to automatically set up your input dataset connection with Ground Truth.
1. Navigate to the Create labeling job page in the Amazon SageMaker console at https://
console.aws.amazon.com/sagemaker/.
This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is in an Amazon S3
bucket in another Region, switch to that Region. To change your AWS Region, on the navigation bar,
choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.
736
Amazon SageMaker Developer Guide
Use Input and Output Data
The following GIF demonstrates how to use the automated data setup for image data. This example will
create a file, dataset-YYMMDDTHHMMSS.manifest in the Amazon S3 bucket example-groundtruth-
images where YYMMDDTHHmmSS indicates the year (YY), month (MM), day (DD) and time in hours (HH),
minutes (mm) and seconds (ss), that the input manifest file was created.
Video Frame Object Video frames and video Video Refer to Create a Video
Detection, Video frame sequence files frames: .jpg, .jpeg, .png Frame Input Manifest
Frame Object Tracking (for Object Tracking) File (p. 775).
(bounding boxes, Sequence files: .json
polylines, polygons or
keypoint)
3D Point Cloud Point clouds and point Point clouds: Binary Refer to Create an Input
Semantic cloud sequence files pack format and ASCII. Manifest File for a 3D
Segmentation, 3D Point (for Object Tracking) For more information Point Cloud Labeling
Cloud Object Detection, see Accepted Raw 3D Job (p. 748).
Data Formats (p. 746).
737
Amazon SageMaker Developer Guide
Use Input and Output Data
• Send new dataset objects to workers in real time using a perpetually running labeling job. Workers
continuously receive new data objects to label as long as the labeling job is active and new objects are
being sent to it.
• Gain visibility into the number of objects that have been queued and are waiting to be labeled. Use
this information to control the flow of data objects sent to your labeling job.
• Receive label data for individual data objects in real time as workers finish labeling them.
Ground Truth streaming labeling jobs remain active until they are manually stopped or have been idle
for more than 10 days. You can intermittently send new data objects to workers while the labeling job is
active.
If you are a new user of Ground Truth streaming labeling jobs, it is recommended that you review How It
Works (p. 738).
Use Create a Streaming Labeling Job (p. 714) to learn how to create a streaming labeling job.
Note
Ground Truth streaming labeling jobs are only supported through the SageMaker API.
Topics
• How It Works (p. 738)
• Send Data to a Streaming Labeling Job (p. 738)
• Manage Labeling Requests with an Amazon SQS Queue (p. 740)
• Receive Output Data from a Streaming Labeling Job (p. 740)
• Duplicate Message Handling (p. 740)
How It Works
When you create a Ground Truth streaming labeling job, the job remains active until it is manually
stopped, remains idle for more than 10 days, or is unable to access input data sources. You can
intermittently send new data objects to workers while it is active. A worker can continue to receive
new data objects in real time as long as the total number of tasks currently available to the worker
is less than the value in MaxConcurrentTaskCount. Otherwise, the data object is sent to a queue
that Ground Truth creates on your behalf in Amazon Simple Queue Service (Amazon SQS) for later
processing. These tasks are sent to workers as soon as the total number of tasks currently available to
a worker falls below MaxConcurrentTaskCount. If a data object is not sent to a worker after 14 days,
it expires. You can view the number of tasks pending in the queue and adjust the number of objects
you send to the labeling job. For example, you may decrease the speed at which you send objects to the
labeling job if the backlog of pending objects moves above a threshold.
You can optionally submit input data to a streaming labeling job one time when you create the labeling
job using an input manifest file. Once the labeling job has started and the state is InProgress, you
738
Amazon SageMaker Developer Guide
Use Input and Output Data
can submit new data objects to your labeling job in real time using your Amazon SNS input topic and
Amazon S3 event notifications.
Submit Data Objects When you Start the Labeling Job (One Time):
• Use an Input Manifest File – You can optionally specify an input manifest file Amazon S3 URI in
ManifestS3Uri when you create the streaming labeling job. Ground Truth sends each data object in
the manifest file to workers for labeling as soon as the labeling job starts. To learn more, see Create a
Manifest File (Optional) (p. 717).
After you submit a request to create the streaming labeling job, its status will be Initializing.
Once the labeling job is active, the state changes to InProgress and you can start using the real-time
options to submit additional data objects for labeling.
• Send data objects using Amazon SNS messages – You can send Ground Truth new data objects to
label by sending an Amazon SNS message. You will send this message to an Amazon SNS input topic
that you create and specify when you create your streaming labeling job. For more information, see
Send Data Objects Using Amazon SNS (p. 739).
• Send data objects by placing them in an Amazon S3 bucket – Each time you add a new data object
to an Amazon S3 bucket, you can prompt Ground Truth to process that object for labeling. To do this,
you add an event notification to the bucket so that it notifies your Amazon SNS input topic each time
a new object is added to (or created in) that bucket. For more information, see Send Data Objects
using Amazon S3 (p. 740). This option is not available for text-based labeling jobs such as text
classification and named entity recognition.
Important
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your
input data configuration and your output data. You specify the S3 prefix for your output data
when you create a labeling job.
You can send data objects to your streaming labeling job using Amazon Simple Notification Service
(Amazon SNS). Amazon SNS is a web service that coordinates and manages the delivery of messages
to and from endpoints (for example, an email address or AWS Lambda function). An Amazon SNS topic
acts as a communication channel between two or more endpoints. You use Amazon SNS to send, or
publish, new data objects to the topic specified in the CreateLabelingJob parameter SnsTopicArn in
InputConfig. The format of these messages is the same as a single line from an input manifest file.
For example, you may send a piece of text to an active text classification labeling job by publishing it to
your input topic. The message that you publish may look similar to the following:
To send a new image object to an image classification labeling job, your message may look similar to the
following:
{"source-ref": "s3://awsexamplebucket/example-image.jpg"}
Note
You can also include custom deduplication IDs and deduplication keys in your Amazon SNS
messages. To learn more, see Duplicate Message Handling (p. 740).
When Ground Truth creates your streaming labeling job, it subscribes to your Amazon SNS input topic.
739
Amazon SageMaker Developer Guide
Use Input and Output Data
You can send one or more new data objects to a streaming labeling job by placing them in an Amazon
S3 bucket that is configured with an Amazon SNS event notification. You can set up an event to notify
your Amazon SNS input topic anytime a new object is created in your bucket. You must specify this same
Amazon SNS input topic in the CreateLabelingJob parameter SnsTopicArn in InputConfig.
Anytime you configure an Amazon S3 bucket to send notifications to Amazon SNS, Ground Truth
will publish a test event, "s3:TestEvent", to ensure that the topic exists and that the owner of the
Amazon S3 bucket specified has permission to publish to the specified topic. It is recommended that you
set up your Amazon S3 connection with Amazon SNS before starting a streaming labeling job. If you do
not, this test event may register as a data object and be sent to Ground Truth for labeling.
Important
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your input
data configuration and your output data. You specify the S3 prefix for your output data when
you create a labeling job.
For image-based labeling jobs, Ground Truth requires all S3 buckets to have a CORS policy
attached. To learn more, see CORS Permission Requirement (p. 816).
Once you have configured your Amazon S3 bucket and created your labeling job, you can add objects
to your bucket and Ground Truth either sends that object to workers or places it on your Amazon SQS
queue.
To learn more, see Set up Amazon S3 Bucket Event Notifications (p. 717).
Important
This option is not available for text-based labeling jobs such as text classification and named
entity recognition.
When Ground Truth creates your streaming labeling job, it creates an Amazon SQS queue in the AWS
account used to create the labeling job. The queue name is GroundTruth-labeling_job_name where
labeling_job_name is the name of your labeling job, in lowercase letters. When you send data objects
to your labeling job, Ground Truth either sends the data objects directly to workers or places the task in
your queue to be processed at a later time. If a data object is not sent to a worker after 14 days, it expires
and is removed from the queue. You can setup an alarm in Amazon SQS to detect when objects expire
and use this mechanism to control the volume of objects you send to your labeling job.
Important
Modifying, deleting, or sending objects directly to the Amazon SQS queue associated with your
streaming labeling job may lead to job failures.
Your Amazon S3 output bucket is periodically updated with new output data from your streaming
labeling job.
Optionally, you can specify an Amazon SNS output topic. Each time a worker submits a labeled object, a
notification with the output data is sent to that topic. You can subscribe an endpoint to your SNS output
topic to receive notifications or trigger events when you receive output data from a labeling task. Use an
Amazon SNS output topic if you want to do real time chaining to another streaming job and receive an
Amazon SNS notifications each time a data object is submitted by a worker.
To learn more, see Subscribe an Endpoint to Your Amazon SNS Output Topic (p. 716).
For data objects sent in real time, Ground Truth guarantees idempotency by ensuring each unique object
is only sent for labeling once, even if the input message referring to that object is received multiple
740
Amazon SageMaker Developer Guide
Use Input and Output Data
times (duplicate messages). To do this, each data object sent to a streaming labeling job is assigned a
deduplication ID, which is identified with a deduplication key.
If you send your requests to label data objects directly through your Amazon SNS input topic using
Amazon SNS messages, you can optionally choose a custom deduplication key and deduplication IDs
for your objects. For more information, see Specify A Deduplication Key and ID in an Amazon SNS
Message (p. 741).
If you do not provide your own deduplication key, or if you use the Amazon S3 configuration to send
data objects to your labeling job, Ground Truth uses one of the following for the deduplication ID:
• For messages sent directly to your Amazon SNS input topic, Ground Truth uses the SNS message ID.
• For messages that come from an Amazon S3 configuration, Ground Truth creates a deduplication ID by
combining the Amazon S3 URI of the object with the sequencer token in the message.
When you send a data object to your streaming labeling job using an Amazon SNS message, you have
the option to specify your deduplication key and deduplication ID in one of the following ways. In all of
these scenarios, identify your deduplication key with dataset-objectid-attribute-name.
Create your own deduplication key and deduplication ID by configuring your Amazon SNS message as
follows. Replace byo-key with your key and UniqueId with the deduplication ID for that data object.
{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"byo-key",
"byo-key":"UniqueId"
}
Your deduplication key can be up to 140 characters. Supported patterns include: "^[$a-zA-Z0-9](-
*[a-zA-Z0-9])*".
You can use an existing key in your message as the deduplication key. When you do this, the value
associated with that key is used for the deduplication ID.
For example, you can specify use the source-ref key as your deduplication key by formatting your
message as follows:
{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"source-ref"
}
In this example, Ground Truth uses "s3://bucket/prefix/object1" for the deduplication id.
You can see the deduplication key and ID in your output data. The deduplication key is identified by
dataset-objectid-attribute-name.
741
Amazon SageMaker Developer Guide
Use Input and Output Data
When you use your own custom deduplication key, your output contains something similar to the
following:
"dataset-objectid-attribute-name": "byo-key",
"byo-key": "UniqueId",
When you do not specify a key, you can find the deduplication ID that Ground Truth assigned to
your data object as follows. The $label-attribute-name-object-id parameter identifies your
deduplication ID.
{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"$label-attribute-name-object-id"
"label-attribute-name" :0,
"label-attribute-name-metadata": {...},
"$label-attribute-name-object-id":"<service-generated-key>"
}
For <service-generated-key>, if the data object came through an Amazon S3 configuration, Ground
Truth adds a unique value used by the service and emits a new field keyed by $sequencer which shows
the Amazon S3 sequencer used. If object was fed to SNS directly, Ground Truth use the SNS message ID.
Note
Do not use the $ character in your label attribute name.
Input image data for active and non-active learning labeling jobs must not exceed size and resolution
quotas. Active learning refers to labeling job that use automated data labeling. Non-active learning refers
to labeling jobs that don't use automated data labeling.
Additional quotas apply for label categories for all task types, and for input data and labeling category
attributes for 3D point cloud and video frame task types.
Input files can't exceed the following size- quotas for both active and non-active learning labeling jobs.
There is no input file size quota for videos used in video classification labeling jobs.
Image classification 40 MB
Semantic segmentation 40 MB
742
Amazon SageMaker Developer Guide
Use Input and Output Data
Labeling Job Task Type Resolution Quota - Non Active Resolution Quota - Active
Learning Learning
Bounding box (Object detection) 100 million pixels 3840 x 2160 pixels (4 K)
Semantic segmentation label 100 million pixels 1920 x 1080 pixels (1080 p)
adjustment
The following label category limits apply to labeling jobs. Quotas for label categories depend on
whether you use the SageMaker API operation CreateLabelingJob or the console to create a labeling
job.
Labeling Job Task Type Label Category Quota - API Label Category Quota - Console
743
Amazon SageMaker Developer Guide
Use Input and Output Data
Labeling Job Task Type Label Category Quota - API Label Category Quota - Console
Video classification 30 30
The following quotas apply for 3D point cloud and video frame labeling job input data.
Video frame object detection 2,000 video frames (images) per sequence
Video frame object detection 10 video frame sequences per manifest file
Video frame object tracking 2,000 video frames (images) per sequence
Video frame object tracking 10 video frame sequences per manifest file
3D point cloud object detection 100,000 point cloud frames per labeling job
3D point cloud object tracking 100,000 point cloud frame sequences per labeling
job
3D point cloud object tracking 500 point cloud frames in each sequence file
When you create a video frame or 3D point cloud labeling job, you can add one or more label category
attributes to each label category that you specify to have workers provide more information about an
annotation.
Each label category attribute has a single label category attribute name, and a list of one or more
options (values) to choose from. To learn more, see Worker User Interface (UI) (p. 631) for 3D point
cloud labeling jobs and Worker User Interface (UI) (p. 577) for video frame labeling jobs.
The following quotas apply to the number of label category attributes names and values you can specify
for labeling jobs.
Labeling Job Task Type Label Category Attribute Label Category Attribute
(name) Quota Values Quota
744
Amazon SageMaker Developer Guide
Use Input and Output Data
Labeling Job Task Type Label Category Attribute Label Category Attribute
(name) Quota Values Quota
The following options are available in the Labeling jobs section of the SageMaker console after selecting
Create labeling job. To learn how to create a labeling job in the console, see Getting started (p. 527).
To configure the dataset that you use for labeling, in the Job overview section, choose Additional
configuration.
After you have specified the percentage of data objects that you want to include in the sample, choose
Create subset. SageMaker randomly picks the data objects for your labeling job. After the objects are
selected, choose Use this subset.
SageMaker creates a manifest file for the selected data objects. It also modifies the value in the Input
dataset location field to point to the new manifest file.
Specify a Subset
You can specify a subset of your data objects using an Amazon S3 SELECT query on the object file
names.
The SELECT statement of the SQL query is defined for you. You provide the WHERE clause to specify
which data objects should be returned.
For more information about the Amazon S3 SELECT statement, see Selecting Content from Objects.
Choose Create subset to start the selection, and then choose Use this subset to use the selected data.
SageMaker creates a manifest file for the selected data objects. It also updates the value in the Input
dataset location field to point to the new manifest file.
745
Amazon SageMaker Developer Guide
Use Input and Output Data
Use your labeling job task type to choose a topics on Create an Input Manifest File for a 3D Point Cloud
Labeling Job (p. 748) to learn about the formatting requirements for each line of your input manifest
file.
Topics
• Accepted Raw 3D Data Formats (p. 746)
• Create an Input Manifest File for a 3D Point Cloud Labeling Job (p. 748)
• Understand Coordinate Systems and Sensor Fusion (p. 761)
For each frame, Ground Truth supports Compact Binary Pack Format (.bin) and ASCII (.txt) files. These
files contain information about the location (x, y, and z coordinates) of all points that make up that
frame, and, optionally, information about the pixel color of each point for colored point clouds. When
you create a 3D point cloud labeling job input manifest file, you can specify the format of your raw data
in the format parameter.
The following table lists elements that Ground Truth supports in point cloud frame files to describe
individual points.
Symbol Value
746
Amazon SageMaker Developer Guide
Use Input and Output Data
The Compact Binary Pack Format represents a point cloud as an ordered set of a stream of points.
Each point in the stream is an ordered binary pack of 4-byte float values in some variant of the form
xyzirgb. The x, y, and z elements are required and additional information about that pixel can be
included in a variety of ways using i, r, g, and b.
To use a binary file to input point cloud frame data to a Ground Truth 3D point cloud labeling job, enter
binary/ in the format parameter for your input manifest file and replace with the order of elements
in each binary pack. For example, you may enter one of the following for the format parameter.
• binary/xyzi – When you use this format, your point element stream would be in the following
order: x1y1z1i1x2y2z2i2...
• binary/xyzrgb – When you use this format, your point element stream would be in the following
order: x1y1z1r1g1b1x2y2z2r2g2b2...
• binary/xyzirgb – When you use this format, your point element stream would be in the following
order: x1y1z1i1r1g1b1x2y2z2i2r2g2b2...
When you use a binary file for your point cloud frame data, if you do not enter a value for format, the
default pack format binary/xyzi is used.
ASCII Format
The ASCII format uses a text file to represent a point cloud, where each line in the ASCII point cloud file
represents a single point. Each point is a line the text file and contains white space separated values,
each of which is a 4-byte float ASCII values. The x, y, and z elements are required for each point and
additional information about that point can be included in a variety of ways using i, r, g, and b.
To use a text file to input point cloud frame data to a Ground Truth 3D point cloud labeling job, enter
text/ in the format parameter for your input manifest file and replace with the order of point
elements on each line.
For example, if you enter text/xyzi for format, your text file for each point cloud frame should look
similar to the following:
x1 y1 z1 i1
x2 y2 z2 i2
...
...
If you enter text/xyzrgb, your text file should look similar to the following:
x1 y1 z1 r1 g1 b1
x2 y2 z2 r2 g2 b1
...
...
When you use a text file for your point cloud frame data, if you do not enter a value for format, the
default format text/xyzi will be used.
Ground Truth does not have a resolution limit for 3D point cloud frames. However, we recommend that
you limit each point cloud frame to 500K points for optimal performance. When Ground Truth renders
the 3D point cloud visualization, it must be viewable on your workers' computers, which depends on
workers' computer hardware. Point cloud frames that are larger than 1 million points may not render on
standard machines, or may take too long to load.
747
Amazon SageMaker Developer Guide
Use Input and Output Data
• If you are creating a 3D point cloud object detection or semantic segmentation labeling job, each
line in your input manifest file contains information about a single 3D point cloud frame. This is called
a point cloud frame input manifest. To learn more, see Create a Point Cloud Frame Input Manifest
File (p. 748).
• If you are creating a 3D point cloud object tracking labeling job, each line of your input manifest file
contains a sequence of 3D point cloud frames and associated data. This is called a point cloud sequence
input manifest. To learn more, see Create a Point Cloud Sequence Input Manifest (p. 754).
Ground Truth supports point cloud and video camera sensor fusion in the world coordinate
system (p. 761) for all modalities. If you can obtain your 3D sensor extrinsic (like a LiDAR extrinsic),
we recommend that you transform 3D point cloud frames into the world coordinate system using the
extrinsic. For more information, see Sensor Fusion (p. 763).
However, if you cannot obtain a point cloud in world coordinate system, you can provide coordinates
in the original coordinate system that the data was captured in. If you are providing camera data for
sensor fusion, it is recommended that you provide LiDAR sensor and camera pose in the world coordinate
system.
To create a single-frame input manifest file, you will identify the location of each point cloud frame that
you want workers to label using the source-ref key. Additionally, you must use the source-ref-
metadata key to identify the format of your dataset, a timestamp for that frame, and, optionally, sensor
fusion data and video camera images.
The following example demonstrates the syntax used for an input manifest file for a single-frame point
cloud labeling job. The example includes two point cloud frames. For details about each parameter, see
the table following this example.
Important
Each line in your input manifest file must be in JSON Lines format. The following code block
shows an input manifest file with two JSON objects. Each JSON object is used to point to and
provide details about a single point cloud frame. The JSON objects have been expanded for
readability, but you must minimize each JSON object to fit on a single line when creating an
input manifest file. An example is provided under this code block.
{
"source-ref": "s3://awsexamplebucket/examplefolder/frame1.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861644.759115,
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,
748
Amazon SageMaker Developer Guide
Use Input and Output Data
"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"prefix": "s3://awsexamplebucket/lidar_singleframe_dataset/someprefix/",
"images": [
{
"image-path": "images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
}
}
{
"source-ref": "s3://awsexamplebucket/examplefolder/frame2.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861632.759133,
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,
"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"prefix": "s3://awsexamplebucket/lidar_singleframe_dataset/someprefix/",
"images": [
{
"image-path": "images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
749
Amazon SageMaker Developer Guide
Use Input and Output Data
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
}
}
When you create an input manifest file, you must collapse your JSON objects to fit on a single line. For
example, the code block above would appear as follows in an input manifest file:
{"source-ref":"s3://awsexamplebucket/examplefolder/frame1.bin","source-ref-metadata":
{"format":"binary/xyzi","unix-timestamp":1566861644.759115,"ego-vehicle-pose":{"position":
{"x":-2.7161461413869947,"y":116.25822288149078,"z":1.8348751887989483},"heading":
{"qx":-0.02111296123795955,"qy":-0.006495469416730261,"qz":-0.008024565904865688,"qw":0.999718119229808
awsexamplebucket/lidar_singleframe_dataset/someprefix/","images":
[{"image-path":"images/frame300.bin_camera0.jpg","unix-
timestamp":1566861644.759115,"fx":847.7962624528487,"fy":850.0340893791985,"cx":576.2129134707038,"cy":
{"x":-2.2722515189268138,"y":116.86003310568965,"z":1.454614668542299},"heading":
{"qx":0.7594754093069037,"qy":0.02181790885672969,"qz":-0.02461725233103356,"qw":-0.6496916273040025},"
model":"pinhole"}]}}
{"source-ref":"s3://awsexamplebucket/examplefolder/frame2.bin","source-ref-metadata":
{"format":"binary/xyzi","unix-timestamp":1566861632.759133,"ego-vehicle-pose":{"position":
{"x":-2.7161461413869947,"y":116.25822288149078,"z":1.8348751887989483},"heading":
{"qx":-0.02111296123795955,"qy":-0.006495469416730261,"qz":-0.008024565904865688,"qw":0.999718119229808
awsexamplebucket/lidar_singleframe_dataset/someprefix/","images":
[{"image-path":"images/frame300.bin_camera0.jpg","unix-
timestamp":1566861644.759115,"fx":847.7962624528487,"fy":850.0340893791985,"cx":576.2129134707038,"cy":
{"x":-2.2722515189268138,"y":116.86003310568965,"z":1.454614668542299},"heading":
{"qx":0.7594754093069037,"qy":0.02181790885672969,"qz":-0.02461725233103356,"qw":-0.6496916273040025},"
model":"pinhole"}]}}
The following table shows the parameters you can include in your input manifest file:
s3://<bucket-
name>/<folder-
name>/point-cloud-
frame-file
750
Amazon SageMaker Developer Guide
Use Input and Output Data
Default Values:
751
Amazon SageMaker Developer Guide
Use Input and Output Data
Use the ego-vehicle location to provide information about the location of the vehicle used to capture
point cloud data. Ground Truth use this information to compute LiDAR extrinsic matrix.
Ground Truth uses extrinsic matrices to project labels to and from the 3D scene and 2D images. For more
information, see Sensor Fusion (p. 763).
The following table provides more information about the position and orientation (heading)
parameters that are required when you provide ego-vehicle information.
752
Amazon SageMaker Developer Guide
Use Input and Output Data
If you want to include video camera data with a frame, use the following parameters to provide
information about each image. The Required column below applies when the images parameter is
included in the input manifest file under source-ref-metadata. You are not required to include
images in your input manifest file.
If you include camera images, you must include information about the camera position and heading
used the capture the images in the world coordinate system.
If your images are distorted, Ground Truth can automatically undistort them using information you
provide about the image in your input manifest file, including distortion coefficients (k1, k2, k3, k4, p1,
p1), the camera model and the camera intrinsic matrix. The intrinsic matrix is made up of focal length
(fx, fy), and the principal point (cx, cy). See Intrinsic Matrix (p. 765) to learn how Ground Truth uses
the camera intrinsic. If distortion coefficients are not included, Ground Truth will not undistort an image.
"pinhole"
753
Amazon SageMaker Developer Guide
Use Input and Output Data
You can include up to 100,000 point cloud frames in your input manifest file. 3D point cloud labeling job
have longer pre-processing times than other Ground Truth task types. For more information, see Job Pre-
processing Time (p. 630).
The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line
is delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you
can't have unescaped line break characters. In the point cloud sequence input manifest file, each line
in the manifest contains a sequence of point cloud frames. The point cloud data for each frame in the
sequence can either be stored in binary or ASCII format. For more information, see Accepted Raw 3D
Data Formats (p. 746). This is the manifest file formatting required for 3D point cloud object tracking.
Optionally, you can also provide point attribute and camera sensor fusion data for each point cloud
frame. When you create a sequence input manifest file, you must provide LiDAR and video camera sensor
fusion data in a world coordinate system (p. 761).
The following example demonstrates the syntax used for an input manifest file when each line in the
manifest is a sequence file. Each line in your input manifest file must be in JSON Lines format.
{"source-ref": "s3://awsexamplebucket/example-folder/seq1.json"}
{"source-ref": "s3://awsexamplebucket/example-folder/seq2.json"}
The data for each sequence of point cloud frames needs to be stored in a JSON data object. The
following is an example of the format you use for a sequence file. Information about each frame is
included as a JSON object and is listed in the frames list. This is an example of a sequence file with two
point cloud frame files, frame300.bin and frame303.bin. The ... is used to indicated where you
should include information for additional frames. Add a JSON object for each frame in the sequence.
The following code block includes a JSON object for a single sequence file. The JSON object has been
expanded for readability.
754
Amazon SageMaker Developer Guide
Use Input and Output Data
{
"seq-no": 1,
"prefix": "s3://awsexamplebucket/example_lidar_sequence_dataset/seq1/",
"number-of-frames": 100,
"frames":[
{
"frame-no": 300,
"unix-timestamp": 1566861644.759115,
"frame": "example_lidar_frames/frame300.bin",
"format": "binary/xyzi",
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,
"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"images": [
{
"image-path": "example_images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
},
{
"frame-no": 303,
"unix-timestamp": 1566861644.759115,
"frame": "example_lidar_frames/frame303.bin",
"format": "text/xyzi",
"ego-vehicle-pose":{...},
"images":[{...}]
},
...
]
}
755
Amazon SageMaker Developer Guide
Use Input and Output Data
The following table provides details about the top-level parameters of a sequence file. For detailed
information about the parameters required for individual frames in the sequence file, see Parameters for
Individual Point Cloud Frames (p. 756).
The following table shows the parameters you can include in your input manifest file.
756
Amazon SageMaker Developer Guide
Use Input and Output Data
757
Amazon SageMaker Developer Guide
Use Input and Output Data
Default Values:
758
Amazon SageMaker Developer Guide
Use Input and Output Data
Use the ego-vehicle location to provide information about the pose of the vehicle used to capture point
cloud data. Ground Truth use this information to compute LiDAR extrinsic matrices.
Ground Truth uses extrinsic matrices to project labels to and from the 3D scene and 2D images. For more
information, see Sensor Fusion (p. 763).
The following table provides more information about the position and orientation (heading)
parameters that are required when you provide ego-vehicle information.
If you want to include color camera data with a frame, use the following parameters to provide
information about each image. The Required column in the following table applies when the images
parameter is included in the input manifest file. You are not required to include images in your input
manifest file.
If you include camera images, you must include information about the position and orientation
(heading) of the camera used the capture the images.
If your images are distorted, Ground Truth can automatically undistort them using information you
provide about the image in your input manifest file, including distortion coefficients (k1, k2, k3, k4, p1,
p1), camera model and focal length (fx, fy), and the principal point (cx, cy). To learn more about these
coefficients and undistorting images, see Camera calibration With OpenCV. If distortion coefficients are
not included, Ground Truth will not undistort an image.
759
Amazon SageMaker Developer Guide
Use Input and Output Data
"pinhole"
You can include up to 100,000 point cloud frame sequences in your input manifest file. You can include
up to 500 point cloud frames in each sequence file.
760
Amazon SageMaker Developer Guide
Use Input and Output Data
Keep in mind that 3D point cloud labeling job have longer pre-processing times than other Ground Truth
task types. For more information, see Job Pre-processing Time (p. 630).
• When you are required to provide input data in a world coordinate system or global frame of reference.
• What a world coordinate is and how you can convert point cloud data to a world coordinate system.
• How you can use your sensor and camera extrinsic matrices to provide pose data when using sensor
fusion.
If your point cloud data was collected in a local coordinate system, you can use an extrinsic matrix of the
sensor used to collect the data to convert it to a world coordinate system or a global frame of reference.
If you cannot obtain an extrinsic for your point cloud data and, as a result, cannot obtain point clouds in
a world coordinate system, you can provide point cloud data in a local coordinate system for 3D point
cloud object detection and semantic segmentation task types.
For object tracking, you must provide point cloud data in a world coordinate system. This is because
when you are tracking objects across multiple frames, the ego vehicle itself is moving in the world and so
all of the frames need a point of reference.
If you include camera data for sensor fusion, it is recommended that you provide camera poses in the
same world coordinate system as the 3D sensor (such as a LiDAR sensor).
This section explains what a world coordinate system (WCS), also referred to as a global frame of
reference, is and explains how you can provide point cloud data in a world coordinate system.
A WCS or global frame of reference is a fixed universal coordinate system in which vehicle and sensor
coordinate systems are placed. For example, if multiple point cloud frames are located in different
coordinate systems because they were collected from two sensors, a WCS can be used to translate
all of the coordinates in these point cloud frames into a single coordinate system, where all frames
have the same origin, (0,0,0). This transformation is done by translating the origin of each frame to
the origin of the WCS using a translation vector, and rotating the three axes (typically x, y, and z) to
the right orientation using a rotation matrix. This rigid body transformation is called a homogeneous
transformation.
A world coordinate system is important in global path planning, localization, mapping, and driving
scenario simulations. Ground Truth uses the right-handed Cartesian world coordinate system such as the
one defined in ISO 8855, where the x axis is forward toward the car’s movement, y axis is left, and the z
axis points up from the ground.
The global frame of reference depends on the data. Some datasets use the LiDAR position in the first
frame as the origin. In this scenario, all the frames use the first frame as a reference and device heading
and position will be near the origin in the first frame. For example, KITTI datasets have the first frame as
a reference for world coordinates. Other datasets use a device position that is different from the origin.
761
Amazon SageMaker Developer Guide
Use Input and Output Data
Note that this is not the GPS/IMU coordinate system, which is typically rotated by 90 degrees along
the z-axis. If your point cloud data is in a GPS/IMU coordinate system (such as OxTS in the open source
AV KITTI dataset), then you need to transform the origin to a world coordinate system (typically the
vehicle's reference coordinate system). You apply this transformation by multiplying your data with
transformation metrics (the rotation matrix and translation vector). This will transform the data
from its original coordinate system to a global reference coordinate system. Learn more about this
transformation in the next section.
Ground Truth assumes that your point cloud data has already been transformed into a reference
coordinate system of your choice. For example, you can choose the reference coordinate system of the
sensor (such as LiDAR) as your global reference coordinate system. You can also take point clouds from
various sensors and transform them from the sensor's view to the vehicle's reference coordinate system
view. You use the a sensor's extrinsic matrix, made up of a rotation matrix and translation vector, to
convert your point cloud data to a WCS or global frame of reference.
Collectively, the translation vector and rotation matrix can be used to make up an extrinsic matrix, which
can be used to convert data from a local coordinate system to a WCS. For example, your LiDAR extrinsic
matrix may be composed as follows, where R is the rotation matrix and T is the translation vector:
LiDAR_extrinsic = [R T;0 0 0 1]
For example, the autonomous driving KITTI dataset includes a rotation matrix and translation vector
for the LiDAR extrinsic transformation matrix for each frame. The pykitti python module can be used
for loading the KITTI data, and in the dataset dataset.oxts[i].T_w_imu gives the LiDAR extrinsic
th
transform for the i frame with can be multiplied with points in that frame to convert them to a world
frame - np.matmul(lidar_transform_matrix, points). Multiplying a point in LiDAR frame with
a LiDAR extrinsic matrix transforms it into world coordinates. Multiplying a point in the world frame with
the camera extrinsic matrix gives the point coordinates in the camera's frame of reference.
The following code example demonstrates how you can convert point cloud frames from the KITTI
dataset into a WCS.
import pykitti
import numpy as np
basedir = '/Users/nameofuser/kitti-data'
date = '2011_09_26'
drive = '0079'
# The 'frames' argument is optional - default: None, which loads the whole dataset.
# Calibration, timestamps, and IMU data are read automatically.
# Camera and velodyne data are available via properties that create generators
# when accessed, or through getter methods that provide random access.
data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))
# i is frame number
i = 0
762
Amazon SageMaker Developer Guide
Use Input and Output Data
return tps
Sensor Fusion
Ground Truth supports sensor fusion of point cloud data with up to 8 video camera inputs. This feature
allows human labellers to view the 3D point cloud frame side-by-side with the synchronized video
frame. In addition to providing more visual context for labeling, sensor fusion allows workers to adjust
annotations in the 3D scene and in 2D images and the adjustment are projected into the other view. The
following video demonstrates a 3D point cloud labeling job with LiDAR and camera sensor fusion.
763
Amazon SageMaker Developer Guide
Use Input and Output Data
764
Amazon SageMaker Developer Guide
Use Input and Output Data
For best results, when using sensor fusion, your point cloud should be in a WCS. Ground Truth uses your
sensor (such as LiDAR), camera, and ego vehicle pose information to compute extrinsic and intrinsic
matrices for sensor fusion.
Extrinsic Matrix
Ground Truth uses sensor (such as LiDAR) extrinsic and camera extrinsic and intrinsic matrices to project
objects to and from the point cloud data's frame of reference to the camera's frame of reference.
For example, in order to project a label from the 3D point cloud to camera image plane, Ground Truth
transforms 3D points from LiDAR’s own coordinate system to the camera's coordinate system. This is
typically done by first transforming 3D points from LiDAR’s own coordinate system to a world coordinate
system (or a global reference frame) using the LiDAR extrinsic matrix. Ground Truth then uses the
camera inverse extrinsic (which converts points from a global frame of reference to the camera's frame
of reference) to transform the 3D points from world coordinate system obtained in previous step into
the camera image plane. The LiDAR extrinsic matrix can also be used to transform 3D data into a world
coordinate system. If your 3D data is already transformed into world coordinate system then the first
transformation doesn’t have any impact on label translation, and label translation only depends on the
camera inverse extrinsic. A view matrix is used to visualize projected labels. To learn more about these
transformations and the view matrix, see Ground Truth Sensor Fusion Transformations (p. 769).
Ground Truth computes these extrinsic matrices by using LiDAR and camera pose data that you provide:
heading ( in quaternions: qx, qy, qz, and qw) and position (x, y, z). For the vehicle, typically the
heading and position are described in vehicle's reference frame in a world coordinate system and are
called a ego vehicle pose. For each camera extrinsic, you can add pose information for that camera. For
more information, see Pose (p. 766).
Intrinsic Matrix
Ground Truth use the camera extrinsic and intrinsic matrices to compute view metrics to transform
labels to and from the 3D scene to camera images. Ground Truth computes the camera intrinsic matrix
using camera focal length (fx, fy) and optical center coordinates (cx,cy) that you provide. For more
information, see Intrinsic and Distortion (p. 769).
Image Distortion
Image distortion can occur for a variety of reasons. For example, images may be distorted due to barrel
or fish-eye effects. Ground Truth uses intrinsic parameters along with distortion co-efficient to undistort
images you provide when creating 3D point cloud labeling jobs. If a camera image is already been
undistorted, all distortion coefficients should be set to 0.
For more information about the transformations Ground Truth performs to undistort images, see
Camera Calibrations: Extrinsic, Intrinsic and Distortion (p. 769).
Ego Vehicle
To collect data for autonomous driving applications, the measurements used to generate point cloud
data and are taken from sensors mounted on a vehicle, or the ego vehicle. To project label adjustments to
and from the 3D scene and 2D images, Ground Truth needs your ego vehicle pose in a world coordinate
system. The ego vehicle pose is comprised of position coordinates and orientation quaternion.
Ground Truth uses your ego vehicle pose to compute rotation and transformations matrices. Rotations
in 3 dimensions can be represented by a sequence of 3 rotations around a sequence of axes. In theory,
any three axes spanning the 3D Euclidean space are enough. In practice, the axes of rotation are chosen
to be the basis vectors. The three rotations are expected to be in a global frame of reference (extrinsic).
Ground Truth does not a support body centered frame of reference (intrinsic) which is attached to,
and moves with, the object under rotation. To track objects, Ground Truth needs to measure from a
global reference where all vehicles are moving. When using Ground Truth 3D point cloud labeling jobs, z
specifies the axis of rotation (extrinsic rotation) and yaw Euler angles are in radians (rotation angle).
765
Amazon SageMaker Developer Guide
Use Input and Output Data
Pose
Ground Truth uses pose information for 3D visualizations and sensor fusion. Pose information you input
through your manifest file is used to compute extrinsic matrices. If you already have an extrinsic matrix,
you can use it to extract sensor and camera pose data.
For example in the autonomous driving KITTI dataset, the pykitti python module can be used for
loading the KITTI data. In the dataset dataset.oxts[i].T_w_imu gives the LiDAR extrinsic
th
transform for the i frame and it can be multiplied with the points to get them in a world
frame - matmul(lidar_transform_matrix, points). This transform can be converted
into position (translation vector) and heading (in quaternion) of LiDAR for the input manifest
th
file JSON format. Camera extrinsic transform for cam0 in i frame can be calculated by
inv(matmul(dataset.calib.T_cam0_velo, inv(dataset.oxts[i].T_w_imu))) and this can
be converted into heading and position for cam0.
import numpy
origin= [1.71104606e+00,
5.80000039e-01,
9.43144935e-01]
Position
In the input manifest file, position refers to the position of the sensor with respect to a world frame. If
you are unable to put the device position in a world coordinate system, you can use LiDAR data with local
coordinates. Similarly, for mounted video cameras you can specify the position and heading in a world
coordinate system. For camera, if you do not have position information, please use (0, 0, 0).
{
"position": {
"y": -152.77584902657554,
"x": 311.21505956090624,
"z": -10.854137529636024
}
}
766
Amazon SageMaker Developer Guide
Use Input and Output Data
Heading
In the input manifest file, heading is an object that represents the orientation of a device with respect
to world frame. Heading values should be in quaternion. A quaternion is a representation of the
orientation consistent with geodesic spherical properties. If you are unable to put the sensor heading
in world coordinates, please use the identity quaternion (qx = 0, qy = 0, qz = 0, qw = 1).
Similarly, for cameras, specify the heading in quaternions. If you are unable to obtain extrinsic camera
calibration parameters, please also use the identity quaternion.
{
"heading": {
"qy": -0.7046155108831117,
"qx": 0.034278837280808494,
"qz": 0.7070617895701465,
"qw": -0.04904659893885366
}
}
To learn more, see Compute Orientation Quaternions and Position (p. 767).
Ground Truth requires that all orientation, or heading, data be given in quaternions. A quaternions is
a representation of the orientation consistent with geodesic spherical properties that can be used to
approximate of rotation. Compared to Euler angles they are simpler to compose and avoid the problem
of gimbal lock. Compared to rotation matrices they are more compact, more numerically stable, and
more efficient.
If you have a rotation matrix (made up of the axis rotations) and translation vector (or origin) in world
coordinate system instead of a single 4x4 rigid transformation matrix, then you can directly use the
rotation matrix and translation vector to compute quaternions. Libraries like scipy and pyqaternion can
help. The following code-block shows an example using these libraries to compute quaternion from a
rotation matrix.
import numpy
origin = [1.71104606e+00,
5.80000039e-01,
9.43144935e-01]
767
Amazon SageMaker Developer Guide
Use Input and Output Data
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy
heading = r.as_quat()
print(f"position:{position}\nheading: {heading}")
If you have a 4x4 extrinsic transformation matrix, note that the transformation matrix is in the form [R
T; 0 0 0 1] where R is the rotation matrix and T is the origin translation vector. That means you can
extract rotation matrix and translation vector from the transformation matrix as follows.
import numpy as np
transformation
= [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03, 1.71104606e+00],
[ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02, 5.80000039e-01],
[-3.24531964e-04, 1.03694477e-02, 9.99946183e-01, 9.43144935e-01],
[ 0, 0, 0, 1]]
transformation = np.array(transformation )
rotation = transformation[0:3][0:3]
translation= transformation[0:3][3]
With your own setup, you can compute an extrinsic transformation matrix using the GPS/IMU
position and orientation (latitude, longitude, altitude and roll, pitch, yaw) with respect to the LiDAR
sensor on the ego vehicle. For example, you can compute pose from KITTI raw data using pose =
convertOxtsToPose(oxts) to transform the oxts data into a local euclidean poses, specified by
4x4 rigid transformation matrices. You can then transform this pose transformation matrix to a global
reference frame using the reference frames transformation matrix in the world coordinate system.
struct Quaternion
{
double w, x, y, z;
};
Quaternion ToQuaternion(double yaw, double pitch, double roll) // yaw (Z), pitch (Y), roll
(X)
{
// Abbreviations for the various angular functions
double cy = cos(yaw * 0.5);
double sy = sin(yaw * 0.5);
double cp = cos(pitch * 0.5);
double sp = sin(pitch * 0.5);
double cr = cos(roll * 0.5);
double sr = sin(roll * 0.5);
Quaternion q;
q.w = cr * cp * cy + sr * sp * sy;
q.x = sr * cp * cy - cr * sp * sy;
q.y = cr * sp * cy + sr * cp * sy;
q.z = cr * cp * sy - sr * sp * cy;
return q;
}
768
Amazon SageMaker Developer Guide
Use Input and Output Data
LiDAR Extrinsic
In order to project to and from a 3D LiDAR scene to a 2D camera image, Ground Truth computes the
rigid transformation projection metrics using the ego vehicle pose and heading. Ground Truth computes
rotation and translation of a world coordinates into the 3D plane by doing a simple sequence of
rotations and translation.
Ground Truth computes rotation metrics using the heading quaternions as follows:
Here, [x, y, z, w] corresponds to parameters in the heading JSON object, [qx, qy, qz, qw].
Ground Truth computes the translation column vector as T = [poseX, poseY, poseZ]. Then the
extrinsic metrics is simply as follows:
LiDAR_extrinsic = [R T;0 0 0 1]
Camera Extrinsic
If the camera pose is given, then Ground Truth computes the camera extrinsic based on a rigid
transformation from the 3D plane into the camera plane. The calculation is the same as the one used for
the LiDAR Extrinsic (p. 769), except that Ground Truth uses camera pose (position and heading) and
computes the inverse extrinsic.
There are two types of distortion Ground Truth can correct for: radial distortion and tangential
distortion.
Radial distortion occurs when light rays bend more near the edges of a lens than they do at its optical
center. The smaller the lens, the greater the distortion. The presence of the radial distortion manifests in
form of the barrel or fish-eye effect and Ground Truth uses Formula 1 to undistort it.
Formula 1:
769
Amazon SageMaker Developer Guide
Use Input and Output Data
Tangential distortion occurs because the lenses used to take the images are not perfectly parallel to the
imaging plane. This can be corrected with Formula 2.
Formula 2:
In the input manifest file, you can provide distortion coefficients and Ground Truth will undistort your
images. All distortion coefficients are floats.
• k1, k2, k3, k4 – Radial distortion coefficients. Supported for both fisheye and pinhole camera models.
• p1 ,p2 – Tangential distortion coefficients. Supported for pinhole camera models.
If images are already undistorted, all distortion coefficients should be 0 in your input manifest.
In order to correctly reconstruct the corrected image, Ground Truth does a unit conversion of the images
based on focal lengths. If a common focal length is used with a given aspect ratio for both axes, such as
1, in the upper formula we will have a single focal length. The matrix containing these four parameters is
referred to as the in camera intrinsic calibration matrix.
While the distortion coefficients are the same regardless of the camera resolutions used, these should be
scaled with the current resolution from the calibrated resolution.
Ground Truth use the camera extrinsic and camera intrinsic to compute view metrics as shown in the
following code block to transform labels between the 3D scene and 2D images.
770
Amazon SageMaker Developer Guide
Use Input and Output Data
For both of these options, you can use the Automated data setup option in the Ground Truth section
of the Amazon SageMaker console to set up a connection between Ground Truth and your input data in
Amazon S3 so that Ground Truth knows where to look for your input data when creating your labeling
tasks. This creates and stores an input manifest file in your Amazon S3 input dataset location. To learn
more, see Automated Video Frame Input Data Setup (p. 773).
Alternatively, you can manually create sequence files for each sequence of video frames that you want
labeled and provide the Amazon S3 location of an input manifest file that references each of these
sequences files using the source-ref key. To learn more, see Create a Video Frame Input Manifest
File (p. 775).
Topics
• Choose Video Files or Video Frames for Input Data (p. 771)
• Input Data Setup (p. 772)
Video frames are sequences of images extracted from a video file. You can create a Ground Truth
labeling job to have workers label multiple sequences of video frames. Each sequence is made up of
images extracted from a single video.
To create a labeling job using video frame sequences, you must store each sequence using a unique key
name prefix in Amazon S3. In the Amazon S3 console, key name prefixes are folders. So in the Amazon
S3 console, each sequence of video frames must be located in its own folder in Amazon S3.
For example, if you have two sequences of video frames, you might use the key name prefixes
sequence1/ and sequence2/ to identify your sequences. In this example, your sequences may be
located in s3://DOC-EXAMPLE-BUCKET/video-frames/sequence1/ and s3://DOC-EXAMPLE-
BUCKET/video-frames/sequence2/.
If you are using the Ground Truth console to create an input manifest file, all of the sequence key name
prefixes should be in the same location in Amazon S3. For example, in the Amazon S3 console, each
sequence could be in a folder in s3://DOC-EXAMPLE-BUCKET/video-frames/. In this example,
your first sequence of video frames (images) may be located in s3://DOC-EXAMPLE-BUCKET/video-
frames/sequence1/ and your second sequence may be located in s3://DOC-EXAMPLE-BUCKET/
video-frames/sequence2/.
Important
Even if you only have a single sequence of video frames that you want workers to label, that
sequence must have a key name prefix in Amazon S3. If you are using the Amazon S3 console,
this means that your sequence is located in a folder. It cannot be located in the root of your S3
bucket.
When creating worker tasks using sequences of video frames, Ground Truth uses one sequence per task.
In each task, Ground Truth orders your video frames using UTF-8 binary order.
For example, video frames might be in the following order in Amazon S3:
771
Amazon SageMaker Developer Guide
Use Input and Output Data
They are arranged in the same order in the worker’s task: 0001.jpg, 0002.jpg, 0003.jpg, ...,
0011.jpg.
Frames might also be ordered using a naming convention like the following:
In this case, frame10.jpg and frame11.jpg come before frame2.jpg in the worker task. Your
worker sees your video frames in the following order: frame1.jpg, frame10.jpg, frame11.jpg,
frame2.jpg, ..., frame9.jpg.
You can use the Ground Truth frame splitting feature when creating a new labeling job in the console to
extract video frames from video files (MP4 files). A series of video frames extracted from a single video
file is referred to as a sequence of video frames.
You can either have Ground Truth automatically extract all frames, up to 2,000, from the video, or you
th
can specify a frequency for frame extraction. For example, you can have Ground Truth extract every 10
frame from your videos.
You can provide up to 50 videos when you use automated data setup to extract frames, however your
input manifest file cannot reference more than 10 video frame sequence files when you create a video
frame object tracking and video frame object detection labeling job. If you use the automated data setup
console tool to extract video frames from more than 10 video files, you will need to modify the manifest
file the tool generates or create a new one to include 10 video frame sequence files or less. To learn
more about these quotas, see 3D Point Cloud and Video Frame Labeling Job Quotas (p. 744).
To use the video frame extraction tool, see Automated Video Frame Input Data Setup (p. 773).
When all of your video frames have been successfully extracted from your videos, you will see the
following in your S3 input dataset location:
• A key name prefix (a folder in the Amazon S3 console) named after each video. Each of these prefixes
leads to:
• A sequence of video frames extracted from the video used to name that prefix.
• A sequence file used to identify all of the images that make up that sequence.
• An input manifest file with a .manifest extension. This identifies all of the sequence files that will be
used to create your labeling job.
All of the frames extracted from a single video file are used for a labeling task. If you extract video
frames from multiple video files, multiple tasks are created for your labeling job, one for each sequence
of video frames.
Ground Truth stores each sequence of video frames that it extracts in your Amazon S3 location for input
datasets using a unique key name prefix. In the Amazon S3 console, key name prefixes are folders.
• You can store your input data in Amazon S3 and have Ground Truth automatically detect the input
dataset used for your labeling job. See Automated Video Frame Input Data Setup (p. 773) to learn
more about this option.
• You can create an input manifest file and sequence files and upload them to Amazon S3. See Manual
Input Data Setup (p. 775) to learn more about this option.
772
Amazon SageMaker Developer Guide
Use Input and Output Data
Topics
• Automated Video Frame Input Data Setup (p. 773)
• Manual Input Data Setup (p. 775)
You can use the Ground Truth automated data setup to automatically detect video files in your Amazon
S3 bucket and extract video frames from those files. To learn how, see Provide Video Files (p. 772).
If you already have video frames in Amazon S3, you can use the automated data setup to use these video
frames in your labeling job. For this option, all video frames from a single video must be stored using a
unique prefix. To learn about the requirements to use this option, see Provide Video Frames (p. 771).
Select one of the following sections to learn how to set up your automatic input dataset connection with
Ground Truth.
Use the following procedure to connect your video files with Ground Truth and automatically extract
video frames from those files for video frame object detection and object tracking labeling jobs.
Note
If you use the automated data setup console tool to extract video frames from more than 10
video files, you will need to modify the manifest file the tool generates or create a new one to
include 10 video frame sequence files or less. To learn more, see Provide Video Files (p. 772).
Make sure your video files are stored in an Amazon S3 bucket in the same AWS Region that you perform
the automated data setup in.
Automatically connect your video files in Amazon S3 with Ground Truth and extract video
frames:
1. Navigate to the Create labeling job page in the Amazon SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.
Your input and output S3 buckets must be located in the same AWS Region that you create your
labeling job in. This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is
in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on
the navigation bar, choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.
4. In the section Input data setup, select Automated data setup.
5. Enter an Amazon S3 URI for S3 location for input datasets. An S3 URI looks like the following:
s3://DOC-EXAMPLE-BUCKET/path-to-files/. This URI should point to the Amazon S3 location
where your video files are stored.
6. Specify your S3 location for output datasets. This is where your output data is stored. You can
choose to store your output data in the Same location as input dataset or Specify a new location
and entering the S3 URI of the location that you want to store your output data.
7. Choose Video Files for your Data type using the dropdown list.
8. Choose Yes, extract frames for object tracking and detection tasks.
9. Choose a method of Frame extraction.
• When you choose Use all frames extracted from the video to create a labeling task, Ground
Truth extracts all frames from each video in your S3 location for input datasets, up to 2,000
773
Amazon SageMaker Developer Guide
Use Input and Output Data
frames. If a video in your input dataset contains more than 2,000 frames, the first 2,000 are
extracted and used for that labeling task.
• When you choose Use every x frame from a video to create a labeling task, Ground Truth
th
extracts every x frame from each video in your S3 location for input datasets.
For example, if your video is 2 seconds long, and has a frame rate of 30 frames per second, there
th
are 60 frames in your video. If you specify 10 here, Ground Truth extracts every 10 frame from
st th th th th th th
your video. This means the 1 , 10 , 20 , 30 , 40 , 50 , and 60 frames are extracted.
10. Choose or create an IAM execution role. Make sure that this role has permission to access your
Amazon S3 locations for input and output data specified in steps 5 and 6.
11. Select Complete data setup.
Use the following procedure to connect your sequences of video frames with Ground Truth for video
frame object detection and object tracking labeling jobs.
Make sure your video frames are stored in an Amazon S3 bucket in the same AWS Region that you
perform the automated data setup in. Each sequence of video frames should have a unique prefix.
For example, if you have two sequences stored in s3://DOC-EXAMPLE-BUCKET/video-frames/
sequences/, each should have a unique prefix like sequence1 and sequence2 and should both
be located directly under the /sequences/ prefix. In the example above, the locations of these
two sequences is: s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence1/ and
s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence2/.
1. Navigate to the Create labeling job page in the Amazon SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.
Your input and output S3 buckets must be located in the same AWS Region that you create your
labeling job in. This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is
in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on
the navigation bar, choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.
4. In the section Input data setup, select Automated data setup.
5. Enter an Amazon S3 URI for S3 location for input datasets.
This should be the Amazon S3 location where your sequences are stored. For example, if you have
two sequences stored in s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence1/,
s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence2/, enter s3://DOC-
EXAMPLE-BUCKET/video-frames/sequences/ here.
6. Specify your S3 location for output datasets. This is where your output data is stored. You can
choose to store your output data in the Same location as input dataset or Specify a new location
and entering the S3 URI of the location that you want to store your output data.
7. Choose Video frames for your Data type using the dropdown list.
8. Choose or create an IAM execution role. Make sure that this role has permission to access your
Amazon S3 locations for input and output data specified in steps 5 and 6.
9. Select Complete data setup.
These procedures will create an input manifest in the Amazon S3 location for input datasets that you
specified in step 5. If you are creating a labeling job using the SageMaker API or, AWS CLI, or an AWS
SDK, use the Amazon S3 URI for this input manifest file as input to the parameter ManifestS3Uri.
774
Amazon SageMaker Developer Guide
Use Input and Output Data
Choose the manual data setup option if you have created sequence files for each of your video frame
sequences, and a manifest file listing references to those sequences files.
Ground Truth uses the input manifest file to identify the location of your input dataset when creating
labeling tasks. For video frame object detection and object tracking labeling jobs, each line in the input
manifest file identifies the location of a video frame sequence file. Each sequence file identifies the
images included in a single sequence of video frames.
Use this page to learn how to create a video frame sequence file and an input manifest file for video
frame object tracking and object detection labeling jobs.
If you want Ground Truth to automatically generate your sequence files and input manifest file, see
Automated Video Frame Input Data Setup (p. 773).
In the video frame sequence input manifest file, each line in the manifest is a JSON object, with a
"source-ref" key that references a sequence file. Each sequence file identifies the location of a
sequence of video frames. This is the manifest file formatting required for all video frame labeling jobs.
The following example demonstrates the syntax used for an input manifest file:
{"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-folder/seq1.json"}
{"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-folder/seq2.json"}
The data for each sequence of video frames needs to be stored in a JSON data object. The following is an
example of the format you use for a sequence file. Information about each frame is included as a JSON
object and is listed in the frames list. The following JSON has been expanded for readability.
{
"seq-no": 1,
"prefix": "s3://mybucket/prefix/video1/",
"number-of-frames": 3,
"frames":[
{"frame-no": 1, "unix-timestamp": 1566861644, "frame": "frame0001.jpg" },
{"frame-no": 2, "unix-timestamp": 1566861644, "frame": "frame0002.jpg" },
{"frame-no": 3, "unix-timestamp": 1566861644, "frame": "frame0003.jpg" }
]
}
The following table provides details about the parameters shown in the this code example.
775
Amazon SageMaker Developer Guide
Use Input and Output Data
Output Data
The output from a labeling job is placed in the Amazon S3 location that you specified in the console or in
the call to the CreateLabelingJob operation. Output data appears in this location when the workers have
submitted one or more tasks, or when tasks expire. Note that it may take a few minutes for output data
to appear in Amazon S3 after the worker submits the task or the task expires.
Each line in the output data file is identical to the manifest file with the addition of an attribute
and value for the label assigned to the input object. The attribute name for the value is defined in
the console or in the call to the CreateLabelingJob operation. You can't use -metadata in the
label attribute name. If you are running an image semantic segmentation, 3D point cloud semantic
segmentation, or 3D point cloud object tracking job, the label attribute must end with -ref. For any
other type of job, the attribute name can't end with -ref.
The output of the labeling job is the value of the key-value pair with the label. The label and the value
overwrites any existing JSON data in the input file with the new value.
776
Amazon SageMaker Developer Guide
Use Input and Output Data
For example, the following is the output from an image classification labeling job where the input data
files were stored in an Amazon S3 AWSDOC-EXAMPLE-BUCKET and the label attribute name was defined
as sport. In this example the JSON object is formatted for readability, in the actual output file the JSON
object is on a single line. For more information about the data format, see JSON Lines.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/image_example.png",
"sport":0,
"sport-metadata":
{
"class-name": "football",
"confidence": 0.00,
"type":"groundtruth/image-classification",
"job-name": "identify-sport",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256"
}
}
The value of the label can be any valid JSON. In this case the label's value is the index of the class in the
classification list. Other job types, such as bounding box, have more complex values.
Any key-value pair in the input manifest file other than the label attribute is unchanged in the output
file. You can use this to pass data to your application.
The output from a labeling job can be used as the input to another labeling job. You can use this when
you are chaining together labeling jobs. For example, you can send one labeling job to determine the
sport that is being played. Then you send another using the same data to determine if the sport is being
played indoors or outdoors. By using the output data from the first job as the manifest for the second
job, you can consolidate the results of the two jobs into one output file for easier processing by your
applications.
The output data file is written to the output location periodically while the job is in progress. These
intermediate files contain one line for each line in the manifest file. If an object is labeled, the label is
included. If the object hasn't been labeled, it is written to the intermediate output file identically to the
manifest file.
Output Directories
Ground Truth creates several directories in your Amazon S3 output path. These directories contain the
results of your labeling job and other artifacts of the job. The top-level directory for a labeling job is
given the same name as your labeling job; the output directories are placed beneath it. For example, if
you named your labeling job find-people, your output would be in the following directories:
s3://AWSDOC-EXAMPLE-BUCKET/find-people/activelearning
s3://AWSDOC-EXAMPLE-BUCKET/find-people/annotations
s3://AWSDOC-EXAMPLE-BUCKET/find-people/inference
s3://AWSDOC-EXAMPLE-BUCKET/find-people/manifests
s3://AWSDOC-EXAMPLE-BUCKET/find-people/training
The activelearning directory is only present when you are using automated data labeling. It contains
the input and output validation set for automated data labeling, and the input and output folder for
automatically labeled data.
777
Amazon SageMaker Developer Guide
Use Input and Output Data
Annotations Directory
The annotations directory contains all of the annotations made by the workforce. These are the
responses from individual workers that have not been consolidated into a single label for the data object.
• The first, worker-response, contains the responses from individual workers. This contains
a subdirectory for each iteration, which in turn contains a subdirectory for each data object in
that iteration. The worker response data for each data object is stored in a timestamped JSON
file that contains the answers submitted by each worker for that data object, and if you use a
private workforce, metadata about those workers. To learn more about this metadata, see Worker
Metadata (p. 779).
• The second, consolidated-annotation, contains information required to consolidate the
annotations in the current batch into labels for your data objects.
• The third, intermediate, contains the output manifest for the current batch with any completed
labels. This file is updated as the label for each data object is completed.
Note
We recommend that you do not use files that are not mentioned in the documentation.
Inference Directory
The inference directory is only present when you are using automated data labeling. This directory
contains the input and output files for the SageMaker batch transform used while labeling data objects.
Manifest Directory
The manifest directory contains the output manifest from your labeling job. There is one subdirectory
in the manifest directory, output. The output directory contains the output manifest file for your
labeling job. The file is named output.manifest.
Training Directory
The training directory is only present when you are using automated data labeling. This directory
contains the input and output files used to train the automated data labeling model.
Confidence Score
When you have more than one worker annotate a single task, your label results from annotation
consolidation. Ground Truth calculates a confidence score for each label. A confidence score is a number
between 0 and 1 that indicates how confident Ground Truth is in the label. You can use the confidence
score to compare labeled data objects to each other, and to identify the least or most confident labels.
You should not interpret the value of a confidence score as an absolute value, or compare confidence
scores across labeling jobs. For example, if all of the confidence scores are between 0.98 and 0.998, you
should only compare the data objects with each other and not rely on the high confidence scores.
You should not compare the confidence scores of human-labeled data objects and auto-labeled data
objects. The confidence scores for humans are calculated using the annotation consolidation function
for the task, while the confidence scores for automated labeling are calculated using a model that
incorporates object features. The two models generally have different scales and average confidence.
For a bounding box labeling job, Ground Truth calculates a confidence score per box. You can compare
confidence scores within one image or across images for the same labeling type (human or auto). You
can't compare confidence scores across labeling jobs.
778
Amazon SageMaker Developer Guide
Use Input and Output Data
Worker Metadata
Ground Truth provides information that you can use to track individual workers in task output data. The
following data is located in the directories under the worker-response located in the Annotations
Directory (p. 778):
• The acceptanceTime is the time that the worker accepted the task. The format of this date and time
stamp is YYYY-MM-DDTHH:MM:SS.mmmZ for the year (YYYY), month (MM), day (DD), hour (HH), minute
(MM), second (SS) and millisecond (mmm). The date and time are separated by a T.
• The submissionTime is the time that the worker submitted their annotations using the Submit
button. The format of this date and time stamp is YYYY-MM-DDTHH:MM:SS.mmmZ for the year (YYYY),
month (MM), day (DD), hour (HH), minute (MM), second (SS) and millisecond (mmm). The date and time are
separated by a T.
• timeSpentInSeconds reports the total time, in seconds, that a worker actively worked on that task.
This metric does not include time when a worker paused or took a break.
• The workerId is unique to each worker.
• If you use a private workforce, in workerMetadata, you see the following.
• The identityProviderType is the service used to manage the private workforce.
• The issuer is the Cognito user pool or OIDC Identity Provider (IdP) issuer associated with the work
team assigned to this human review task.
• A unique sub identifier refers to the worker. If you create a workforce using Amazon Cognito, you
can retrieve details about this worker (such as the name or user name) using this ID using Amazon
Cognito. To learn how, see Managing and Searching for User Accounts in Amazon Cognito Developer
Guide.
The following is an example of the output you may see if you use Amazon Cognito to create a private
workforce. This is identified in the identityProviderType.
"submissionTime": "2020-12-28T18:59:58.321Z",
"acceptanceTime": "2020-12-28T18:59:15.191Z",
"timeSpentInSeconds": 40.543,
"workerId": "a12b3cdefg4h5i67",
"workerMetadata": {
"identityData": {
"identityProviderType": "Cognito",
"issuer": "https://cognito-idp.aws-region.amazonaws.com/aws-region_123456789",
"sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}
The following is an example of the workerMetadata you may see if you use your own OIDC IdP to
create a private workforce:
"workerMetadata": {
"identityData": {
"identityProviderType": "Oidc",
"issuer": "https://example-oidc-ipd.com/adfs",
"sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}
To learn more about using private workforces, see Use a Private Workforce (p. 868).
779
Amazon SageMaker Developer Guide
Use Input and Output Data
Output Metadata
The output from each job contains metadata about the label assigned to data objects. These elements
are the same for all jobs with minor variations. The following example shows the metadata elements:
"confidence": 0.00,
"type": "groundtruth/image-classification",
"job-name": "identify-animal-species",
"human-annotated": "yes",
"creation-date": "2020-10-18T22:18:13.527256"
• confidence – The confidence that Ground Truth has that the label is correct. For more information,
see Confidence Score (p. 778).
• type – The type of classification job. For job types, see Built-in Task Types (p. 704).
• job-name – The name assigned to the job when it was created.
• human-annotated – Whether the data object was labeled by a human or by automated data labeling.
For more information, see Automate Data Labeling (p. 807).
• creation-date – The date and time that the label was created.
In addition to the standard metadata elements, the metadata for a classification job includes the text
value of the label's class. For more information, see Image Classification - MXNet (p. 1506).
The red, italicized text in the examples below depends on labeling job specifications and output data.
{
"source-ref":"s3://AWSDOC-EXAMPLE-BUCKET/example_image.jpg",
"species":"0",
"species-metadata":
{
"class-name": "dog",
"confidence": 0.00,
"type": "groundtruth/image-classification",
"job-name": "identify-animal-species",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256"
}
}
{
"source":"The food was delicious",
"mood":"1",
"mood-metadata":
{
"class-name": "positive",
"confidence": 0.8,
"type": "groundtruth/text-classification",
"job-name": "label-sentiment",
"human-annotated": "yes",
"creation-date": "2020-10-18T22:18:13.527256"
}
780
Amazon SageMaker Developer Guide
Use Input and Output Data
The label attribute name parameter (for example, image-label-attribute-name) contains an array
of all of the labels selected by at least one of the workers who completed this task. This array contains
integer keys (for example, [1,0,8]) that correspond to the labels found in class-map. In the multi-
label image classification example, bicycle, person, and clothing were selected by at least one of
the workers who completed the labeling task for the image, exampleimage.jpg.
The confidence-map shows the confidence score that Ground Truth assigned to each label selected by
a worker. To learn more about Ground Truth confidence scores, see Confidence Score (p. 778).
The red, italicized text in the examples below depends on labeling job specifications and output data.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.jpg",
"image-label-attribute-name":[1,0,8],
"image-label-attribute-name-metadata":
{
"job-name":"labeling-job/image-label-attribute-name",
"class-map":
{
"1":"bicycle","0":"person","8":"clothing"
},
"human-annotated":"yes",
"creation-date":"2020-02-27T21:36:25.000201",
"confidence-map":
{
"1":0.95,"0":0.77,"8":0.2
},
"type":"groundtruth/image-classification-multilabel"
}
}
The following is an example of a multi-label text classification output manifest file. In this example,
approving, sad and critical were selected by at least one of the workers who completed the
labeling task for the object exampletext.txt found in AWSDOC-EXAMPLE-BUCKET.
{
"source-ref": "AWSDOC-EXAMPLE-BUCKET/text_file.txt",
"text-label-attribute-name":[1,0,4],
"text-label-attribute-name-metadata":
{
"job-name":"labeling-job/text-label-attribute-name",
"class-map":
{
"1":"approving","0":"sad","4":"critical"
},
"human-annotated":"yes",
"creation-date":"2020-02-20T21:36:25.000201",
"confidence-map":
{
"1":0.95,"0":0.77,"4":0.2
},
781
Amazon SageMaker Developer Guide
Use Input and Output Data
"type":"groundtruth/text-classification-multilabel"
}
}
The class_id element is the index of the box's class in the list of available classes for the task. The
class-map metadata element contains the text of the class.
The metadata has a separate confidence score for each bounding box. The metadata also includes the
class-map element that maps the class_id to the text value of the class. For more information, see
Object Detection - MXNet (p. 1530).
The red, italicized text in the examples below depends on labeling job specifications and output data.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
}
}
The output of a bounding box adjustment job looks like the following JSON. Note that the original JSON
is kept intact and two new jobs are listed, each with “adjust-” prepended to the original attribute’s name.
{
"source-ref": "S3 bucket location",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
782
Amazon SageMaker Developer Guide
Use Input and Output Data
"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
},
"adjusted-bounding-box":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
"annotations":
[
{"class_id": 0, "left": 110, "top": 135,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 10, "top": 10,
"width": 30, "height": 30}
]
},
"adjusted-bounding-box-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"job-name": "adjust-bounding-boxes-on-dogs-and-toys",
"adjustment-status": "adjusted"
}
}
In this output, the job's type doesn't change, but an adjustment-status field is added. This field has
the value of adjusted or unadjusted. If multiple workers have reviewed the object and at least one
adjusted the label, the status is adjusted.
783
Amazon SageMaker Developer Guide
Use Input and Output Data
In the output manifest, the JSON object, annotations, includes a list of the labels (label categories)
that you provided.
Worker responses are in a list named entities. Each entity in this list is a JSON object that contains
a label value that matches one in the labels list, an integer startOffset value for labeled span's
starting Unicode offset, and an integer endOffset value for the ending Unicode offset.
The metadata has a separate confidence score for each entity. If a single worker labeled each data object,
the confidence value for each entity will be zero.
The red, italicized text in the examples below depends on labeling job inputs and worker responses.
{
"source": "Amazon SageMaker is a cloud machine-learning platform that was launched
in November 2017. SageMaker enables developers to create, train, and deploy machine-
learning (ML) models in the cloud. SageMaker also enables developers to deploy ML models on
embedded systems and edge-devices",
"ner-labeling-job-attribute-name": {
"annotations": {
"labels": [
{
"label": "Date",
"shortDisplayName": "dt"
},
{
"label": "Verb",
"shortDisplayName": "vb"
},
{
"label": "Thing",
"shortDisplayName": "tng"
},
{
"label": "People",
"shortDisplayName": "ppl"
}
],
"entities": [
{
"label": "Thing",
"startOffset": 22,
"endOffset": 53
},
{
"label": "Thing",
"startOffset": 269,
"endOffset": 281
},
{
"label": "Verb",
"startOffset": 63,
"endOffset": 71
},
{
"label": "Verb",
"startOffset": 228,
"endOffset": 234
},
{
784
Amazon SageMaker Developer Guide
Use Input and Output Data
"label": "Date",
"startOffset": 75,
"endOffset": 88
},
{
"label": "People",
"startOffset": 108,
"endOffset": 118
},
{
"label": "People",
"startOffset": 214,
"endOffset": 224
}
]
}
},
"ner-labeling-job-attribute-name-metadata": {
"job-name": "labeling-job/example-ner-labeling-job",
"type": "groundtruth/text-span",
"creation-date": "2020-10-29T00:40:39.398470",
"human-annotated": "yes",
"entities": [
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
}
]
}
}
If human workers are verifying or adjusting prior bounding box labels, the output of a verification job
would look like the following JSON. The red, italicized text in the examples below depends on labeling
job specifications and output data.
{
"source-ref":"s3://AWSDOC-EXAMPLE-BUCKET/image_example.png",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
785
Amazon SageMaker Developer Guide
Use Input and Output Data
"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
},
"verify-bounding-box-attribute-name":"1",
"verify-bounding-box-attribute-name-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-bounding-boxes",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"worker-feedback": [
{"comment": "The bounding box on the bird is too wide on the right side."},
{"comment": "The bird on the upper right is not labeled."}
]
}
}
Although the type on the original bounding box output was groundtruth/object-detection,
the new type is groundtruth/label-verification. Also note that the worker-feedback array
provides worker comments. If the worker doesn't provide comments, the empty fields are excluded
during consolidation.
In addition to the standard elements, the metadata for the label includes a color map that defines which
color is used to label the image, the class name associated with the color, and the confidence score for
each color. For more information, see Semantic Segmentation Algorithm (p. 1549).
The red, italicized text in the examples below depends on labeling job specifications and output data.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"city-streets-ref": "S3 bucket location",
786
Amazon SageMaker Developer Guide
Use Input and Output Data
"city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,
"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "label-city-streets",
},
"verify-city-streets-ref":"1",
"verify-city-streets-ref-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-city-streets",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"worker-feedback": [
{"comment": "The mask on the leftmost building is assigned the wrong side of
the road."},
{"comment": "The curb of the road is not labeled but the instructions say
otherwise."}
]
}
}
Confidence is scored on a per-image basis. Confidence scores are the same across all classes within an
image.
The output of a semantic segmentation adjustment job looks similar to the following JSON.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"city-streets-ref": "S3 bucket location",
"city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,
787
Amazon SageMaker Developer Guide
Use Input and Output Data
"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "label-city-streets",
},
"adjusted-city-streets-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"adjusted-city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,
"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"job-name": "adjust-label-city-streets",
}
}
In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. The metadata also includes job-name which is the name you assigned
to the labeling job. For adjustment tasks, If one or more bounding boxes were modified, there is an
adjustment-status parameter in the metadata for audit workflows that is set to adjusted.
{
"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-path/input-manifest.json",
"CarObjectDetection-ref": "s3://AWSDOC-EXAMPLE-BUCKET/output/labeling-job-name/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"CarObjectDetection-ref-metadata": {
"class-map": {
"0": "car",
"1": "bus"
},
"job-name": "labeling-job/labeling-job-name",
"human-annotated": "yes",
"creation-date": "2021-09-29T05:50:35.566000",
"type": "groundtruth/video-object-detection"
}
}
Ground Truth creates one output sequence file for each sequence of video frames that was labeled. Each
output sequence file contains the following:
788
Amazon SageMaker Developer Guide
Use Input and Output Data
• All annotations for all frames in a sequence in the detection-annotations list of JSON objects.
• For each frame that was annotated by a worker, the frame file name (frame), number (frame-no), a
list of JSON objects containing annotations (annotations), and if applicable, frame-attributes.
The name of this list is defined by the task type you use: polylines, polygons, keypoints, and for
bounding boxes, annotations.
Each JSON object contains information about a single annotation and associated label. The following
table outlines the parameters you'll see for each video frame task type.
In addition to task type specific values, you will see the following in each JSON object:
• Values of any label-category-attributes that were specified for that label.
• The class-id of the box. Use the class-map in the output manifest file to see which label
category this ID maps to.
The following is an example of a SeqLabel.json file from a bounding box video frame object
detection labeling job. This file will be located under s3://your-output-bucket/output-prefix/
annotations/consolidated-annotation/output/annotation-number/
{
"detection-annotations": [
{
"annotations": [
{
"height": 41,
"width": 53,
"top": 152,
"left": 339,
"class-id": "1",
"label-category-attributes": {
"occluded": "no",
"size": "medium"
}
},
{
"height": 24,
"width": 37,
"top": 148,
"left": 183,
"class-id": "0",
"label-category-attributes": {
789
Amazon SageMaker Developer Guide
Use Input and Output Data
"occluded": "no",
}
}
],
"frame-no": 0,
"frame": "frame_0000.jpeg",
"frame-attributes": {name: value, name: value}
},
{
"annotations": [
{
"height": 41,
"width": 53,
"top": 152,
"left": 341,
"class-id": "0",
"label-category-attributes": {}
},
{
"height": 24,
"width": 37,
"top": 141,
"left": 177,
"class-id": "0",
"label-category-attributes": {
"occluded": "no",
}
}
],
"frame-no": 1,
"frame": "frame_0001.jpeg",
"frame-attributes": {name: value, name: value}
}
]
}
In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence of frames. The metadata also includes job-name which is the name you
assigned to the labeling job. For adjustment tasks, If one or more bounding boxes were modified, there is
an adjustment-status parameter in the metadata for audit workflows that is set to adjusted.
{
"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-path/input-manifest.json",
"CarObjectTracking-ref": "s3://AWSDOC-EXAMPLE-BUCKET/output/labeling-job-name/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"CarObjectTracking-ref-metadata": {
"class-map": {
"0": "car",
"1": "bus"
},
"job-name": "labeling-job/labeling-job-name",
"human-annotated": "yes",
"creation-date": "2021-09-29T05:50:35.566000",
"type": "groundtruth/video-object-tracking"
}
}
790
Amazon SageMaker Developer Guide
Use Input and Output Data
Ground Truth creates one output sequence file for each sequence of video frames that was labeled. Each
output sequence file contains the following:
• All annotations for all frames in a sequence in the tracking-annotations list of JSON objects.
• For each frame that was annotated by a worker, the frame (frame), number (frame-no), a list of
JSON objects containing annotations (annotations), and if applicable, frame attributes (frame-
attributes). The name of this list is defined by the task type you use: polylines, polygons,
keypoints, and for bounding boxes, annotations.
Each JSON object contains information about a single annotation and associated label. The following
table outlines the parameters you'll see for each video frame task type.
In addition to task type specific values, you will see the following in each JSON object:
• Values of any label-category-attributes that were specified for that label.
• The class-id of the box. Use the class-map in the output manifest file to see which label
category this ID maps to.
• An object-id which identifies an instance of a label. This ID will be the same across frames if a
worker identified the same instance of an object in multiple frames. For example, if a car appeared in
multiple frames, all bounding boxes uses to identify that car would have the same object-id.
• The object-name which is the instance ID of that annotation.
The following is an example of a SeqLabel.json file from a bounding box video frame object
tracking labeling job. This file will be located under s3://your-output-bucket/output-prefix/
annotations/consolidated-annotation/output/annotation-number/
{
"tracking-annotations": [
{
"annotations": [
{
"height": 36,
"width": 46,
"top": 178,
"left": 315,
"class-id": "0",
"label-category-attributes": {
"occluded": "no"
},
"object-id": "480dc450-c0ca-11ea-961f-a9b1c5c97972",
791
Amazon SageMaker Developer Guide
Use Input and Output Data
"object-name": "car:1"
}
],
"frame-no": 0,
"frame": "frame_0001.jpeg",
"frame-attributes": {}
},
{
"annotations": [
{
"height": 30,
"width": 47,
"top": 163,
"left": 344,
"class-id": "1",
"label-category-attributes": {
"occluded": "no",
"size": "medium"
},
"object-id": "98f2b0b0-c0ca-11ea-961f-a9b1c5c97972",
"object-name": "bus:1"
},
{
"height": 28,
"width": 33,
"top": 150,
"left": 192,
"class-id": "0",
"label-category-attributes": {
"occluded": "partially"
},
"object-id": "480dc450-c0ca-11ea-961f-a9b1c5c97972",
"object-name": "car:1"
}
],
"frame-no": 1,
"frame": "frame_0002.jpeg",
"frame-attributes": {name: value, name: value}
}
]
}
In addition to the standard elements, the metadata for the label includes a color map that defines
which color is used to label the image, the class name associated with the color, and the confidence
score for each color. Additionally, there is an adjustment-status parameter in the metadata for
audit workflows that is set to adjusted if the color mask is modified. If you added one or more
frameAttributes to your label category configuration file, worker responses for frame attributes are
in the JSON object, dataset-object-attributes.
The your-label-attribute-ref parameter contains the location of a compressed file with a .zlib
extension. When you uncompress this file, it contains an array. Each index in the array corresponds to the
index of an annotated point in the input point cloud. The value of the array at a given index gives the
class of the point at the same index in the point cloud, based on the semantic color map found in the
color-map parameter of the metadata.
You can use Python code similar to the following to decompress a .zlib file:
import zlib
from array import array
792
Amazon SageMaker Developer Guide
Use Input and Output Data
print(my_int_array_data)
The code block above will produce an output similar to the following. Each element of the printed array
contains the class of a point at the that index in the point cloud. For example, my_int_array_data[0]
= 1 means point[0] in the input point cloud has a class 1. In the following output manifest file
example, class 0 corresponds with "Background", 1 with Car, and 2 with Pedestrian.
The following is an example of a semantic segmentation 3D point cloud labeling job output manifest file.
The red, italicized text in the examples below depends on labeling job specifications and output data.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/examplefolder/frame1.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861644.759115,
"ego-vehicle-pose":{...},
"prefix": "s3://AWSDOC-EXAMPLE-BUCKET/lidar_singleframe_dataset/prefix",
"images": [{...}]
},
"lidar-ss-label-attribute-ref": "s3://your-output-bucket/labeling-job-name/annotations/
consolidated-annotation/output/dataset-object-id/filename.zlib",
"lidar-ss-label-attribute-ref-metadata": {
'color-map': {
"0": {
"class-name": "Background",
"hex-color": "#ffffff",
"confidence": 0.00
},
"1": {
"class-name": "Car",
"hex-color": "#2ca02c",
"confidence": 0.00
},
"2": {
"class-name": "Pedestrian",
"hex-color": "#1f77b4",
"confidence": 0.00
},
"3": {
"class-name": "Tree",
"hex-color": "#ff7f0e",
"confidence": 0.00
}
},
'type': 'groundtruth/point_cloud_single_frame_semantic_segmentation',
'human-annotated': 'yes',
'creation-date': '2019-11-12T01:18:14.271944',
'job-name': 'labeling-job-name',
//only present for adjustment audit workflow
"adjustment-status": "adjusted", // "adjusted" means the label was adjusted
793
Amazon SageMaker Developer Guide
Use Input and Output Data
• Each class, or label category, that you specify in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• These classes are used to give each 3D cuboid an object-name in the format <class>:<integer>
where integer is a unique number to identify that cuboid in the frame.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):
old_yaw_in_output = pi - yaw
• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.
• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.
If one or more cuboids were modified, there is an adjustment-status parameter in the metadata for
audit workflows that is set to adjusted. If you added one or more frameAttributes to your label
category configuration file, worker responses for frame attributes are in the JSON object, dataset-
object-attributes.
The red, italicized text in the examples below depends on labeling job specifications and output
data. The ellipses (...) denote a continuation of that list, where additional objects with the same format
as the proceeding object can appear.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/examplefolder/frame1.txt",
"source-ref-metadata":{
"format": "text/xyzi",
"unix-timestamp": 1566861644.759115,
"prefix": "s3://AWSDOC-EXAMPLE-BUCKET/lidar_singleframe_dataset/prefix",
"ego-vehicle-pose": {
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
},
"position": {
"x": -2.7161461413869947,
794
Amazon SageMaker Developer Guide
Use Input and Output Data
"y": 116.25822288149078,
"z": 1.8348751887989483
}
},
"images": [
{
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"unix-timestamp": 1566861644.759115,
"image-path": "images/frame_0_camera_0.jpg",
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera_model": "pinhole"
}
]
},
"3d-bounding-box":
{
"annotations": [
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.616382013657516,
"center-y": 125.04149850484193,
"center-z": 0.311272296465834,
"length": 2.993000265181146,
"width": 1.8355260519692056,
"height": 1.3233490884304047,
"roll": 0,
"pitch": 0,
"yaw": 1.6479308313703527
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.188984560617168,
"center-y": 99.7954483288783,
"center-z": 0.2226435567445657,
"length": 4,
"width": 2,
795
Amazon SageMaker Developer Guide
Use Input and Output Data
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.6243170732068055
}
]
},
"3d-bounding-box-metadata":
{
"objects": [],
"class_map":
{
"0": "Car",
},
"type": "groundtruth/point_cloud_object_detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-3d-objects",
"adjustment-status": "adjusted",
"dataset-object-attributes": {name: value, name: value}
}
}
In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. If one or more cuboids were modified, there is an adjustment-status
parameter in the metadata for audit workflows that is set to adjusted.
{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/myfolder/seq1.json",
"lidar-label-attribute-ref": "s3://<CustomerOutputLocation>/<labelingJobName>/
annotations/consolidated-annotation/output/<datasetObjectId>/SeqLabel.json",
"lidar-label-attribute-ref-metadata": {
"objects":
[
{
"frame-no": 300,
"confidence": []
},
{
"frame-no": 301,
"confidence": []
},
...
],
'class-map': {'0': 'Car', '1': 'Person'},
'type': 'groundtruth/point_cloud_object_tracking',
'human-annotated': 'yes',
'creation-date': '2019-11-12T01:18:14.271944',
'job-name': 'identify-3d-objects',
"adjustment-status": "adjusted"
}
}
In the above example, the cuboid data for each frame in seq1.json is in SeqLabel.json in the
Amazon S3 location, s3://<customerOutputLocation>/<labelingJobName>/annotations/
796
Amazon SageMaker Developer Guide
Use Input and Output Data
For each frame in the sequence, you see the frame-number, frame-name, if applicable, frame-
attributes, and a list of annotations. This list contains 3D cubiods that were drawn for that frame.
Each annotation includes the following information:
• An object-name in the format <class>:<integer> where class identifies the label category and
integer is a unique ID across the dataset.
• When workers draw a cuboid, it is associated with a unique object-id which is associated with all
cuboids that identify the same object across multiple frames.
• Each class, or label category, that you specified in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):
old_yaw_in_output = pi - yaw
• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.
• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.
{
"tracking-annotations": [
{
"frame-number": 0,
"frame-name": "0.txt.pcd",
"frame-attributes": {name: value, name: value},
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": -2.2906369208300674,
"center-y": 103.73924823843463,
"center-z": 0.37634114027023313,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5827222214406014,
"object-id": "ae5dc770-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
797
Amazon SageMaker Developer Guide
Use Input and Output Data
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.6451293634707413,
"center-y": 124.9534455706848,
"center-z": 0.5020834081743839,
"length": 4,
"width": 2,
"height": 2.080488827301309,
"roll": 0,
"pitch": 0,
"yaw": -1.5963335581398077,
"object-id": "06efb020-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.205611313118477,
"center-y": 99.91731932137061,
"center-z": 0.22917217081212138,
"length": 3.8747142207671956,
"width": 1.9999999999999918,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5672228760316775,
"object-id": "26fad020-a782-11ea-b57d-67c51a0561a1"
}
]
},
{
"frame-number": 1,
"frame-name": "1.txt.pcd",
"frame-attributes": {},
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": -2.2906369208300674,
"center-y": 103.73924823843463,
"center-z": 0.37634114027023313,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5827222214406014,
"object-id": "ae5dc770-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.6451293634707413,
"center-y": 124.9534455706848,
"center-z": 0.5020834081743839,
"length": 4,
"width": 2,
798
Amazon SageMaker Developer Guide
Use Input and Output Data
"height": 2.080488827301309,
"roll": 0,
"pitch": 0,
"yaw": -1.5963335581398077,
"object-id": "06efb020-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.221311072916759,
"center-y": 100.4639841045424,
"center-z": 0.22917217081212138,
"length": 3.8747142207671956,
"width": 1.9999999999999918,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5672228760316775,
"object-id": "26fad020-a782-11ea-b57d-67c51a0561a1"
}
]
}
]
}
In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. If one or more cuboids were modified, there is an adjustment-status
parameter in the metadata for audit workflows that is set to adjusted.
{
"source-ref": "s3://iad-groundtruth-lidar-test-bucket/artifacts/gt-point-cloud-demos/
sequences/seq2.json",
"source-ref-metadata": {
"json-paths": [
"number-of-frames",
"prefix",
"frames{frame-no, frame}"
]
},
"3D2D-linking-ref": "s3://iad-groundtruth-lidar-test-bucket/xyz/3D2D-linking/annotations/
consolidated-annotation/output/0/SeqLabel.json",
"3D2D-linking-ref-metadata": {
"objects": [
{
"frame-no": 0,
"confidence": []
},
{
"frame-no": 1,
"confidence": []
},
{
799
Amazon SageMaker Developer Guide
Use Input and Output Data
"frame-no": 2,
"confidence": []
},
{
"frame-no": 3,
"confidence": []
},
{
"frame-no": 4,
"confidence": []
},
{
"frame-no": 5,
"confidence": []
},
{
"frame-no": 6,
"confidence": []
},
{
"frame-no": 7,
"confidence": []
},
{
"frame-no": 8,
"confidence": []
},
{
"frame-no": 9,
"confidence": []
}
],
"class-map": {
"0": "Car"
},
"type": "groundtruth/point_cloud_object_tracking",
"human-annotated": "yes",
"creation-date": "2023-01-19T02:55:10.206508",
"job-name": "mcm-linking"
},
"3D2D-linking-chain-ref": "s3://iad-groundtruth-lidar-test-bucket/xyz/3D2D-linking-chain/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"3D2D-linking-chain-ref-metadata": {
"objects": [
{
"frame-no": 0,
"confidence": []
},
{
"frame-no": 1,
"confidence": []
},
{
"frame-no": 2,
"confidence": []
},
{
"frame-no": 3,
"confidence": []
},
{
"frame-no": 4,
"confidence": []
},
{
"frame-no": 5,
800
Amazon SageMaker Developer Guide
Use Input and Output Data
"confidence": []
},
{
"frame-no": 6,
"confidence": []
},
{
"frame-no": 7,
"confidence": []
},
{
"frame-no": 8,
"confidence": []
},
{
"frame-no": 9,
"confidence": []
}
],
"class-map": {
"0": "Car"
},
"type": "groundtruth/point_cloud_object_tracking",
"human-annotated": "yes",
"creation-date": "2023-01-19T03:29:49.149935",
"job-name": "3d2d-linking-chain"
}
}
In the above example, the cuboid data for each frame in seq2.json is in SeqLabel.json in the
Amazon S3 location, s3://<customerOutputLocation>/<labelingJobName>/annotations/
consolidated-annotation/output/<datasetObjectId>/SeqLabel.json. The following is an
example of this label sequence file.
For each frame in the sequence, you see the frame-number, frame-name, if applicable, frame-
attributes, and a list of annotations. This list contains 3D cubiods that were drawn for that frame.
Each annotation includes the following information:
• An object-name in the format <class>:<integer> where class identifies the label category and
integer is a unique ID across the dataset.
• When workers draw a cuboid, it is associated with a unique object-id which is associated with all
cuboids that identify the same object across multiple frames.
• Each class, or label category, that you specified in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):
old_yaw_in_output = pi - yaw
• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.
801
Amazon SageMaker Developer Guide
Use Input and Output Data
• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.
{
"lidar": {
"tracking-annotations": [
{
"frame-number": 0,
"frame-name": "0.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": 12.172361721602815,
"center-y": 120.23067521992364,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
},
{
"frame-number": 1,
"frame-name": "1.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -1.6841480600695489,
"center-y": 126.20198882749516,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
802
Amazon SageMaker Developer Guide
Use Input and Output Data
"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
},
{
"frame-number": 2,
"frame-name": "2.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -1.6841480600695489,
"center-y": 126.20198882749516,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
}
]
},
"camera-0": {
"tracking-annotations": [
{
"frame-no": 0,
"frame": "0.txt.pcd",
803
Amazon SageMaker Developer Guide
Enhanced Data Labeling
"annotations": [
{
"label-category-attributes": {
"Occlusion": "Partial"
},
"object-name": "Car:2",
"class-id": 0,
"width": 223,
"height": 164,
"top": 225,
"left": 486,
"object-id": "5229df60-97a4-11ed-8903-dd5b8b903715"
}
],
"frame-attributes": {}
},
{
"frame-no": 1,
"frame": "1.txt.pcd",
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"width": 252,
"height": 246,
"top": 237,
"left": 473,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
}
]
}
}
The cuboid and bounding box for an object are linked through a common object-id.
Ground Truth provides two features that help improve the accuracy of your data labels and reduce the
total cost of labeling your data:
• Annotation consolidation helps to improve the accuracy of your data object labels. It combines the
results of multiple workers' annotation tasks into one high-fidelity label.
• Automated data labeling uses machine learning to label portions of your data automatically without
having to send them to human workers.
Topics
• Control the Flow of Data Objects Sent to Workers (p. 805)
• Consolidate Annotations (p. 806)
• Automate Data Labeling (p. 807)
• Chaining Labeling Jobs (p. 813)
804
Amazon SageMaker Developer Guide
Enhanced Data Labeling
• For both types of labeling jobs, you can use MaxConcurrentTaskCount to control the total number
of data objects available to all workers at a given point in time when the labeling job is running.
• For streaming labeling jobs, you can control the flow of data objects to workers by monitoring and
controlling the number of data objects sent to the Amazon SQS associated with your labeling job.
Use the following sections to learn more about these options. To learn more about streaming labeling
jobs, see Ground Truth Streaming Labeling Jobs (p. 738).
Topics
• Use MaxConcurrentTaskCount to Control the Flow of Data Objects (p. 805)
• Use Amazon SQS to Control the Flow of Data Objects to Streaming Labeling Jobs (p. 806)
When you start a labeling job using an input manifest file, Ground Truth does the following:
1. For each data object listed in your input manifest file, one or more tasks are created, depending on
the value you specify for NumberOfHumanWorkersPerDataObject. For example, if you set the
number of workers per data object to 3, 3 tasks will be created for each dataset object. To be marked
as successfully labeled, at least one worker must label the object. Alternatively, the tasks can expire or
be declined.
2. If you are using the Mechanical Turk workforce, Ground Truth first sends a batch of 10 dataset objects
to your workers. It uses this small batch to set up the labeling job and to make sure that the job is
correctly configured.
3. Next, Ground Truth sends MaxConcurrentTaskCount number of dataset objects to workers. For
example, if you have 2,000 input data objects in your input manifest file and have set the number of
workers per data object to 3 and set MaxConcurrentTaskCount to 900, the first 900 data objects in
your input manifest are sent to workers, corresponding to 2,700 tasks (900 x 3). This is the first full-
sized set of objects sent to workers.
4. What happens next depends on the type of labeling job you create. This step assumes one or more
dataset objects in your input manifest file, or sent using an Amazon SNS input data source (in a
streaming labeling job) were not include in the set sent to workers in step 3.
• Streaming labeling job: As long as the total number of objects available to workers is equal
to MaxConcurrentTaskCount, all remaining dataset objects on your input manifest file and
that you send in real time using Amazon SNS are placed on an Amazon SQS queue. When the
total number of objects available to workers falls below MaxConcurrentTaskCount minus
NumberOfHumanWorkersPerDataObject, a new data object from the queue is used to create
NumberOfHumanWorkersPerDataObject-tasks, which are sent to workers in real time.
• Non-streaming labeling job: As workers finish labeling one set of objects, up to
MaxConcurrentTaskCount times NumberOfHumanWorkersPerDataObject number of new
tasks will be sent to workers. This process is repeated until all data objects in the input manifest file
are labeled.
805
Amazon SageMaker Developer Guide
Enhanced Data Labeling
Use Amazon SQS to Control the Flow of Data Objects to Streaming Labeling
Jobs
When you create a streaming labeling job, an Amazon SQS queue is automatically created in your
account. Data objects are only added to the Amazon SQS queue when the total number of objects sent
to workers is above MaxConcurrentTaskCount. Otherwise, objects are sent directly to workers.
You can use this queue to manage the flow of data objects to your labeling job. To learn more, see
Manage Labeling Requests with an Amazon SQS Queue (p. 740).
Consolidate Annotations
An annotation is the result of a single worker's labeling task. Annotation consolidation combines the
annotations of two or more workers into a single label for your data objects. A label, which is assigned to
each object in the dataset, is a probabilistic estimate of what the true label should be. Each object in the
dataset typically has multiple annotations, but only one label or set of labels.
You decide how many workers annotate each object in your dataset. Using more workers can increase the
accuracy of your labels, but also increases the cost of labeling. To learn more about Ground Truth pricing,
see Amazon SageMaker Ground Truth pricing .
If you use the Amazon SageMaker console to create a labeling job, the following are the defaults for the
number of workers who can annotate objects:
When you use the CreateLabelingJob operation, you set the number of workers to annotate each
data object with the NumberOfHumanWorkersPerDataObject parameter. You can override the
default number of workers that annotate a data object using the console or the CreateLabelingJob
operation.
Ground Truth provides an annotation consolidation function for each of its predefined labeling
tasks: bounding box, image classification, name entity recognition, semantic segmentation, and text
classification. These are the functions:
• Multi-class annotation consolidation for image and text classification uses a variant of the Expectation
Maximization approach to annotations. It estimates parameters for each worker and uses Bayesian
inference to estimate the true class based on the class annotations from individual workers.
• Bounding box annotation consolidates bounding boxes from multiple workers. This function finds the
most similar boxes from different workers based on the Jaccard index, or intersection over union, of
the boxes and averages them.
• Semantic segmentation annotation consolidation treats each pixel in a single image as a multi-
class classification. This function treats the pixel annotations from workers as "votes," with more
information from surrounding pixels incorporated by applying a smoothing function to the image.
• Named entity recognition clusters text selections by Jaccard similarity and calculates selection
boundaries based on the mode, or the median if the mode isn't clear. The label resolves to the most
assigned entity label in the cluster, breaking ties by random selection.
You can use other algorithms to consolidate annotations. For information, see Create Your Own
Annotation Consolidation Function (p. 807).
806
Amazon SageMaker Developer Guide
Enhanced Data Labeling
If you want to use other algorithms to create annotation consolidations functions, you can find the
worker responses in the [project-name]/annotations/worker-response folder of the Amazon S3
bucket where you direct the job output.
Assess Similarity
To assess the similarity between labels, you can use one of the following strategies, or you can use one
that meets your data labeling needs:
• For label spaces that consist of discrete, mutually exclusive categories, such as multi-class
classification, assessing similarity can be straightforward. Discrete labels either match or do not match.
• For label spaces that don't have discrete values, such as bounding box annotations, find a broad
measure of similarity. For bounding boxes, one such measure is the Jaccard index. This measures
the ratio of the intersection of two boxes with the union of the boxes to assess how similar they
are. For example, if there are three annotations, then there can be a function that determines which
annotations represent the same object and should be consolidated.
Some approaches attempt to estimate the accuracy of different annotators and weight their annotations
in proportion to the probability of correctness. An example of this is the Expectation Maximization
method, which is used in the default Ground Truth consolidation function for multi-class annotations.
For more information about creating an annotation consolidation function, see Step 3: Processing with
AWS Lambda (p. 678).
We recommend using automated data labeling on large datasets because the neural networks used with
active learning require a significant amount of data for every new dataset. Typically, as you provide more
data, the potential for high accuracy predictions goes up. Data will only be auto-labeled if the neural
network used in the auto-labeling model can achieve an acceptably high level of accuracy. Therefore,
with larger datasets, there is more potential to automatically label the data because the neural network
can achieve high enough accuracy for auto-labeling. Automated data labeling is most appropriate when
you have thousands of data objects. The minimum number of objects allowed for automated data
labeling is 1,250, but we strongly suggest providing a minimum of 5,000 objects.
Automated data labeling is available only for the following Ground Truth built-in task types:
807
Amazon SageMaker Developer Guide
Enhanced Data Labeling
To learn how to create a custom active learning workflow using your own model, see Set up an active
learning workflow with your own model (p. 813).
Input data quotas apply for automated data labeling jobs. See Input Data Quotas (p. 742) for
information about dataset size, input data size and resolution limits.
Note
Before you use an the automated-labeling model in production, you need to fine-tune or test
it, or both. You might fine-tune the model (or create and tune another supervised model of
your choice) on the dataset produced by your labeling job to optimize the model’s architecture
and hyperparameters. If you decide to use the model for inference without fine-tuning it,
we strongly recommend making sure that you evaluate its accuracy on a representative (for
example, randomly selected) subset of the dataset labeled with Ground Truth and that it
matches your expectations.
How it Works
You enable automated data labeling when you create a labeling job. This is how it works:
1. When Ground Truth starts an automated data labeling job, it selects a random sample of input data
objects and sends them to human workers. If more than 10% of these data objects fail, the labeling
job will fail. If the labeling job fails, in addition to reviewing any error message Ground Truth returns,
check that your input data is displaying correctly in the worker UI, instructions are clear, and that you
have given workers enough time to complete tasks.
2. When the labeled data is returned, it is used to create a training set and a validation set. Ground Truth
uses these datasets to train and validate the model used for auto-labeling.
3. Ground Truth runs a batch transform job, using the validated model for inference on the validation
data. Batch inference produces a confidence score and quality metric for each object in the validation
data.
4. The auto labeling component will use these quality metrics and confidence scores to create a
confidence score threshold that ensures quality labels.
5. Ground Truth runs a batch transform job on the unlabeled data in the dataset, using the same
validated model for inference. This produces a confidence score for each object.
6. The Ground Truth auto labeling component determines if the confidence score produced in step 5
for each object meets the required threshold determined in step 4. If the confidence score meets the
threshold, the expected quality of automatically labeling exceeds the requested level of accuracy and
that object is considered auto-labeled.
7. Step 6 produces a dataset of unlabeled data with confidence scores. Ground Truth selects data points
with low confidence scores from this dataset and sends them to human workers.
8. Ground Truth uses the existing human-labeled data and this additional labeled data from human
workers to update the model.
9. The process is repeated until the dataset is fully labeled or until another stopping condition is met. For
example, auto-labeling stops if your human annotation budget is reached.
The preceding steps happen in iterations. Select each tab in the following table to see an example of the
processes that happen in each iteration for an object detection automated labeling job. The number of
data objects used in a given step in these images (for example, 200) is specific to this example. If there
are fewer than 5,000 objects to label, the validation set size is 20% of the whole dataset. If there are
808
Amazon SageMaker Developer Guide
Enhanced Data Labeling
more than 5,000 objects in your input dataset, the validation set size is 10% of the whole dataset. You
can control the number of human labels collected per active learning iteration by changing the value for
MaxConcurrentTaskCount when using the API operation CreateLabelingJob. This value is set to
1,000 when you create a labeling job using the console. In the active learning flow illustrated under the
Active Learning tab, this value is set to 200.
Model Training
809
Amazon SageMaker Developer Guide
Enhanced Data Labeling
Automated Labeling
810
Amazon SageMaker Developer Guide
Enhanced Data Labeling
Active Learning
The definition of accuracy depends on the built-in task type that you use with automated labeling. For
all task types, these accuracy requirements are pre-determined by Ground Truth and cannot be manually
configured.
• For image classification and text classification, Ground Truth uses logic to find a label-prediction
confidence level that corresponds to at least 95% label accuracy. This means Ground Truth expects the
accuracy of the automated labels to be at least 95% when compared to the labels that human labelers
would provide for those examples.
• For bounding boxes, the expected mean Intersection Over Union (IoU) of the auto-labeled images is
0.6. To find the mean IoU, Ground Truth calculates the mean IoU of all the predicted and missed boxes
on the image for every class, and then averages these values across classes.
• For semantic segmentation, the expected mean IoU of the auto-labeled images is 0.7. To find the
mean IoU, Ground Truth takes the mean of the IoU values of all the classes in the image (excluding the
background).
At every iteration of Active Learning (steps 3-6 in the list above), the confidence threshold is found using
the human-annotated validation set so that the expected accuracy of the auto-labeled objects satisfies
certain predefined accuracy requirements.
1. Open the Ground Truth Labeling jobs section of the SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.
2. Using Create a Labeling Job (Console) (p. 706) as a guide, complete the Job overview and Task
type sections. Note that auto labeling is not supported for custom task types.
811
Amazon SageMaker Developer Guide
Enhanced Data Labeling
You can see your labeling job appear in the Labeling jobs section of the SageMaker console. Your
output data appears in the Amazon S3 bucket that you specified when creating the labeling job. For
more information about the format and file structure of your labeling job output data, see Output
Data (p. 776).
Specify the Amazon Resource Name (ARN) of the algorithm that you are using for automated data
labeling in the LabelingJobAlgorithmSpecificationArn parameter. Choose from one of the four Ground
Truth built-in algorithms that are supported with automated labeling:
When an automated data labeling job finishes, Ground Truth returns the ARN of the model it used for
the automated data labeling job. Use this model as the starting model for similar auto-labeling job types
by providing the ARN, in string format, in the InitialActiveLearningModelArn parameter. To retrieve the
model's ARN, use an AWS Command Line Interface (AWS CLI) command similar to the following.
# Fetch the mARN of the model trained in the final iteration of the previous labeling
job.Ground Truth
pretrained_model_arn = sagemaker_client.describe_labeling_job(LabelingJobName=job_name)
['LabelingJobOutput']['FinalActiveLearningModelArn']
To encrypt data on the storage volume attached to the ML compute instance(s) that are used in
automated labeling, include an AWS Key Management Service (AWS KMS) key in the VolumeKmsKeyId
parameter. For information about AWS KMS keys, see What is AWS Key Management Service? in the AWS
Key Management Service Developer Guide.
For an example that uses the CreateLabelingJob operation to create an automated data labeling job,
see the object_detection_tutorial example in the SageMaker Examples, Ground Truth Labeling Jobs
section of a SageMaker notebook instance. To learn how to create and open a notebook instance, see
Create a Notebook Instance (p. 209). To learn how to access SageMaker example notebooks, see Example
Notebooks (p. 220).
812
Amazon SageMaker Developer Guide
Enhanced Data Labeling
Automated Data Labeling Job Training Instance Type Inference Instance Type
Type
Ground Truth manages the instances that you use for automated data labeling jobs. It creates,
configures, and terminates the instances as needed to perform your job. These instances don't appear in
your Amazon EC2 instance dashboard.
You can also find this notebook in the SageMaker Examples repository. See Use Example Notebooks to
learn how to find an Amazon SageMaker example notebook.
Cloning copies the setup of a prior labeling job and allows you to make additional changes before setting
it to run.
Chaining uses not only the setup of the prior job, but also the results. This allows you to continue an
incomplete job and add labels or data objects to a completed job. Chaining is a more complex operation.
• Cloning uses the prior job's input manifest, with optional modifications, as the new job's input
manifest.
• Chaining uses the prior job's output manifest as the new job's input manifest.
813
Amazon SageMaker Developer Guide
Enhanced Data Labeling
In Amazon SageMaker Ground Truth you can configure a chained labeling job with either the console or
the API.
In an output manifest, the label attribute name appears similar to the following.
If you're creating a job in the console and don't explicitly set the label attribute name value, Ground
Truth uses the job name as the label attribute name for the job.
814
Amazon SageMaker Developer Guide
Enhanced Data Labeling
In the Job overview panel, a new Job name is set based on the title of the job from which you are
chaining this one. You can change it.
You may also specify a label attribute name different from the labeling job name.
If you're chaining from a completed job, the label attribute name uses the name of the new job you're
configuring. To change the name, select the check box.
If you're chaining from a stopped or failed job, the label attribute name uses to the name of the job from
which you're chaining. It's easy to see and edit the value because the name check box is checked.
Attribute label naming considerations
• The default uses the label attribute name Ground Truth has selected. All data objects without
data connected to that label attribute name are labeled.
• Using a label attribute name not present in the manifest causes the job to process all the
objects in the dataset.
The input dataset location in this case is automatically selected as the output manifest of the chained
job. The input field is not available, so you cannot change it.
Adding data objects to a labeling job
You cannot specify an alternate manifest file. Manually edit the output manifest from the
previous job to add new items before starting a chained job. The Amazon S3 URI helps you
locate where you are storing the manifest in your Amazon S3 bucket. Download the manifest
file from there, edit it locally on your computer, and then upload the new version to replace it.
Make sure you are not introducing errors during editing. We recommend you use JSON linter to
check your JSON. Many popular text editors and IDEs have linter plugins available.
• Manifest location: Rather than use your original manifest from the prior job, the value for the
ManifestS3Uri in the DataSource should point to the Amazon S3 URI of the output manifest from
the prior labeling job.
• Label attribute name: Setting the correct LabelAttributeName value is important here. This is the
key portion of a key-value pair where labeling data is the value. Sample use cases include:
• Adding new or more specific labels to a completed job — Set a new label attribute name.
• Labeling the unlabeled items from a prior job — Use the label attribute name from the prior job.
If you're using the API, the instructions are the same as those for starting a chained job. However, be sure
to upload your manifest to an Amazon S3 bucket and use it instead of using the output manifest from a
prior job.
The Label attribute name value in the manifest has to conform to the naming considerations discussed
earlier.
815
Amazon SageMaker Developer Guide
Security and Permissions
If you are a new user and want to get started quickly, or if you do not require granular permissions, see
Use IAM Managed Policies with Ground Truth (p. 817).
For more information about IAM users and roles, see Identities (Users, Groups, and Roles) in the IAM User
Guide.
To learn more about using IAM with SageMaker, see Identity and Access Management for Amazon
SageMaker (p. 3048).
Topics
• CORS Permission Requirement (p. 816)
• Assign IAM Permissions to Use Ground Truth (p. 817)
• Using Amazon SageMaker Ground Truth in an Amazon Virtual Private Cloud (p. 828)
• Output Data and Storage Volume Encryption (p. 839)
• Workforce Authentication and Restrictions (p. 840)
Starting with Chrome 89, AWS can no longer automatically prevent the rotation of images because the
web standards group W3C has decided that the ability to control rotation of images violates the web’s
Same-origin Policy. Therefore, to ensure human workers annotate your input images in a predictable
orientation when you submit requests to create a labeling job, you must add a CORS header policy to the
Amazon S3 buckets that contain your input images.
Important
If you do not add a CORS configuration to the Amazon S3 buckets that contain your input data,
labeling tasks for those input data objects will fail.
If you create a job through the Ground Truth console, CORS is enabled by default. If all of your input
data is not located in the same Amazon S3 bucket as your input manifest file, you must add a CORS
configuration to all Amazon S3 buckets that contain input data using the following instructions.
If you are using the CreateLabelingJob API to create a Ground Truth labeling job, you can add a CORS
policy to an Amazon S3 bucket that contains input data in the S3 console. To set the required CORS
headers on the Amazon S3 bucket that contain your input images in the Amazon S3 console, follow the
directions detailed in How do I add cross-domain resource sharing with CORS?. Use the following CORS
configuration code for the buckets that host your images. If you use the Amazon S3 console to add the
policy to your bucket, you must use the JSON format.
Important
If you create a 3D point cloud or video frame labeling job, you must add additional rules
to your CORS configuration. To learn more, see 3D Point Cloud Labeling Job Permission
Requirements (p. 633) and Video Frame Job Permission Requirements (p. 579) respectively.
816
Amazon SageMaker Developer Guide
Security and Permissions
JSON
[{
"AllowedHeaders": [],
"AllowedMethods": ["GET"],
"AllowedOrigins": ["*"],
"ExposeHeaders": ["Access-Control-Allow-Origin"]
}]
XML
<CORSConfiguration>
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
</CORSRule>
</CORSConfiguration>
You can use the sections on this page to learn the following:
• How to create IAM policies that grant a user or role permission to create a labeling job. Administrators
can use IAM policies to restrict access to Amazon SageMaker and other AWS services that are specific
to Ground Truth.
• How to create a SageMaker execution role. An execution role is the role that you specify when you
create a labeling job. The role is used to start and manage your labeling job.
• If you are getting started using Ground Truth, or you do not require granular permissions for your use
case, it is recommended that you use the IAM managed policies described in Use IAM Managed Policies
with Ground Truth (p. 817).
• Learn about the permissions required to use the Ground Truth console in Grant IAM Permission to Use
the Amazon SageMaker Ground Truth Console (p. 818). This section includes policy examples that
grant an IAM entity permission to create and modify private work teams, subscribe to vendor work
teams, and create custom labeling workflows.
• When you create a labeling job, you must provide an execution role. Use Create a SageMaker Execution
Role for a Ground Truth Labeling Job (p. 822) to learn about the permissions required for this role.
• AmazonSageMakerFullAccess – Use this policy to give a user or role permission to create a labeling
job. This is a broad policy that grants a entity permission to use SageMaker features, as well as features
of necessary AWS services through the console and API. This policy gives the entity permission to
create a labeling job and to create and manage workforces using Amazon Cognito. To learn more, see
AmazonSageMakerFullAccess Policy.
817
Amazon SageMaker Developer Guide
Security and Permissions
To learn how to attach an AWS managed policy to a user or role, refer to Adding and removing IAM
identity permissions in the IAM User Guide.
Grant IAM Permission to Use the Amazon SageMaker Ground Truth Console
To use the Ground Truth area of the SageMaker console, you need to grant permission to an entity to
access SageMaker and other AWS services that Ground Truth interacts with. Required permissions to
access other AWS services depends on your use-case:
• Amazon S3 permissions are required for all use cases. These permissions must grant access to the
Amazon S3 buckets that contain input and output data.
• AWS Marketplace permissions are required to use a vendor workforce.
• Amazon Cognito permission are required for private work team setup.
• AWS KMS permissions are required to view available AWS KMS keys that can be used for output data
encryption.
• IAM permissions are required to either list pre-existing execution roles, or to create a new one.
Additionally, you must use add a PassRole permission to allow SageMaker to use the execution role
chosen to start the labeling job.
The following sections list policies you may want to grant to a role to use one or more functions of
Ground Truth.
Topics
• Ground Truth Console Permissions (p. 818)
• Custom Labeling Workflow Permissions (p. 821)
• Private Workforce Permissions (p. 822)
• Vendor Workforce Permissions (p. 822)
To grant permission to a user or role to use the Ground Truth area of the SageMaker console to create
a labeling job, attach the following policy to the user or role. The following policy will give an IAM role
permission to create a labeling job using a built-in task type task type. If you want to create a custom
818
Amazon SageMaker Developer Guide
Security and Permissions
labeling workflow, add the policy in Custom Labeling Workflow Permissions (p. 821) to the following
policy. Each Statement included in the following policy is described below this code block.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SageMakerApis",
"Effect": "Allow",
"Action": [
"sagemaker:*"
],
"Resource": "*"
},
{
"Sid": "KmsKeysForCreateForms",
"Effect": "Allow",
"Action": [
"kms:DescribeKey",
"kms:ListAliases"
],
"Resource": "*"
},
{
"Sid": "AccessAwsMarketplaceSubscriptions",
"Effect": "Allow",
"Action": [
"aws-marketplace:ViewSubscriptions"
],
"Resource": "*"
},
{
"Sid": "SecretsManager",
"Effect": "Allow",
"Action": [
"secretsmanager:CreateSecret",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecrets"
],
"Resource": "*"
},
{
"Sid": "ListAndCreateExecutionRoles",
"Effect": "Allow",
"Action": [
"iam:ListRoles",
"iam:CreateRole",
"iam:CreatePolicy",
"iam:AttachRolePolicy"
],
"Resource": "*"
},
{
"Sid": "PassRoleForExecutionRoles",
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PassedToService": "sagemaker.amazonaws.com"
}
}
},
819
Amazon SageMaker Developer Guide
Security and Permissions
{
"Sid": "GroundTruthConsole",
"Effect": "Allow",
"Action": [
"groundtruthlabeling:*",
"lambda:InvokeFunction",
"lambda:ListFunctions",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketCors",
"s3:PutBucketCors",
"s3:ListAllMyBuckets",
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:AdminCreateUser",
"cognito-idp:AdminDeleteUser",
"cognito-idp:AdminDisableUser",
"cognito-idp:AdminEnableUser",
"cognito-idp:AdminRemoveUserFromGroup",
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:DescribeUserPool",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:ListGroups",
"cognito-idp:ListIdentityProviders",
"cognito-idp:ListUsers",
"cognito-idp:ListUsersInGroup",
"cognito-idp:ListUserPoolClients",
"cognito-idp:ListUserPools",
"cognito-idp:UpdateUserPool",
"cognito-idp:UpdateUserPoolClient"
],
"Resource": "*"
}
]
}
This policy includes the following statements. You can scope down any of these statements by adding
specific resourses to the Resource list for that statement.
SageMakerApis
This statement includes sagemaker:*, which allows the user to perform all SageMaker API actions.
You can reduce the scope of this policy by restricting users from performing actions that are not used to
create and monitoring a labeling job.
KmsKeysForCreateForms
You only need to include this statement if you want to grant a user permission to list and select AWS
KMS keys in the Ground Truth console to use for output data encryption. The policy above grants a user
permission to list and select any key in the account in AWS KMS. To restrict the keys that a user can list
and select, specify those key ARNs in Resource.
SecretsManager
This statement gives the user permission to describe, list, and create resources in AWS Secrets Manager
required to create the labeling job.
ListAndCreateExecutionRoles
This statement gives a user permission to list (ListRoles) and create (CreateRole) IAM roles
in your account. It also grants the user permission to create (CreatePolicy) policies and attach
820
Amazon SageMaker Developer Guide
Security and Permissions
(AttachRolePolicy) policies to entities. These are required to list, select, and if required, create an
execution role in the console.
If you have already created an execution role, and want to narrow the scope of this statement so
that users can only select that role in the console, specify the ARNs of the roles you want the user to
have permission to view in Resource and remove the actions CreateRole, CreatePolicy, and
AttachRolePolicy.
AccessAwsMarketplaceSubscriptions
These permissions are required to view and choose vendor work teams that you are already subscribed
to when creating a labeling job. To give the user permission to subscribe to vendor work teams, add the
statement in Vendor Workforce Permissions (p. 822) to the policy above
PassRoleForExecutionRoles
This is required to give the labeling job creator permission to preview the worker UI and verify that input
data, labels, and instructions display correctly. This statement gives an entity permissions to pass the
IAM execution role used to create the labeling job to SageMaker to render and preview the worker UI.
To narrow the scope of this policy, add the role ARN of the execution role used to create the labeling job
under Resource.
GroundTruthConsole
• groundtruthlabeling – This allows a user to perform actions required to use certain features
of the Ground Truth console. These include permissions to describe the labeling job status
(DescribeConsoleJob), list all dataset objects in the input manifest file (ListDatasetObjects),
filter the dataset if dataset sampling is selected (RunFilterOrSampleDatasetJob), and to generate
input manifest files if automated data labeling is used (RunGenerateManifestByCrawlingJob).
These actions are only available when using the Ground Truth console and cannot be called directly
using an API.
• lambda:InvokeFunction and lambda:ListFunctions – these actions give users permission to list
and invoke Lambda functions that are used to run a custom labeling workflow.
• s3:* – All Amazon S3 permissions included in this statement are used to view Amazon S3 buckets
for automated data setup (ListAllMyBuckets), access input data in Amazon S3 (ListBucket,
GetObject), check for and create a CORS policy in Amazon S3 if needed (GetBucketCors and
PutBucketCors), and write labeling job output files to S3 (PutObject).
• cognito-idp – These permissions are used to create, view and manage and private workforce using
Amazon Cognito. To learn more about these actions, refer to the Amazon Cognito API References.
{
"Sid": "GroundTruthConsoleCustomWorkflow",
"Effect": "Allow",
"Action": [
"lambda:InvokeFunction",
"lambda:ListFunctions"
],
"Resource": "*"
}
To learn how to give an entity permission to create and test pre-annotation and post-annotation Lambda
functions, see Required Permissions To Use Lambda With Ground Truth.
821
Amazon SageMaker Developer Guide
Security and Permissions
{
"Effect": "Allow",
"Action": [
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:AdminCreateUser",
"cognito-idp:AdminDeleteUser",
"cognito-idp:AdminDisableUser",
"cognito-idp:AdminEnableUser",
"cognito-idp:AdminRemoveUserFromGroup",
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:DescribeUserPool",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:ListGroups",
"cognito-idp:ListIdentityProviders",
"cognito-idp:ListUsers",
"cognito-idp:ListUsersInGroup",
"cognito-idp:ListUserPoolClients",
"cognito-idp:ListUserPools",
"cognito-idp:UpdateUserPool",
"cognito-idp:UpdateUserPoolClient"
],
"Resource": "*"
}
To learn more about creating private workforce using Amazon Cognito, see Create and Manage Amazon
Cognito Workforce (p. 869).
{
"Sid": "AccessAwsMarketplaceSubscriptions",
"Effect": "Allow",
"Action": [
"aws-marketplace:Subscribe",
"aws-marketplace:Unsubscribe",
"aws-marketplace:ViewSubscriptions"
],
"Resource": "*"
}
This role must give Ground Truth permission to access the following:
• Amazon S3 to retrieve your input data and write output data to an Amazon S3 bucket. You can either
grant permission for an IAM role to access an entire bucket by providing the bucket ARN, or you can
822
Amazon SageMaker Developer Guide
Security and Permissions
grant access to the role to access specific resources in a bucket. For example, the ARN for a bucket may
look similar to arn:aws:s3:::awsexamplebucket1 and the ARN of a resource in an Amazon S3
bucket may look similar to arn:aws:s3:::awsexamplebucket1/prefix/file-name.png. To
apply an action to all resources in an Amazon S3 bucket, you can use the wild card: *. For example,
arn:aws:s3:::awsexamplebucket1/prefix/*. For more information, see Amazon Amazon S3
Resources in the Amazon Simple Storage Service User Guide.
• CloudWatch to log worker metrics and labeling job statuses.
• AWS KMS for data encryption. (Optional)
• AWS Lambda for processing input and output data when you create a custom workflow.
Additionally, if you create a streaming labeling job, this role must have permission to access:
• Amazon SQS to create an interact with an SQS queue used to manage labeling requests.
• Amazon SNS to subscribe to and retrieve messages from your Amazon SNS input topic and to send
messages to your Amazon SNS output topic.
• Data and storage volume encryption of your Amazon S3 buckets. To learn how to configure these
permissions, see Encrypt Output Data and Storage Volume with AWS KMS (p. 827).
• Permission to select and invoke Lambda functions that do not include GtRecipe, SageMaker,
Sagemaker, sagemaker, or LabelingFunction in the function name.
• Amazon S3 buckets that do not include either GroundTruth, Groundtruth, groundtruth,
SageMaker, Sagemaker, and sagemaker in the prefix or bucket name or an object tag that includes
SageMaker in the name (case insensitive).
Topics
• Built-In Task Types (Non-streaming) Execution Role Requirements (p. 823)
• Built-In Task Types (Streaming) Execution Role Requirements (p. 824)
• Execution Role Requirements for Custom Task Types (p. 826)
• Automated Data Labeling Permission Requirements (p. 826)
The following policy grants permission to create a labeling job for a built-in task type. This execution
policy does not include permissions for AWS KMS data encryption or decryption. Replace each red,
italicized ARN with your own Amazon S3 ARNs.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ViewBuckets",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
823
Amazon SageMaker Developer Guide
Security and Permissions
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>",
"arn:aws:s3:::<output-bucket-name>"
]
},
{
"Sid": "S3GetPutObjects",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>/*",
"arn:aws:s3:::<output-bucket-name>/*"
]
},
{
"Sid": "CloudWatch",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:DescribeLogStreams",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
If you create a streaming labeling job, you must add a policy similar to the following to the execution
role you use to create the labeling job. To narrow the scope of the policy, replace the * in Resource with
specific AWS resources that you want to grant the IAM role permission to access and use.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>/*",
"arn:aws:s3:::<output-bucket-name>/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "*",
"Condition": {
"StringEqualsIgnoreCase": {
"s3:ExistingObjectTag/SageMaker": "true"
824
Amazon SageMaker Developer Guide
Security and Permissions
}
}
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>",
"arn:aws:s3:::<output-bucket-name>"
]
},
{
"Sid": "CloudWatch",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:DescribeLogStreams",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Sid": "StreamingQueue",
"Effect": "Allow",
"Action": [
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:SendMessageBatch",
"sqs:SetQueueAttributes"
],
"Resource": "arn:aws:sqs:*:*:*GroundTruth*"
},
{
"Sid": "StreamingTopicSubscribe",
"Effect": "Allow",
"Action": "sns:Subscribe",
"Resource": [
"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
],
"Condition": {
"StringEquals": {
"sns:Protocol": "sqs"
},
"StringLike": {
"sns:Endpoint": "arn:aws:sns:<aws-region>:<aws-account-
number>:*GroundTruth*"
}
}
},
{
"Sid": "StreamingTopic",
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": [
825
Amazon SageMaker Developer Guide
Security and Permissions
"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
]
},
{
"Sid": "StreamingTopicUnsubscribe",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe"
],
"Resource": [
"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
]
}
]
}
If you want to create a custom labeling workflow, add the following statement to an execution role
policy like the ones found in Built-In Task Types (Non-streaming) Execution Role Requirements (p. 823)
or Built-In Task Types (Streaming) Execution Role Requirements (p. 824).
This policy gives the execution role permission to Invoke your pre-annotation and post-annotation
Lambda functions.
{
"Sid": "LambdaFunctions",
"Effect": "Allow",
"Action": [
"lambda:InvokeFunction"
],
"Resource": [
"arn:aws:lambda:<region>:<account-id>:function:<pre-annotation-lambda-name>",
"arn:aws:lambda:<region>:<account-id>:function:<post-annotation-lambda-name>"
]
}
If you want to create a labeling job with automated data labeling enabled, you must 1) add one policy to
the IAM policy attached to the execution role and 2) update the trust policy of the execution role.
The following statement allows the IAM execution role to be passed to SageMaker so that it can be
used to run the training and inference jobs used for active learning and automated data labeling
respectively. Add this statement to an execution role policy like the ones found in Built-In Task Types
(Non-streaming) Execution Role Requirements (p. 823) or Built-In Task Types (Streaming) Execution
Role Requirements (p. 824). Replace arn:aws:iam::<account-number>:role/<role-name> with
the execution role ARN. You can find your IAM role ARN in the IAM console under Roles.
{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "arn:aws:iam::<account-number>:role/<execution-role-name>",
"Condition": {
"StringEquals": {
"iam:PassedToService": [
826
Amazon SageMaker Developer Guide
Security and Permissions
"sagemaker.amazonaws.com"
]
}
}
}
The following statement allows SageMaker to assume the execution role to create and manage the
SageMaker training and inference jobs. This policy must be added to the trust relationship of the
execution role. To learn how to add or modify an IAM role trust policy, see Modifying a role in the IAM
User Guide.
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Principal": {"Service": "sagemaker.amazonaws.com" },
"Action": "sts:AssumeRole"
}
}
This section describes the IAM policies you must attach to your customer managed key to enable output
data encryption and the policies you must attach to your customer managed key and execution role to
use storage volume encryption. To learn more about these options, see Output Data and Storage Volume
Encryption (p. 839).
If you specify an AWS KMS customer managed key to encrypt output data, you must add an IAM policy
similar to the following to that key. This policy gives the IAM execution role that you use to create your
labeling job permission to use this key to perform all of the actions listed in "Action". To learn more
about these actions, see AWS KMS permissions in the AWS Key Management Service Developer Guide.
To use this policy, replace the IAM service-role ARN in "Principal" with the ARN of the execution
role you use to create the labeling job. When you create a labeling job in the console, this is the
role you specify for IAM Role under the Job overview section. When you create a labeling job using
CreateLabelingJob, this is ARN you specify for RoleArn.
{
"Sid": "AllowUseOfKmsKey",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:role/service-role/example-role"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
827
Amazon SageMaker Developer Guide
Security and Permissions
"Resource": "*"
}
If you specify a VolumeKmsKeyId to encrypt the storage volume attached to the ML compute instance
used for automated data labeling training and inference, you must do the following:
• Attach permissions described in Encrypt Output Data using KMS (p. 827) to the customer managed
key.
• Attach a policy similar to the following to the IAM execution role you use to create your labeling
job. This is the IAM role you specify for RoleArn in CreateLabelingJob. To learn more about the
"kms:CreateGrant" action that this policy permits, see CreateGrant in the AWS Key Management
Service API Reference.
{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Action": [
"kms:CreateGrant"
],
"Resource": "*"
}
]
}
To learn more about Ground Truth storage volume encryption, see Use Your KMS Key to Encrypt
Automated Data Labeling Storage Volume (API Only) (p. 840).
This guide shows how you can use Ground Truth in an Amazon VPC in the following ways:
1. Run an Amazon SageMaker Ground Truth Labeling Job in an Amazon Virtual Private Cloud (p. 828)
2. Use Amazon VPC Mode from a Private Worker Portal (p. 835)
• You can use Amazon S3 bucket policies to control access to buckets from specific Amazon VPC
endpoints, or specific VPCs. If you launch a labeling job and your input data is located in an Amazon S3
bucket with access restricted to users in your VPC, you can add a bucket policy to also grant a Ground
828
Amazon SageMaker Developer Guide
Security and Permissions
Truth endpoint permission to access the bucket. To learn more, see Allow Ground Truth to Access VPC
Restricted Amazon S3 Buckets (p. 829).
• You can launch an automated data labeling job in your VPC. You use a VPC configuration to specify
VPC subnets and security groups. SageMaker uses this configuration to launch the training and
inference jobs used for automated data labeling in your VPC. To learn more, see Create an Automated
Data Labeling Job in a VPC (p. 833).
You may want to use these options in any of the following ways.
• You can use both of these methods to launch a labeling job using a VPC-protected Amazon S3 bucket
with automated data labeling enabled.
• You can launch a labeling job using any built-in task type using a VPC-protected bucket.
• You can launch a custom labeling workflow using a VPC-protected bucket. Ground Truth interacts with
your pre-annotation and post-annotation Lambda functions using an AWS PrivateLink endpoint.
We recommend that you review Prerequisites to Run a Ground Truth Labeling Job in a VPC (p. 829)
before you create a labeling job in an Amazon VPC.
Review the following prerequisites before you create a Ground Truth labeling job in an Amazon VPC.
• If you are a new user of Ground Truth, review Getting started to learn how to create a labeling job.
• If your input data is located in a VPC-protected Amazon S3 bucket, your workers must access the
worker portal from your VPC.
Note
When you launch a labeling job in your VPC, you must use a private work team. To learn more
about creating a private work team, see Use a Private Workforce.
• If you want to launch an automated data labeling job in your VPC, review the following prerequisites.
• Use the instructions in Create an Amazon S3 VPC Endpoint. Training and inference containers used
in the automated data labeling workflow use this endpoint to communicate with your buckets in
Amazon S3.
• Review Automate Data Labeling to learn more about this feature. Note that automated data
labeling is supported for the following built-in task types: Image Classification (Single Label), Image
Semantic Segmentation, Bounding Box, and Text Classification (Single Label). Streaming labeling
jobs do not support automated data labeling.
• Review the Ground Truth Security and Permissions section and ensure that you have met the following
conditions.
• The user creating the labeling job has all necessary permissions
• You have created an IAM execution role with required permissions. If you do not require fine-tuned
permissions for your use case, we recommend you use the IAM managed policies described in Grant
General Permissions To Get Started Using Ground Truth.
• Allow your VPC to have access to the sagemaker-labeling-data-region and sm-
bxcb-region-saved-task-states S3 buckets. These are system owned regionalized S3 buckets
that are accessed from worker portal when worker is working on a task. We use these buckets to
interact with system managed data.
The following sections provide details about the permissions Ground Truth requires to launch labeling
jobs using Amazon S3 buckets that have access restricted to your VPC and VPC endpoints. To learn how
829
Amazon SageMaker Developer Guide
Security and Permissions
to restrict access to an Amazon S3 bucket to a VPC, see Controlling access from VPC endpoints with
bucket policies in the Amazon Simple Storage Service User Guide guide. To learn how to add a policy to
an S3 bucket, see Adding a bucket policy using the Amazon S3 console.
Note
Modifying policies on existing buckets can cause IN_PROGRESS Ground Truth jobs to fail. We
recommend you start new jobs using a new bucket. If you want to continue using the same
bucket, you can do one of the following.
You can restrict Amazon S3 bucket access to users in your VPC using an AWS PrivateLink endpoint.
For example, the following S3 bucket policy allows access to a specific bucket, <bucket-name>, from
<vpc> and the endpoint <vpc-endpoint> only. When you modify this policy, you must replace all red-
italized text with your resources and specifications.
Note
The following policy denies all entities other than users within a VPC to perform the actions
listed in Action. If you do not include actions in this list, they are still accessible to any entity
that has access to this bucket and permission to perform those actions. For example, if a user
has permission to perform GetBucketLocation on your Amazon S3 bucket, the policy below
does not restrict the user from performing this action outside of your VPC.
{
"Version": "2012-10-17",
"Id": "Policy1415115909152",
"Statement": [
{
"Sid": "Access-to-specific-VPCE-only",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Deny",
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<vpc>"
]
}
}
}
]
}
Ground Truth must be able to perform the following Amazon S3 actions on the S3 buckets you use to
configure the labeling job.
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"
830
Amazon SageMaker Developer Guide
Security and Permissions
You can do this by adding a Ground Truth endpoint to the bucket policy like the one previously
mentioned. The following table includes Ground Truth service endpoints for each AWS Region. Add an
endpoint in the same AWS Region you use to run your labeling job to your bucket policy.
us-east-2 vpce-02569ba1c40aad0bc
us-east-1 vpce-08408e335ebf95b40
us-west-2 vpce-0ea07aa498eb78469
ca-central-1 vpce-0d46ea4c9ff55e1b7
eu-central-1 vpce-0865e7194a099183d
eu-west-2 vpce-0bccd56798f4c5df0
eu-west-1 vpce-0788e7ed8628e595d
ap-south-1 vpce-0d7fcda14e1783f11
ap-southeast-2 vpce-0b7609e6f305a77d4
ap-southeast-1 vpce-0e7e67b32e9efed27
ap-northeast-2 vpce-007893f89e05f2bbf
ap-northeast-1 vpce-0247996a1a1807dbd
For example, the following policy restricts GetObject and PutObject actions on:
{
"Version": "2012-10-17",
"Id": "1",
"Statement": [
{
"Sid": "DenyAccessFromNonGTandCustomerVPC",
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"ForAllValues:StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<ground-truth-endpoint>"
],
"aws:SourceVpc": "<vpc>"
831
Amazon SageMaker Developer Guide
Security and Permissions
}
}
}
]
}
If you want a user to have permission to launch a labeling job using the Ground Truth console, you must
also add the user's ARN to the bucket policy using the aws:PrincipalArn condition. This user must
also have permission to perform the following Amazon S3 actions on the bucket you use to launch the
labeling job.
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketCors",
"s3:PutBucketCors",
"s3:ListAllMyBuckets",
The following code is an example of a bucket policy that restricts permission to perform the actions
listed in Action on the S3 bucket <bucket-name> to the following.
• <role-name>
• The VPC endpoints listed in aws:sourceVpce
• Users within the VPC named <vpc>
{
"Version": "2012-10-17",
"Id": "1",
"Statement": [
{
"Sid": "DenyAccessFromNonGTandCustomerVPC",
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*",
"arn:aws:s3:::<bucket-name>"
],
"Condition": {
"ForAllValues:StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<ground-truth-endpoint>"
],
"aws:PrincipalArn": "arn:aws:iam::<aws-account-id>:role/<role-name>",
"aws:SourceVpc": "<vpc>"
}
}
}
]
}
Note
The Amazon VPC interface endpoints and the protected Amazon S3 buckets you use for input
and output data must be located in the same AWS Region that you use to create the labeling
job.
832
Amazon SageMaker Developer Guide
Security and Permissions
After you have granted Ground Truth permission to access your Amazon S3 buckets, you can use one
of the topics in Create a Labeling Job to launch a labeling job. Specify the VPC-restricted Amazon S3
buckets for your input and output data buckets.
Use the following procedures to learn how to add a VPC configuration to your labeling job request.
1. Follow the instructions in Create a Labeling Job (Console) and complete each step in the procedure,
up to step 15.
2. In the Workers section, select the checkbox next to Enable automated data labeling.
3. Maximize the VPC configuration section of the console by selecting the arrow.
4. Specify the Virtual private cloud (VPC) that you want to use for your automated data labeling job.
5. Choose the dropdown list under Subnets and select one or more subnets.
6. Choose the dropdown list under Security groups and select one or more groups.
7. Complete all remaining steps of the procedure in Create a Labeling Job (Console).
To configure a labeling job using the Ground Truth API operation, CreateLabelingJob, follow the
instructions in Create an Automated Data Labeling Job (API) to configure your request. In addition
to the parameters described in this documentation, you must include a VpcConfig parameter in
LabelingJobResourceConfig to specify one or more subnets and security groups using the following
schema.
"LabelingJobAlgorithmsConfig": {
"InitialActiveLearningModelArn": "string",
"LabelingJobAlgorithmSpecificationArn": "string",
"LabelingJobResourceConfig": {
"VolumeKmsKeyId": "string",
"VpcConfig": {
"SecurityGroupIds": [ "string" ],
"Subnets": [ "string" ]
}
}
}
The following is an example of an AWS Python SDK (Boto3) request to create an automated data
labeling job in the US East (N. Virginia) Region using a private workforce. Replace all red-italicized
text with your labeling job resources and specifications. To learn more about the CreateLabelingJob
operation, see the Create a Labeling Job (API) tutorial and CreateLabelingJob API documentation.
import boto3
client = boto3.client(service_name='sagemaker')
833
Amazon SageMaker Developer Guide
Security and Permissions
response = client.create_labeling_job(
LabelingJobName="example-labeling-job",
LabelAttributeName="label",
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': "s3://bucket/path/manifest-with-input-data.json"
}
}
},
"LabelingJobAlgorithmsConfig": {
"LabelingJobAlgorithmSpecificationArn": "arn:aws:sagemaker:us-
east-1:027400017018:labeling-job-algorithm-specification/tasktype",
"LabelingJobResourceConfig": {
"VpcConfig": {
"SecurityGroupIds": [ "sg-01233456789", "sg-987654321" ],
"Subnets": [ "subnet-e0123456", "subnet-e7891011" ]
}
}
},
OutputConfig={
'S3OutputPath': "s3://bucket/path/file-to-store-output-data",
'KmsKeyId': "string"
},
RoleArn="arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri="s3://bucket/path/label-categories.json",
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': "arn:aws:sagemaker:region:*:workteam/private-crowd/*",
'UiConfig': {
'UiTemplateS3Uri': "s3://bucket/path/custom-worker-task-template.html"
},
'PreHumanTaskLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype",
'TaskKeywords': [
"Images",
"Classification",
"Multi-label"
],
'TaskTitle': "Add task title here",
'TaskDescription': "Add description of task here for workers",
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 3600,
'TaskAvailabilityLifetimeInSeconds': 21600,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:ACS-tasktype"
},
Tags=[
{
'Key': "string",
'Value': "string"
},
]
)
834
Amazon SageMaker Developer Guide
Security and Permissions
Point Cloud and video tasks do not support loading through a VPC.
The guide demonstrates how to complete the necessary steps to add and delete an Amazon VPC
configuration to your workforce, and satisfy the prerequisites.
Prerequisites
To run a Ground Truth labeling job in Amazon VPC, review the following prerequisites.
• You have an Amazon VPC configured that you can use. If you have not configured a VPC, follow these
instructions for creating a VPC.
• Depending on how a Worker Task Template is written, labeling data stored in an Amazon S3 bucket
may be accessed directly from Amazon S3 during labeling tasks. In these cases, the VPC network must
be configured to allow traffic from the device used by the human labeler to the S3 bucket containing
labeling data.
• Follow View and update DNS attributes for your VPC to enable DNS hostnames and DNS resolution for
your VPC.
Note
There are two ways to configure your VPC for your workforce. You can do this through the
console or the AWS SageMaker CLI.
You can use the SageMaker console to add or remove a VPC configuration. You can also delete an
existing workforce.
After you have created your private workforce, add a VPC configuration to it.
835
Amazon SageMaker Developer Guide
Security and Permissions
b. Subnets
i. Ensure that your VPC has an existing subnet
c. Security groups
i. Note
You cannot select more than 5 security groups.
d. After filling in this information, choose Confirm.
5. After you choose Confirm, you are redirected back to the Private page under Labeling workforces.
You should see a green banner at the top that reads Your private workforce update with VPC
configuration was successfully initialized. The workforce status is Updating. Next to the Delete
workforce button is the Refresh button, which can be used to retrieve the latest Workforce status.
After the workforce status has changed to Active, the VPC endpoint ID is updated as well.
Use the following information to remove a VPC configuration from your workforce using the console.
If you delete a workforce, you should not have any teams associated with it. You can delete a workforce
only if the workforce status is Active or Failed.
Download the following files to use a new VpcConfig parameter into to the SageMaker workforce CLI:
sagemaker-2017-07-24.normal.json
sagemaker-2017-07-24.paginators.json
sagemaker-2017-07-24.waiters-2.json
After downloading the files, run the following commands in your CLI:
cp ./sagemaker-2017-07-24.paginators.json ~/.aws/models/sagemaker/2017-07-24/
paginators.json
836
Amazon SageMaker Developer Guide
Security and Permissions
cp ./sagemaker-2017-07-24.waiters-2.json ~/.aws/models/sagemaker/2017-07-24/waiters-2.json
You can now test your API changes using AWS CLI. You can either create a new workforce with a VPC
configuration or update an existing workforce to add a VPC configuration. You can also remove a VPC
configuration from an existing workforce.
Navigate to the Amazon VPC console. Select Endpoints from the left panel. There should be two VPC
endpoints created in your account.
837
Amazon SageMaker Developer Guide
Security and Permissions
Navigate to your Amazon VPC console. Select Endpoints from the left panel. There should be two VPC
endpoints created in your account.
Update a VPC private workforce with an empty VPC configuration to remove VPC resources.
838
Amazon SageMaker Developer Guide
Security and Permissions
"CognitoConfig": {
"UserPool": "Pool_ID",
"ClientId": "app-client-id"
},
"CreateDate": 1622151252.451,
"Status": "Updating"
}
}
Naviagate to your Amazon VPC console. Select Endpoints from the left panel. The two VPC endpoints
should be deleted.
Restrict public access to the worker portal while maintaining access through a VPC
The workers in a VPC or non-VPC worker portal are be able to see the labeling job tasks assigned
to them. The assignment comes from assigning workers in a work team through OIDC groups. It
is the customer’s responsibility to restrict the access to their public worker portal by setting the
sourceIpConfig in their workforce.
Note
You can restrict access to the worker portal only through the SageMaker API. This cannot be
done through the console.
Use the following command to restrict public access to the worker portal.
After the sourceIpConfig is set on the workforce, the workers can access the worker portal in VPC but
not through public internet.
Note
You can not set the sourceIP restriction for worker portal in VPC.
Use the topics on this page to learn more about these Ground Truth security features.
If you don't provide a customer managed key, Amazon SageMaker uses the default AWS managed key for
Amazon S3 for your role's account to encrypt your output data.
839
Amazon SageMaker Developer Guide
Security and Permissions
If you provide a customer managed key, you must add the required permissions to the key described in
Encrypt Output Data and Storage Volume with AWS KMS (p. 827). When you use the API operation
CreateLabelingJob, you can specify your customer managed key ID using the parameter KmsKeyId.
See the following procedure to learn how to add a customer managed key when you create a labeling job
using the console.
1. Complete the first 7 steps in Create a Labeling Job (Console) (p. 706).
2. In step 8, select the arrow next to Additional configuration to expand this section.
3. For Encryption key, select the AWS KMS key that you want to use to encrypt output data.
4. Complete the rest of steps in Create a Labeling Job (Console) (p. 706) to create a labeling job.
Use Your KMS Key to Encrypt Automated Data Labeling Storage Volume (API
Only)
When you create a labeling job with automated data labeling using the CreateLabelingJob API
operation, you have the option to encrypt the storage volume attached to the ML compute instances
that run the training and inference jobs. To add encryption to your storage volume, use the parameter
VolumeKmsKeyId to input an AWS KMS customer managed key. For more information about this
parameter, see LabelingJobResourceConfig.
If you specify a key ID or ARN for VolumeKmsKeyId, your SageMaker execution role must include
permissions to call kms:CreateGrant. To learn how to add this permission to an execution role, see
Create a SageMaker Execution Role for a Ground Truth Labeling Job (p. 822).
Note
If you specify an AWS KMS customer managed key when you create a labeling job in the
console, that key is only used to encrypt your output data. It is not used to encrypt the storage
volume attached to the ML compute instances used for automated data labeling.
A Ground Truth workforce maps to a Amazon Cognito user pool. A Ground Truth work team maps to
a Amazon Cognito user group. Amazon Cognito manages the worker authentication. Amazon Cognito
supports Open ID connection (OIDC) and customers can set up Amazon Cognito federation with their
own identity provider (IdP).
Ground Truth only allows one workforce per account per AWS Region. Each workforce has a dedicated
Ground Truth work portal login URL.
You can also restrict workers to a Classless Inter-Domain Routing (CIDR) block/IP address range. This
means annotators must be on a specific network to access the annotation site. You can add up to
ten CIDR blocks for one workforce. To learn more, see Manage Private Workforce Using the Amazon
SageMaker API (p. 885).
To learn how you can create a private workforce, see Create a Private Workforce (Amazon
Cognito) (p. 869).
840
Amazon SageMaker Developer Guide
Security and Permissions
using one of these types or the work team ARN, use the sagemaker:WorkteamType and/or the
sagemaker:WorkteamArn condition keys. For the sagemaker:WorkteamType condition key, use
string condition operators. For the sagemaker:WorkteamArn condition key, use Amazon Resource
Name (ARN) condition operators. If the user attempts to create a labeling job with a restricted work
team, SageMaker returns an access denied error.
The policies below demonstrate different ways to use the sagemaker:WorkteamType and
sagemaker:WorkteamArn condition keys with appropriate condition operators and valid condition
values.
The following example uses the sagemaker:WorkteamType condition key with the StringEquals
condition operator to restrict access to a public work team. It accepts condition values in the following
format: workforcetype-crowd, where workforcetype can equal public, private, or vendor.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",
"Resource": "*",
"Condition": {
"StringEquals": {
"sagemaker:WorkteamType": "public-crowd"
}
}
}
]
}
The following policies show how to restrict access to a public work team using the
sagemaker:WorkteamArn condition key. The first shows how to use it with a valid IAM regex-variant
of the work team ARN and the ArnLike condition operator. The second shows how to use it with the
ArnEquals condition operator and the work team ARN.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",
"Resource": "*",
"Condition": {
"ArnLike": {
"sagemaker:WorkteamArn": "arn:aws:sagemaker:*:*:workteam/public-crowd/
*"
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",
841
Amazon SageMaker Developer Guide
Monitor Labeling Job Status
"Resource": "*",
"Condition": {
"ArnEquals": {
"sagemaker:WorkteamArn": "arn:aws:sagemaker:us-
west-2:394669845002:workteam/public-crowd/default"
}
}
}
]
}
Once you create a rule, you can add a target to it. CloudWatch Events uses this target to invoke
another AWS service to process the event. For example, you can create a target using a Amazon Simple
Notification Service (Amazon SNS) topic to send a notification to your email when a labeling job status
changes.
Prerequisites:
To create a CloudWatch Events rule, you will need an AWS Identity and Access Management (IAM)
role with an events.amazonaws.com trust policy attached. The following is an example of an
events.amazonaws.com trust policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": [
"events.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
Topics
• Send Events to CloudWatch Events (p. 842)
• Set Up a Target to Process Events (p. 843)
• Labeling Job Expiration (p. 844)
• Declining Tasks (p. 844)
842
Amazon SageMaker Developer Guide
Monitor Labeling Job Status
job status changes to Completed. When using the put-rule command, specify the following to receive
labeling job statuses:
• \"source\":[\"aws.sagemaker\"]
• \"detail-type\":[\"SageMaker Ground Truth Labeling Job State Change\"]
To configure a CloudWatch Events rule to watch for all status changes, use the following command and
replace the placeholder text. For example, replace "GTLabelingJobStateChanges" with a unique
CloudWatch Events rule name and "arn:aws:iam::111122223333:role/MyRoleForThisRule"
with the Amazon Resource Number (ARN) of an IAM role with an events.amazonaws.com trust policy
attached.
The following example creates a CloudWatch Events rule that notifies you when a labeling job in us-
west-2 (Oregon) changes to Completed.
The following example creates a CloudWatch Events rule that notifies you when a labeling job in us-
east-1 (Virginia) changes to Completed or Failed.
To learn more about the put-rule request, see Event Patterns in CloudWatch Events in the Amazon
CloudWatch Events User Guide.
{
"version": "0",
"id": "111e1111-11d1-111f-b111-1111b11dcb11",
"detail-type": "SageMaker Ground Truth Labeling Job State Change",
"source": "aws.sagemaker",
"account": "111122223333",
"time": "2018-10-06T12:26:13Z",
"region": "us-east-1",
"resources": [
843
Amazon SageMaker Developer Guide
SageMaker Ground Truth Plus
"arn:aws:sagemaker:us-east-1:111122223333:labeling-job/test-labeling-job"
],
"detail": {
"LabelingJobStatus": "Completed"
}
}
To process events, you need to set up a target. For example, if you want to receive an email when your
labeling job status changes, use a procedure in Setting Up Amazon SNS Notifications in the Amazon
CloudWatch User Guide to set up an Amazon SNS topic and subscribe your email to it. Once you have
create a topic, you can use it to create a target.
Declining Tasks
Workers are able to decline tasks.
Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.
844
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Plus.
labeling workforces on their own. You can get started with Amazon SageMaker Ground Truth Plus by
uploading data along with the labeling requirements in Amazon S3.
To train a machine learning (ML) model, data scientists need large, high-quality, labeled datasets. As
ML adoption grows, labeling needs increase. This forces data scientists to spend weeks on building data
labeling workflows and managing a data labeling workforce. Unfortunately, this slows down innovation
and increases cost. To ensure data scientists can spend their time building, training, and deploying ML
models, data scientists typically task other in-house teams consisting of data operations managers and
program managers to produce high-quality training datasets. However, these teams typically don't have
access to skills required to deliver high-quality training datasets, which affects ML results. As a result, you
look for a data labeling partner that can help them create high-quality training datasets at scale without
consuming their in-house resources.
When you upload the data, SageMaker Ground Truth Plus sets up the data labeling workflows and
operates them on your behalf. From there, an expert workforce trained on a varierty of machine learning
(ML) tasks performs data labeling. SageMaker Ground Truth Plus currently offers two types of expert
workforce: an Amazon employed workforce and a curated list of third-party vendors. SageMaker Ground
Truth Plus provides you with the flexibility to choose the labeling workforce. AWS experts select the best
labeling workforce based on your project requirements. For example, if you need people proficient in
labeling audio files, specify that in the guidelines provided to SageMaker Ground Truth Plus, and the
service automatically selects labelers with those skills.
Note
SageMaker Ground Truth Plus does not support PHI, PCI or FedRAMP certified data, and you
should not provide this data to SageMaker Ground Truth Plus.
There are five main components to the SageMaker Ground Truth Plus workflow.
• Requesting a project
• Creating a project team
• Accessing the project portal to monitor progress of training datasets and review labeled data
• Creating a batch
• Receiving the labeled data
If you are a first-time user of SageMaker Ground Truth Plus, we recommend that you follow the
procedures outlined in the Getting Started with Amazon SageMaker Ground Truth Plus. (p. 845)
section.
To get started using SageMaker Ground Truth Plus, review Set Up Amazon SageMaker Ground Truth Plus
Prerequisites (p. 845) and Core Components of Amazon SageMaker Ground Truth Plus (p. 846).
845
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Plus.
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.
AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.
1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.
For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.
For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.
• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).
For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.
• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.
For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.
846
Amazon SageMaker Developer Guide
Request a Project
• Project: Each qualified engagement with an AWS expert results in a SageMaker Ground Truth Plus
project. A project can be in the pilot or production stage.
• Batch: A batch is a collection of similar recurring data objects such as images, video frames and text to
be labeled. A project can have multiple batches.
• Metrics: Metrics are data about your SageMaker Ground Truth Plus project for a specific date or over a
date range.
• Task type: SageMaker Ground Truth Plus supports five task types for data labeling. You can also have a
custom task type. These include text, image, video, audio, and 3D point cloud.
• Data objects: Individual items that are to be labeled.
Request a Project
You can request a free of cost pilot by creating a project.
To get started with a Amazon SageMaker Ground Truth Plus pilot, do the following.
a. Under General information, enter your First name, Last name and Business email address. An
AWS expert uses this information for contacting you to discuss the project after you submit the
request.
b. Under Project overview, enter your Project name and Project description. Choose the Task
type based on your data and use case. You can also indicate if your data contains personally
identifiable information (PII).
c. Create or select an IAM role that grants SageMaker Ground Truth Plus permissions to perform a
labeling job by choosing one of the options below.
i. You can Create an IAM role that provides access to any S3 bucket you specify.
ii. You can Enter a custom IAM role ARN.
iii. You can choose an existing role.
iv. If you use an existing role or a custom IAM role ARN, make sure you have the following IAM
role and trust policy.
IAM role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
//Ex: "arn:aws:s3:::input-data-to-label/*"
]
}
847
Amazon SageMaker Developer Guide
Create a Project Team
]
}
Trust policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker-ground-truth-plus.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Once you create a project, you can find it on the SageMaker Ground Truth Plus page, under the Projects
section. The project status should be Review in-progress
Note
You cannot have more than 5 projects with the Review in progress status.
To add team members using Amazon Cognito, you have two options:
a. Enter an Amazon Cognito user group name. This name cannot be changed.
b. Enter the email addresses of up to 50 team members in the Email addresses field. The
addresses must be separated by a comma.
c. Choose Create project team.
848
Amazon SageMaker Developer Guide
Open the Project Portal
d. Your team members receive an email inviting them to join the SageMaker Ground Truth Plus
project team as shown in the following image.
a. Choose a user pool that you have created. User pools require a domain and an existing user
group. If you get an error that the domain is missing, set it in the Domain name options on the
App integration page of the Amazon Cognito console for your group.
b. Choose an app client. We recommend using a client generated by Amazon SageMaker.
c. Choose a user group from your pool to import its members.
d. Choose Create project team.
You can view and manage the list of team members through the AWS console.
Once you have added members to your project team, you can open the project portal to access your
projects.
Each project consists of one or more batches. A batch is a collection of recurring similar data objects
(text, image, video frame, and point cloud) to be labeled. The project portal provides you with
transparency into the data labeling process. You can stay updated about a project, create batches within
a project, review the progress of the datasets across multiple projects, and analyze project metrics. The
849
Amazon SageMaker Developer Guide
Open the Project Portal
project portal also allows you to review a subset of the labeled data and provide feedback. You can
configure the columns displayed in your project and batch table.
You can use the SageMaker Ground Truth Plus project portal to track the following details about your
project.
Status: A SageMaker Ground Truth Plus project has one of the following status types:
1. Review in progress: You have successfully submitted the project request form. An AWS expert is
currently reviewing your request.
2. Request approved: Your project request is approved. You can now share your data by creating a new
batch from the project portal.
3. Workflow design and setup progress: An AWS expert is setting up your project.
4. Pilot in-progress: Object labeling for the project in the pilot stage is currently in progress.
5. Pilot complete: Object labeling is complete and the labeled data is stored in your Amazon S3 bucket.
6. Pricing complete: An AWS expert shares the pricing for the production project with you.
7. Contract executed: The contract is complete.
8. Production in-progress: Labeling for the project in the production stage is in progress.
9. Production complete: Object labeling is complete and the labeled data is stored in your Amazon S3
bucket.
10.Paused: Project is currently paused at your request.
Task type: SageMaker Ground Truth Plus lets you label five types of tasks that include text, image, video,
audio, and point cloud.
850
Amazon SageMaker Developer Guide
Create a Batch
Failed objects: Number of objects that cannot be labeled due to an issue with the input data.
Create a Batch
You can use the project portal to create batches for a project after the project status is changed to
Request approved.
To create a batch successfully, make sure you meet the following criteria:
Note
You cannot create a batch before the project status changes to Request approved.
Review Metrics
Metrics are data about your SageMaker Ground Truth Plus project for a specific date or over a date range.
851
Amazon SageMaker Developer Guide
Review Batches
You can review metrics for all batches or choose a batch of your choice as shown in the following image.
Objects completed by day: Total numbers of objects labeled on a specific date or over a date range.
Labels completed by day: Total numbers of labels completed on a specific date or over a date range. An
object can have more than one label.
Review Batches
Every Amazon SageMaker Ground Truth Plus project consists of one or more batches. Each batch is made
up of data objects to be labeled. You can view all the batches for your project using the project portal as
shown in the following image.
852
Amazon SageMaker Developer Guide
Review Batches
You can use the SageMaker Ground Truth Plus project portal to track the following details about every
batch:
Status: A SageMaker Ground Truth Plus batch has one of the following status types:
Task type: SageMaker Ground Truth Plus lets you label five types of tasks that include text, image, video,
audio, and point cloud.
Failed objects: Number of objects that cannot be labeled due to an issue with the input data.
Objects to review: Number of objects that are ready for your review.
Objects with feedback: Number of objects that have gotten feedback from the team members.
SageMaker Ground Truth Plus lets you review a sample set of your labeled data (determined during the
initial consultation call) through the review UI shown in the following image.
853
Amazon SageMaker Developer Guide
Review Batches
854
Amazon SageMaker Developer Guide
Accept or Reject Batches
The portal allows your project team members and you to review a small sample set of the labeled
objects for each batch. You can provide feedback for each labeled object within that subset through this
UI. The review UI allows you to navigate across the subset of labeled objects and provide feedback for
those labeled objects.
You can perform the following actions using the review UI.
• Use the arrow controls on the bottom left to navigate through the data objects.
• You can provide feedback for each object. The Feedback section is in the right panel. Choose Submit
to submit feedback for all images.
• Use the image controls in the bottom tray to zoom, pan, and control contrast.
• If you plan on returning to finish up your review, choose Stop and resume later on the top right.
• Choose Save to save your progress. Your progress is also autosaved every 15 minutes.
• To exit the review UI, choose Close on the upper right corner of the review UI.
• You can verify the Label attributes and Frame attributes on each frame using the panel on the right.
You cannot create new objects or modify existing objects in this task.
If you accept a batch, the output from that labeling job is placed in the Amazon S3 bucket that you
specify. Once the data is delivered to your S3 bucket, the status of your batch changes from Accepted to
Data delivered.
If you reject a batch, you can provide feedback and explain your reasons for rejecting the batch.
SageMaker Ground Truth Plus allows you to provide feedback at the data object level as well as the
batch level. You can provide feedback for data objects through the review UI. You can use the project
portal to provide feedback for each batch. When you reject a batch, an AWS expert contacts you to
determine the rework process and the next steps for the batch.
Note
Accepting or rejecting a batch is a one-time action and cannot be undone. It is necessary to
either accept or reject every batch of the project.
Collecting and labeling data in dynamic environments with variations in object size, shape, color,
position, background, and lighting is often a time-consuming and expensive process. To effectively
855
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Synthetic Data
train a model to operate in a dynamic environment, ML scientists must collect a large set of real-world
images to represent all possible scenarios, a process that can take months. For scenarios that don’t occur
frequently, such as rare product defects and faulty product placement, it can take years to capture a
sufficient number of images to train a CV model. To acquire images with product defects, ML scientists
may intentionally damage products in order to acquire defective images. Ground Truth synthetic data
makes it faster and more cost effective for ML scientists to quickly acquire labeled images that represent
real-world scenarios, a core requirement for training CV models. ML scientists can use Ground Truth
synthetic data to generate thousands of synthetic images from 3D virtual environments representing
real world scenarios in hours instead of months. Ground Truth provides a synthetic image fidelity and
diversity report and a manifest file along with the labeled synthetic data. The synthetic image fidelity
and diversity report provides statistics and plots that help you better understand the generated synthetic
images. The manifest file contains information about the images and image labels that you can use to
train and test a model.
Note
Ground Truth synthetic data does not support PHI, PCI, or FedRAMP certified data, and you
should not provide this data to Ground Truth synthetic data.
If you are a first-time user of Ground Truth synthetic data, we recommend that you follow the
procedures outlined in the Getting Started with Amazon SageMaker Ground Truth Synthetic
Data (p. 856) section.
To get started using synthetic data, review Set Up Amazon SageMaker Ground Truth Synthetic
Data Prerequisites (p. 856) and Core Components of Amazon SageMaker Ground Truth Synthetic
Data (p. 857).
1. Open https://portal.aws.amazon.com/billing/signup.
856
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Synthetic Data
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.
AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.
1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.
For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.
For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.
• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).
For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.
• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.
For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.
• Project: Each qualified engagement with an AWS expert results in a Ground Truth synthetic data
project.
• Batch: A batch is a collection of similar labeled images. A project can have multiple batches. A batch
can be in the test or production stage. A project can have multiple batches.
857
Amazon SageMaker Developer Guide
Share Data from Your Amazon S3 Bucket
• Synthetic Image Fidelity and Diversity Report: Ground Truth synthetic data provides a metrics report
that helps you compare the generated synthetic images with your typical dataset.
Request a Project
To get started with Amazon SageMaker Ground Truth synthetic data, go to the SageMaker console and
complete the intake form.
Once you submit the intake form in the AWS console, an AWS expert from the Ground Truth synthetic
data team reaches out to discuss your data labeling project requirements and pricing.
Create an Amazon S3 bucket to share your project assets with Ground Truth synthetic data and store
your project’s output data.
1. Follow the instructions in Create a Bucket in the Amazon Simple Storage Service Console User Guide.
2. We recommend using the following naming convention while storing your data in an Amazon S3
bucket.
858
Amazon SageMaker Developer Guide
Project Portal
Note
If you have additional requirements for accessing your data in an Amazon S3 bucket, please
contact your AWS expert.
To share your project assets with the Ground Truth synthetic data team for project evaluation, work
estimation, and synthetic data generation, follow the steps in the Send Project Data to Ground Truth
Synthetic Data (p. 859) section below.
After receiving your intake form and project assets, we return a statement of work (SOW) within 5
business days. The SOW outlines your engagement with Ground Truth synthetic data generation and
labeling. After you approve the SOW, the Ground Truth synthetic data team produces a test batch
consisting of 50 synthetic images. An AWS expert meets with you to the review the test batch, approve
or reject images, and complete the final production. The timeline for this is based on the responses in
your intake form.
1. Under the Project data transfers table in the project portal, choose Send project data.
2. Enter the name of your S3 bucket from which you would like to send project data as the Amazon S3
source location of the project data transfer.
3. Select an IAM role for the project data transfer. If you select Automatic, Ground Truth synthetic data
creates an IAM role in your account with the required permissions to run the project data transfer
and call other services on your behalf (recommended). If you select an existing IAM role in your
account, Ground Truth synthetic data uses that IAM role to run the project data transfer and call
other services on your behalf.
4. Choose Create to create and start the project data transfer.
After creating a project data transfer, you can view the status of the transfer in the Project data
transfers table on the project details page in the project portal. When the project data transfer status is
Completed, the project data is available to the Ground Truth synthetic data team.
Project Portal
Each project consists of one or more batches. A batch is a collection of similar generated and labeled
images. The project portal provides you access to the projects you have contracted with Ground Truth
synthetic data. You can view the status of your projects and access completed batches along with the
synthetic image fidelity and diversity report. You also review your batches to accept or reject them
through the project portal.
859
Amazon SageMaker Developer Guide
Project Portal
You can use the Ground Truth synthetic data project portal to track the following details about your
project:
Status: A Ground Truth synthetic data project has one of the following status types:
1. Request submitted: You have successfully submitted the project request form. Next, an AWS expert
schedules a call with you to discuss the details for your project.
2. Review in progress: We are reviewing your project. An AWS expert has been assigned to your project.
3. Production in progress: We are currently working on generating labeled data for your project.
4. Data ready for review: At least one batch is ready for your review.
5. Project complete: We have completed the generation of the required labeled images. The images are
stored in your Amazon S3 bucket.
Completed images: Number of labeled images generated across all accepted production batches.
Delete a Project
You can delete a project using the console if the project status is Request submitted or Project
complete. To delete a project with any other status, contact your AWS expert. Deleting a Ground Truth
synthetic data project does not delete your data from the Amazon S3 buckets and can be subject to
charge.
• Request submitted: Deleting a requested project deletes all the project and customer information
from the Ground Truth synthetic data database.
• Review in progress / Production in progress / Data ready for review: You can request your AWS
expert to delete a project having one of these statuses. Deleting a project deletes all the project and
personal information from Ground Truth synthetic data database and S3 buckets.
• Project complete: Once a project is marked as Project complete, we delete all customer information
from the Ground Truth synthetic data database and S3 buckets. You can view the project and batches
as long as you want, or delete them using the console.
860
Amazon SageMaker Developer Guide
Review Batches
Note
Deleting a project does not delete the images from your S3 bucket. To learn more about
deleting images from your S3 bucket, refer to Deleting Amazon S3 objects.
Review Batches
Every Amazon SageMaker Ground Truth synthetic data project consists of one or more batches. Each
batch is made up of labeled synthetic images. Batches are of two types, Test Batch and Production
Batch. A test batch provides a small preview of how the synthetic images look using your 3D assets and
environment. Images in the test batch are not counted towards the total number of synthetic images you
contract. After you approve the test batch for a specific configuration of images, Ground Truth synthetic
data starts generating images for your production batch. Images in a production batch are counted
towards the total required images.
For every batch, Ground Truth synthetic data provides a Synthetic Image Fidelity and Diversity
Report. This report provides image and object level statistics and plots that help you make sense of
the generated synthetic images. The statistics are used to describe the diversity and the fidelity of the
synthetic images and compare them with real images. Examples of the statistics and plots provided are
the distributions of object classes, object sizes, image brightness, image contrast, as well as the plots
evaluating the indistinguishability between synthetic and real images. The raw data for all the computed
dataset statistics is also provided as CSV files to help you accelerate model debugging and enable further
analyses.
You can view all the batches for your project using the project portal.
You can use the Ground Truth synthetic data project portal to track the following details about every
batch:
861
Amazon SageMaker Developer Guide
Accept or Reject Batches
Status: A Ground Truth synthetic data batch has one of the following status types:
1. In progress: We are currently generating labeled images for this batch. It will soon be ready for your
review.
2. Ready for review: A batch of labeled synthetic images is now ready for your review. Follow the steps
in the Transfer Batch Data to Your Amazon S3 bucket (p. 862) section to view the images and review
the batch.
3. Accepted: You have accepted this batch.
4. Rejected: You have rejected this batch and it needs to be reworked. When you reject a batch, an AWS
expert contacts you to discuss this further.
1. On the batch details page in the project portal, choose Get batch data.
2. Under S3 destination location, enter the name of the S3 bucket where you would like to receive
your batch data.
3. Select an IAM role for the project data transfer. If you select Automatic, Ground Truth synthetic data
creates an IAM role in your account with the required permissions to run the project data transfer
and call other services on your behalf (recommended). If you select an existing IAM role in your
account, Ground Truth synthetic data uses that IAM role to run the project data transfer and call
other services on your behalf.
4. Choose Create to create and start the batch data transfer.
After creating a batch data transfer, you can view the status of the transfer in the Batch data transfers
table on the batch details page in the project portal. When the batch data transfer status is Completed,
the batch data is available in your S3 bucket, the batch images are viewable on the batch details page in
the project portal, and you can proceed to review the batch.
862
Amazon SageMaker Developer Guide
Create and Manage Workforces
Accepting a batch informs your AWS expert to continue or complete the project, based on the number of
remaining images you contracted.
When you reject a batch, an AWS expert contacts you to determine the rework process and the next
steps for the batch.
Accepting or rejecting a batch is a one-time action and can only be undone by contacting your AWS
expert. It is necessary to either accept or reject every batch of the project.
When you use a private workforce, you also create work teams, a group of workers from your workforce
that are assigned to specific jobs— Amazon SageMaker Ground Truth labeling jobs or Amazon
Augmented AI human review tasks. You can have multiple work teams and can assign one or more work
teams to each job.
You can use Amazon Cognito or your own private OpenID Connect (OIDC) Identity Provider (IdP) to
manage your private workforce and work teams. For more information about the permissions required to
manage your workforce this way, see Permissions Required to Use the Amazon SageMaker Ground Truth
Console (p. 3058).
Topics
• Using the Amazon Mechanical Turk Workforce (p. 863)
• Managing Vendor Workforces (p. 867)
• Use a Private Workforce (p. 868)
863
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce
Any Amazon Mechanical Turk workforce billing is handled as part of your Ground Truth or Amazon
Augmented AI billing. You do not need to create a separate Mechanical Turk account to use the Amazon
Mechanical Turk workforce.
Important
You should not share confidential information, personal information, or protected health
information with this workforce. You should not use the Amazon Mechanical Turk workforce
when you use Amazon A2I in conjunction with AWS HIPAA-eligible services, such as Amazon
Textract and Amazon Rekognition, for workloads containing protected health information.
You can choose Mechanical Turk as your workforce when you create a Ground Truth labeling job or
Amazon A2I human review workflow (flow definition). You can create a labeling job and a human review
workflow using the SageMaker console and API.
When you use an API operation to create a labeling job or human review workflow, you use the following
ARN for the Amazon Mechanical Turk workforce for your WorkteamArn. Replace region with the AWS
Region you are using to create the labeling job or human loops. For example, if you create a labeling job
in US West (Oregon), replace region with us-west-2.
• arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
Ground Truth and Amazon A2I require that your input data is free of personally identifiable information
(PII) when you use Mechanical Turk. If you use the Mechanical Turk workforce and do not specify that
your input data is free of PII, your Ground Truth labeling jobs and Augmented AI tasks will fail. You
specify that your input data is free of PII when you create a Ground Truth labeling job and when you
create a Amazon A2I human loop using a built-in integration or the StartHumanLoop operation.
Use the following sections to learn how to use Mechanical Turk with these services.
Topics
• Use Mechanical Turk with Ground Truth (p. 864)
• Use Mechanical Turk with Amazon A2I (p. 865)
• When is Mechanical Turk Not Supported? (p. 867)
When you create a labeling job, we recommend you adjust the number of workers that annotate each
data object based on the complexity of the job and the quality that you need. Amazon SageMaker
Ground Truth uses annotation consolidation to improve the quality of the labels. More workers can make
a difference in the quality of the labels for more complex labeling jobs, but might not make a difference
for simpler jobs. For more information, see Consolidate Annotations (p. 806). Note that annotation
consolidation is not supported for Amazon A2I human review workflows.
1. Use the following to create a labeling job using the Ground Truth area of the SageMaker console:
Create a Labeling Job (Console) (p. 706).
2. When you are selecting Worker types in the Workers section, select Amazon Mechanical Turk.
3. Specify the total amount of time workers have to complete a task using Task timeout.
4. Specify the total amount of time a task remains available to workers in Task expiration. This is how
long workers have to pick up a task before it fails.
5. Select the Price per task using the dropdown list. This is the amount of money a worker receives for
completing a single task.
864
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce
6. (Optional) If applicable, select The dataset does not contain adult content. SageMaker may restrict
the Mechanical Turk workers that can view your task if it contains adult content.
7. You must read and confirm the following statement by selecting the check box to use the
Mechanical Turk workforce. If your input data contains confidential information, personal
information, or protected health information, you must select another workforce.
You understand and agree that the Mechanical Turk workforce consists of independent
contractors located worldwide and that you should not share confidential information, personal
information, or protected health information with this workforce.
8. (Optional) Select the check box next to Enable automated data labeling if you want to enable
automated data labeling. To learn more about this feature, see Automate Data Labeling (p. 807).
9. You can specify the Number of workers per dataset object under Additional configuration. For
example, if you enter 3 in this field, each data object will be labeled by 3 workers.
When you create your labeling job by selecting Create, your labeling tasks are sent to Mechanical Turk
workers.
1. Use the following to create a labeling job using the CreateLabelingJob operation: Create a
Labeling Job (API) (p. 709).
2. Use the following for the WorkteamArn. Replace region with the AWS Region you are using to
create the labeling job.
arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
3. Use TaskTimeLimitInSeconds to specify the total amount of time workers have to complete a
task.
4. Use TaskAvailabilityLifetimeInSeconds to specify the total amount of time a task remains
available to workers. This is how long workers have to pick up a task before it fails.
5. Use NumberOfHumanWorkersPerDataObject to specify the number of workers per dataset
object.
6. Use PublicWorkforceTaskPrice to set the price per task. This is the amount of money a worker
receives for completing a single task.
7. Use DataAttributes to specify that your input data is free of confidential information, personal
information, or protected health information.
Ground Truth requires that your input data is free of personally identifiable information (PII) if you
use the Mechanical Turk workforce. If you use Mechanical Turk and do not specify that your input
data is free of PII using the FreeOfPersonallyIdentifiableInformation flag, your labeling
job will fail.
Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.
You can see examples of how to use this API in the following notebooks, found on GitHub: Ground
Truth Jupyter Notebook Examples. You can access these notebooks under the SageMaker Example
Notebooks (p. 220) in a notebook instance.
865
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce
operation. When you use this human review workflow to configure human loops, you must specify that
your input data is free of PII.
To use Mechanical Turk when you create a human review workflow (console):
1. Use the following to create a human review workflow in the Augmented AI section of the SageMaker
console: Create a Human Review Workflow (Console) (p. 2967).
2. When you are selecting Worker types in the Workers section, select Amazon Mechanical Turk.
3. Select the Price per task using the dropdown list. This is the amount of money a worker receives for
completing a single task.
4. (Optional) You can specify the Number of workers per dataset object under Additional
configuration. For example, if you enter 3 in this field, each data object will be labeled by 3 workers.
5. (Optional) Specify the total amount of time workers have to complete a task using Task timeout.
6. (Optional) Specify the total amount of time a task remains available to workers in Task expiration.
This is how long workers have to pick up a task before it fails.
7. Once you have created your human review workflow, you can use it to configure a human loop by
providing its Amazon Resource Name (ARN) in the parameter FlowDefinitionArn. You configure
a human loop using one of the API operations of a built-in task type, or the Amazon A2I runtime API
operation, StartHumanLoop. To learn more, see Create and Start a Human Loop (p. 2985).
When you configure your human loop, you must specify that your input data is free of personally
identifiable information (PII) using the FreeOfPersonallyIdentifiableInformation content
classifier in DataAttributes. If you use Mechanical Turk and do not specify that your input data is
free of PII, your human review tasks will fail.
Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.
To use Mechanical Turk when you create a human review workflow (API):
1. Use the following to create a human review workflow using the CreateFlowDefinition
operation: Create a Human Review Workflow (API) (p. 2969).
2. Use the following for the WorkteamArn. Replace region with the AWS Region you are using to
create the labeling job.
arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
3. Use TaskTimeLimitInSeconds to specify the total amount of time workers have to complete a
task.
4. Use TaskAvailabilityLifetimeInSeconds to specify the total amount of time a task remains
available to workers. This is how long workers have to pick up a task before it fails.
5. Use TaskCount to specify the number of workers per dataset object. For example, if you specify 3
for this parameter, each data object will be labeled by 3 workers.
6. Use PublicWorkforceTaskPrice to set the price per task. This is the amount of money a worker
receives for completing a single task.
7. Once you have created your human review workflow, you can use it to configure a human loop by
providing its Amazon Resource Name (ARN) in the parameter FlowDefinitionArn. You configure
a human loop using one of the API operations of a built-in task type, or the Amazon A2I runtime API
operation, StartHumanLoop. To learn more, see Create and Start a Human Loop (p. 2985).
When you configure your human loop, you must specify that your input data is free of personally
identifiable information (PII) using the FreeOfPersonallyIdentifiableInformation content
classifier in DataAttributes. If you use Mechanical Turk and do not specify that your input data is
free of PII, your human review tasks will fail.
866
Amazon SageMaker Developer Guide
Managing Vendor Workforces
Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.
You can see examples of how to use this API in the following notebooks, found on GitHub: Amazon A2I
Jupyter Notebook Examples.
• This workforce is not supported for Ground Truth video frame labeling jobs and 3D point cloud
labeling jobs.
• You cannot use this workforce if your input data contains personally identifiable information (PII).
• Mechanical Turk is not available in some of the AWS special regions. If applicable, refer to the
documentation for your special region for more information.
Vendors make their services available via the AWS Marketplace. You can find details of the vendor's
services on their detail page, such as the number of workers and the hours that they work. You can use
these details to make estimates of how much the labeling job will cost and the amount of time that you
can expect the job to take. Once you have chosen a vendor you subscribe to their services using the AWS
Marketplace.
A subscription is an agreement between you and the vendor. The agreement spells out the details of the
agreement, such as price, schedule, or refund policy. You work directly with the vendor if there are any
issues with your labeling job.
You can subscribe to any number of vendors to meet your data annotation needs. When you create a
labeling job or human review worklow you can specify that the job be routed to a specific vendor.
Important
Before you send sensitive data to a vendor, check the vendor's security and compliance
practices on their detail page and review the end user license agreement (EULA) that is part
of your subscription agreement. You are responsible for ensuring that the vendor meets your
compliance requirements for personal or confidential information. Do not share protected
health information with this workforce.
You must use the console to subscribe to a vendor workforce. Once you have a subscription, you can use
the ListSubscribedWorkteams operation to list your subscribed vendors.
• For Ground Truth labeling jobs, choose Labeling workforces, choose Vendor, and then choose
Find data labeling services.
867
Amazon SageMaker Developer Guide
Use a Private Workforce
• For Amazon A2I human review workflows, choose Human review workforces, choose Vendor, and
then choose Find human review services.
3. The console opens the AWS Marketplace with:
Here you see a list of the vendor services available for this service.
4. Choose a vendor. The AWS Marketplace shows detailed information about the data labeling or
human review service. Use this information to determine if the vendor meets your requirements for
your task.
5. If the vendor meets your requirements, choose Continue to subscribe.
6. Review the details of the subscription. If you agree to the terms, choose Subscribe to complete your
subscription to the service.
Each AWS account has access to a single private workforce per region, and the owner has the ability
to create multiple private work teams within that workforce. A single private work team is used to
complete a labeling job or human review task, or a job. You can assign each work team to a separate job
or use a single team for multiple jobs. A single worker can be in more than one work team.
Your private workforce can either be created and managed using Amazon Cognito or your own private
OpenID Connect (OIDC) Identity Provider (IdP).
If you are a new user of Amazon SageMaker Ground Truth or Amazon Augmented AI and do not require
your workers to be managed with your own IdP, it is recommended that you use Amazon Cognito to
create and manage your private workforce.
After you create a workforce, in addition to creating and managing work teams, you can do the
following:
Note
Your private workforce is shared between Ground Truth and Amazon A2I. To create and manage
private work teams used by Augmented AI, use the Ground Truth section of the SageMaker
console.
Topics
• Create and Manage Amazon Cognito Workforce (p. 869)
• Create and Manage OIDC IdP Workforce (p. 876)
• Manage Private Workforce Using the Amazon SageMaker API (p. 885)
• Track Worker Performance (p. 886)
• Create and manage Amazon SNS topics for your work teams (p. 887)
868
Amazon SageMaker Developer Guide
Use a Private Workforce
Topics
• Create a Private Workforce (Amazon Cognito) (p. 869)
• Manage a Private Workforce (Amazon Cognito) (p. 871)
• Create a new workforce while you are creating your labeling job. To learn how, see Create an Amazon
Cognito Workforce When Creating a Labeling Job (p. 870).
• Create a new workforce before you create your labeling job. To learn how, see Create an Amazon
Cognito Workforce Using the Labeling Workforces Page (p. 870).
• Import an existing workforce after creating a user pool in the Amazon Cognito console. To learn how,
see Create a Private Workforce (Amazon Cognito Console) (p. 871).
Once you create a private workforce, that workforce and all work teams and workers associated with it
are available to use for all Ground Truth labeling job tasks and Amazon Augmented AI human review
workflows tasks.
If you are new to Amazon SageMaker and want to test Ground Truth or Amazon A2I, we suggest that you
create a private work team consisting of people from your organization using the console. Use this work
team when creating labeling or human review workflows (flow definitions) to test your worker UI and job
workflow.
Topics
• Create a Private Workforce (Amazon SageMaker Console) (p. 869)
• Create a Private Workforce (Amazon Cognito Console) (p. 871)
You can create a private workforce in the Amazon SageMaker console in one of two ways:
• When creating a labeling job in the Labeling jobs page of the Amazon SageMaker Ground Truth
section.
• Using the Labeling workforces page of the Amazon SageMaker Ground Truth section. If you are
creating a private workforce for an Amazon A2I human review workflow, use this method.
Both of these methods also create a default work team containing all of the members of the
workforce. This private workforce is available to use for both Ground Truth and Amazon Augmented AI
jobs.
When you create a private workforce using the console, SageMaker uses Amazon Cognito as an identity
provider for your workforce. If you want to use your own OpenID Connect (OIDC) Identity Provider (IdP)
to create and manage your private workforce, you must create a workforce using the SageMaker API
operation CreateWorkforce. To learn more, see Create a Private Workforce (OIDC IdP) (p. 876).
869
Amazon SageMaker Developer Guide
Use a Private Workforce
If you haven't created a private workforce when you create your labeling job and you choose to use
private workers, you are prompted to create a work team. This will create a private workforce using
Amazon Cognito.
When you create the labeling job, an email is sent to each worker inviting them to join the workforce.
After creating the workforce, you can add, delete, and disable workers using the SageMaker console or
the Amazon Cognito console.
To create and manage your private workforce using Amazon Cognito, you can use the Labeling
workforces page. When following the instructions below, you have the option to create a private
workforce by entering worker emails importing a pre-existing workforce from an Amazon Cognito user
pool. To import a workforce, see Create a Private Workforce (Amazon Cognito Console) (p. 871).
After you import your private workforce, refresh the page. On the Private workforce summary page,
you can see information about the Amazon Cognito user pool for your workforce, a list of work teams for
your workforce, and a list of all of the members of your private workforce.
Note
If you delete all of your private work teams, you have to repeat this process to use a private
workforce in that region.
870
Amazon SageMaker Developer Guide
Use a Private Workforce
To create a private workforce using Amazon Cognito, you must have an existing Amazon Cognito user
pool containing at least one user group. See Tutorial: Creating a User Pool to learn how to create a user
pool. See Adding Groups to a User Pool to learn how to add a user group to a pool.
Once your user pool has been created, follow the steps below to create a private workforce by importing
that user pool into Amazon SageMaker.
Important
After you create a workforce using an Amazon Cognito user pool, it should not be deleted
without first deleting all work teams associated with that pool in the SageMaker console.
After you import your private workforce, refresh the page to see the Private workforce summary page.
On this page, you can see information about the Amazon Cognito user pool for your workforce, a list of
work teams for your workforce, and a list of all of the members of your private workforce. This workforce
is now available to use in both Amazon Augmented AI and Amazon SageMaker Ground Truth for human
review tasks and data labeling jobs respectively.
871
Amazon SageMaker Developer Guide
Use a Private Workforce
You can do the following using either the SageMaker console or Amazon Cognito console.
You can restrict access to tasks to workers at specific IP addresses using the SageMaker API. For more
information, see Manage Private Workforce Using the Amazon SageMaker API (p. 885).
Topics
• Manage a Workforce (Amazon SageMaker Console) (p. 872)
• Manage a Private Workforce (Amazon Cognito Console) (p. 874)
You can use the Amazon SageMaker console to create and manage the work teams and individual
workers that make up a private workforce.
Use a work team to assign members of your private workforce to a labeling or human review job. When
you create your workforce using the SageMaker console, there is a work team called Everyone-in-
private-workforce that enables you to assign your entire workforce to a job. Because an imported
Amazon Cognito user pool may contain members that you don't want to include in your work teams, a
similar work team is not created for Amazon Cognito user pools.
• You can create a work team in the SageMaker console and add members from your workforce to the
team.
• You can create a user group by using the Amazon Cognito console and then create a work team by
importing the user group. You can import more than one user group into each work team. You manage
the members of the work team by updating the user group in the Amazon Cognito console. See
Manage a Private Workforce (Amazon Cognito Console) (p. 874) for more information.
You can create a new Amazon Cognito user group or import an existing user group using the SageMaker
console, on the Labeling workforces page. For more information on creating a user group in the Amazon
Cognito console, see Manage a Private Workforce (Amazon Cognito Console) (p. 874).
• If you chose Create a team by adding workers to a new Amazon Cognito user group, select the
workers to add to the team.
• If you chose Create a team by importing existing Amazon Cognito user groups, choose the user
groups that are part of the new team.
872
Amazon SageMaker Developer Guide
Use a Private Workforce
6. If you select an SNS topic, all workers added to the team are subscribed to the Amazon SNS topic
and notified when new work items are available to the team. Select from a list of your existing
Ground Truth related Amazon SNS topics or select Create new topic to open a topic-creation dialog.
Amazon SNS notifications are supported by Ground Truth and are not supported by Augmented AI.
If you subscribe workers to receive SNS notifications, they only receive notifications about Ground
Truth labeling jobs. They do not receive notifications about Augmented AI tasks.
Workers in a workteam subscribed to a topic receive notifications when a new Ground Truth labeling job
for that team becomes available and when one is about to expire.
Read Create and manage Amazon SNS topics for your work teams (p. 887) for more information about
using Amazon SNS topic.
Subscriptions
After you have created a work team, you can see more information about the team and change or set the
Amazon SNS topic to which its members are subscribed by visiting the Amazon Cognito console. If you
added any team members before you subscribed the team to a topic, you need to manually subscribe
those members to that topic. Read Create and manage Amazon SNS topics for your work teams for more
information on creating and managing the Amazon SNS topic.
A work team is a group of workers within your workforce to whom you can assign jobs. A worker can be
added to more than one work team. Once a worker has been added to a work team, that worker can be
disabled or removed.
Adding a worker to the workforce enables you to add that worker to any work team within that work
force.
A worker must be added to the workforce before being added to a work team. To add a worker to a work
team, first navigate to the Private workforce summary page using the steps above.
To add a worker to a work team from the private workforce summary page
1. In the Private teams section, choose the team to which you want to add the workers.
2. Choose the Workers tab.
3. Choose Add workers to team and choose the boxes next to the workers that you want to add.
4. Click Add workers to team.
873
Amazon SageMaker Developer Guide
Use a Private Workforce
Disabling a worker stops the worker from receiving a job. This action does not remove the worker from
the workforce, or from any work team with which the worker is associated. To disable or remove a worker
from a work team, first navigate to the private workforce summary page using the steps above.
1. In the Workers section, choose the worker that you would like to disable.
2. Choose Disable.
If desired, you can subsequently Enable a worker after they have been disabled.
You can remove workers from your private workforce directly in the SageMaker console if that worker
was added in this console. If you added the worker (user) in the Amazon Cognito console, see Manage
a Private Workforce (Amazon Cognito Console) (p. 874) to learn how to remove the worker in the
Amazon Cognito console.
1. In the Workers section, choose the worker that you would like to delete.
2. If the worker has not been disabled, choose Disable.
3. Select the worker and choose Delete.
A private workforce corresponds to a single Amazon Cognito user pool. Private work teams correspond
to Amazon Cognito user groups within that user pool. Workers correspond to Amazon Cognito users
within those groups.
After your workforce has been created, you can add work teams and individual workers through the
Amazon Cognito console. You can also delete workers from your private workforce or remove them from
individual teams in the Amazon Cognito console.
Important
You can't delete work teams from the Amazon Cognito console. Deleting a Amazon Cognito user
group that is associated with an Amazon SageMaker work team will result in an error. To remove
work teams, use the SageMaker console.
You can create a new work team to complete a job by adding a Amazon Cognito user group to the user
pool associated with your private workforce. To add a Amazon Cognito user group to an existing worker
pool, see Adding groups to a User Pool.
874
Amazon SageMaker Developer Guide
Use a Private Workforce
team. Choose from a list of your existing SNS topics related to SageMaker Ground Truth or Amazon
Augmented AI or choose Create new topic to create one.
Note
Amazon SNS notifications are supported by Ground Truth and are not supported by
Augmented AI. If you subscribe workers to receive SNS notifications, they only receive
notifications about Ground Truth labeling jobs. They do not receive notifications about
Augmented AI tasks.
Subscriptions
After you have created a work team, you can see more information about the team and change or set
the SNS topic to which its members are subscribed using the Amazon Cognito console. If you added
any team members before you subscribed the team to a topic, you need to manually subscribe those
members to that topic. For more information, see Create and manage Amazon SNS topics for your work
teams (p. 887).
When using the Amazon Cognito console to add workers to a work team, you must add a user to the user
pool associated with the workforce before adding that user to a user group. Users can be added to a user
pool in various ways. For more information, see Signing Up and Confirming User Accounts.
After a user has been added to a pool, the user can be associated with user groups inside of that pool.
After a user has been added to a user group, that user becomes a worker on any work team created using
that user group.
Disabling a worker stops the worker from receiving jobs. This action doesn't remove the worker from the
workforce, or from any work team the worker is associated with. To remove a user from a work team in
Amazon Cognito, you remove the user from the user group associated with that team.
875
Amazon SageMaker Developer Guide
Use a Private Workforce
If you are a new user of Ground Truth or Amazon A2I, you can test your worker UI and job workflow by
creating a private work team and adding yourself as a worker. Use this work team when you create a
labeling job or human review workflow. First, create a private OIDC IdP workforce using the instructions
in Create a Private Workforce (OIDC IdP) (p. 876). Next, refer to Manage a Private Workforce (OIDC
IdP) (p. 882) to learn how to create a work team.
Topics
• Create a Private Workforce (OIDC IdP) (p. 876)
• Manage a Private Workforce (OIDC IdP) (p. 882)
To create a workforce using an OIDC IdP, your IdP must support groups because Ground Truth and
Amazon A2I use one or more groups that you specify to create work teams. You use work teams to
specify workers for your labeling jobs and human review tasks. Because groups are not a standard
claim, your IdP may have a different naming convention for a group of users (workers). Therefore,
you must identify one or more user groups to which a worker belongs using the custom claim
sagemaker:groups that is sent to Ground Truth or Amazon A2I from your IdP. To learn more, see Send
Required and Optional Claims to Ground Truth and Amazon A2I (p. 876).
You create an OIDC IdP workforce using the SageMaker API operation CreateWorkforce. Once
you create a private workforce, that workforce and all work teams and workers associated with it are
available to use for all Ground Truth labeling job tasks and Amazon A2I human review workflows tasks.
To learn more, see Create an OIDC IdP Workforce (p. 878).
Send Required and Optional Claims to Ground Truth and Amazon A2I
When you use your own IdP, Ground Truth and Amazon A2I use your Issuer, ClientId,
and ClientSecret to authenticate workers by obtaining an authentication CODE from your
AuthorizationEndpoint.
876
Amazon SageMaker Developer Guide
Use a Private Workforce
Ground Truth and Amazon A2I will use this CODE to obtain a custom claim from either your IdP's
TokenEndpoint or UserInfoEndpoint. You can either configure TokenEndpoint to return a JSON
web token (JWT) or UserInfoEndpoint to return a JSON object. The JWT or JSON object must contain
required and optional claims that you specify. A claim is a key-value pair that contains information about
a worker or metadata about the OIDC service. The following table lists the claims that must be included,
and that can optionally be included in the JWT or JSON object that your IdP returns.
Note
Some of the parameters in the following table can be specified using a : or a -. For example,
you can specify the groups a worker belongs to using sagemaker:groups or sagemaker-
groups in your claim.
Allowable
characters:
Regex:
[\p{L}\p{M}\p{S}\p{N}\p{P}]+
Quotas:
63 characters per
group name
Yes
sagemaker:client_id Data type: A client ID. All tokens "00b600bb-1f00-05d0-
or sagemaker- must be issued for this bd00-00be00fbd0e0"
client_id String client ID.
Allowable
characters:
877
Amazon SageMaker Developer Guide
Use a Private Workforce
Quotes:
128 characters
True, False
The following an example of the JSON object syntax your UserInfoEndpoint can return.
{
"sub":"122",
"exp":"10000",
"sagemaker-groups":["group1","group2"]
"sagemaker-name":"name",
"sagemaker-sub":"122",
"sagemaker-client_id":"123456"
}
Ground Truth or Amazon A2I compares the groups listed in sagemaker:groups or sagemaker-groups
to verify that your worker belongs to the work team specified in the labeling job or human review task.
After the work team has been verified, labeling or human review tasks are sent to that worker.
You can create a workforce using the SageMaker API operation CreateWorkforce and associated
language-specific SDKs. Specify a WorkforceName and information about your OIDC IDP in the
parameter OidcConfig. It is recommended that you configure your OIDC with a place-holder redirect
URI, and then update the URI with the worker portal URL after you create the workforce. To learn more,
see Configure your OIDC IdP (p. 879).
878
Amazon SageMaker Developer Guide
Use a Private Workforce
The following shows an example of the request. See CreateWorkforce to learn more about each
parameter in this request.
CreateWorkforceRequest: {
#required fields
WorkforceName: "example-oidc-workforce",
OidcConfig: {
ClientId: "clientId",
ClientSecret: "secret",
Issuer: "https://example-oidc-idp.com/adfs",
AuthorizationEndpoint: "https://example-oidc-idp.com/adfs/oauth2/authorize",
TokenEndpoint: "https://example-oidc-idp.com/adfs/oauth2/token",
UserInfoEndpoint: "https://example-oidc-idp.com/adfs/oauth2/userInfo",
LogoutEndpoint: "https://example-oidc-idp.com/adfs/oauth2/log-out",
JwksUri: "https://example-oidc-idp.com/adfs/discovery/keys"
},
SourceIpConfig: {
Cidrs: ["string", "string"]
}
}
How you configure your OIDC IdP depends on the IdP you use, and your business requirements.
When you configure your IdP, you must to specify a callback or redirect URI. After Ground Truth or
Amazon A2I authenticates a worker, this URI will redirect the worker to the worker portal where the
workers can access labeling or human review tasks. To create a worker portal URL, you need to create a
workforce with your OIDC IdP details using the CreateWorkforce API operation. Specifically, you must
configure your OIDC IdP with required custom sagemaker claims (see the next section for more details).
Therefore, it is recommended that you configure your OIDC with a place-holder redirect URI, and then
update the URI after you create the workforce. See Create an OIDC IdP Workforce (p. 878) to learn how
to create a workforce using this API.
You can view your worker portal URL in the SageMaker Ground Truth console, or using the SageMaker
API operation, DescribeWorkforce. The worker portal URL is in the SubDomain parameter in the
response.
Important
Make sure you add the workforce subdomain to your OIDC IdP allow list. When you add the
subdomain to your allow list, it must end with /oauth2/idpresponse.
To view your worker portal URL after creating a private workforce (Console):
To view your worker portal URL after creating a private workforce (API):
When you create a private workforce using CreateWorkforce, you specify a WorkforceName. Use this
name to call DescribeWorkforce. The following table includes examples of requests using the AWS
CLI and AWS SDK for Python (Boto3).
response = client.describe_workforce(WorkforceName='string')
879
Amazon SageMaker Developer Guide
Use a Private Workforce
AWS CLI
After you have created your OIDC IdP workforce, you can use the following procedure to validate its
authentication workflow using cURL. This procedure assumes you have access to a terminal, and that you
have cURL installed.
a. Replace {AUTHORIZE ENDPOINT} with the authorize endpoint for your OIDC IdP.
b. Replace {CLIENT ID} with the Client ID from your OAuth client.
c. Replace {REDIRECT URI} with the worker portal URL. If it is not already present, you must add
/oauth2/idpresponse to the end of the URL.
d. If you have a custom scope, use it to replace {SCOPE}. If you do not have a custom scope,
replace {SCOPE} with openid.
The following is an example of a URI after the modifications above are made:
https://example.com/authorize?
client_id=f490a907-9bf1-4471-97aa-6bfd159f81ac&redirect_uri=https%3A%2F%2F
%2Fexample.labeling.sagemaker.aws
%2Foauth2%2Fidpresponse&response_type=code&scope=openid
2. Copy and paste the modified URI from step 1 into your browser and press Enter on your keyboard.
3. Authenticate using your IdP.
4. Copy the authentication code query parameter in the URI. This parameter beings with code=.
The following is an example of what the response might look like. In this example, copy
code=MCNYDB... and everything thereafter.
https://example.labeling.sagemaker.aws/oauth2/idpresponse?code=MCNYDB....
5. Open a terminal and enter the following command after making required modifications listed
below:
a. Replace {TOKEN ENDPOINT} with the token endpoint for your OIDC IdP.
880
Amazon SageMaker Developer Guide
Use a Private Workforce
b. Replace {CLIENT ID} with the Client ID from your OAuth client.
c. Replace {CLIENT SECRET} with the Client Secret from your OAuth client.
d. Replace {CODE} with the authentication code query parameter you copied in step 4.
e. Replace {REDIRECT URI} with the worker portal URL.
The following is an example of the cURL request after making the modifications described above:
6. This step depends on the type of access_token your IdP returns, a plain text access token or a JWT
access token.
• If your IdP does not support JWT access tokens, access_token may be plain text (for example, a
UUID). The response you see may look similar to the following. In this case, move to step 7.
{
"access_token":"179c144b-fccb-4d96-a28f-eea060f39c13",
"token_type":"Bearer",
"expires_in":3600,
"refresh_token":"ef43e52e-9b4f-410c-8d4c-d5c5ee57631a",
"scope":"openid"
}
• If your IdP supports JWT access tokens, step 5 should generate an access token in JWT format. For
example, the response may look similar to the following:
{
"access_token":"eyJh...JV_adQssw5c",
"refresh_token":"i6mapTIAVSp2oJkgUnCACKKfZxt_H5MBLiqcybBBd04",
"refresh_token_expires_in":6327,
"scope":"openid",
"id_token":"eyJ0eXAiOiJK9...-rDaQzUHl6cQQWNiDpWOl_lxXjQEvQ"
}
Copy the JWT and decode it. You can use python script or a third party website to decode it. For
example, you can go to the website https://jwt.io/ and paste the JWT into the Encoded box to
decode it.
a. Replace {USERINFO ENDPOINT} with the user info endpoint for your OIDC IdP.
881
Amazon SageMaker Developer Guide
Use a Private Workforce
b. Replace {ACCESS TOKEN} with the access token in the response you received in step 7. This is
the entry for the "access_token" parameter.
The following is an example of the cURL request after making the modifications described above:
8. The response to the final step in the procedure above may look similar to the following code block.
If the access_token returned in step 6 was plain text, you must verify that this response contains
required information. In this case, the response must contain the Required SageMaker claims in the
table found in Send Required and Optional Claims to Ground Truth and Amazon A2I (p. 876). For
example, sagemaker-groups, sagamaker-name.
{
"sub":"122",
"exp":"10000",
"sagemaker-groups":["group1","group2"]
"sagemaker-name":"name",
"sagemaker-sub":"122",
"sagemaker-client_id":"123456"
}
Next Steps
Once you've created a private workforce using your IdP and verified your IdP authentication response,
you can create work teams using your IdP groups. To learn more, see Manage a Private Workforce (OIDC
IdP) (p. 882).
You can restrict worker access to tasks to specific IP addresses, and update or delete your workforce
using the SageMaker API. To learn more, see Manage Private Workforce Using the Amazon SageMaker
API (p. 885).
To add workers to an Amazon SageMaker Ground Truth (Ground Truth) labeling job or Amazon
Augmented AI (Amazon A2I) human review task, you create work teams using 1-10 IdP groups and
assign that work team to the job or task. You assign a work team to a job or task by specifing that work
team when you create a labeling job (Ground Truth) or a human review workflow (Amazon A2I).
You can only assign one team to each labeling job or human review workflow. You can use the same
team to create multiple labeling jobs or human review tasks. You can also create multiple work teams to
work on different labeling jobs or human review tasks.
Prerequisites
To create and manage private work teams using your OIDC IdP groups, first you must create a workforce
using the SageMaker API operation CreateWorkforce. To learn more, see Create a Private Workforce
(OIDC IdP) (p. 876).
882
Amazon SageMaker Developer Guide
Use a Private Workforce
You can use the SageMaker console to create a private work team using your OIDC IdP workforce on the
Labeling workforces page under Ground Truth. If you are creating a Ground Truth labeling job, you can
also create a private work team while creating a labeling job.
Note
You create and manage work teams for Amazon A2I in the Ground Truth area of the SageMaker
console.
You can also use the SageMaker API and associated language-specific SDKs to create a private work
team.
Use the following procedures to learn how to create a private work team using the SageMaker console
and API.
To create a private work team while creating a Ground Truth labeling job (console)
The private team that you create is used for this labeling job, and is listed in the Labeling workforces
section of the SageMaker console.
883
Amazon SageMaker Developer Guide
Use a Private Workforce
You can create a private work team using the SageMaker API operation CreateWorkteam.
When you use this operation, list all user groups that you want included in the work team in the
OidcMemberDefinition parameter Groups.
Important
The group names you specify for Groups must match the group names specified in your OIDC
IdP.
For example, if your user group names are group1, group2, and group3 in your OIDC IdP, configure
OidcMemberDefinition as follows:
"OidcMemberDefinition": {
"Groups": ["group1", "group2", "group3"]
}
Additionally, you must give the work team a name using the WorkteamName parameter.
After you've created a work team, you can use the SageMaker API to manage that work team. Use the
UpdateWorkteam operation to update the IdP user groups included in that work team.
• Use the WorkteamName parameter to identify the work team that you want to update.
• When you use this operation, list all user groups that you want included in the work team in the
OidcMemberDefinition parameter Groups. If a user group is associated with a work team and you
do not include it in this list, that user group is no longer associated with this work team.
You can delete a work team using the SageMaker console and SageMaker API.
You can delete a private work team using the SageMaker API operation DeleteWorkteam.
When you create a workforce using your own OIDC IdP, you cannot use Ground Truth or Amazon A2I to
manage individual workers.
• To add a worker to a work team, add that worker to a group associated with that work team.
• To remove a worker from a work team, remove that worker from all user groups associated with that
work team.
884
Amazon SageMaker Developer Guide
Use a Private Workforce
• UpdateWorkforce – You may want to update a workforce created using your own OIDC IdP to specify
a different authorization endpoint, token endpoint, or issuer. You can update any parameter found in
OidcConfig using this operation.
You can only update your OIDC IdP configuration when there are no work teams associated with your
workforce. To learn how to delete work teams, see Delete a work team (p. 884).
• DeleteWorkforce – Use this operation to delete your private workforce. If you have any work teams
associated with your workforce, you must delete those work teams before you delete your work force.
For more information, see Delete a work team (p. 884).
• DescribeWorkforce – Use this operation to list private workforce information, including workforce
name, Amazon Resource Name (ARN), and, if applicable, allowed IP address ranges (CIDRs).
If you created your workforce using your own OIDC IdP, you can find your workforce name in the Ground
Truth area of the SageMaker console.
After you have restricted your workforce to one or more CIDRs, the output of UpdateWorkforce lists all
allowable CIDRs. You can also use the DescribeWorkforce operation to view all allowable CIDRs for a
workforce.
885
Amazon SageMaker Developer Guide
Use a Private Workforce
• You want to create a workforce using a new Amazon Cognito user pool.
• You have already created a private workforce using Amazon Cognito and you want to create a
workforce using your own OpenID Connect (OIDC) Identity Provider (IdP).
To delete a private workforce, use the DeleteWorkforce API operation. If you have any work teams
associated with your workforce, you must delete those work teams before you delete your workforce.
You can delete a private work team using the DeleteWorkteam operation.
Enable Tracking
During the set-up process for a new work team, the permissions for Amazon CloudWatch logging of
worker events are created. Since this feature was added in August 2019, work teams created prior to that
may not have the correct permissions. If all of your work teams were created before August 2019, create
a new work team. It does not need any members and may be deleted after creation, but by creating it,
you establish the permissions and apply them to all of your work teams, regardless of when they were
created.
Examine Logs
After tracking is enabled, the activity of your workers is logged. Open the Amazon CloudWatch
console and choose Logs in the navigation pane. You should see a log group named /aws/sagemaker/
groundtruth/WorkerActivity.
Each completed task is represented by a log entry, which contains information about the worker, their
team, the job, when the task was accepted, and when it was submitted.
{
"worker_id": "cd449a289e129409",
"cognito_user_pool_id": "us-east-2_IpicJXXXX",
"cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd",
886
Amazon SageMaker Developer Guide
Use a Private Workforce
A useful data point in each event is the cognito_sub_id. You can match that to an individual worker.
To get information about all of the team's members, use the ListUsers action (examples) in the Amazon
Cognito API.
To view metrics
Create and manage Amazon SNS topics for your work teams
Use the procedures in this topic when you want to:
If you create a work team using the console, the console provides an option to create a new topic for the
team so that you don't have to perform these steps.
887
Amazon SageMaker Developer Guide
Use a Private Workforce
Important
The Amazon SNS feature is not supported by Amazon A2I. If you subscribe your work team to
an Amazon SNS topic, workers will only receive notifications about Ground Truth labeling jobs.
Workers will not receive notifications about new Amazon A2I human review tasks.
, {
"Sid": "AwsSagemaker_SnsAccessPolicy",
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sns:Publish",
"Resource": "arn:partition:sns:region:111122223333:MyTopic", # ARN of the topic
you copied in the previous step
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:partition:sagemaker:region:111122223333:workteam/
*" # Workteam ARN
},
"StringEquals": {
"aws:SourceAccount": "111122223333" # SNS topic account
}
}
}
After you create the topic, it appears in your Topics summary screen. For more information about
creating topics, see Creating a Topic in the Amazon SNS Developer Guide.
If you subscribe a work team to a topic after you've already created the work team, the individual work
team members who were added to the team when the work team was created are not automatically
subscribed to the topic. For information about subscribing workers' email addresses to the topic, see
Subscribing an Endpoint to an Amazon SNS Topic in the Amazon SNS Developer Guide.
The only situation in which workers are automatically subscribed to your topic is when you create or
import an Amazon Cognito user group at the time that you create a work team and you set up the topic
subscription when you create that work team. For more information about creating and managing your
workteams with Amazon Cognito, see Create Work Teams (Amazon Cognito Console) (p. 874).
888
Amazon SageMaker Developer Guide
Crowd HTML Elements Reference
As a starting point, you can use a template built using Crowd HTML Elements from one of the following
GitHub repositories:
These repositories include templates designed for audio, image, text, video, and other types of data
labeling and annotation tasks.
For more information about how to implement custom templates in Amazon SageMaker Ground Truth,
see Creating Custom Labeling Workflows (p. 671). To learn more about custom templates in Amazon
Augmented AI, see Create Custom Worker Task Templates (p. 2995).
Topics
• crowd-alert (p. 890)
• crowd-badge (p. 891)
• crowd-button (p. 893)
• crowd-bounding-box (p. 894)
• crowd-card (p. 898)
• crowd-checkbox (p. 900)
• crowd-classifier (p. 903)
• crowd-classifier-multi-select (p. 904)
• crowd-entity-annotation (p. 906)
• crowd-fab (p. 910)
• crowd-form (p. 911)
• crowd-icon-button (p. 912)
• crowd-image-classifier (p. 913)
• crowd-image-classifier-multi-select (p. 917)
• crowd-input (p. 919)
• crowd-instance-segmentation (p. 921)
• crowd-instructions (p. 925)
• crowd-keypoint (p. 927)
• crowd-line (p. 931)
• crowd-modal (p. 934)
• crowd-polygon (p. 935)
• crowd-polyline (p. 940)
889
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-alert
A message that alerts the worker to a current situation.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-alert> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<div id="errorBox"></div>
<crowd-keypoint
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Item A', 'Item B', 'Item C']"
header="Please locate the centers of each item."
name="annotatedResult">
<short-instructions>
Describe your task briefly here and give examples
</short-instructions>
<full-instructions>
Give additional instructions and good/bad examples here
</full-instructions>
</crowd-keypoint>
</crowd-form>
<script>
var num_obj = 1;
document.querySelector('crowd-form').onsubmit = function(e) {
const keypoints = document.querySelector('crowd-keypoint').value.keypoints ||
document.querySelector('crowd-keypoint')._submittableValue.keypoints;
const labels = keypoints.map(function(p) {
return p.label;
});
890
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Object.entries(labelCounts).forEach(entry => {
if (entry[1] != goalNumSingleLabel) {
e.preventDefault();
errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must use each label
only once.</crowd-alert>';
errorBox.scrollIntoView();
}
})
};
</script>
Attributes
The following attributes are supported by this element.
dismissible
A Boolean switch that, if present, allows the message to be closed by the worker.
type
A string that specifies the type of message to be displayed. The possible values are "info" (the default),
"success", "error", and "warning".
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-badge
An icon that floats over the top right corner of another element to which it is attached.
891
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a template that uses the <crowd-badge> element. Copy the following
code and save it in a file with the extension .html. Open the file in any browser to preview and interact
with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="https://unsplash.com/photos/NLUkAA-nDdE"
header="Choose the correct category for this image."
categories="['Person', 'Umbrella', 'Chair', 'Dolphin']"
>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>
<short-instructions id="short-instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<crowd-badge icon="star" for="short-instructions"/>
</short-instructions>
</crowd-image-classifier>
</crowd-form>
Attributes
The following attributes are supported by this element.
for
A string that specifies the ID of the element to which the badge is attached.
icon
A string that specifies the icon to be displayed in the badge. The string must be either the name of an
icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.
The following is an example of the syntax that you can use to add an iron-icon to a <crowd-badge>
HTML element. Replace icon-name with the name of the icon you'd like to use from this Icons set.
label
The text to display in the badge. Three characters or less is recommended because text that is too large
will overflow the badge area. An icon can be displayed instead of text by setting the icon attribute.
Element Hierarchy
This element has the following parent and child elements.
892
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See Also
For more information, see the following.
crowd-button
A styled button that represents some action.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a template that uses the <crowd-button> element. Copy the following
code and save it in a file with the extension .html. Open the file in any browser to preview and interact
with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="https://unsplash.com/photos/NLUkAA-nDdE"
header="Please select the correct category for this image"
categories="['Person', 'Umbrella', 'Chair', 'Dolphin']"
>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<crowd-button>
<iron-icon icon="question-answer"/>
</crowd-button>
</short-instructions>
</crowd-image-classifier>
</crowd-form>
Attributes
The following attributes are supported by this element.
disabled
A Boolean switch that, if present, displays the button as disabled and prevents clicks.
form-action
A switch that either submits its parent crowd-form (p. 911) element, if set to "submit", or resets its
parent <crowd-form> element, if set to "reset".
href
The URL to an online resource. Use this property if you need a link styled as a button.
893
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
icon
A string that specifies the icon to be displayed next to the button's text. The string must be the name of
an icon from the open-source iron-icons set, which is pre-loaded. For example, to insert the search iron-
icon, use the following:
<crowd-button>
<iron-icon icon="search"/>
</crowd-button>
The icon is positioned to either the left or the right of the text, as specified by the icon-align attribute.
icon-align
The left or right position of the icon relative to the button's text. The default is "left".
icon-url
A URL to a custom image for the icon. A custom image can be used in place of a standard icon that is
specified by the icon attribute.
loading
A Boolean switch that, if present, displays the button as being in a loading state. This attribute has
precedence over the disabled attribute if both attributes are present.
target
When you use the href attribute to make the button act as a hyperlink to a specific URL, the target
attribute optionally targets a frame or window where the linked URL should load.
variant
The general style of the button. Use "primary" for primary buttons, "normal" for secondary buttons,
"link" for tertiary buttons, or "icon" to display only the icon without text.
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-bounding-box
A widget for drawing rectangles on an image and assigning a label to the portion of the image that is
enclosed in each rectangle.
894
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-bounding-box> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Draw bounding boxes around all the cats and dogs in this image"
labels="['Cat', 'Dog']"
>
<full-instructions header="Bounding Box Instructions" >
<p>Use the bounding box tool to draw boxes around the requested target of interest:</
p>
<ol>
<li>Draw a rectangle using your mouse over each instance of the target.</li>
<li>Make sure the box does not cut into the target, leave a 2 - 3 pixel margin</li>
<li>
When targets are overlapping, draw a box around each object,
include all contiguous parts of the target in the box.
Do not include parts that are completely overlapped by another object.
</li>
<li>
Do not include parts of the target that cannot be seen,
even though you think you can interpolate the whole shape of the target.
</li>
<li>Avoid shadows, they're not considered as a part of the target.</li>
<li>If the target goes off the screen, label up to the edge of the image.</li>
</ol>
</full-instructions>
<short-instructions>
Draw boxes around the requested target of interest.
</short-instructions>
</crowd-bounding-box>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
initial-value
An array of JSON objects, each of which sets a bounding box when the component is loaded. Each
JSON object in the array contains the following properties. Bounding boxes set via the initial-
value property can be adjusted and whether or not a worker answer was adjusted is tracked via an
initialValueModified boolean in the worker answer output.
895
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
• top – Distance of the top-left corner of the box from the top of the image, measured in pixels.
• width – The width of the box in pixels.
You can extract the bounding box initial value from a manifest file of a previous job in a custom
template using the Liquid templating language:
initial-value="[
{% for box in task.input.manifestLine.label-attribute-name-from-prior-job.annotations
%}
{% capture class_id %}{{ box.class_id }}{% endcapture %}
{% assign label = task.input.manifestLine.label-attribute-name-from-prior-job-
metadata.class-map[class_id] %}
{
label: {{label | to_json}},
left: {{box.left}},
top: {{box.top}},
width: {{box.width}},
height: {{box.height}},
},
{% endfor %}
]"
labels
A JSON formatted array of strings, each of which is a label that a worker can assign to the image portion
enclosed by a rectangle. Limit: 10 labels.
name
The name of this widget. It's used as a key for the widget's input in the form output.
src
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are required by this element.
full-instructions
short-instructions
Output
The following output is supported by this element.
896
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
boundingBoxes
An array of JSON objects, each of which specifies a bounding box that has been created by the worker.
Each JSON object in the array contains the following properties.
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
The following are samples of outputs from common use scenarios for this element.
[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 401,
"label": "Dog",
"left": 243,
"top": 117,
"width": 187
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]
[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 401,
"label": "Dog",
"left": 243,
"top": 117,
"width": 187
897
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
},
{
"height": 283,
"label": "Dog",
"left": 684,
"top": 120,
"width": 116
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]
[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 395,
"label": "Dog",
"left": 241,
"top": 125,
"width": 158
},
{
"height": 298,
"label": "Cat",
"left": 699,
"top": 116,
"width": 101
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]
You could have many labels available, but only the ones that are used appear in the output.
See Also
For more information, see the following.
crowd-card
A box with an elevated appearance for displaying information.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
898
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
The following is an example of a template designed for sentiment analysis tasks that uses the <crowd-
card> element. Copy the following code and save it in a file with the extension .html. Open the file in
any browser to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<style>
h3 {
margin-top: 0;
}
crowd-card {
width: 100%;
}
.card {
margin: 10px;
}
.left {
width: 70%;
margin-right: 10px;
display: inline-block;
height: 200px;
}
.right {
width: 20%;
height: 200px;
display: inline-block;
}
</style>
<crowd-form>
<short-instructions>
Your short instructions here.
</short-instructions>
<full-instructions>
Your full instructions here.
</full-instructions>
<div class="left">
<h3>What sentiment does this text convey?</h3>
<crowd-card>
<div class="card">
Nothing is great.
</div>
</crowd-card>
</div>
<div class="right">
<h3>Select an option</h3>
<div class="left">
<h3>What sentiment does this text convey?</h3>
899
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<crowd-card>
<div class="card">
Everything is great!
</div>
</crowd-card>
</div>
<div class="right">
<h3>Select an option</h3>
Attributes
The following attributes are supported by this element.
heading
image
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-checkbox
A UI component that can be checked or unchecked allowing a user to select multiple options from a set.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-checkbox> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
900
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<crowd-form>
</crowd-form>
Attributes
The following attributes are supported by this element.
checked
disabled
A Boolean switch that, if present, displays the check box as disabled and prevents it from being checked.
name
A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.
required
value
A string used as the name for the check box state in the output. Defaults to "on" if not specified.
Element Hierarchy
This element has the following parent and child elements.
901
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Output
Provides a JSON object. The name string is the object name and the valuestring is the property name
for a Boolean value based on the check box state; true if checked, false if not checked.
Note that all three color values are properties of a single object.
See Also
For more information, see the following.
902
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-classifier
A widget for classifying non-image content, such as audio, video, or text.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an HTML worker task template built using crowd-classifier. This
example uses the Liquid template language to automate:
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="category"
categories="{{ task.input.labels | to_json | escape }}"
header="What type of a document is this?"
>
<classification-target>
<iframe style="width: 100%; height: 600px;" src="{{ task.input.taskObject |
grant_read_access }}" type="application/pdf"></iframe>
</classification-target>
<short-instructions>
Please choose the correct category for the document
</short-instructions>
</crowd-classifier>
</crowd-form>
Attributes
The following attributes are supported by this element.
categories
A JSON formatted array of strings, each of which is a category that a worker can assign to the text. You
should include "other" as a category, otherwise the worker my not be able to provide an answer.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
name
The name of this widget. It is used as a key for the widget's input in the form output.
Element Hierarchy
This element has the following parent and child elements.
903
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
• Child elements: classification-target (p. 904), full-instructions (p. 904), short-instructions (p. 904)
Regions
The following regions are supported by this element.
classification-target
The content to be classified by the worker. This can be plain text or HTML. Examples of how the HTML
can be used include but are not limited to embedding a video or audio player, embedding a PDF, or
performing a comparison of two or more images.
full-instructions
General instructions about how to do text classification.
short-instructions
Important task-specific instructions that are displayed in a prominent place.
Output
The output of this element is an object using the specified name value as a property name, and a string
from the categories as the property's value.
[
{
"<name>": {
"label": "<value>"
}
}
]
See Also
For more information, see the following.
crowd-classifier-multi-select
A widget for classifying various forms of content—such as audio, video, or text—into one or more
categories. The content to classify is referred to as an object.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an HTML worker task template built using this element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
904
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<crowd-classifier-multi-select
name="category"
categories="['Positive', 'Negative', 'Neutral']"
header="Select the relevant categories"
exclusion-category="{ text: 'None of the above' }"
>
<classification-target>
{{ task.input.taskObject }}
</classification-target>
<short-instructions>
Choose all categories that are expressed by the text.
</short-instructions>
</crowd-classifier-multi-select>
</crowd-form>
Attributes
The following attributes are supported by the crowd-classifier-multi-select element. Each
attribute accepts a string value or string values.
categories
Required. A JSON-formatted array of strings, each of which is a category that a worker can assign to the
object.
header
Required. The text to display above the image. This is typically a question or simple instruction for
workers.
name
Required. The name of this widget. In the form output, the name is used as a key for the widget's input.
exclusion-category
Optional. A JSON-formatted string with the following format: "{ text: 'default-value' }". This
attribute sets a default value that workers can choose if none of the labels applies to the object shown in
the worker UI.
Element Hierarchy
This element has the following parent and child elements:
Regions
This element uses the following regions.
905
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
classification-target
The content to be classified by the worker. Content can be plain text or an object that you specify in
the template using HTML. For example, you can use HTML elements to include a video or audio player,
embedding a PDF file, or include a comparison of two or more images.
full-instructions
short-instructions
Output
The output of this element is an object that uses the specified name value as a property name, and a
string from categories as the property's value.
[
{
"<name>": {
labels: ["label_a", "label_b"]
}
}
]
See Also
For more information, see the following:
crowd-entity-annotation
A widget for labeling words, phrases, or character strings within a longer text. Workers select a label, and
highlight the text that the label applies to.
Important: Self-contained Widget
Do not use <crowd-entity-annotation> element with the <crowd-form> element. It
contains its own form submission logic and Submit button.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a template that uses the <crowd-entity-annotation> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
906
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<crowd-entity-annotation
name="crowd-entity-annotation"
header="Highlight parts of the text below"
labels="[{'label': 'person', 'shortDisplayName': 'per', 'fullDisplayName': 'Person'},
{'label': 'date', 'shortDisplayName': 'dat', 'fullDisplayName': 'Date'}, {'label':
'company', 'shortDisplayName': 'com', 'fullDisplayName': 'Company'}]"
text="Amazon SageMaker Ground Truth helps you build highly accurate training datasets for
machine learning quickly."
>
<full-instructions header="Named entity recognition instructions">
<ol>
<li><strong>Read</strong> the text carefully.</li>
<li><strong>Highlight</strong> words, phrases, or sections of the text.</li>
<li><strong>Choose</strong> the label that best matches what you have highlighted.</
li>
<li>To <strong>change</strong> a label, choose highlighted text and select a new
label.</li>
<li>To <strong>remove</strong> a label from highlighted text, choose the X next to
the abbreviated label name on the highlighted text.</li>
<li>You can select all of a previously highlighted text, but not a portion of it.</
li>
</ol>
</full-instructions>
<short-instructions>
Apply labels to words or phrases.
</short-instructions>
<script>
document.addEventListener('all-crowd-elements-ready', () => {
document
.querySelector('crowd-entity-annotation')
.shadowRoot
.querySelector('crowd-form')
.form
.appendChild(additionalQuestions);
});
</script>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
initial-value
A JSON formatted array of objects, each of which defines an annotation to apply to the text at
initialization. Objects contain a label value that matches one in the labels attribute, an integer
startOffset value for labeled span's starting unicode offset, and an integer endOffset value for the
ending unicode offset.
907
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Example
[
{
label: 'person',
startOffset: 0,
endOffset: 16
},
...
]
labels
Example
[
{
label: 'person',
shortDisplayName: 'per',
fullDisplayName: 'person'
}
]
name
Serves as the widget's name in the DOM. It is also used as the label attribute name in form output and
the output manifest.
text
The text to be annotated. The templating system escapes quotes and HTML strings by default. If your
code is already escaped or partially escaped, see Variable filters (p. 676) for more ways to control
escaping.
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are supported by this element.
908
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
full-instructions
short-instructions
Output
The following output is supported by this element.
entities
A JSON object that specifies the start, end, and label of an annotation. This object contains the following
properties.
{
"myAnnotatedResult": {
"entities": [
{
"endOffset": 54,
"label": "person",
"startOffset": 47
},
{
"endOffset": 97,
"label": "event",
"startOffset": 93
},
{
"endOffset": 219,
"label": "date",
"startOffset": 212
},
{
"endOffset": 271,
"label": "location",
"startOffset": 260
}
]
}
}
See Also
For more information, see the following.
909
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-fab
A floating button with an image in its center.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template designed for image classification that uses the
<crowd-fab> element. This template uses JavaScript to enable workers to report issues with the worker
UI. Copy the following code and save it in a file with the extension .html. Open the file in any browser
to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>If there is an issue with the image or tools, please select
<b>None of the Above</b>, describe the issue in the text box and click the
button below.</p>
<crowd-input label="Report an Issue" name="template-issues"></crowd-input>
<crowd-fab id="button1" icon="report-problem" title="Issue"/>
</short-instructions>
</crowd-image-classifier>
</crowd-form>
<script>
[
button1,
].forEach(function(button) {
button.addEventListener('click', function() {
document.querySelector('crowd-form').submit();
});
});
</script>
Attributes
The following attributes are supported by this element.
disabled
A Boolean switch that, if present, displays the floating button as disabled and prevents clicks.
icon
A string that specifies the icon to be displayed in the center of the button. The string must be either the
name of an icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.
910
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
The following is an example of the syntax that you can use to add an iron-icon to a <crowd-fab> HTML
element. Replace icon-name with the name of the icon you'd like to use from this Icons set.
label
A string consisting of a single character that can be used instead of an icon. Emojis or multiple characters
may result in the button displaying an ellipsis instead.
title
A string that will display as a tool tip when the mouse hovers over the button.
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-form
The form wrapper for all custom tasks. Sets and implements important actions for the proper
submission of your form data.
If a crowd-button (p. 893) of type "submit" is not included inside the <crowd-form> element, it will
automatically be appended within the <crowd-form> element.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an image classification template that uses the <crowd-form> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>
911
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
</crowd-image-classifier>
</crowd-form>
Element Hierarchy
This element has the following parent and child elements.
Element Events
The crowd-form element extends the standard HTML form element and inherits its events, such as
onclick and onsubmit.
See Also
For more information, see the following.
crowd-icon-button
A button with an image placed in the center. When the user touches the button, a ripple effect emanates
from the center of the button.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template designed for image classification that uses the
<crowd-icon-button> element. This template uses JavaScript to enable workers to report issues with
the worker UI. Copy the following code and save it in a file with the extension .html. Open the file in
any browser to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>If there is an issue with the image or tools, please select
<b>None of the Above</b>, describe the issue in the text box and click the
button below.</p>
<crowd-input label="Report an Issue" name="template-issues"/></crowd-input>
<crowd-icon-button id="button1" icon="report-problem" title="Issue"/>
</short-instructions>
912
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
</crowd-image-classifier>
</crowd-form>
<script>
[
button1,
].forEach(function(button) {
button.addEventListener('click', function() {
document.querySelector('crowd-form').submit();
});
});
</script>
Attributes
The following attributes are supported by this element.
disabled
A Boolean switch that, if present, displays the button as disabled and prevents clicks.
icon
A string that specifies the icon to be displayed in the center of the button. The string must be either the
name of an icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.
The following is an example of the syntax that you can use to add an iron-icon to a <crowd-icon-
button> HTML element. Replace icon-name with the name of the icon you'd like to use from this Icons
set.
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-image-classifier
A widget for classifying an image. Use one of the following supported image formats: APNG, BMP, GIF,
ICO, JPEG, PNG, SVG. Images do not have a size limit.
913
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an image classification template that uses the <crowd-image-
classifier> element. Copy the following code and save it in a file with the extension .html. Open the
file in any browser to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>
</crowd-image-classifier>
</crowd-form>
Attributes
The following attributes are required by this element.
categories
A JSON formatted array of strings, each of which is a category that a worker can assign to the image. You
should include "other" as a category, so that the worker can provide an answer. You can specify up to 10
categories.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
name
The name of this widget. It is used as a key for the widget's input in the form output.
overlay
Information to be overlaid on the source image. This is for verification workflows of bounding-box,
semantic-segmentation, and instance-segmentation tasks.
It is a JSON object containing an object with the name of the task-type in camelCase as the key. That
key's value is an object that contains the labels and other necessary information from the previous task.
<crowd-image-classifier
914
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
name="boundingBoxClassification"
header="Rate the quality of the annotations based on the background section
in the instructions on the left hand side."
src="https://i.imgur.com/CIPKVJo.jpg"
categories="['good', 'bad', 'okay']"
overlay='{
"boundingBox": {
labels: ["bird", "cat"],
value: [
{
height: 284,
label: "bird",
left: 230,
top: 974,
width: 223
},
{
height: 69,
label: "bird",
left: 79,
top: 889,
width: 247
}
]
},
}'
> ... </crowd-image-classifier>
A semantic segmentation verification task would use the overlay value as follows:
<crowd-image-classifier
name='crowd-image-classifier'
categories='["good", "bad"]'
src='URL of image to be classified'
header='Please classify'
overlay='{
"semanticSegmentation": {
"labels": ["Cat", "Dog", "Bird", "Cow"],
"labelMappings": {
"Bird": {
"color": "#ff7f0e"
},
"Cat": {
"color": "#2ca02c"
},
"Cow": {
"color": "#d62728"
},
"Dog": {
"color": "#2acf59"
}
},
"src": "URL of overlay image",
}
}'
> ... </crowd-image-classifier>
<crowd-image-classifier
name='crowd-image-classifier'
categories='["good", "bad"]'
src='URL of image to be classified'
915
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
src
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are used by this element.
full-instructions
short-instructions
worker-comment
Use this in verification workflows when you need workers to explain why they made the choice they
did. Use the text between the opening and closing tags to provide instructions for workers on what
information should be included in the comment.
header
A phrase with a call to action for leaving a comment. Used as the title text for a modal window where the
comment is added.
916
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
link-text
This text appears below the categories in the widget. When clicked, it opens a modal window where the
worker may add a comment.
placeholder
An example text in the comment text area that is overwritten when worker begins to type. This does not
appear in output if the worker leaves the field blank.
Output
The output of this element is a string that specifies one of the values defined in the categories attribute
of the <crowd-image-classifier> element.
[
{
"<name>": {
"label": "<value>"
"workerComment": "Comment - if no comment is provided, this field will not be
present"
}
}
]
See Also
For more information, see the following.
crowd-image-classifier-multi-select
A widget for classifying an image into one or more categories. Use one of the following supported image
formats: APNG, BMP, GIF, ICO, JPEG, PNG, SVG. Images do not have a size limit.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an HTML worker task template built using this crowd element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier-multi-select
name="animals"
categories="['Cat', 'Dog', 'Horse', 'Pig', 'Bird']"
917
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
src="https://images.unsplash.com/photo-1509205477838-a534e43a849f?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1998&q=80"
header="Please identify the animals in this image"
exclusion-category="{ text: 'None of the above' }"
>
<full-instructions header="Classification Instructions">
<p>If more than one label applies to the image, select multiple labels.</p>
<p>If no labels apply, select <b>None of the above</b></p>
</full-instructions>
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label(s) that best suit the image.</p>
</short-instructions>
</crowd-image-classifier-multi-select>
</crowd-form>
Attributes
The following attributes are supported by the crowd-image-classifier-multi-select element.
Each attribute accepts a string value or string values.
categories
Required. A JSON-formatted array of strings, each of which is a category that a worker can assign to the
image. A worker must choose at least one category and can choose all categories.
header
Required. The text to display above the image. This is typically a question or simple instruction for
workers.
name
Required. The name of this widget. In the form output, the name is used as a key for the widget's input.
src
exclusion-category
Optional. A JSON-formatted string with the following format: "{ text: 'default-value' }". This
attribute sets a default value that workers can choose if none of the labels applies to the image shown in
the worker UI.
Element Hierarchy
This element has the following parent and child elements:
Regions
This element uses the following regions
full-instructions
918
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
short-instructions
Output
The output of this element is a string that specifies one or more of the values defined in the
categories attribute of the <crowd-image-classifier-multi-select> element.
[
{
"<name>": {
labels: ["label_a", "label_b"]
}
}
]
See Also
For more information, see the following:
crowd-input
A box that accepts input data.
Cannot be self-closing
Unlike the input element in the HTML standard, this element cannot be self-closed by putting
a slash before the ending bracket, e.g. <crowd-input ... />. It must be followed with a </
crowd-input> to close the element.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-input> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<img style="max-width: 35vw; max-height: 50vh" src="{{ task.input.taskObject |
grant_read_access }}">
<crowd-input name="tag1" label="Word/phrase 1" required></crowd-input>
<crowd-input name="tag2" label="Word/phrase 2" required></crowd-input>
<crowd-input name="tag3" label="Word/phrase 3" required></crowd-input>
<short-instructions>
Your custom quick instructions and examples
</short-instructions>
<full-instructions>
919
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Attributes
The following attributes are supported by this element.
allowed-pattern
A regular expression that is used with the auto-validate attribute to ignore non-matching characters as
the worker types.
auto-focus
When the value is set to true, the browser places focus inside the input area after loading. This way, the
worker can start typing without having to select it first.
auto-validate
A Boolean switch that, if present, turns on input validation. The behavior of the validator can be
modified by the error-message and allowed-pattern attributes.
disabled
error-message
The text to be displayed below the input field, on the left side, if validation fails.
label
This text shrinks and rises up above a text field when the worker starts typing in the field or when the
value attribute is set.
max-length
A maximum number of characters the input will accept. Input beyond this limit is ignored.
min-length
name
Sets the name of the input to be used in the DOM and the output of the form.
placeholder
A string value that is used as placeholder text, displayed until the worker starts entering data into the
input, It is not used as a default value.
required
920
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
type
Takes a string to set the HTML5 input-type behavior for the input. Examples include file and date.
value
A preset that becomes the default if the worker does not provide input. The preset appears in a text field.
Element Hierarchy
This element has the following parent and child elements.
Output
Provides a name string as the property name, and the text that was entered in the field as its value.
[
{
"tag1": "blue",
"tag2": "red"
}
]
This means any code built to parse these results should be able to handle the presence or absence of
each input in the answers.
See Also
For more information, see the following.
crowd-instance-segmentation
A widget for identifying individual instances of specific objects within an image and creating a colored
overlay for each labeled instance.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
921
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instance-segmentation
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please label each of the requested objects in this image"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Segmentation Instructions">
<ol>
<li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to
understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li>
</ol>
</full-instructions>
<short-instructions>
<p>Use the tools to label all instances of the requested items in the image</p>
</short-instructions>
</crowd-instance-segmentation>
</crowd-form>
Use a template similar to the following to allow workers to add their own categories (labels).
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instance-segmentation
id="annotator"
name="myTexts"
src="{{ task.input.taskObject | grant_read_access }}"
header="Click Instructions to add new labels."
labels="['placeholder']"
>
<short-instructions>
<h3>Add a label to describe each type of object in this image.</h3>
<h3>Cover each instance of each object with a segmentation mask.</h3>
<br>
<h3>
Add new label
</h3>
<crowd-input name="_customLabel" id="customLabel"></crowd-input>
<crowd-button id="addLabel">Add</crowd-button>
<br><br><br>
<h3>
Manage labels
</h3>
<div id="labelsSection"></div>
</short-instructions>
<full-instructions>
Describe your task in more detail here.
</full-instructions>
</crowd-instance-segmentation>
</crowd-form>
<script>
document.addEventListener('all-crowd-elements-ready', function(event) {
document.querySelector('crowd-instance-segmentation').labels = [];
});
function populateLabelsSection() {
922
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
labelsSection.innerHTML = '';
annotator.labels.forEach(function(label) {
const labelContainer = document.createElement('div');
labelContainer.innerHTML = label + ' <a href="javascript:void(0)">(Delete)</a>';
labelContainer.querySelector('a').onclick = function() {
annotator.labels = annotator.labels.filter(function(l) {
return l !== label;
});
populateLabelsSection();
};
labelsSection.appendChild(labelContainer);
});
}
addLabel.onclick = function() {
annotator.labels = annotator.labels.concat([customLabel.value]);
customLabel.value = null;
populateLabelsSection();
};
</script>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
labels
A JSON formatted array of strings, each of which is a label that a worker can assign to an instance of an
object in the image. Workers can generate different overlay colors for each relevant instance by selecting
"add instance" under the label in the tool.
name
The name of this widget. It is used as a key for the labeling data in the form output.
src
initial-value
A JSON object containing the color mappings of a prior instance segmentation job and a link to the
overlay image output by the prior job. Include this when you want a human worker to verify the results
of a prior labeling job and adjust it if necessary.
initial-value="{
"instances": [
{
"color": "#2ca02c",
"label": "Cat"
},
{
"color": "#1f77b4",
"label": "Cat"
923
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
},
{
"color": "#d62728",
"label": "Dog"
}
],
"src": {{ "S3 file URL for image" | grant_read_access }}
}"
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are supported by this element.
full-instructions
short-instructions
Output
The following output is supported by this element.
labeledImage
instances
A JSON Array containing objects with the instance labels and colors.
• color – The hexadecimal value of the label's RGB color in the labeledImage PNG.
• label – The label given to overlay(s) using that color. This value may repeat, because the different
instances of the label are identified by their unique color.
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
924
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
[
{
"annotatedResult": {
"inputImageProperties": {
"height": 533,
"width": 800
},
"instances": [
{
"color": "#1f77b4",
"label": "<Label 1>":
},
{
"color": "#2ca02c",
"label": "<Label 1>":
},
{
"color": "#ff7f0e",
"label": "<Label 3>":
},
],
"labeledImage": {
"pngImageData": "<Base-64 Encoded Data>"
}
}
}
]
See Also
For more information, see the following.
crowd-instructions
An element that displays instructions on three tabbed pages, Summary, Detailed Instructions, and
Examples, when the worker clicks on a link or button.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that used the <crowd-instructions> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instructions link-text="View instructions" link-type="button">
<short-summary>
<p>Given an image, write three words or short phrases that summarize its contents.</
p>
</short-summary>
<detailed-instructions>
<p>Imagine that you are describing an image to a friend or tagging it for a news
website. Provide three specific words or short phrases that describe it.</p>
</detailed-instructions>
<positive-example>
<p><img src="https://s3.amazonaws.com/cv-demo-images/highway.jpg"/></p>
925
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<p>
<ul>
<li>Highway</li>
<li>Cars</li>
<li>Gas station</li>
</ul>
</p>
</positive-example>
<negative-example>
<p><img src="https://s3.amazonaws.com/cv-demo-images/highway.jpg"/></p>
<p>
These are not specific enough:
<ol>
<li>Trees</li>
<li>Outside</li>
<li>Daytime</li>
</ol>
</p>
</negative-example>
</crowd-instructions>
<p><strong>Instructions: </strong>Given an image, write three words or short phrases
that summarize its contents.</p>
<p>If someone were to see these three words or phrases, they should understand the
subject and context of the image, as well as any important actions.</p>
<p>View the instructions for detailed instructions and examples.</p>
<p><img style="max-width: 100%; max-height: 100%" src="{{ task.input.taskObject |
grant_read_access }}"></p>
<crowd-input name="tag1" label="Word/phrase 1" required></crowd-input>
<crowd-input name="tag2" label="Word/phrase 2" required></crowd-input>
<crowd-input name="tag3" label="Word/phrase 3" required></crowd-input>
</crowd-form>
Attributes
The following attributes are supported by this element.
link-text
The text to display for opening the instructions. The default is Click for instructions.
link-type
A string that specifies the type of trigger for the instructions. The possible values are "link" (default) and
"button".
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are supported by this element.
detailed-instructions
Content that provides specific instructions for a task. This appears on the page of the "Detailed
Instructions" tab.
926
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
negative-example
Content that provides examples of inadequate task completion. This appears on the page of the
"Examples" tab. More than one example may be provided within this element.
positive-example
Content that provides examples of proper task completion. This appears on the page of the "Examples"
tab.
short-summary
A brief statement that summarizes the task to be completed. This appears on the page of the "Summary"
tab. More than one example may be provided within this element.
See Also
For more information, see the following.
crowd-keypoint
Generates a tool to select and annotate key points on an image.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of an Liquid template that uses the <crowd-keypoint> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<div id="errorBox"></div>
<crowd-keypoint
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Item A', 'Item B', 'Item C']"
header="Please locate the centers of each item."
name="annotatedResult">
<short-instructions>
Describe your task briefly here and give examples
</short-instructions>
<full-instructions>
Give additional instructions and good/bad examples here
</full-instructions>
</crowd-keypoint>
</crowd-form>
<script>
var num_obj = 1;
document.querySelector('crowd-form').onsubmit = function(e) {
const keypoints = document.querySelector('crowd-keypoint').value.keypoints ||
document.querySelector('crowd-keypoint')._submittableValue.keypoints;
const labels = keypoints.map(function(p) {
return p.label;
927
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
});
Object.entries(labelCounts).forEach(entry => {
if (entry[1] != goalNumSingleLabel) {
e.preventDefault();
errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must use each label
only once.</crowd-alert>';
errorBox.scrollIntoView();
}
})
};
</script>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
initial-value
An array, in JSON format, of keypoints to be applied to the image on start. For example:
initial-value="[
{
'label': 'Left Eye',
'x': 1022,
'y': 429
},
{
'label': 'Beak',
'x': 941,
'y': 403
}
928
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Note
Please note that label values used in this attribute must have a matching value in the labels
attribute or the point will not be rendered.
labels
name
A string used to identify the answer submitted by the worker. This value will match a key in the JSON
object that specifies the answer.
src
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are required by this element.
full-instructions
short-instructions
Output
The following output is supported by this element.
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
keypoints
An array of JSON objects containing the coordinates and label of a keypoint. Each object contains the
following properties.
929
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Note
X and Y coordinates are based on 0,0 being the top left corner of the image.
[
{
"crowdKeypoint": {
"inputImageProperties": {
"height": 1314,
"width": 962
},
"keypoints": [
{
"label": "dog",
"x": 155,
"y": 275
},
{
"label": "cat",
"x": 341,
"y": 447
},
{
"label": "cat",
"x": 491,
"y": 513
},
{
"label": "dog",
"x": 714,
"y": 578
},
{
"label": "cat",
"x": 712,
"y": 763
},
{
"label": "cat",
"x": 397,
"y": 814
}
]
}
}
]
You may have many labels available, but only the ones that are used appear in the output.
See Also
For more information, see the following.
930
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-line
A widget for drawing lines on an image. Each line is associated with a label, and output data will report
the starting and ending points of each line.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-line> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-line
name="crowdLine"
src="{{ task.input.taskObject | grant_read_access }}"
header="Add header here to describe the task"
labels="['car','pedestrian','street car']"
>
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>Draw a line on each objects that the label applies to.</p>
</short-instructions>
<full-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
<p>Draw a line along each object that the image applies to.
Make sure that the line does not extend beyond the boundaries
of the object.
</p>
<p>Each line is defined by a starting and ending point. Carefully
place the starting and ending points on the boundaries of the object.</p>
</full-instructions>
</crowd-line>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
Optional. The text to display above the image. This is typically a question or simple instruction for the
worker.
initial-value
Optional. An array of JSON objects, each of which sets a line when the component is loaded. Each JSON
object in the array contains the following properties:
• label – The text assigned to the line as part of the labeling task. This text must match one of the labels
defined in the labels attribute of the <crowd-line> element.
• vertices – the x and y pixel corrdinates of the start point and end point of the line, relative to the top-
left corner of the image.
initial-value="{
931
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
lines: [
{
label: 'sideline', // label of this line annotation
vertices:[ // an array of vertices which decide the position of the line
{
x: 84,
y: 110
},
{
x: 60,
y: 100
}
]
},
{
label: 'yardline',
vertices:[
{
x: 651,
y: 498
},
{
x: 862,
y: 869
}
]
}
]
}"
Lines set via the initial-value property can be adjusted. Whether or not a worker answer was
adjusted is tracked via an initialValueModified boolean in the worker answer output.
labels
Required. A JSON formatted array of strings, each of which is a label that a worker can assign to the line.
Limit: 10 labels
label-colors
Optional. An array of strings. Each string is a hexadecimal (hex) code for a label.
name
Required. The name of this widget. It's used as a key for the widget's input in the form output.
src
Regions
The following regions are required by this element.
full-instructions
short-instructions
932
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Element Hierarchy
This element has the following parent and child elements.
Output
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
lines
A JSON Array containing objects with the line labels and vertices.
{
"crowdLine": { //This is the name you set for the crowd-line
"inputImageProperties": {
"height": 1254,
"width": 2048
},
"lines": [
{
"label": "yardline",
"vertices": [
{
"x": 58,
"y": 295
},
{
"x": 1342,
"y": 398
}
]
},
{
"label": "sideline",
"vertices": [
{
"x": 472,
"y": 910
},
{
"x": 1480,
933
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
"y": 600
}
]
}
]
}
}
See Also
For more information, see the following.
crowd-modal
A small window that pops up on the display when it is opened.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of the syntax that you can use with the <crowd-modal> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-modal
link-text = "See Examples"
link-type = "button">
Example Modal Text</crowd-modal>
Attributes
The following attributes are supported by this element.
link-text
The text to display for opening the modal. The default is "Click to open modal".
link-type
A string that specifies the type of trigger for the modal. The possible values are "link" (default) and
"button".
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
934
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-polygon
A widget for drawing polygons on an image and assigning a label to the portion of the image that is
enclosed in each polygon.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-polygon> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-polygon
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Draw a polygon around each of the requested target(s) of interest"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Polygon instructions">
<ul>
<li>Make the polygon tight around the object</li>
<li>You need to select a label before starting a polygon</li>
<li>You will need to select a label again after completing a polygon</li>
<li>To select a polygon, you can click on its borders</li>
<li>You can start drawing a polygon from inside another polygon</li>
<li>You can undo and redo while you're drawing a polygon to go back and forth
between points you've placed</li>
<li>You are prevented from drawing lines that overlap other lines from the same
polygon</li>
</ul>
</full-instructions>
<short-instructions>
<p>Draw a polygon around each of the requested target(s) of interest</p>
<p>Make the polygon tight around the object</p>
</short-instructions>
</crowd-polygon>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
labels
A JSON formatted array of strings, each of which is a label that a worker can assign to the image portion
enclosed by a polygon.
name
The name of this widget. It's used as a key for the widget's input in the form output.
935
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
src
The URL of the image on which to draw polygons.
initial-value
An array of JSON objects, each of which defines a polygon to be drawn when the component is loaded.
Each JSON object in the array contains the following properties.
• label – The text assigned to the polygon as part of the labeling task. This text must match one of the
labels defined in the labels attribute of the <crowd-polygon> element.
• vertices – An array of JSON objects. Each object contains an x and y coordinate value for a point in the
polygon.
Example
An initial-value attribute might look something like this.
initial-value =
'[
{
"label": "dog",
"vertices":
[
{
"x": 570,
"y": 239
},
...
{
"x": 759,
"y": 281
}
]
}
]'
Because this will be within an HTML element, the JSON array must be enclosed in single or double
quotes. The example above uses single quotes to encapsulate the JSON and double quotes within the
JSON itself. If you must mix single and double quotes inside your JSON, replace them with their HTML
entity codes (" for double quote, ' for single) to safely escape them.
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are required.
full-instructions
General instructions about how to draw polygons.
short-instructions
Important task-specific instructions that are displayed in a prominent place.
936
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Output
The following output is supported by this element.
polygons
An array of JSON objects, each of which describes a polygon that has been created by the worker. Each
JSON object in the array contains the following properties.
• label – The text assigned to the polygon as part of the labeling task.
• vertices – An array of JSON objects. Each object contains an x and y coordinate value for a point in the
polygon. The top left corner of the image is 0,0.
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
{
"annotatedResult":
{
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons":
[
{
"label": "dog",
"vertices":
[
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]
937
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
}
]
}
}
]
[
{
"annotatedResult": {
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons": [
{
"label": "dog",
"vertices": [
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]
},
{
"label": "dog",
"vertices": [
{
"x": 870,
"y": 278
},
{
"x": 908,
"y": 446
},
{
"x": 1009,
"y": 602
},
{
"x": 1116,
"y": 519
},
{
"x": 1174,
"y": 498
},
{
938
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
"x": 1227,
"y": 479
},
{
"x": 1179,
"y": 405
},
{
"x": 1179,
"y": 337
}
]
}
]
}
}
]
[
{
"annotatedResult": {
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons": [
{
"label": "dog",
"vertices": [
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]
},
{
"label": "cat",
"vertices": [
{
"x": 870,
"y": 278
},
{
"x": 908,
"y": 446
},
{
939
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
"x": 1009,
"y": 602
},
{
"x": 1116,
"y": 519
},
{
"x": 1174,
"y": 498
},
{
"x": 1227,
"y": 479
},
{
"x": 1179,
"y": 405
},
{
"x": 1179,
"y": 337
}
]
}
]
}
}
]
You could have many labels available, but only the ones that are used appear in the output.
See Also
For more information, see the following.
crowd-polyline
A widget for drawing polylines or lines on an image. Each polyline is associated with a label and can
include two or more vertices. A polyline can intersect itself and its starting and ending points can be
placed anywhere on the image.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-polyline> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-polyline
name="crowdPolyline"
src="{{ task.input.taskObject | grant_read_access }}"
header="Add header here to describe the task"
labels="['car','pedestrian','street car']"
>
<full-instructions>
940
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Review the tool guide to learn how to use the polyline tool.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>To draw a polyline, select a label that applies to an object of interest
and add a single point to the photo by clicking on that point. Continue to
draw the polyline around the object by adding additional points
around the object boundary.</p>
<p>After you place the final point on the polyline, press <b>Enter</b> on your
keyboard to complete the polyline.</p>
</short-instructions>
</crowd-polyline>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
Optional. The text to display above the image. This is typically a question or simple instruction for the
worker.
initial-value
Optional. An array of JSON objects, each of which sets a polyline when the component is loaded. Each
JSON object in the array contains the following properties:
• label – The text assigned to the polyline as part of the labeling task. This text must match one of the
labels defined in the labels attribute of the <crowd-polyline> element.
• vertices – the x and y pixel corrdinates of the vertices of a polyline, relative to the top-left corner of
the image.
initial-value= "{
polylines: [
{
label: 'sideline', // label of this line annotation
vertices:[ // an array of vertices which decide the position of the line
{
x: 84,
y: 110
},
{
x: 60,
y: 100
}
]
},
{
label: 'yardline',
vertices:[
{
941
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
x: 651,
y: 498
},
{
x: 862,
y: 869
},
{
x: 1000,
y: 869
}
]
}
]
}"
Polylines set via the initial-value property can be adjusted. Whether or not a worker answer was
adjusted is tracked via an initialValueModified boolean in the worker answer output.
labels
Required. A JSON formatted array of strings, each of which is a label that a worker can assign to the line.
Limit: 10 labels
label-colors
Optional. An array of strings. Each string is a hexadecimal (hex) code for a label.
name
Required. The name of this widget. It's used as a key for the widget's input in the form output.
src
Required. The URL of the image on which to draw polylines.
Regions
The following regions are required by this element.
full-instructions
General instructions about how to draw polylines.
short-instructions
Important task-specific instructions that are displayed in a prominent place.
Element Hierarchy
This element has the following parent and child elements.
Output
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
942
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
polylines
{
"crowdPolyline": { //This is the name you set for the crowd-polyline
"inputImageProperties": {
"height": 1254,
"width": 2048
},
"polylines": [
{
"label": "sideline",
"vertices": [
{
"x": 651,
"y": 498
},
{
"x": 862,
"y": 869
},
{
"x": 1449,
"y": 611
}
]
},
{
"label": "yardline",
"vertices": [
{
"x": 1148,
"y": 322
},
{
"x": 1705,
"y": 474
},
,
{
"x": 1755,
"y": 474
}
]
}
]
}
}
943
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See Also
For more information, see the following.
crowd-radio-button
A button that can be either checked or unchecked. When radio buttons are inside a radio group, exactly
one radio button in the group can be checked at any time. The following is an example of how to
configure a crowd-radio-button element inside of a crowd-radio-group element.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of the syntax that you can use with the <crowd-radio-button> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-radio-group>
<crowd-radio-button name="tech" value="tech">Technology</crowd-radio-button>
<crowd-radio-button name="politics" value="politics">Politics</crowd-radio-button>
</crowd-radio-group>
</crowd-form>
The previous example can be seen in a custom worker task template in this GitHub example: entity
recognition labeling job custom template.
Crowd HTML Element radio buttons do not support the HTML tag, required. To make a radio button
selection required, use <input type="radio"> elements to create radio buttons and add the
required tag. The name attribute for all <input> elements that belong to the same group of radio
buttons must be the same. For example, the following template requires the user to select a radio button
in the animal-type group before submitting.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<p>Select an animal type:</p>
<img src="https://images.unsplash.com/photo-1537151608828-ea2b11777ee8?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1539&q=80" style="height:
500; width: 400;"/>
<br><br>
<div>
<input type="radio" id="cat" name="animal-type" value="cat" required>
<label for="cat">Cat</label>
</div>
<div>
<input type="radio" id="dog" name="animal-type" value="dog">
<label for="dog">Dog</label>
</div>
<div>
<input type="radio" id="unknown" name="animal-type" value="unknown">
<label for="unknown">Unknown</label>
</div>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>
944
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>
</crowd-form>
Attributes
The following attributes are supported by this element.
checked
A Boolean switch that, if present, displays the radio button as checked.
disabled
A Boolean switch that, if present, displays the button as disabled and prevents it from being checked.
name
A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.
Note
If you use the buttons outside of a crowd-radio-group (p. 946) element, but with the same
name string and different value strings, the name object in the output will contain a Boolean
value for each value string. To ensure that only one button in a group is selected, make them
children of a crowd-radio-group (p. 946) element and use different name values.
value
A property name for the element's boolean value. If not specified, it uses "on" as the default, e.g.
{ "<name>": { "<value>": <true or false> } }.
Element Hierarchy
This element has the following parent and child elements.
Output
Outputs an object with the following pattern: { "<name>": { "<value>": <true or false> } }.
If you use the buttons outside of a crowd-radio-group (p. 946) element, but with the same name
string and different value strings, the name object will contain a Boolean value for each value
string. To ensure that only one in a group of buttons is selected, make them children of a crowd-radio-
group (p. 946) element and use different name values.
[
{
"btn1": {
"yes": true
},
"btn2": {
"no": false
}
}
945
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See Also
For more information, see the following.
crowd-radio-group
A group of radio buttons. Only one radio button within the group can be selected. Choosing one radio
button clears any previously chosen radio button within the same group. For an example of a custom UI
template that uses the crowd-radio-group element, see this entity recognition labeling job custom
template.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of the syntax that you can use with the <crowd-radio-group> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<style>
body {
padding-left: 20px;
margin-bottom: 20px;
}
#outer-container {
display: flex;
justify-content: space-around;
max-width: 900px;
margin-left: 100px;
}
.left-container {
margin-right: auto;
padding-right: 50px;
}
.right-container {
margin-left: auto;
padding-left: 50px;
}
#vertical-separator {
border: solid 1px #d5dbdb;
}
</style>
<crowd-form>
<div>
<h1>Instructions</h1>
Lorem ipsum...
</div>
<div>
<h2>Background</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
</div>
<div id="outer-container">
946
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
<span class="left-container">
<h2>Option 1</h2>
<p>Nulla facilisi morbi tempus iaculis urna. Orci dapibus ultrices in iaculis nunc sed
augue lacus.</p>
</span>
<span id="vertical-separator"></span>
<span class="right-container">
<h2>Option 2</h2>
<p>Ultrices vitae auctor eu augue ut. Pellentesque massa placerat duis ultricies lacus
sed turpis tincidunt id.</p>
</span>
</div>
<div>
<h2>Question</h2>
<p>Which do you agree with?</p>
<crowd-radio-group>
<crowd-radio-button name="option1" value="Option 1">Option 1</crowd-radio-button>
<crowd-radio-button name="option2" value="Option 2">Option 2</crowd-radio-button>
</crowd-radio-group>
Attributes
No special attributes are supported by this element.
Element Hierarchy
This element has the following parent and child elements.
Output
Outputs an array of objects representing the crowd-radio-button (p. 944) elements within it.
[
{
"btn1": {
"yes": true
},
"btn2": {
"no": false
}
}
]
See Also
For more information, see the following.
947
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-semantic-segmentation
A widget for segmenting an image and assigning a label to each image segment.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-semantic-segmentation
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please label each of the requested objects in this image"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Segmentation Instructions">
<ol>
<li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to
understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li>
</ol>
</full-instructions>
<short-instructions>
<p>Use the tools to label the requested items in the image</p>
</short-instructions>
</crowd-semantic-segmentation>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
The text to display above the image. This is typically a question or simple instruction for the worker.
initial-value
A JSON object containing the color mappings of a prior semantic segmentation job and a link to the
overlay image output by the prior job. Include this when you want a human worker to verify the results
of a prior labeling job and adjust it if necessary.
initial-value='{
"labelMappings": {
"Bird": {
"color": "#ff7f0e"
},
"Cat": {
"color": "#2ca02c"
},
"Cow": {
"color": "#d62728"
},
948
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
"Dog": {
"color": "#1f77b4"
}
},
"src": {{ "S3 file URL for image" | grant_read_access }}
}'
When using Ground Truth built in task types with annotation consolidation (where more than one worker
labels a single image), label mappings are included in individual worker output records, however the
overall result is represented as the internal-color-map in the consolidated results.
You can convert the internal-color-map to label-mappings in a custom template using the Liquid
templating language:
initial-value="{
'src' : '{{ task.input.manifestLine.label-attribute-name-from-prior-job|
grant_read_access }}',
'labelMappings': {
{% for box in task.input.manifestLine.label-attribute-name-from-prior-job-
metadata.internal-color-map %}
{% if box[1]['class-name'] != 'BACKGROUND' %}
{{ box[1]['class-name'] | to_json }}: {
'color': {{ box[1]['hex-color'] | to_json }}
},
{% endif %}
{% endfor %}
}
}"
labels
A JSON formatted array of strings, each of which is a label that a worker can assign to a segment of the
image.
name
The name of this widget. It is used as a key for the widget's input in the form output.
src
The URL of the image that is to be segmented.
Element Hierarchy
This element has the following parent and child elements.
Regions
The following regions are supported by this element.
full-instructions
General instructions about how to do image segmentation.
short-instructions
Important task-specific instructions that are displayed in a prominent place.
949
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Output
The following output is supported by this element.
labeledImage
labelMappings
A JSON Object containing objects with named with the segmentation labels.
• color – The hexadecimal value of the label's RGB color in the labeledImage PNG.
initialValueModified
A boolean representing whether the initial values have been modified. This is only included when the
output is from an adjustment task.
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.
[
{
"annotatedResult": {
"inputImageProperties": {
"height": 533,
"width": 800
},
"labelMappings": {
"<Label 2>": {
"color": "#ff7f0e"
},
"<label 3>": {
"color": "#2ca02c"
},
"<label 1>": {
"color": "#1f77b4"
}
},
"labeledImage": {
"pngImageData": "<Base-64 Encoded Data>"
}
}
}
]
See Also
For more information, see the following.
950
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
crowd-slider
A bar with a sliding knob that allows a worker to select a value from a range of values by moving
the knob. The slider makes it a great choice for settings that reflect intensity levels, such as volume,
brightness, or color saturation.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a survey template that uses the <crowd-slider> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instructions link-text="View instructions" link-type="button">
<short-summary>
<p>Provide a brief instruction here</p>
</short-summary>
<detailed-instructions>
<h3>Provide more detailed instructions here</h3>
<p>Include additional information</p>
</detailed-instructions>
<positive-example>
<p>Provide an example of a good answer here</p>
<p>Explain why it's a good answer</p>
</positive-example>
<negative-example>
<p>Provide an example of a bad answer here</p>
<p>Explain why it's a bad answer</p>
</negative-example>
</crowd-instructions>
<div>
<p>What is your favorite color for a bird?</p>
<crowd-input name="favoriteColor" placeholder="example: pink" required></crowd-input>
</div>
<div>
<p>Check this box if you like birds</p>
<crowd-checkbox name="likeBirds" checked="true" required></crowd-checkbox>
</div>
<div>
<p>On a scale of 1-10, how much do you like birds?</p>
<crowd-slider name="howMuch" min="1" max="10" step="1" pin="true" required></crowd-
slider>
</div>
<div>
<p>Write a short essay describing your favorite bird</p>
<crowd-text-area name="essay" rows="4" placeholder="Lorem ipsum..." required></crowd-
text-area>
</div>
</crowd-form>
951
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Attributes
The following attributes are supported by this element.
disabled
editable
A Boolean switch that, if present, displays an up/down button that can be chosen to select the value.
Selecting the value via the up/down button is an alternative to selecting the value by moving the knob
on the slider. The knob on the slider will move synchronously with the up/down button choices.
max
min
name
A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.
pin
A Boolean switch that, if present, displays the current value above the knob as the knob is moved.
required
secondary-progress
When used with a crowd-slider-secondary-color CSS attribute, the progress bar is colored
to the point represented by the secondary-progress. For example, if this was representing the
progress on a streaming video, the value would represent where the viewer was in the video timeline.
The secondary-progress value would represent the point on the timeline to which the video had
buffered.
step
A number that specifies the difference between selectable values on the slider.
value
A preset that becomes the default if the worker does not provide input.
Element Hierarchy
This element has the following parent and child elements.
952
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See Also
For more information, see the following.
crowd-tab
A component styled to look like a tab with information below.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example template that uses the <crowd-tab> element. Copy the following code and
save it in a file with the extension .html. Open the file in any browser to preview and interact with this
template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-tabs>
<crowd-tab header="Tab 1">
<h2>Image</h2>
<img
src="https://images.unsplash.com/photo-1478382188900-5bb598fe27d3?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80"
style="max-width: 40%"
>
<h2>Text</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
</p>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>
953
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
</crowd-tabs>
<short-instructions>
<p>Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas
sed sed risus.</p>
</short-instructions>
</crowd-form>
Attributes
The following attributes are supported by this element.
header
The text appearing on the tab. This is usually some short descriptive name indicative of the information
contained below the tab.
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-tabs
A container for tabbed information.
954
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example template that uses the <crowd-tabs> element. Copy the following code
and save it in a file with the extension .html. Open the file in any browser to preview and interact with
this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-tabs>
<crowd-tab header="Tab 1">
<h2>Image</h2>
<img
src="https://images.unsplash.com/photo-1478382188900-5bb598fe27d3?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80"
style="max-width: 40%"
>
<h2>Text</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
</p>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>
955
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
</crowd-tabs>
<short-instructions>
<p>Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas
sed sed risus.</p>
</short-instructions>
</crowd-form>
Attributes
This element has no attributes.
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-text-area
A field for text input.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template designed to transcribe audio clips that uses the
<crowd-text-area> element. Copy the following code and save it in a file with the extension .html.
Open the file in any browser to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<audio controls>
<source src="{{ task.input.taskObject | grant_read_access }}" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
<h3>Instructions</h3>
<p>Transcribe the audio</p>
<p>Ignore "umms", "hmms", "uhs" and other non-textual phrases</p>
<crowd-text-area name="transcription" rows="4"></crowd-text-area>
</crowd-form>
956
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Attributes
The following attributes are supported by this element.
allowed-pattern
A regular expression that is used with the auto-validate attribute to ignore non-matching characters as
the worker types.
auto-focus
A Boolean switch that, if present, puts the cursor in this element on-load so that users can immediately
begin typing without having to click inside the element.
auto-validate
A Boolean switch that, if present, turns on input validation. The behavior of the validator can be
modified by the error-message and allowed-pattern attributes.
char-counter
A Boolean switch that, if present, puts a small text field beneath the lower-right corner of the element,
displaying the number of characters inside the element.
disabled
A Boolean switch that, if present, displays the input area as disabled.
error-message
The text to be displayed below the input field, on the left side, if validation fails.
label
A string that is displayed inside a text field.
This text shrinks and rises up above a text field when the worker starts typing in the field or when the
value attribute is set.
max-length
An integer that specifies the maximum number of characters allowed by the element. Characters typed
or pasted beyond the maximum are ignored.
max-rows
An integer that specifies the maximum number of rows of text that are allowed within a crowd-text-
area. Normally the element expands to accommodate new rows. If this is set, after the number of rows
exceeds it, content scrolls upward out of view and a scrollbar control appears.
name
A string used to represent the element's data in the output.
placeholder
A string presented to the user as placeholder text. It disappears after the user puts something in the
input area.
rows
An integer that specifies the height of the element in rows of text.
957
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
value
A preset that becomes the default if the worker does not provide input. The preset appears in a text field.
Element Hierarchy
This element has the following parent and child elements.
Output
This element outputs the name as a property name and the element's text contents as the value.
Carriage returns in the text are represented as \n.
[
{
"textInput1": "This is the text; the text that\nmakes the crowd go wild."
}
]
See Also
For more information, see the following.
crowd-toast
A subtle notification that temporarily appears on the display. Only one crowd-toast is visible.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following is an example of a Liquid template that uses the <crowd-toast> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<p>Find the official website for: <strong>{{ task.input.company }}</strong></p>
<p>Do not give Yelp pages, LinkedIn pages, etc.</p>
<p>Include the http:// prefix from the website</p>
<crowd-input name="website" placeholder="http://example.com"></crowd-input>
958
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
</crowd-form>
Attributes
The following attributes are supported by this element.
duration
A number that specifies the duration, in milliseconds, that the notification appears on the screen.
text
Element Hierarchy
This element has the following parent and child elements.
See Also
For more information, see the following.
crowd-toggle-button
A button that acts as an ON/OFF switch, toggling a state.
See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.
The following example shows different ways you can use to use the <crowd-toggle-button> HTML
element. Copy the following code and save it in a file with the extension .html. Open the file in any
browser to preview and interact with this template.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<!--Toggle button without value-->
<crowd-toggle-button name="toggleButtonWithoutValue"></crowd-toggle-button>
959
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements
Attributes
The following attributes are supported by this element.
checked
A Boolean switch that, if present, displays the button switched to the ON position.
disabled
A Boolean switch that, if present, displays the button as disabled and prevents toggling.
invalid
When in an off position, a button using this attribute, will display in an alert color. The standard is red,
but may be changed in CSS. When toggled on, the button will display in the same color as other buttons
in the on position.
name
A string that is used to identify the answer submitted by the worker. This value matches a key in the
JSON object that specifies the answer.
required
value
A value used in the output as the property name for the element's Boolean state. Defaults to "on" if not
provided.
Element Hierarchy
This element has the following parent and child elements.
Output
This element outputs the name as the name of an object, containing the value as a property name
and the element's state as Boolean value for the property. If no value for the element is specified, the
property name defaults to "on."
[
{
"theToggler": {
"on": true
}
}
]
See Also
For more information, see the following.
960
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
Topics
• crowd-textract-analyze-document (p. 961)
• crowd-rekognition-detect-moderation-labels (p. 964)
crowd-textract-analyze-document
A widget to enable human review of a Amazon Textract document analysis result.
Attributes
The following attributes are supported by this element.
header
This is the text that is displayed as the header.
src
This is a link to the image to be analyzed by the worker.
initialValue
This sets initial values for attributes found in the worker UI.
[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": {
"boundingBox": {
"width": 0.32613086700439453,
"weight": 0.0942094624042511,
"left": 0.4833833575248718,
"top": 0.5227988958358765
},
"polygon": [
{"x": 0.123, "y": 0.345}, ...
]
}
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{
"type": "CHILD",
"ids": [
"7ee7b7da-ee1b-428d-a567-55a3e3affa56",
"4d6da730-ba43-467c-a9a5-c6137ba0c472"
]
},
{
961
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
"type": "VALUE",
"ids": [
"6ee7b7da-ee1b-428d-a567-55a3e3affa54"
]
}
],
"entityTypes": [
"KEY"
],
"text": "Foo bar"
},
]
blockTypes
This determines the kind of analysis the workers can do. Only KEY_VALUE_SET is currently supported.
keys
This specifies new keys and the associated text value the worker can add. The input values for keys can
include the following elements:
[
{
importantFormKey: 'Address',
importantFormKeyAliases: [
'address',
'Addr.',
'Add.',
]
},
{
importantFormKey: 'Last name',
importantFormKeyAliases: ['Surname']
}
]
no-key-edit
This prevents the workers from editing the keys of annotations passed through initialValue. This
prevents workers from editing the keys that have been detected on your documents. This is required.
no-geometry-edit
This prevents workers from editing the polygons of annotations passed through initialValue. For
example, this would prevent the worker from editing the bounding box around a given key. This is
required.
Element Hierarchy
This element has the following parent and child elements.
962
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
Regions
The following regions are supported by this element. You can use custom HTML and CSS code within
these regions to format your instructions to workers. For example, use the short-instructions
section to provide good and bad examples of how to complete a task.
full-instructions
General instructions about how to work with the widget.
short-instructions
Important task-specific instructions that are displayed in a prominent place.
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
{% capture s3_uri %}http://s3.amazonaws.com/
{{ task.input.aiServiceRequest.document.s3Object.bucket }}/
{{ task.input.aiServiceRequest.document.s3Object.name }}{% endcapture %}
<crowd-form>
<crowd-textract-analyze-document
src="{{ s3_uri | grant_read_access }}"
initial-value="{{ task.input.selectedAiServiceResponse.blocks }}"
header="Review the key-value pairs listed on the right and correct them if they don't
match the following document."
no-key-edit
no-geometry-edit
keys="{{ task.input.humanLoopContext.importantFormKeys }}"
block-types="['KEY_VALUE_SET']"
>
<short-instructions header="Instructions">
<style>
.instructions {
white-space: pre-wrap;
}
.instructionsImage {
display: inline-block;
max-width: 100%;
}
</style>
<p class='instructions'>Click on a key-value block to highlight the corresponding
key-value pair in the document.
If it is a valid key-value pair, review the content for the value. If the content is
incorrect, correct it.
963
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
If you can’t find the key in the document, choose Key not found.
<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/key-is-not-
found.png" />
<b>Examples</b>
Key and value are often displayed next or below to each other.
If the content of the value has multiple lines, enter all the text without line break.
Include all value text even if it extends beyond the highlight box.
<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/multiple-
lines.png" /></p>
</short-instructions>
<full-instructions header="Instructions"></full-instructions>
</crowd-textract-analyze-document>
</crowd-form>
Output
The following is a sample of the output from this element. You can find a detailed explanation of this
output in the Amazon Textract AnalyzeDocument API documentation.
{
"AWS/Textract/AnalyzeDocument/Forms/V1": {
blocks: [
{
"blockType": "KEY_VALUE_SET",
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{
"type": "CHILD",
"ids": ["7ee7b7da-ee1b-428d-a567-55a3e3affa56", "4d6da730-ba43-467c-a9a5-
c6137ba0c472"]
},
{
"type": "VALUE",
"ids": ["6ee7b7da-ee1b-428d-a567-55a3e3affa54"]
}
],
"entityTypes": ["KEY"],
"text": "Foo bar baz"
}
]
}
}
crowd-rekognition-detect-moderation-labels
A widget to enable human review of an Amazon Rekognition image moderation result.
964
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
Attributes
The following attributes are supported by this element.
header
src
categories
This supports categories as an array of strings or an array of objects where each object has a name
field.
• The returned answer is an array of all the strings that were selected.
exclusion-category
By setting this attribute you create a button underneath the categories in the UI.
• When a user chooses the button, all categories are deselected and disabled.
• Choosing the button again re-enables the categories so that users can choose them.
• If you submit after choosing the button, it returns an empty array.
Element Hierarchy
This element has the following parent and child elements.
AWS Regions
The following AWS Regions are supported by this element. You can use custom HTML and CSS
code within these Regions to format your instructions to workers. For example, use the short-
instructions section to provide good and bad examples of how to complete a task.
full-instructions
short-instructions
965
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
{% capture s3_uri %}http://s3.amazonaws.com/
{{ task.input.aiServiceRequest.image.s3Object.bucket }}/
{{ task.input.aiServiceRequest.image.s3Object.name }}{% endcapture %}
<crowd-form>
<crowd-rekognition-detect-moderation-labels
categories='[
{% for label in task.input.selectedAiServiceResponse.moderationLabels %}
{
name: "{{ label.name }}",
parentName: "{{ label.parentName }}",
},
{% endfor %}
]'
src="{{ s3_uri | grant_read_access }}"
header="Review the image and choose all applicable categories."
>
<short-instructions header="Instructions">
<style>
.instructions {
white-space: pre-wrap;
}
</style>
<p class='instructions'>Review the image and choose all applicable categories.
If no categories apply, choose None.
<b>Nudity</b>
Visuals depicting nude male or female person or persons
<b>Sexual Activity</b>
Visuals depicting various types of explicit sexual activities and pornography
<b>Adult Toys</b>
Visuals depicting adult toys, often in a marketing context
<b>Partial Nudity</b>
Visuals depicting covered up nudity, for example using hands or pose
<b>Revealing Clothes</b>
Visuals depicting revealing clothes and poses, such as deep cut dresses
<b>Physical Violence</b>
966
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements
<b>Weapon Violence</b>
Visuals depicting violence using weapons like firearms or blades, such as shooting
<b>Weapons</b>
Visuals depicting weapons like firearms and blades
<b>Self Injury</b>
Visuals depicting self-inflicted cutting on the body, typically in distinctive patterns
using sharp objects
<b>Emaciated Bodies</b>
Visuals depicting extremely malnourished human bodies
<b>Corpses</b>
Visuals depicting human dead bodies
<b>Hanging</b>
Visuals depicting death by hanging</p>
</short-instructions>
<full-instructions header="Instructions"></full-instructions>
</crowd-rekognition-detect-moderation-labels>
</crowd-form>
Output
The following is a sample of the output from this element. For details about this output, see Amazon
Rekognition DetectModerationLabels API documentation.
{
"AWS/Rekognition/DetectModerationLabels/Image/V3": {
"ModerationLabels": [
{ name: 'Gore', parentName: 'Violence' },
{ name: 'Corpses', parentName: 'Violence' },
]
}
}
967
Amazon SageMaker Developer Guide
Detect Pre-training Data Bias
Import data from Amazon S3, Amazon Redshift, Amazon Athena, and use Data Wrangler to create
sophisticated machine learning data prep workflows with built-in and custom data transformations and
analysis including feature target leakage and quick modeling.
After you have defined a data prep workflow, or data flow, you can integrate it with SageMaker
Processing, SageMaker Pipelines, and SageMaker Feature Store, simplify the task of processing, sharing
and storing ML training data. You can also export your data flow to a python script and create a custom
ML data prep pipeline.
For more information, see Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981).
For fast data preparation at scale, Amazon SageMaker Studio provides a built-in integration with
Amazon EMR. You can use SageMaker Studio to connect to, provision, or manage EMR clusters from
your notebook interface for petabyte-scale data processing, interactive analytics, and machine learning.
Amazon EMR uses open-source frameworks such as Apache Spark, Apache Hive, or Presto. For more
information about using Amazon EMR within SageMaker Studio, see Prepare data using Amazon
EMR (p. 1164).
Alternatively, you can use the Apache Spark-based serverless engine from an AWS Glue interactive
sessions to aggregate and transform data from multiple sources. You can aggregate and transform
data from your analytics and ETL (extract, transform, and load) pipelines without needing to manage
infrastructure. For more information about using AWS Glue interactive sessions within SageMaker Studio,
see Prepare data using AWS Glue Interactive Sessions (p. 1192).
The data that you’re using to train your machine learning model might contain bias. Bias can result in
machine learning models that discriminate against certain individuals or groups. You can use Amazon
SageMaker Clarify to determine whether the data that you’re using to train models or your resulting
model encodes any bias. SageMaker Clarify can also help you explain models created with tabular, image
or NLP data with partial dependence plots, feature importance and more. For more information about
SageMaker Clarify, see Detect Pre-training Data Bias (p. 968).
Topics
• Detect Pre-training Data Bias (p. 968)
• Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981)
• Prepare data at scale from Studio notebooks with Amazon EMR or AWS Glue (p. 1163)
968
Amazon SageMaker Developer Guide
Amazon SageMaker Clarify Terms for Bias and Fairness
Bias can be measured before training and after training, and monitored against baselines after deploying
models to endpoints for inference. Pre-training bias metrics are designed to detect and measure bias
in the raw data before it is used to train a model. The metrics used are model-agnostic because they do
not depend on any model outputs. However, there are different concepts of fairness that require distinct
measures of bias. Amazon SageMaker Clarify provides bias metrics to quantify various fairness criteria.
For additional information about bias metrics, see Learn How Amazon SageMaker Clarify Helps Detect
Bias and Fairness Measures for Machine Learning in Finance.
Feature
Feature that is the target for training a machine learning model. Referred to as the observed label or
observed outcome.
Predicted label
The label as predicted by the model. Also referred to as the predicted outcome.
Sample
An observed entity described by feature values and label value, contained in a row for tabular data.
Dataset
A collection of samples.
Bias
An imbalance in the training data or the prediction behavior of the model across different groups,
such as age or income bracket. Biases can result from the data or algorithm used to train your
model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may
be less accurate when making predictions involving younger and older people.
Bias metric
A function that returns numerical values indicating the level of a potential bias.
Bias report
A collection of bias metrics for a given dataset, or a combination of a dataset and a model.
Positive label values
Label values that are favorable to a demographic group observed in a sample. In other words,
designates a sample as having a positive result.
Negative label values
Label values that are unfavorable to a demographic group observed in a sample. In other words,
designates a sample as having a negative result.
Group variable
Categorical column of the dataset that is used to form subgroups for the measurement of
Conditional Demographic Disparity (CDD). Required only for this metric with regards to Simpson’s
paradox.
969
Amazon SageMaker Developer Guide
Sample Notebooks
Facet
A column or feature that contains the attributes with respect to which bias is measured.
Facet value
The probability, as predicted by the model, of a sample having a positive or negative outcome.
Sample Notebooks
Amazon SageMaker Clarify provides the following sample notebook for bias detection:
• Explainability and bias detection with Amazon SageMaker Clarify – Use SageMaker Clarify to create a
processing job for detecting bias and explaining model predictions with feature attributions.
This notebook has been verified to run in Amazon SageMaker Studio only. If you need instructions on
how to open a notebook in Amazon SageMaker Studio, see Create or Open an Amazon SageMaker Studio
Notebook (p. 148). If you're prompted to choose a kernel, choose Python 3 (Data Science).
Topics
• Measure Pre-training Bias (p. 970)
• Generate Reports for Bias in Pre-training Data in SageMaker Studio (p. 980)
We use the following notation to discuss the bias metrics. The conceptual model described here is for
binary classification, where events are labeled as having only two possible outcomes in their sample
space, referred to as positive (with value 1) and negative (with value 0). This framework is usually
extensible to multicategory classification in a straightforward way or to cases involving continuous
valued outcomes when needed. In the binary classification case, positive and negative labels are assigned
to outcomes recorded in a raw dataset for a favored facet a and for a disfavored facet d. These labels y
are referred to as observed labels to distinguish them from the predicted labels y' that are assigned by
a machine learning model during the training or inferences stages of the ML lifecycle. These labels are
used to define probability distributions Pa(y) and Pd(y) for their respective facet outcomes.
• labels:
970
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Models trained on data biased by demographic disparities might learn and even exacerbate them. To
identify bias in the data before expending resources to train models on it, SageMaker Clarify provides
data bias metrics that you can compute on raw datasets before training. All of the pretraining metrics are
model-agnostic because they do not depend on model outputs and so are valid for any model. The first
bias metric examines facet imbalance, but not outcomes. It determines the extent to which the amount
of training data is representative across different facets, as desired for the application. The remaining
bias metrics compare the distribution of outcome labels in various ways for facets a and d in the data.
The metrics that range over negative values can detect negative bias. The following table contains a
cheat sheet for quick guidance and links to the pretraining bias metrics.
Class Imbalance Measures the imbalance Could there be age- Normalized range: [-1,
(CI) (p. 975) in the number of based biases due to not +1]
members between having enough data
different facet values. for the demographic Interpretation:
outside a middle-aged
facet? • Positive values
indicate the facet a
971
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Difference in Measures the imbalance Could there be age- Range for normalized
Proportions of Labels of positive outcomes based biases in ML binary & multicategory
(DPL) (p. 975) between different facet predictions due to facet labels: [-1,+1]
values. biased labeling of facet
values in the data? Range for continuous
labels: (-∞, +∞)
Interpretation:
• Positive values
indicate facet a has a
higher proportion of
positive outcomes.
• Values near zero
indicate a more equal
proportion of positive
outcomes between
facets.
• Negative values
indicate facet d has a
higher proportion of
positive outcomes.
Kullback-Leibler Measures how much the How different are Range for binary,
Divergence outcome distributions the distributions multicategory,
(KL) (p. 976) of different facets for loan application continuous: [0, +∞)
diverge from each other outcomes for different
entropically. demographic groups? Interpretation:
972
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Jensen-Shannon Measures how much the How different are Range for binary,
Divergence outcome distributions the distributions multicategory,
(JS) (p. 976) of different facets for loan application continuous: [0, +∞)
diverge from each other outcomes for different
entropically. demographic groups? Interpretation:
Lp-norm (LP) (p. 977) Measures a p-norm How different are Range for binary,
difference between the distributions multicategory,
distinct demographic for loan application continuous: [0, +∞)
distributions of the outcomes for different
outcomes associated demographics? Interpretation:
with different facets in
a dataset. • Values near zero
indicate the labels are
similarly distributed.
• Positive values
indicate the label
distributions diverge,
the more positive the
larger the divergence.
Total Variation Distance Measures half How different are Range for binary,
(TVD) (p. 977) of the L1-norm the distributions multicategory, and
difference between for loan application continuous outcomes:
distinct demographic outcomes for different [0, +∞)
distributions of the demographics?
outcomes associated • Values near zero
with different facets in indicates the
a dataset. labels are similarly
distributed.
• Positive values
indicates the label
distributions diverge,
the more positive the
larger the divergence.
973
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Conditional Measures the disparity Do some groups have Range of CDD: [-1, +1]
Demographic Disparity of outcomes between a larger proportion of
(CDD) (p. 978) different facets as a rejections for college • Positive values
whole, but also by admission outcomes indicate a outcomes
subgroups. than their proportion of where facet d is
acceptances? rejected more than
accepted.
• Near zero indicates
no demographic
disparity on average.
• Negative values
indicate a outcomes
where facet a is
rejected more than
accepted.
For additional information about bias metrics, see Fairness Measures for Machine Learning in Finance.
Topics
• Class Imbalance (CI) (p. 975)
• Difference in Proportions of Labels (DPL) (p. 975)
• Kullback-Leibler Divergence (KL) (p. 976)
• Jensen-Shannon Divergence (JS) (p. 976)
• Lp-norm (LP) (p. 977)
• Total Variation Distance (TVD) (p. 977)
• Kolmogorov-Smirnov (KS) (p. 978)
• Conditional Demographic Disparity (CDD) (p. 978)
974
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Where na is the number of members of facet a and nd the number for facet d. Its values range over the
interval [-1, 1].
• Positive CI values indicate the facet a has more training samples in the dataset and a value of 1
indicates the data only contains members of the facet a.
• Values of CI near zero indicate a more equal distribution of members between facets and a value of
zero indicates a perfectly equal partition between facets and represents a balanced distribution of
samples in the training data.
• Negative CI values indicate the facet d has more training samples in the dataset and a value of -1
indicates the data only contains members of the facet d.
• CI values near either of the extremes values of -1 or 1 are very imbalanced and are at a substantial risk
of making biased predictions.
If a significant facet imbalance is found to exist among the facets, you might want to rebalance the
sample before proceeding to train models on it.
Where:
(1)
• qa = na /na is the proportion of facet a who have an observed label value of 1. For example, the
(1)
proportion of a middle-aged demographic who get approved for loans. Here na represents the
number of members of facet a who get a positive outcome and na the is number of members of facet
a.
(1)
• qd = nd /nd is the proportion of facet d who have an observed label value of 1. For example, the
(1)
proportion of people outside the middle-aged demographic who get approved for loans. Here nd
represents the number of members of the facet d who get a positive outcome and nd the is number of
members of the facet d.
If DPL is close enough to 0, then we say that demographic parity has been achieved.
975
Amazon SageMaker Developer Guide
Measure Pre-training Bias
For binary and multicategory facet labels, the DPL values range over the interval (-1, 1). For continuous
labels, we set a threshold to collapse the labels to binary.
• Positive DPL values indicate that facet a is has a higher proportion of positive outcomes when
compared with facet d.
• Values of DPL near zero indicate a more equal proportion of positive outcomes between facets and a
value of zero indicates perfect demographic parity.
• Negative DPL values indicate that facet d has a higher proportion of positive outcomes when
compared with facet a.
Whether or not a high magnitude of DPL is problematic varies from one situation to another. In a
problematic case, a high-magnitude DPL might be a signal of underlying issues in the data. For example,
a dataset with high DPL might reflect historical biases or prejudices against age-based demographic
groups that would be undesirable for a model to learn.
It is the expectation of the logarithmic difference between the probabilities Pa(y) and Pd(y), where the
expectation is weighted by the probabilities Pa(y). This is not a true distance between the distributions
as it is asymmetric and does not satisfy the triangle inequality. The implementation uses natural
logarithms, giving KL in units of nats. Using different logarithmic bases gives proportional results but in
different units. For example, using base 2 gives KL in units of bits.
For example, assume that a group of applicants for loans have a 30% approval rate (facet d) and that
the approval rate for other applicants (facet a) is 80%. The Kullback-Leibler formula gives you the label
distribution divergence of facet a from facet d as follows:
There are two terms in the formula here because labels are binary in this example. This measure can
be applied to multiple labels in addition to binary ones. For example, in a college admissions scenario,
assume an applicant may be assigned one of three category labels: yi = {y0, y1, y2} = {rejected, waitlisted,
accepted}.
Range of values for the KL metric for binary, multicategory, and continuous outcomes is [0, +∞).
• Values near zero mean the outcomes are similarly distributed for the different facets.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.
976
Amazon SageMaker Developer Guide
Measure Pre-training Bias
The range of JS values for binary, multicategory, continuous outcomes is [0, ln(2)).
This metric indicates whether there is a big divergence in one of the labels across facets.
Lp-norm (LP)
The Lp-norm (LP) measures the p-norm distance between the facet distributions of the observed labels in
a training dataset. This metric is non-negative and so cannot detect reverse bias.
Where the p-norm distance between the points x and y is defined as follows:
p p p 1/p
Lp(x, y) = (|x1-y1| + |x2-y2| + … +|xn-yn| )
The 2-norm is the Euclidean norm. Assume you have an outcome distribution with three categories, for
example, yi = {y0, y1, y2} = {accepted, waitlisted, rejected} in a college admissions multicategory scenario.
You take the sum of the squares of the differences between the outcome counts for facets a and d. The
resulting Euclidean distance is calculated as follows:
(0) (0) 2 (1) (1) 2 (2) (2) 2 1/2
L2(Pa, Pd) = [(na - nd ) + (na - nd ) + (na - nd ) ]
Where:
(i) (0)
• na is number of the ith category outcomes in facet a: for example na is number of facet a
acceptances.
(i) (2)
• nd is number of the ith category outcomes in facet d: for example nd is number of facet d rejections.
The range of LP values for binary, multicategory, and continuous outcomes is [0, √2), where:
• Values near zero mean the labels are similarly distributed.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.
For example, assume you have an outcome distribution with three categories, yi = {y0, y1, y2} = {accepted,
waitlisted, rejected}, in a college admissions multicategory scenario. You take the differences between
the counts of facets a and d for each outcome to calculate TVD. The result is as follows:
(0) (0) (1) (1) (2) (2)
L1(Pa, Pd) = |na - nd | + |na - nd | + |na - nd |
977
Amazon SageMaker Developer Guide
Measure Pre-training Bias
Where:
(i) (0)
• na is number of the ith category outcomes in facet a: for example na is number of facet a
acceptances.
(i) (2)
• nd is number of the ith category outcomes in facet d: for example nd is number of facet d
rejections.
The range of TVD values for binary, multicategory, and continuous outcomes is [0, 1), where:
• Values near zero mean the labels are similarly distributed.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.
Kolmogorov-Smirnov (KS)
The Kolmogorov-Smirnov bias metric (KS) is equal to the maximum divergence between labels in the
distributions for facets a and d of a dataset. The two-sample KS test implemented by SageMaker Clarify
complements the other measures of label imbalance by finding the most imbalanced label.
KS = max(|Pa(y) - Pd(y)|)
For example, assume a group of applicants (facet a) to college are rejected, waitlisted, or accepted at
40%, 40%, 20% respectively and that these rates for other applicants (facet d) are 20%, 10%, 70%. Then
the Kolmogorov-Smirnov bias metric value is as follows:
This tells us the maximum divergence between facet distributions is 0.5 and occurs in the acceptance
rates. There are three terms in the equation because labels are multiclass of cardinality three.
The range of LP values for binary, multicategory, and continuous outcomes is [0, +1], where:
• Values near zero indicate the labels were evenly distributed between facets in all outcome categories.
For example, both facets applying for a loan got 50% of the acceptances and 50% of the rejections.
• Values near one indicate the labels for one outcome were all in one facet. For example, facet a got
100% of the acceptances and facet d got none.
• Intermittent values indicate relative degrees of maximum label imbalance.
The formula for the demographic disparity for the less favored facet d is as follows:
(0) (0) (1) (1) R 0 A 1
DDd = nd /n - nd /n = Pd (y ) - Pd (y )
Where:
978
Amazon SageMaker Developer Guide
Measure Pre-training Bias
For the college admission example, the demographic disparity for women is DDd = 0.46 - 0.32 = 0.14. For
men DDa = 0.54 - 0.68 = - 0.14.
A conditional demographic disparity (CDD) metric that conditions DD on attributes that define a strata of
subgroups on the dataset is needed to rule out Simpson's paradox. The regrouping can provide insights
into the cause of apparent demographic disparities for less favored facets. The classic case arose in the
case of Berkeley admissions where men were accepted at a higher rate overall than women. The statistics
for this case were used in the example calculations of DD. However, when departmental subgroups
were examined, women were shown to have higher admission rates than men when conditioned by
department. The explanation was that women had applied to departments with lower acceptance rates
than men had. Examining the subgrouped acceptance rates revealed that women were actually accepted
at a higher rate than men for the departments with lower acceptance rates.
The CDD metric gives a single measure for all of the disparities found in the subgroups defined by
an attribute of a dataset by averaging them. It is defined as the weighted average of demographic
disparities (DDi) for each of the subgroups, with each subgroup disparity weighted in proportion to the
number of observations in contains. The formula for the conditional demographic disparity is as follows:
Where:
• ∑ini = n is the total number of observations and niis the number of observations for each subgroup.
(0) (0) (1) (1) R 0 A 1
• DDi = ni /n - ni /n = Pi (y ) - Pi (y ) is the demographic disparity for the ith subgroup.
The demographic disparity for a subgroup (DDi) are the difference between the proportion of rejected
outcomes and the proportion of accepted outcomes for each subgroup.
The range of DD values for binary outcomes for the full dataset DDd or for its conditionalized subgroups
DDi is [-1, +1].
• +1: when there no rejections in facet a or subgroup and no acceptances in facet d or subgroup
• Positive values indicate there is a demographic disparity as facet d or subgroup has a greater
proportion of the rejected outcomes in the dataset than of the accepted outcomes. The higher the
value the less favored the facet and the greater the disparity.
• Negative values indicate there is not a demographic disparity as facet d or subgroup has a larger
proportion of the accepted outcomes in the dataset than of the rejected outcomes. The lower the
value the more favored the facet.
• -1: when there are no rejections in facet d or subgroup and no acceptances in facet a or subgroup
If you don't condition on anything then CDD is zero if and only if DPL is zero.
This metric is useful for exploring the concepts of direct and indirect discrimination and of objective
justification in EU and UK non-discrimination law and jurisprudence. For additional information, see Why
Fairness Cannot Be Automated. This paper also contains the relevant data and analysis of the Berkeley
admissions case that shows how conditionalizing on departmental admission rate subgroups illustrates
Simpson's paradox.
979
Amazon SageMaker Developer Guide
Generate Reports for Bias in Pre-
training Data in SageMaker Studio
You specify attributes of interest, such as gender or age, and SageMaker Clarify runs a set of algorithms
to detect the presence of bias in those attributes. After the algorithm runs, SageMaker Clarify provides
a visual report with a description of the sources and severity of possible bias so that you can plan steps
to mitigate. For example, in a financial dataset that contains few examples of business loans to one
age group as compared to others, SageMaker flags the imbalance so that you can avoid a model that
disfavors that age group.
To get started with Data Wrangler, see Get Started with Data Wrangler (p. 983).
1.
In Amazon SageMaker Studio, from the Home ( ) menu in the left panel, navigate to the Data
node, then choose Data Wrangler. This opens the Data Wrangler landing page in Studio.
2. Choose the + Import data button to create a new flow.
3. In your flow page, from the Import tab, choose Amazon S3, navigate to your Amazon S3 bucket,
find your dataset, then choose Import.
4. After you have imported your data, on the flow graph in the Data flow tab, choose the + sign to the
right of the Data types node.
5. Choose Add analysis.
6. On the Create Analysis page, choose Bias Report for the Analysis type.
7. Configure the bias report by providing a report Name, the column to predict and whether it is a
value or threshold, the column to analyze for bias (the facet) and whether it is a value or threshold.
8. Continue configuring the bias report by choosing the bias metrics.
980
Amazon SageMaker Developer Guide
Prepare Data with Data Wrangler
9. Choose Check for bias to generate and view the bias report. Scroll down to view all of the reports.
10. Choose the caret to the right of each bias metric description to see documentation that can help you
interpret the significance of the metric values.
11. To view a table summary of the bias metric values, choose the Table toggle. To save the report,
choose Save in the lower-right corner of the page. You can see the report on the flow graph in the
Data flow tab. Double-click on the report to open it.
981
Amazon SageMaker Developer Guide
Prepare Data with Data Wrangler
Data Wrangler provides the following core functionalities to help you analyze and prepare data for
machine learning applications.
• Import – Connect to and import data from Amazon Simple Storage Service (Amazon S3), Amazon
Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
• Data Flow – Create a data flow to define a series of ML data prep steps. You can use a flow to combine
datasets from different data sources, identify the number and types of transformations you want to
apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.
• Transform – Clean and transform your dataset using standard transforms like string, vector, and
numeric data formatting tools. Featurize your data using transforms like text and date/time
embedding and categorical encoding.
• Generate Data Insights – Automatically verify data quality and detect abnormalities in your data with
Data Wrangler Data Insights and Quality Report.
• Analyze – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-
in data visualization tools like scatter plots and histograms, as well as data analysis tools like target
leakage analysis and quick modeling to understand feature correlation.
• Export – Export your data preparation workflow to a different location. The following are example
locations:
• Amazon Simple Storage Service (Amazon S3) bucket
• Amazon SageMaker Model Building Pipelines – Use SageMaker Pipelines to automate model
deployment. You can export the data that you've transformed directly to the pipelines.
• Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
• Python script – Store the data and their transformations in a Python script for your custom
workflows.
To start using Data Wrangler, see Get Started with Data Wrangler (p. 983).
Important
Data Wrangler no longer supports Jupyter Lab Version 1 (JL1). To access the latest features and
updates, update to Jupyter Lab Version 3. For more information about upgrading, see View and
update the JupyterLab version of an application from the console (p. 140).
Important
The information and procedures in this guide use the latest version of Amazon SageMaker
Studio. For information about updating Studio to the latest version, see Amazon SageMaker
Studio UI Overview (p. 129).
Topics
• Get Started with Data Wrangler (p. 983)
• Import (p. 991)
• Create and Use a Data Wrangler Flow (p. 1034)
• Get Insights On Data and Data Quality (p. 1045)
• Automatically Train Models on Your Data Flow (p. 1057)
• Transform Data (p. 1058)
• Analyze and Visualize (p. 1101)
• Reusing Data Flows for Different Datasets (p. 1109)
• Export (p. 1116)
• Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Notebook to Get Data
Insights (p. 1138)
• Security and Permissions (p. 1141)
• Release Notes (p. 1152)
• Troubleshoot (p. 1156)
982
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
Prerequisites
To use Data Wrangler, you must complete the following prerequisites.
1. To use Data Wrangler, you need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance.
For more information about the Amazon EC2 instances that you can use, see Instances (p. 1034). To
learn how to view your quotas and, if necessary, request a quota increase, see AWS service quotas.
2. Configure the required permissions described in Security and Permissions (p. 1141).
To use Data Wrangler, you need an active Studio instance. To learn how to launch a new instance, see
Onboard to Amazon SageMaker Domain (p. 37). When your Studio instance is Ready, use the instructions
in Access Data Wrangler (p. 983).
1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. You can also create a Data Wrangler flow by doing the following.
983
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
This messaging persists as long as the KernelGateway app on your User Details page is Pending.
To see the status of this app, in the SageMaker console on the Amazon SageMaker Studio page,
select the name of the user you are using to access Studio. On the User Details page, you see a
KernelGateway app under Apps. Wait until this app status is Ready to start using Data Wrangler.
This can take around 5 minutes the first time you launch Data Wrangler.
11. To get started, choose a data source and use it to import a dataset. See Import (p. 991) to learn
more.
When you import a dataset, it appears in your data flow. To learn more, see Create and Use a Data
Wrangler Flow (p. 1034).
12. After you import a dataset, Data Wrangler automatically infers the type of data in each column.
Choose + next to the Data types step and select Edit data types.
Important
After you add transforms to the Data types step, you cannot bulk-update column types
using Update types.
13. Use the data flow to add transforms and analyses. To learn more see Transform Data (p. 1058) and
Analyze and Visualize (p. 1101).
14. To export a complete data flow, choose Export and choose an export option. To learn more, see
Export (p. 1116).
984
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
15. Finally, choose the Components and registries icon, and select Data Wrangler from the dropdown
list to see all the .flow files that you've created. You can use this menu to find and move between
data flows.
After you have launched Data Wrangler, you can use the following section to walk through how you
might use Data Wrangler to create an ML data prep flow.
This walkthrough uses the Titanic dataset. It's a modified version of the Titanic dataset that you can
import into your Data Wrangler flow more easily. This data set contains the survival status, age, gender,
and class (which serves as a proxy for economic status) of passengers aboard the maiden voyage of the
RMS Titanic in 1912.
To import the dataset directly into Data Wrangler, open the flow and choose Use Sample Dataset.
Uploading the dataset to Amazon S3 and importing it into Data Wrangler is closer to the experience
you have importing your own data. The following information tells you how to upload your dataset and
import it.
Before you start importing the data into Data Wrangler, download the Titanic dataset and upload it to an
Amazon S3 (Amazon S3) bucket in the AWS Region in which you want to complete this demo.
985
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
If you are a new user of Amazon S3, you can do this using drag and drop in the Amazon S3 console.
To learn how, see Uploading Files and Folders by Using Drag and Drop in the Amazon Simple Storage
Service User Guide.
Important
Upload your dataset to an S3 bucket in the same AWS Region you want to use to complete this
demo.
When your dataset has been successfully uploaded to Amazon S3, you can import it into Data Wrangler.
1. Choose the Import data button in your Data flow tab or choose the Import tab.
2. Select Amazon S3.
3. Use the Import a dataset from S3 table to find the bucket to which you added the Titanic dataset.
Choose the Titanic dataset CSV file to open the Details pane.
4. Under Details, the File type should be CSV. Check First row is header to specify that the first row of
the dataset is a header. You can also name the dataset something more friendly, such as Titanic-
train.
5. Choose the Import button.
When your dataset is imported into Data Wrangler, it appears in your Data Flow tab. You can double
click on a node to enter the node detail view, which allows you to add transformations or analysis. You
can use the plus icon for a quick access to the navigation. In the next section, you use this data flow to
add analysis and transform steps.
Data Flow
In the data flow section, the only steps in the data flow are your recently imported dataset and a Data
type step. After applying transformations, you can come back to this tab and see what the data flow
looks like. Now, add some basic transformations under the Prepare and Analyze tabs.
Data Wrangler has built-in transformations and visualizations that you can use to analyze, clean, and
transform your data.
The Data tab of the node detail view lists all built-in transformations in the right panel, which also
contains an area in which you can add custom transformations. The following use case showcases how to
use these transformations.
To get information that might help you with data exploration and feature engineering, create a data
quality and insights report. The information from the report can help you clean and process your data.
It gives you information such as the number of missing values and the number of outliers. If you have
issues with your data, such as target leakage or imbalance, the insights report can bring those issues
to your attention. For more information about creating a report, see Get Insights On Data and Data
Quality (p. 1045).
Data Exploration
First, create a table summary of the data using an analysis. Do the following:
1. Choose the + next to the Data type step in your data flow and select Add analysis.
2. In the Analysis area, select Table summary from the dropdown list.
3. Give the table summary a Name.
986
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
Using the statistics you see, you can make observations similar to the following about this dataset:
• Fare average (mean) is around $33, while the max is over $500. This column likely has outliers.
• This dataset uses ? to indicate missing values. A number of columns have missing values: cabin,
embarked, and home.dest
• The age category is missing over 250 values.
Next, clean your data using the insights gained from these stats.
Using the analysis from the previous section, clean up the dataset to prepare it for training. To add a
new transform to your data flow, choose + next to the Data type step in your data flow and choose Add
transform.
First, drop columns that you don't want to use for training. You can use pandas data analysis library to
do this, or you can use one of the built-in transforms.
• cabin
• ticket
• name
• sibsp
• parch
• home.dest
• boat
• body
7. Choose Preview.
8. Verify that the columns have been dropped, then choose Add.
987
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
5. Choose Preview to preview the change, and then choose Add to add the transformation.
Now, clean up missing values. You can do this with the Handling missing values transform group.
A number of columns have missing values. Of the remaining columns, age and fare contain missing
values. Inspect this using a Custom Transform.
Using the Python (Pandas) option, use the following to quickly review the number of entries in each
column:
df.info()
To drop rows with missing values in the age category, do the following:
You can use df.info() in the Custom transform section to confirm that all rows now have 1,045
values.
988
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
Try flat encoding using Pandas. Encoding categorical data is the process of creating a numerical
representation for categories. For example, if your categories are Dog and Cat, you may encode this
information into two vectors: [1,0] to represent Dog, and [0,1] to represent Cat.
1. In the Custom Transform section, choose Python (Pandas) from the dropdown list.
2. Enter the following in the code box.
import pandas as pd
dummies = []
cols = ['pclass','sex','embarked']
for col in cols:
dummies.append(pd.get_dummies(df[col]))
df = pd.concat((df, encoded),axis=1)
3. Choose Preview to preview the change. The encoded version of each column is added to the dataset.
4. Choose Add to add the transformation.
Now, select the columns you want to keep using SQL. For this demo, select the columns listed in the
following SELECT statement. Because survived is your target column for training, put that column first.
1. In the Custom Transform section, select SQL (PySpark SQL) from the dropdown list.
2. Enter the following in the code box.
3. Choose Preview to preview the change. The columns listed in your SELECT statement are the only
remaining columns.
4. Choose Add to add the transformation.
When you export your data flow using a Data Wrangler job, the process automatically creates a Jupyter
Notebook. This notebook automatically opens in your Studio instance and is configured to run a
SageMaker processing job to run your Data Wrangler data flow, which is referred to as a Data Wrangler
job.
1. Save your data flow. Select File and then select Save Data Wrangler Flow.
2. Back to the Data Flow tab, select the last step in your data flow (SQL), then choose the + to open
the navigation.
3. Choose Export, and Amazon S3 (via Jupyter Notebook). This opens a Jupyter Notebook.
989
Amazon SageMaker Developer Guide
Get Started with Data Wrangler
Alternatively, you can add the code blocks found in Training XGBoost Classifier (p. 990) to the
notebook and run them to use the XGBoost open source library to train an XGBoost classifier.
7. Uncomment and run the cell under Cleanup and run it to revert the SageMaker Python SDK to its
original version.
You can monitor your Data Wrangler job status in the SageMaker console in the Processing tab.
Additionally, you can monitor your Data Wrangler job using Amazon CloudWatch. For additional
information, see Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics.
If you kicked off a training job, you can monitor its status using the SageMaker console under Training
jobs in the Training section.
You can train an XGBoost Binary Classifier using either a Jupyter notebook or a Amazon SageMaker
Autopilot. You can use Autopilot to automatically train and tune models on the data that you've
transformed directly from your Data Wrangler flow. For information about Autopilot, see Automatically
Train Models on Your Data Flow (p. 1057).
990
Amazon SageMaker Developer Guide
Import
In the same notebook that kicked off the Data Wrangler job, you can pull the data and train an XGBoost
Binary Classifier using the prepared data with minimal data preparation.
1. First, upgrade necessary modules using pip and remove the _SUCCESS file (this last file is
problematic when using awswrangler).
2. Read the data from Amazon S3. You can use awswrangler to recursively read all the CSV files in
the S3 prefix. The data is then split into features and labels. The label is the first column of the
dataframe.
import awswrangler as wr
df = wr.s3.read_csv(path=output_path, dataset=True)
X, y = df.iloc[:,:-1],df.iloc[:,-1]
• Finally, create DMatrices (the XGBoost primitive structure for data) and do cross-validation using
the XGBoost binary classification.
xgb.cv(
dtrain=dmatrix,
params=params,
nfold=3,
num_boost_round=50,
early_stopping_rounds=10,
metrics="rmse",
as_pandas=True,
seed=123)
When you are finished using Data Wrangler, we recommend that you shut down the instance it runs on
to avoid incurring additional charges. To learn how to shut down the Data Wrangler app and associated
instance, see Shut Down Data Wrangler (p. 1162).
Import
You can use Amazon SageMaker Data Wrangler to import data from the following data sources: Amazon
Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. The dataset that
you import can include up to 1000 columns.
Topics
• Import data from Amazon S3 (p. 993)
• Import data from Athena (p. 997)
• Import data from Amazon Redshift (p. 1000)
• Import data from Amazon EMR (p. 1004)
• Import data from Databricks (JDBC) (p. 1011)
991
Amazon SageMaker Developer Guide
Import
• You can connect to multiple Amazon Redshift clusters. Each cluster becomes a data source.
• You can query any Athena database in your account to import data from that database.
When you import a dataset from a data source, it appears in your data flow. Data Wrangler automatically
infers the data type of each column in your dataset. To modify these types, select the Data types step
and select Edit data types.
When you import data from Athena or Amazon Redshift, the imported data is automatically stored
in the default SageMaker S3 bucket for the AWS Region in which you are using Studio. Additionally,
Athena stores data you preview in Data Wrangler in this bucket. To learn more, see Imported Data
Storage (p. 1033).
Important
The default Amazon S3 bucket may not have the least permissive security settings, such as
bucket policy and server-side encryption (SSE). We strongly recommend that you Add a Bucket
Policy To Restrict Access to Datasets Imported to Data Wrangler.
Important
In addition, if you use the managed policy for SageMaker, we strongly recommend that you
scope it down to the most restrictive policy that allows you to perform your use case. For more
information, see Grant an IAM Role Permission to Use Data Wrangler (p. 1143).
All data sources except for Amazon Simple Storage Service (Amazon S3) require you to specify a SQL
query to import your data. For each query, you must specify the following:
• Data catalog
• Database
• Table
You can specify the name of the database or the data catalog in either the drop down menus or within
the query. The following are example queries:
The link between Data Wrangler and the data source is a connection. You use the connection to import
data from your data source.
992
Amazon SageMaker Developer Guide
Import
• Direct
• Cataloged
Data Wrangler always has access to the most recent data in a direct connection. If the data in the data
source has been updated, you can use the connection to import the data. For example, if someone adds a
file to one of your Amazon S3 buckets, you can import the file.
A cataloged connection is the result of a data transfer. The data in the cataloged connection doesn't
necessarily have the most recent data. For example, you might set up a data transfer between Salesforce
and Amazon S3. If there's an update to the Salesforce data, you must transfer the data again. You can
automate the process of transferring data. For more information about data transfers, see Import Data
From Software as a Service (SaaS) Platforms (p. 1030).
Data Wrangler uses S3 Select to allow you to preview your Amazon S3 files in Data Wrangler. You incur
standard charges for each file preview. To learn more about pricing, see the Requests & data retrievals
tab on Amazon S3 pricing.
Important
If you plan to export a data flow and launch a Data Wrangler job, ingest data into a SageMaker
feature store, or create a SageMaker pipeline, be aware that these integrations require Amazon
S3 input data to be located in the same AWS region.
Important
If you're importing a CSV file, make sure it meets the following requirements:
Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For
Amazon S3, it provides the following sampling options:
After you've imported your data, you can also use the sampling transformer to take one or more samples
from your entire dataset. For more information about the sampling transformer, see Sampling (p. 1092).
993
Amazon SageMaker Developer Guide
Import
You can import either a single file or multiple files as a dataset. You can use the multifile import
operation when you have a dataset that is partitioned into separate files. It takes all of the files from an
Amazon S3 directory and imports them as a single dataset. For information on the types of files that you
can import and how to import them, see the following sections.
For files formatted in JSON, Data Wrangler supports both JSON lines (.jsonl) and JSON documents
(.json). When you preview your data, it automatically shows the JSON in tabular format. For nested
JSON documents that are larger than 5 MB, Data Wrangler shows the schema for the structure
and the arrays as values in the dataset. Use the Flatten structured and Explode array operators to
display the nested values in tabular format. For more information, see Unnest JSON Data (p. 1097)
and Explode Array (p. 1098).
When you choose a dataset, you can rename it, specify the file type, and identify the first row as a
header.
You can import a dataset that you've partitioned into multiple files in an Amazon S3 bucket in a
single import step.
To import a dataset into Data Wrangler from a single file that you've stored in Amazon
S3:
994
Amazon SageMaker Developer Guide
Import
Multifile Import
• CSV
• Parquet
• Optimized Row Columnar (ORC)
• Image – Data Wrangler uses OpenCV to import images. For more information about supported
image formats, see Image file reading and writing.
To import a dataset into Data Wrangler from multiple files that you've stored in an
Amazon S3 directory
995
Amazon SageMaker Developer Guide
Import
3. From the table of available S3 buckets, select the bucket containing the folder that you want to
import.
4. Select the folder containing the files that you want to import. Each file must be in one of the
supported formats. Your files must be the same data type.
5. If your folder contains CSV files with headers, select the checkbox next to First row is header.
6. If your files are nested within other folders, select the checkbox next to Include nested
directories.
7. (Optional) Choose Add filename column add a column to the dataset that shows the filename
for each observation.
8. (Optional) By default, Data Wrangler doesn't show you a preview of a folder. You can activate
previewing by choosing the blue Preview off button. A preview shows the first 10 rows of
the first 10 files in the folder. The following images show you how to activate a preview for a
dataset created from nested directories.
996
Amazon SageMaker Developer Guide
Import
9. In the Details pane, verify or change the Name and File Type for your dataset. If you add a
Name that contains spaces, these spaces are replaced with underscores when your dataset is
imported.
10. Specify the sampling configuration that you'd like to use.
11. Choose Import dataset.
You can also use parameters to import a subset of files that match a pattern. Parameters help you more
selectively pick the files that you're importing. To start using parameters, edit the data source and apply
them to the path that you're using to import the data. For more information, see Reusing Data Flows for
Different Datasets (p. 1109).
You can use the AWS Management Console to set up Amazon Athena. You must create at least one
database in Athena before you start running queries. For more information about getting started with
Athena, see Getting started.
Athena is directly integrated with Data Wrangler. You can write Athena queries without having to leave
the Data Wrangler UI.
In addition to writing simple Athena queries in Data Wrangler, you can also use:
• Athena workgroups for query result management. For more information about workgroups, see
Managing query results (p. 999).
• Lifecycle configurations for setting data retention periods. For more information about data retention,
see Setting data retention periods (p. 999).
997
Amazon SageMaker Developer Guide
Import
If you use AWS Lake Formation with Athena, make sure your Lake Formation IAM permissions do not
override IAM permissions for the database sagemaker_data_wrangler.
Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For
Athena, it provides the following sampling options:
The following procedure shows how to import a dataset from Athena into Data Wrangler.
a. Choose a Workgroup.
b. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a
workgroup, specify a value for Amazon S3 location of query results.
c. (Optional) For Data retention period, select the checkbox to set a data retention period and
specify the number of days to store the data before it's deleted.
d. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the
checkbox and not save the connection.
13. For Sampling, choose a sampling method. Choose None to turn off sampling.
14. Enter your query in the query editor and use the Run button to run the query. After a successful
query, you can preview your result under the editor.
Note
Salesforce data uses the timestamptz type. If you're querying the timestamp column that
you've imported to Athena from Salesforce, cast the data in the column to the timestamp
type. The following query casts the timestamp column to the correct type.
998
Amazon SageMaker Developer Guide
Import
After you complete the preceding procedure, the dataset that you've queried and imported appears in
the Data Wrangler flow.
By default, Data Wrangler saves the connection settings as a new connection. When you import your
data, the query that you've already specified appears as a new connection. The saved connections store
information about the Athena workgroups and Amazon S3 buckets that you're using. When you're
connecting to the data source again, you can choose the saved connection.
Your workgroup might be configured to enforce the Amazon S3 query output location. You can't change
the output location of the query results for those workgroups.
If you don't use a workgroup or specify an output location for your queries, Data Wrangler uses the
default Amazon S3 bucket in the same AWS Region in which your Studio instance is located to store
Athena query results. It creates temporary tables in this database to move the query output to this
Amazon S3 bucket. It deletes these tables after data has been imported; however the database,
sagemaker_data_wrangler, persists. To learn more, see Imported Data Storage (p. 1033).
To use Athena workgroups, set up the IAM policy that gives access to workgroups. If you're using a
SageMaker-Execution-Role, we recommend adding the policy to the role. For more information
about IAM policies for workgroups, see IAM policies for accessing workgroups. For example workgroup
policies, see Workgroup example policies.
If you don't set a retention period, the Amazon S3 lifecycle configuration determines the duration that
the objects are stored. The data retention policy that you've specified for the lifecycle configuration
removes any query results that are older than the Lifecycle configuration that you've specified. For more
information, see Setting lifecycle configuration on a bucket.
Data Wrangler uses S3 lifecycle configurations to manage data retention and expiration. You must
give your Amazon SageMaker Studio IAM execution role permissions to manage bucket lifecycle
configurations. Use the following procedure to give permissions.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. Choose Roles.
3. In the search bar, specify the Amazon SageMaker execution role that Amazon SageMaker Studio is
using.
4. Choose the role.
999
Amazon SageMaker Developer Guide
Import
You can connect to and query one or more Amazon Redshift clusters in Data Wrangler. To use this import
option, you must create at least one cluster in Amazon Redshift. To learn how, see Getting started with
Amazon Redshift.
You can output the results of your Amazon Redshift query in one of the following locations:
You can either import the entire dataset or sample a portion of it. For Amazon Redshift, it provides the
following sampling options:
The default Amazon S3 bucket is in the same AWS Region in which your Studio instance is located to
store Amazon Redshift query results. For more information, see Imported Data Storage (p. 1033).
For either the default Amazon S3 bucket or the bucket that you specify, you have the following
encryption options:
• The default AWS service-side encryption with an Amazon S3 managed key (SSE-S3)
• An AWS Key Management Service (AWS KMS) key that you specify
An AWS KMS key is an encryption key that you create and manage. For more information on KMS keys,
see AWS Key Management Service.
You can specify an AWS KMS key using either the key ARN or the ARN of your AWS account.
1000
Amazon SageMaker Developer Guide
Import
If you use the IAM managed policy, AmazonSageMakerFullAccess, to grant a role permission to use
Data Wrangler in Studio, your Database User name must have the prefix sagemaker_access.
1001
Amazon SageMaker Developer Guide
Import
1002
Amazon SageMaker Developer Guide
Import
The following image shows all the fields from the preceding procedure.
After your connection is successfully established, it appears as a data source under Data Import. Select
this data source to query your database and import data.
1. Select the connection that you want to query from Data Sources.
2. Select a Schema. To learn more about Amazon Redshift Schemas, see Schemas in the Amazon
Redshift Database Developer Guide.
3. (Optional) Under Advanced configuration, specify the Sampling method that you'd like to use.
4. Enter your query in the query editor and choose Run to run the query. After a successful query, you
can preview your result under the editor.
5. Select Import dataset to import the dataset that has been queried.
6. Enter a Dataset name. If you add a Dataset name that contains spaces, these spaces are replaced
with underscores when your dataset is imported.
7. Choose Add.
1003
Amazon SageMaker Developer Guide
Import
Prerequisites
• Network configurations
• You have an Amazon VPC in the Region that you're using to launch Amazon SageMaker
Studio and Amazon EMR.
• Both Amazon EMR and Amazon SageMaker Studio must be launched in private subnets.
They can be in the same subnet or in different ones.
• Amazon SageMaker Studio must be in VPC-only mode.
For more information about creating a VPC, see Connect SageMaker Studio Notebooks in a
VPC to External Resources.
• The Amazon EMR clusters that you're running must be in the same Amazon VPC.
• The Amazon EMR clusters and the Amazon VPC must be in the same AWS account.
• Your Amazon EMR clusters are running Hive or Presto.
• Hive clusters must allow inbound traffic from Studio security groups on port 10000.
• Presto clusters must allow inbound traffic from Studio security groups on port 8889.
• SageMaker Studio
• Amazon SageMaker Studio must run Jupyter Lab Version 3. For information about updating
the Jupyter Lab Version, see View and update the JupyterLab version of an application from
the console (p. 140).
• Amazon SageMaker Studio has an IAM role that controls users access. The default IAM role
that you're using to run Amazon SageMaker Studio doesn't have policies that you can give
you access Amazon EMR clusters. You must attach the policy granting permissions to the
IAM role. For more information, see Configure the discoverability of Amazon EMR clusters
(for administrators) (p. 1178).
• The IAM role must also have the following policy attached
secretsmanager:PutResourcePolicy.
• If you're using a Studio domain that you've already created, make sure that its
AppNetworkAccessType is in VPC-only mode. For information about updating a domain
to use VPC-only mode, see Shut down and Update SageMaker Studio (p. 199).
• Amazon EMR clusters
Note
Amazon EMR supports auto termination. Auto termination stops idle clusters from
running and prevents you from incurring costs. The following are the releases that
support auto termination:
• For 6.x releases, version 6.1.0 or later.
• For 5.x releases, version 5.30.0 or later.
An Amazon VPC is a virtual network that is logically isolated from other networks on the AWS cloud.
Amazon SageMaker Studio and your Amazon EMR cluster only exist within the Amazon VPC.
Use the following procedure to launch Amazon SageMaker Studio in an Amazon VPC.
If you don't have an Amazon EMR cluster ready, you can use the following procedure to create one. For
more information about Amazon EMR, see What is Amazon EMR?
Auto termination stops idle clusters from running and prevent you from incurring costs.
6. (Optional) For Applications, choose Presto.
7. Choose the application that you're running on the cluster.
1005
Amazon SageMaker Developer Guide
Import
8. Under Networking, for Hardware configuration, specify the hardware configuration settings.
Important
For Networking, choose the VPC that is running Amazon SageMaker Studio and choose a
private subnet.
9. Under Security and access, specify the security settings.
10. Choose Create.
For a tutorial about creating an Amazon EMR cluster, see Getting started with Amazon EMR. For
information about best practices for configuring a cluster, see Considerations and best practices.
Note
For security best practices, Data Wrangler can only connect to VPCs on private subnets. You
won't be able to connect to the master node unless you use AWS Systems Manager for your
EMR instances. For more information, see Securing access to EMR clusters using AWS Systems
Manager.
You can currently use the following methods to access and Amazon EMR cluster:
• No authentication
• Lightweight Directory Access Protocol (LDAP)
Use the following sections to create a Presto or Hive Amazon EMR cluster with LDAP activated.
Presto
Important
To use AWS Glue as a metastore for Presto tables, select Use for Presto table metadata
to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're
launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you
from incurring charges.
To be able to query large datasets on Amazon EMR clusters, add the following properties to
Presto configuration file on your EMR clusters:
[{"classification":"presto-config","properties":{
"http-server.max-request-header-size":"5MB",
"http-server.max-response-header-size":"5MB"}}]
You can also modify the configuration settings when you launch the Amazon EMR cluster.
The configuration file for your Amazon EMR cluster is located under the following path: /
etc/presto/conf/config.properties.
Use the following procedure to create a Presto cluster with LDAP activated,.
1006
Amazon SageMaker Developer Guide
Import
Auto termination stops idle clusters from running and prevent you from incurring costs.
6. Choose the application that you're running on the cluster.
7. Under Networking, for Hardware configuration, specify the hardware configuration settings.
Important
For Networking, choose the VPC that is running Amazon SageMaker Studio and choose
a private subnet.
8. Under Security and access, specify the security settings.
9. Choose Create.
Hive
Important
To use AWS Glue as a metastore for Hive tables, select Use for Hive table metadata to store
the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching
an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from
incurring charges.
To be able to query large datasets on Amazon EMR clusters, add the following properties to
Hive configuration file on your EMR clusters:
[{"classification":"hive-site", "properties"
:{"hive.resultset.use.unique.column.names":"false"}}]
You can also modify the configuration settings when you launch the Amazon EMR cluster.
The configuration file for your Amazon EMR cluster is located under the following path: /
etc/hive/conf/hive-site.xml. You can specify the following property and restart the
cluster:
<property>
<name>hive.resultset.use.unique.column.names</name>
<value>false</value>
</property>
Use the following procedure to create a Hive cluster with LDAP activated,.
1007
Amazon SageMaker Developer Guide
Import
8. (Optional) Select Use for Hive table metadata to store the results of your Amazon EMR queries
in a AWS Glue data catalog. Storing the query results in a AWS Glue catalog can save you from
incurring charges. For more information, see Using the AWS Glue Data Catalog as the metastore
for Hive.
Note
Storing the query results in a data catalog requires Amazon EMR version 5.8.0 or later.
9. Under Enter configuration, specify the following JSON:
[
{
"classification": "hive-site",
"properties": {
"hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
"hive.server2.authentication": "LDAP",
"hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
}
}
]
Note
As a security best practice, we recommend enabling SSL for HiveServer by adding a few
properties in the preceding hive-site JSON. For more information, see Enable SSL on
HiveServer2.
10. Specify the remaining cluster settings and create a cluster.
Use the following sections to use LDAP authentication for Amazon EMR clusters that you've already
created.
Using LDAP on a cluster running Presto requires access to the Presto coordinator through HTTPS. Do
the following to provide access:
- Classification: presto-config
ConfigurationProperties:
http-server.authentication.type: 'PASSWORD'
http-server.https.enabled: 'true'
http-server.https.port: '8889'
http-server.http.port: '8899'
node-scheduler.include-coordinator: 'true'
http-server.https.keystore.path: '/path/to/keystore/path/for/presto'
http-server.https.keystore.key: 'keystore-key-password'
discovery.uri: 'http://master-node-dns-name:8899'
- Classification: presto-password-authenticator
ConfigurationProperties:
password-authenticator.name: 'ldap'
ldap.url: !Sub 'ldaps://ldap-server-dns-name:636'
ldap.user-bind-pattern: "uid=${USER},dc=example,dc=org"
internal-communication.authentication.ldap.user: "ldap-user-name"
internal-communication.authentication.ldap.password: "ldap-password"
1008
Amazon SageMaker Developer Guide
Import
For information about setting up LDAP in Presto, see the following resources:
• LDAP Authentication
• Using LDAP Authentication for Presto on Amazon EMR
Note
As a security best practice, we recommend enabling SSL for Presto. For more information,
see Secure Internal Communication.
LDAP for Hive
To use LDAP for Hive for a cluster that you've created, use the following procedure Reconfigure an
instance group in the console.
[
{
"classification": "hive-site",
"properties": {
"hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
"hive.server2.authentication": "LDAP",
"hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
}
}
]
• No authentication
• LDAP
9. For Login into example-cluster-name cluster, specify the Username and Password for the
cluster.
10. Choose Connect.
11. In the query editor specify a SQL query.
1009
Amazon SageMaker Developer Guide
Import
The format of a valid JDBC URL depends on whether you use authentication and whether you use
Hive or Presto as the query engine. The following list shows the valid JBDC URL formats for the
different possible configurations.
The following are the valid JDBC URL formats for Hive with SSL enabled:
• Without a Java Keystore File – jdbc:hive2://emr-cluster-
master-public-dns:10000/;AuthMech=3;UID=user-
name;PWD=password;SSL=1;AllowSelfSignedCerts=1;
• With a Java Keystore File – jdbc:hive2://emr-cluster-master-public-
dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;SSLKeyStore=/
home/sagemaker-user/data/Java-keystore-file-name;SSLKeyStorePwd=Java-
keystore-file-passsword;
• Presto, no authentication – jdbc:presto://emr-cluster-master-public-dns:8889/;
• For Presto with LDAP authentication and SSL enabled, the JDBC URL format depends on whether
you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the
identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it
on an EMR cluster and upload it to Data Wrangler. To upload a file, choose the upward arrow on
the left-hand navigation of the Data Wrangler UI. For information about creating a Java Keystore
File for Presto, see Java Keystore File for TLS. For information about running commands on an
Amazon EMR cluster, see Securing access to EMR clusters using AWS Systems Manager.
• Without a Java Keystore File – jdbc:presto://emr-cluster-master-public-
dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;UID=user-
name;PWD=password;AllowSelfSignedServerCert=1;AllowHostNameCNMismatch=1;
1010
Amazon SageMaker Developer Guide
Import
Throughout the process of importing data from an Amazon EMR cluster, you might run into issues. For
information about troubleshooting them, see Troubleshooting issues with Amazon EMR (p. 1159).
We assume that you have a running Databricks cluster and that you've configured your JDBC driver to it.
For more information, see the following Databricks documentation pages:
• JDBC driver
• JDBC configuration and connection parameters
• Authentication parameters
Data Wrangler stores your JDBC URL in AWS Secrets Manager. You must give your Amazon SageMaker
Studio IAM execution role permissions to use Secrets Manager. Use the following procedure to give
permissions.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. Choose Roles.
3. In the search bar, specify the Amazon SageMaker execution role that Amazon SageMaker Studio is
using.
4. Choose the role.
5. Choose Add permissions.
6. Choose Create inline policy.
7. For Service, specify Secrets Manager and choose it.
8. For Actions, select the arrow icon next to Permissions management.
9. Choose PutResourcePolicy.
10. For Resources, choose Specific.
11. Choose the checkbox next to Any in this account.
12. Choose Review policy.
13. For Name, specify a name.
14. Choose Create policy.
You can use partitions to import your data more quickly. Partitions give Data Wrangler the ability to
process the data in parallel. By default, Data Wrangler uses 2 partitions. For most use cases, 2 partitions
give you near-optimal data processing speeds.
If you choose to specify more than 2 partitions, you can also specify a column to partition the data. The
type of the values in the column must be numeric or date.
1011
Amazon SageMaker Developer Guide
Import
We recommend using partitions only if you understand the structure of the data and how it's processed.
You can either import the entire dataset or sample a portion of it. For a Databricks database, it provides
the following sampling options:
Use the following procedure to import your data from a Databricks database.
• Dataset name – A name that you want to use for the dataset in your Data Wrangler flow.
• Driver – com.simba.spark.jdbc.Driver.
• JDBC URL – The URL of the Databricks database. The URL formatting can vary between Databricks
instances. For information about finding the URL and the specifying the parameters within
it, see JDBC configuration and connection parameters. The following is an example of how a
URL can be formatted: jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/
default;transportMode=http;ssl=1;httpPath=sql/protocolv1/
o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=token;PWD=personal-
access-token.
Note
You can specify a secret ARN that contains the JDBC URL instead of specifying the
JDBC URL itself. The secret must contain a key-value pair with the following format:
jdbcURL:JDBC-URL. For more information, see What is Secrets Manager?.
7. Specify a SQL SELECT statement.
Note
Data Wrangler doesn't support Common Table Expressions (CTE) or temporary tables within
a query.
8. For Sampling, choose a sampling method.
9. Choose Run.
10. (Optional) For the PREVIEW, choose the gear to open the Partition settings.
The gear for the additional settings is located to the far right of the PREVIEW title.
• Specify the number of partitions. You can partition by column if you specify the number of
partitions:
1012
Amazon SageMaker Developer Guide
Import
• Upper bound – From the values in the column that you've specified, the upper bound is the
value that you're using in the partition. The value that you specify doesn't change the data
that you're importing. It only affects the speed of the import. For the best performance,
specify an upper bound that's close to the column's maximum.
• Lower bound – From the values in the column that you've specified, the lower bound is the
value that you're using in the partition. The value that you specify doesn't change the data
that you're importing. It only affects the speed of the import. For the best performance,
specify a lower bound that's close to the column's minimum.
11. Choose Import.
With Snowflake as a data source in Data Wrangler, you can quickly connect to Snowflake without writing
a single line of code. You can join your data in Snowflake with data from any other data source in Data
Wrangler.
Once connected, you can interactively query data stored in Snowflake, transform data with more than
300 preconfigured data transformations, understand data and identify potential errors and extreme
values with a set of robust preconfigured visualization templates, quickly identify inconsistencies in your
data preparation workflow, and diagnose issues before models are deployed into production. Finally, you
can export your data preparation workflow to Amazon S3 for use with other SageMaker features such as
Amazon SageMaker Autopilot, Amazon SageMaker Feature Store and Amazon SageMaker Model Building
Pipelines.
You can encrypt the output of your queries using an AWS Key Management Service key that you've
created. For more information about AWS KMS, see AWS Key Management Service.
Topics
• Administrator Guide (p. 1013)
• Data Scientist Guide (p. 1026)
Administrator Guide
Important
To learn more about granular access control and best practices, see Security Access Control.
This section is for Snowflake administrators who are setting up access to Snowflake from within
SageMaker Data Wrangler.
Important
You are responsible for managing and monitoring the access control within Snowflake. This
includes what data a user can access, what storage integration a user can use, and what queries
a user can run. Data Wrangler does not add a layer of access control with respect to Snowflake.
Access control includes the following:
Data Wrangler does not add a layer of access control to Snowflake. For more information, see
Configure Snowflake Data Import Permissions (p. 1014).
1013
Amazon SageMaker Developer Guide
Import
Important
Note that granting monitor privileges can permit users to see details within an object, such as
queries or usage within a warehouse.
Snowflake requires the following permissions on an S3 bucket and directory to be able to access files in
the directory:
• s3:GetObject
• s3:GetObjectVersion
• s3:ListBucket
• s3:ListObjects
• s3:GetBucketLocation
You must create an IAM policy to configure access permissions for Snowflake to load and unload data
from an Amazon S3 bucket.
The following is the JSON policy document that you use to create the policy:
For information and procedures about creating policies with policy documents, see Creating IAM policies.
For documentation that provides an overview of using IAM permissions with Snowflake, see the
following resources:
1014
Amazon SageMaker Developer Guide
Import
• What is IAM?
• Create the IAM Role in AWS
• Create a Cloud Storage Integration in Snowflake
• Retrieve the AWS IAM User for your Snowflake Account
• Grant the IAM User Permissions to Access Bucket.
To grant the data scientist's Snowflake role usage permission to the storage integration, you must run
GRANT USAGE ON INTEGRATION integration_name TO snowflake_role;.
Instead of having your users directly enter their credentials into Data Wrangler, you can have them use
an identity provider to access Snowflake. The following are links to the Snowflake documentation for the
identity providers that Data Wrangler supports.
• Azure AD
• Okta
• Ping Federate
Use the documentation from the preceding links to set up access to your identity provider. The
information and procedures in this section help you understand how to properly use the documentation
to access Snowflake within Data Wrangler.
Your identity provider needs to recognize Data Wrangler as an application. Use the following procedure
to register Data Wrangler as an application within the identity provider:
1. Select the configuration that starts the process of registering Data Wrangler as an application.
2. Provide the users within the identity provider access to Data Wrangler.
3. Turn on OAuth client authentication by storing the client credentials as an AWS Secrets Manager
secret.
4. Specify a redirect URL using the following format: https://Domain-ID.studio.AWS
Region.sagemaker.aws/jupyter/default/lab
Important
You're specifying the Amazon SageMaker Domain ID and AWS Region that you're using to
run Data Wrangler.
Important
You must register a URL for each Amazon SageMaker Domain and AWS Region where you're
running Data Wrangler. Users from a Domain and AWS Region that don't have redirect
URLs set up for them won't be able to authenticate with the identity provider to access the
Snowflake connection.
5. Make sure that the authorization code and refresh token grant types are allowed for the Data
Wrangler application.
Within your identity provider, you must set up a server that sends OAuth tokens to Data Wrangler at the
user level. The server sends the tokens with Snowflake as the audience.
Snowflake uses the concept of roles that are distinct role the IAM roles used in AWS. You must configure
the identity provider to use any role to use the default role associated with the Snowflake account. For
1015
Amazon SageMaker Developer Guide
Import
example, if a user has systems administrator as the default role in their Snowflake profile, the
connection from Data Wrangler to Snowflake uses systems administrator as the role.
To set up the server, do the following. You're working within Snowflake for all steps except the last one.
Important
Data Wrangler doesn't support rotating refresh tokens. Using rotating refresh tokens might
result in access failures or users needing to log in frequently.
Important
If the refresh token expires, your users must reauthenticate by accessing the connection that
they've made to Snowflake through Data Wrangler.
After you've set up the OAuth provider, you provide Data Wrangler with the information it needs to
connect to the provider. You can use the documentation from your identity provider to get values for the
following fields:
• Token URL – The URL of the token that the identity provider sends to Data Wrangler.
• Authorization URL – The URL of the authorization server of the identity provider.
• Client ID – The ID of the identity provider.
• Client secret – The secret that only the authorization server or API recognizes.
• (Azure AD only) The OAuth scope credentials that you've copied.
You store the fields and values in a AWS Secrets Manager secret and add it to the Amazon SageMaker
Studio Lifecycle Configuration that you're using for Data Wrangler. A Lifecycle Configuration is a shell
script. Use it to make the Amazon Resource Name (ARN) of the secret accessible to Data Wrangler. For
information about creating secrets see Move hardcoded secrets to AWS Secrets Manager. For information
about using lifecycle configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker
Studio (p. 182).
Important
Before you create a Secrets Manager secret, make sure that the SageMaker execution role that
you're using for Amazon SageMaker Studio has permissions to create and update secrets in
1016
Amazon SageMaker Developer Guide
Import
Secrets Manager. For more information about adding permissions, see Example: Permission to
create secrets.
For Okta and Ping Federate, the following is the format of the secret:
{
"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
"client_id":"example-client-id",
"client_secret":"example-client-secret",
"identity_provider":"OKTA"|"PING_FEDERATE",
"authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/
v2/authorize"
}
{
"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
"client_id":"example-client-id",
"client_secret":"example-client-secret",
"identity_provider":"AZURE_AD",
"authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/
v2/authorize",
"datasource_oauth_scope":"api://appuri/session:role-any)"
}
You must have a LifeCycle configuration that uses the Secrets Manager secret that you've created.
You can either create the LifeCycle configuration or modify one that has already been created. The
configuration must use the following script.
#!/bin/bash
set -eux
## Script Body
For information about setting up Lifecycle Configurations, see Creating and Associating a Lifecycle
Configuration (p. 183). When you're going through the process of setting up, do the following:
1017
Amazon SageMaker Developer Guide
Import
Private Connectivity between Data Wrangler and Snowflake via AWS PrivateLink
This section explains how to use AWS PrivateLink to establish a private connection between Data
Wrangler and Snowflake. The steps are explained in the following sections.
Create a VPC
If you do not have a VPC set up, then follow the Create a new VPC instructions to create one.
Once you have a chosen VPC you would like to use for establishing a private connection, provide the
following credentials to your Snowflake Administrator to enable AWS PrivateLink:
• VPC ID
• AWS Account ID
• Your corresponding account URL you use to access Snowflake
Important
As described in Snowflake's documentation, enabling your Snowflake account can take up to
two business days.
After AWS PrivateLink is activated, retrieve the AWS PrivateLink configuration for your Region by running
the following command in a Snowflake worksheet. Log into your Snowflake console and enter the
following under Worksheets: select SYSTEM$GET_PRIVATELINK_CONFIG();
privatelink-account-name: xxxxxxxx.region.privatelink
privatelink-vpce-id: com.amazonaws.vpce.region.vpce-svc-xxxxxxxxxxxxxxxxx
privatelink-account-url: xxxxxxxx.region.privatelink.snowflakecomputing.com
privatelink_ocsp-url: ocsp.xxxxxxxx.region.privatelink.snowflakecomputing.com
5. In the Service Name field, paste in the value for privatelink-vpce-id that you retrieved in the
preceding step and choose Verify.
If the connection is successful, a green alert saying Service name found appears on your screen and
the VPC and Subnet options automatically expand, as shown in the following screenshot. Depending
on your targeted Region, your resulting screen may show another AWS Region name.
1018
Amazon SageMaker Developer Guide
Import
6. Select the same VPC ID that you sent to Snowflake from the VPC dropdown list.
7. If you have not yet created a subnet, then perform the following set of instructions on creating a
subnet.
8. Select Subnets from the VPC dropdown list. Then select Create subnet and follow the prompts to
create a subset in your VPC. Ensure you select the VPC ID you sent Snowflake.
9. Under Security Group Configuration, select Create New Security Group to open the default Security
Group screen in a new tab. In this new tab, select tCreate Security Group.
10.Provide a name for the new security group (such as datawrangler-doc-snowflake-
privatelink-connection) and a description. Be sure to select the VPC ID you have used in
previous steps.
11.Add two rules to allow traffic from within your VPC to this VPC endpoint.
Navigate to your VPC under Your VPCs in a separate tab, and retrieve your CIDR block for your VPC.
Then choose Add Rule in the Inbound Rules section. Select HTTPS for the type, leave the Source as
Custom in the form, and paste in the value retrieved from the preceding describe-vpcs call (such
as 10.0.0.0/16).
12.Choose Create Security Group. Retrieve the Security Group ID from the newly created security group
(such as sg-xxxxxxxxxxxxxxxxx).
13.In the VPC Endpoint configuration screen, remove the default security group. Paste in the security
group ID in the search field and select the checkbox.
1019
Amazon SageMaker Developer Guide
Import
Retrieve the topmost record in the DNS names list. This can be differentiated from other DNS names
because it only includes the Region name (such as us-west-2), and no Availability Zone letter
notation (such as us-west-2a). Store this information for later use.
This section explains how to configure DNS for Snowflake endpoints in your VPC. This allows your VPC to
resolve requests to the Snowflake AWS PrivateLink endpoint.
1020
Amazon SageMaker Developer Guide
Import
c. In the VPCs to associate with the hosted zone section, select the Region in which your VPC is
located and the VPC ID used in previous steps.
1021
Amazon SageMaker Developer Guide
Import
This section explains how to configure Route 53 resolvers inbound endpoints for your VPC.
• Choose Create Security Group. Note the Security Group ID because adds a rule to allow traffic to
the VPC endpoint security group.
3. Navigate to the Route 53 menu within your AWS console.
• In the Resolver section, select the Inbound Endpoint option.
4. Choose Create Inbound Endpoint.
• Provide an endpoint name.
• From the VPC in the Region dropdown list, select the VPC ID you have used in all previous steps.
• In the Security group for this endpoint dropdown list, select the security group ID from Step 2 in
this section.
1022
Amazon SageMaker Developer Guide
Import
• In the IP Address section, select an Availability Zones, select a subnet, and leave the radio selector
for Use an IP address that is selected automatically selected for each IP address.
• Choose Submit.
1023
Amazon SageMaker Developer Guide
Import
This section explains how to create VPC endpoints for the following: Amazon SageMaker Studio,
SageMaker Notebooks, the SageMaker API, SageMaker Runtime, and Amazon SageMaker Feature Store
Runtime.
1024
Amazon SageMaker Developer Guide
Import
• The HTTP connection to the security group you provisioned for the Snowflake PrivateLink
connection you created in the Set up the Snowflake PrivateLink Integration step.
• The UDP and TCP for DNS (port 53) to Route 53 Resolver Inbound Endpoint security group you
create in step 2 of Configure Route 53 Resolver Inbound Endpoint for your VPC.
f. Choose Create Security Group button in the lower right hand corner.
2. Configure Studio.
• Navigate to the SageMaker menu in the AWS console.
• From the left hand console, Select the SageMaker Studio option.
• If you do not have any domains configured, the Get Started menu is present.
• Select the Standard Setup option from the Get Started menu.
• Under Authentication method, select AWS Identity and Access Management (IAM).
• From the Permissions menu, you can create a new role or use a pre-existing role, depending on your
use case.
• If you choose Create a new role, you are presented the option to provide an S3 bucket name, and
a policy is generated for you.
• If you already have a role created with permissions for the S3 buckets to which you require access,
select the role from the dropdown list. This role should have the AmazonSageMakerFullAccess
policy attached to it.
• Select the Network and Storage dropdown list to configure the VPC, security, and subnets
SageMaker uses.
• Under VPC, select the VPC in which your Snowflake PrivateLink connection exists.
• Under Subnet(s), select the subnets which have access to the Snowflake PrivateLink connection.
• Under Network Access for Studio, select VPC Only.
• Under Security Group(s), select the security group you created in step 1.
• Choose Submit.
3. Edit the SageMaker security group.
• Create the following inbound rules:
• Port 2049 to the inbound and outbound NFS Security Groups created automatically by
SageMaker in step 2 (the security group names contain the Studio domain ID).
• Access to all TCP ports to itself (required for SageMaker for VPC Only).
4. Edit the VPC Endpoint Security Groups:
• Navigate to the Amazon EC2 menu in the AWS console.
• Locate the security group you created in a preceding step.
• Add an inbound rule allowing for HTTPS traffic from the security group created in step 1.
5. Create a user profile.
• From the SageMaker Studio Control Panel , choose Add User.
• Provide a user name.
• For the Execution Role, choose to create a new role or to use a pre-existing role.
• If you choose Create a new role, you are presented the option to provide an Amazon S3 bucket
name, and a policy is generated for you.
• If you already have a role created with permissions to the Amazon S3 buckets to which
you require access, select the role from the dropdown list. This role should have the
AmazonSageMakerFullAccess policy attached to it.
• Choose Submit.
6. Create a data flow (follow the data scientist guide outlined in a preceding section).
• When adding a Snowflake connection, enter the value of privatelink-account-name (from the
Set up Snowflake PrivateLink Integration step) into the Snowflake account name (alphanumeric)
field, instead of the plain Snowflake account
1025 name. Everything else is left unchanged.
Amazon SageMaker Developer Guide
Import
1. To allow your data scientist to access Snowflake from SageMaker Data Wrangler, provide them with
one of the following:
• For Basic Authentication, a Snowflake account name, user name, and password.
• For OAuth, a user name and password in the identity provider.
• For ARN, the Secrets Manager secret Amazon Resource Name (ARN).
• A secret created with AWS Secrets Manager and the ARN of the secret. Use the following
procedure below to create the secret for Snowflake if you choose this option.
Important
If your data scientists use the Snowflake Credentials (User name and Password) option
to connect to Snowflake, you can use Secrets Manager to store the credentials in a
secret. Secrets Manager rotates secrets as part of a best practice security plan. The
secret created in Secrets Manager is only accessible with the Studio role configured
when you set up a Studio user profile. This requires you to add this permission,
secretsmanager:PutResourcePolicy, to the policy that is attached to your Studio
role.
We strongly recommend that you scope the role policy to use different roles for different
groups of Studio users. You can add additional resource-based permissions for the Secrets
Manager secrets. See Manage Secret Policy for condition keys you can use.
For information about creating a secret, see Create a secret. You're charged for the secrets
that you create.
2. Provide the data scientist with the name of the storage integration you created in Step 3: Create
a Cloud Storage Integration in Snowflake. This is the name of the new integration and is called
integration_name in the CREATE INTEGRATION SQL command you ran, which is shown in the
following snippet:
You must use Studio version 1.3.0 or later. Use the following procedure to open Amazon SageMaker
Studio and see which version you're running.
To open Studio and check its version, see the following procedure.
1. Use the steps in Prerequisites (p. 983) to access Data Wrangler through Amazon SageMaker Studio.
2. Next to the user you want to use to launch Studio, select Launch app.
1026
Amazon SageMaker Developer Guide
Import
3. Choose Studio.
4. After Studio loads, select File, then New, and then Terminal.
5. Once you have launched Studio, select File, then New, and then Terminal.
6. Enter cat /opt/conda/share/jupyter/lab/staging/yarn.lock | grep -A 1 "@amzn/
sagemaker-ui-data-prep-plugin@" to print the version of your Studio instance. You must have
Studio version 1.3.0 to use Snowflake.
You can update Amazon SageMaker Studio from within the AWS Management Console. For more
information about updating Studio, see Amazon SageMaker Studio UI Overview (p. 129).
• Specifying your Snowflake credentials (account name, user name, and password) in Data Wrangler.
• Providing an Amazon Resource Name (ARN) of a secret containing the credentials.
• Using an open standard for access delegation (OAuth) provider that connects to Snowflake. Your
administrator can give you access to one of the following OAuth providers:
1027
Amazon SageMaker Developer Guide
Import
• Azure AD
• Okta
• Ping Federate
Talk to your administrator about the method that you need to use to connect to Snowflake.
The following sections have information about how you can connect to Snowflake using the preceding
methods.
To import a dataset into Data Wrangler from Snowflake using your credentials
• Snowflake account name (alphanumeric) – The full name of the Snowflake account.
• Username – The username that you use to access the account.
• Password – The password associated with the username.
• Storage integration – Your administrator provides you with the storage integration
information. It's the configuration that specifies the IAM role that Snowflake uses to save the
query results to an Amazon S3 bucket.
• Connection name – The name that you're specifying to uniquely identify the connection.
• (Optional) KMS key ID – A KMS key that you've created. You can specify its ARN to encrypt the
output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.
12. Choose Connect.
1028
Amazon SageMaker Developer Guide
Import
• Secrets Manager ARN – The ARN of the AWS Secrets Manager secret used to store the
credentials used to connect to Snowflake.
• Storage integration – Your administrator you with the storage integration information. It's
the configuration that specifies the IAM role that Snowflake uses to save the query results to
an Amazon S3 bucket.
• KMS key ID – Your administrator provides
• Connection name – The name that you're specifying for the connection. You can choose the
connection
• (Optional) KMS key ID – A KMS key that you've created. It's used to encrypt the output of the
Snowflake query.
12. Choose Connect.
To import a dataset into Data Wrangler from Snowflake using your credentials
• Connection name – The name that you're specifying to uniquely identify the connection.
• Snowflake account name (alphanumeric) – The full name of the Snowflake account.
• (Optional) KMS key ID – A KMS key that you've created. You can specify its ARN to encrypt the
output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.
12. Choose Connect.
1029
Amazon SageMaker Developer Guide
Import
You can begin the process of importing your data from Snowflake after you've connected to it.
Within Data Wrangler, you can view your data warehouses, databases, and schemas, along with the eye
icon with which you can preview your table. After you select the Preview Table icon, the schema preview
of that table is generated. You must select a warehouse before you can preview a table.
Important
If you're importing a dataset with columns of type TIMESTAMP_TZ or TIMESTAMP_LTZ, add
::string to the column names of your query. For more information, see How To: Unload
TIMESTAMP_TZ and TIMESTAMP_LTZ data to a Parquet file.
After you select a data warehouse, database and schema, you can now write queries and run them. The
output of your query shows under Query results.
After you have settled on the output of your query, you can then import the output of your query into a
Data Wrangler flow to perform data transformations.
After you've queried your data, navigate to the Data flow screen to start transforming your data.
Data Wrangler supports transferring data from the following SaaS platforms:
• Amplitude
• CircleCI
• DocuSign Monitor
• Domo
• Datadog
• Dynatrace
• Facebook Ads
• Facebook Page Insights
• Google Ads
• Google Analytics 4
• Google Search Console
• GitHub
• GitLab
• Infor Nexus
• Instagram Ads
• Jira Cloud
• LinkedIn Ads
• Mailchimp
• Marketo
• Microsoft Teams
• Mixpanel
1030
Amazon SageMaker Developer Guide
Import
• Okta
• Salesforce
• Salesforce Marketing Cloud
• Salesforce Pardot
• SAP OData
• SendGrid
• ServiceNow
• Singular
• Slack
• Stripe
• Trend Micro
• Typeform
• Veeva
• Zendesk
• Zendesk Chat
• Zendesk Sell
• Zendesk Sunshine
• Zoom Meetings
The preceding list has links to more information about setting up your data source. You or your
administrator can refer to the preceding links after you've read the following information.
When you navigate to the Import tab of your Data Wrangler flow, you see data sources under the
following sections:
• Available
• Set up data sources
You can connect to data sources under Available without needing additional configuration. You can
choose the data source and import your data.
Data sources under Set up data sources, require you or your administrator to use Amazon AppFlow to
transfer the data from the SaaS platform to Amazon S3 or Amazon Redshift. For information about
performing a transfer, see Using Amazon AppFlow to transfer your data (p. 1031).
After you perform the data transfer, the SaaS platform appears as a data source under Available. You
can choose it and import the data that you've transferred into Data Wrangler. The data that you've
transferred appears as tables that you can query.
After you've added permissions, you can transfer the data. Within Amazon AppFlow, you create a flow
to transfer the data. A flow is a series of configurations. You can use it to specify whether you're running
1031
Amazon SageMaker Developer Guide
Import
the data transfer on a schedule or whether you're partitioning the data into separate files. After you've
configured the flow, you run it to transfer the data.
For information about creating a flow, see Creating flows in Amazon AppFlow. For information about
running a flow, see Activate an Amazon AppFlow flow.
After the data has been transferred, use the following procedure to access the data in Data Wrangler.
Important
Before you try to access your data, make sure your IAM role has the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "glue:SearchTables",
"Resource": [
"arn:aws:glue:*:*:table/*/*",
"arn:aws:glue:*:*:database/*",
"arn:aws:glue:*:*:catalog"
]
}
]
}
By default, the IAM role that you use to access Data Wrangler is the
SageMakerExecutionRole. For more information about adding policies, see Adding IAM
identity permissions (console).
a. Choose a Workgroup.
b. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a
workgroup, specify a value for Amazon S3 location of query results.
c. (Optional) For Data retention period, select the checkbox to set a data retention period and
specify the number of days to store the data before it's deleted.
d. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the
checkbox and not save the connection.
12. Choose Connect.
13. Specify a query.
1032
Amazon SageMaker Developer Guide
Import
Note
To help you specify a query, you can choose a table on the left-hand navigation panel. Data
Wrangler shows the table name and a preview of the table. Choose the icon next to the
table name to copy the name. You can use the table name in the query.
14. Choose Run.
15. Choose Import query.
16. For Dataset name, specify the name of the dataset.
17. Choose Add.
When you navigate to the Import data screen, you can see the connection that you've created. You can
use the connection to import more data.
When you query data from Amazon Athena or Amazon Redshift, the queried dataset is automatically
stored in Amazon S3. Data is stored in the default SageMaker S3 bucket for the AWS Region in which you
are using Studio.
Data Wrangler flows depend on this Amazon S3 dataset location, so you should not modify this dataset
in Amazon S3 while you are using a dependent flow. If you do modify this S3 location, and you want to
continue using your data flow, you must remove all objects in trained_parameters in your .flow file.
To do this, download the .flow file from Studio and for each instance of trained_parameters, delete
all entries. When you are done, trained_parameters should be an empty JSON object:
"trained_parameters": {}
When you export and use your data flow to process your data, the .flow file you export refers to this
dataset in Amazon S3. Use the following sections to learn more.
This file is stored under the following prefix (directory): redshift/uuid/data/, where uuid is a unique
identifier that gets created for each query.
The dataset you import by selecting Import dataset is stored in Parquet format in Amazon S3.
1033
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
Preview files are written in CSV format when you select Run on the Athena import screen, and contain
up to 100 rows from your queried dataset.
The dataset you query is located under the prefix (directory): athena/uuid/data/, where uuid is a
unique identifier that gets created for each query.
The subset of the dataset that is stored to preview dataframes in Data Wrangler is stored under the
prefix: athena/.
Instances
When you create a Data Wrangler flow in Amazon SageMaker Studio, Data Wrangler uses an Amazon
EC2 instance to run the analyses and transformations in your flow. By default, Data Wrangler uses
the m5.4xlarge instance. m5 instances are general purpose instances that provide a balance between
compute and memory. You can use m5 instances for a variety of compute workloads.
Data Wrangler also gives you the option of using r5 instances. r5 instances are designed to deliver fast
performance that processes large datasets in memory.
We recommend that you choose an instance that is best optimized around your workloads. For example,
the r5.8xlarge might have a higher price than the m5.4xlarge, but the r5.8xlarge might be better
optimized for your workloads. With better optimized instances, you can run your data flows in less time
at lower cost.
The following table shows the instances that you can use to run your Data Wrangler flow.
ml.m5.4xlarge 16 64 GiB
For more information about r5 instances, see Amazon EC2 R5 Instances. For more information about m5
instances, see Amazon EC2 M5 Instances.
Each Data Wrangler flow has an Amazon EC2 instance associated with it. You might have multiple flows
that are associated with a single instance.
1034
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
For each flow file, you can seamlessly switch the instance type. If you switch the instance type, the
instance that you used to run the flow continues to run.
1.
Choose the home icon, .
2. Navigate to the instance that you're using and choose it.
3. Choose the instance type that you want to use.
4. Choose Save.
You are charged for all running instances. To avoid incurring additional charges, shut down the instances
that you aren't using manually. To shut down an instance that is running, use the following procedure.
1. Choose the instance icon. The following image shows you where to select the RUNNING INSTANCES
icon.
2. Choose Shut down next to the instance that you want to shut down.
1035
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
If you shut down an instance used to run a flow, you temporarily can't access the flow. If you get an error
while attempting to open the flow running an instance you previously shut down, wait for 5 minutes and
try opening it again.
When you export your data flow to a location such as Amazon Simple Storage Service or Amazon
SageMaker Feature Store, Data Wrangler runs an Amazon SageMaker processing job. You can use one
of the following instances for the processing job. For more information on exporting your data, see
Export (p. 1116).
ml.m5.4xlarge 16 64 GiB
For more information about the cost per hour for using the available instance types, see SageMaker
Pricing.
Each time you add a transform step, you create a new dataframe. When multiple transform steps (other
than Join or Concatenate) are added to the same dataset, they are stacked.
Join and Concatenate create standalone steps that contain the new joined or concatenated dataset.
The following diagram shows a data flow with a join between two datasets, as well as two stacks of
steps. The first stack (Steps (2)) adds two transforms to the type inferred in the Data types dataset. The
downstream stack, or the stack to the right, adds transforms to the dataset resulting from a join named
demo-join.
1036
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
The small, gray box in the bottom right corner of the data flow provides an overview of number of stacks
and steps in the flow and the layout of the flow. The lighter box inside the gray box indicates the steps
that are within the UI view. You can use this box to see sections of your data flow that fall outside of the
UI view. Use the fit screen icon ( ) to fit all steps and datasets into your UI view.
The bottom left navigation bar includes icons that you can use to zoom in ( )
and out ( ) of your data flow and resize the data flow to fit the screen
( ) . Use the lock icon ( ) to lock and unlock the location of each
step on the screen.
1037
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
• Edit data types (For a Data types step only): If you have not added any transforms to a Data types
step, you can select Edit data types to update the data types Data Wrangler inferred when importing
your dataset.
• Add transform: Adds a new transform step. See Transform Data (p. 1058) to learn more about the
data transformations you can add.
• Add analysis: Adds an analysis. You can use this option to analyze your data at any point in the data
flow. When you add one or more analyses to a step, an analysis icon ( ) appears on that step. See
Analyze and Visualize (p. 1101) to learn more about the analyses you can add.
• Join: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see Join
Datasets (p. 1064).
• Concatenate: Concatenates two datasets and adds the resulting dataset to the data flow. To learn
more, see Concatenate Datasets (p. 1064).
To delete a step from a stack of steps, select the stack and then select the step you want to delete.
You can use one of the following procedures to delete a step without deleting the downstream steps.
You can delete an individual step for nodes in your data flow that have a single input. You can't
delete individual steps for source, join, and concatenate nodes.
Use the following procedure to delete a step in the Data Wrangler flow.
1. Choose the group of steps that has the step that you're deleting.
2. Choose the icon next to the step.
3. Choose Delete step.
1038
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
1039
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
You can delete an individual step for nodes in your data flow that have a single input. You can't
delete individual steps for source, join, and concatenate nodes.
1. Choose the step and open the table view for the step.
2. Move your cursor over the step so the ellipsis icon appears.
3. Choose the icon next to the step.
4. Choose Delete.
1040
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
1041
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
There are many ways that you can edit a step. Some examples include changing the imputation method
or changing the threshold for considering a value to be an outlier.
1. Choose a step in the Data Wrangler flow to open the table view.
1042
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
1043
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow
Note
You can use the shared spaces within your Amazon SageMaker Domain to work collaboratively
on your Data Wrangler flows. Within a shared space, you and your collaborators can edit a flow
file in real-time. However, neither you nor your collaborators can see the changes in real-time.
When anyone makes a change to the Data Wrangler flow, they must save it immediately. When
someone saves a file, a collaborator won’t be able to see it unless the close the file and reopen
1044
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
it. Any changes that aren’t saved by one person are overwritten by the person who saved their
changes.
Topics
• Summary (p. 1045)
• Target column (p. 1047)
• Quick model (p. 1050)
• Feature summary (p. 1052)
• Samples (p. 1054)
• Definitions (p. 1055)
You can either download the report or view it online. To download the report, choose the download
button at the top right corner of the screen. The following image shows the button.
Summary
The insights report has a brief summary of the data that includes general information such as missing
values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings
that point to probable issues with the data. We recommend that you investigate the warnings.
1045
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
1046
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
Target column
When you create the data quality and insights report, Data Wrangler gives you the option to select a
target column. A target column is a column that you're trying to predict. When you choose a target
column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the
order of their predictive power. When you select a target column, you must specify whether you’re trying
to solve a regression or a classification problem.
For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a
category. It also presents observations, or rows, with a missing or invalid target value.
The following image shows an example target column analysis for a classification problem.
1047
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents
observations, or rows, with a missing, invalid, or outlier target value.
The following image shows an example target column analysis for a regression problem.
1048
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
1049
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
Quick model
The Quick model provides an estimate of the expected predicted quality of a model that you train on
your data.
Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training
and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split,
each data partition has the same ratio of labels. For classification problems, it's important to have the
same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost
model with the default hyperparameters. It applies early stopping on the validation data and performs
minimal feature preprocessing.
For classification models, Data Wrangler returns both a model summary and a confusion matrix.
The following is an example of a classification model summary. To learn more about the information that
it returns, see Definitions (p. 1055).
The following is an example of a confusion matrix that the quick model returns.
1050
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
• The number of times the predicted label matches the true label.
• The number of times the predicted label doesn't match the true label.
The true label represents an actual observation in your data. For example, if you're using a model to
detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-
fraudulent. The predicted label represents the label that your model assigns to the data.
You can use the confusion matrix to see how well the model predicts the presence or the absence of a
condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of
both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect
fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent
transactions as fraudulent.
The following is an example of the quick model outputs for a regression problem.
1051
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
Feature summary
When you specify a target column, Data Wrangler orders the features by their prediction power.
Prediction power is measured on the data after it was split into 80% training and 20% validation folds.
Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature
preprocessing and measures prediction performance on the validation data.
It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more
useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the
target column.
It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem
with other columns. You can confidently use the prediction scores to determine whether a feature in your
dataset is predictive.
A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities,
which often indicates target leakage. Target leakage usually happens when the dataset contains a
column that isn’t available at the prediction time. For example, it could be a duplicate of the target
column.
The following are examples of the table and the histogram that show the prediction value of each
feature.
1052
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
1053
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
Samples
Data Wrangler provides information about whether your samples are anomalous or if there are
duplicates in your dataset.
Data Wrangler detects anomalous samples using the isolation forest algorithm. The isolation forest
associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate
anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative
anomaly score are usually considered anomalous and samples with positive anomaly score are
considered non-anomalous.
When you look at a sample that might be anomalous, we recommend that you pay attention to
unusual values. For example, you might have anomalous values that result from errors in gathering and
processing the data. The following is an example of the most anomalous samples according to the Data
Wrangler’s implementation of the isolation forest algorithm. We recommend using domain knowledge
and business logic when you examine the anomalous samples.
Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data
sources could include valid duplicates. Other data sources could have duplicates that point to problems
in data collection. Duplicate samples that result from faulty data collection could interfere with machine
learning processes that rely on splitting the data into independent training and validation folds.
The following are elements of the insights report that can be impacted by duplicated samples:
• Quick model
1054
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
You can remove duplicate samples from the dataset using the Drop duplicates transform under Manage
rows. Data Wrangler shows you the most frequently duplicated rows.
Definitions
The following are definitions for the technical terms that are used in the data insights report.
Feature types
The following are the definitions for each of the feature types:
• Numeric – Numeric values can be either floats or integers, such as age or income. The machine
learning models assume that numeric values are ordered and a distance is defined over them. For
example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
• Categorical – The column entries belong to a set of unique values, which is usually much smaller
than the number of entries in the column. For example, a column of length 100 could contain
the unique values Dog, Cat, and Mouse. The values could be numeric, text, or a combination of
both. Horse, House, 8, Love, and 3.1 would all be valid values and could be found in the same
categorical column. The machine learning model does not assume order or distance on the values
of categorical features, as opposed to numeric features, even when all the values are numbers.
• Binary – Binary features are a special categorical feature type in which the cardinality of the set of
unique values is 2.
• Text – A text column contains many non-numeric unique values. In extreme cases, all the elements
of the column are unique. In an extreme case, no two entries are the same.
• Datetime – A datetime column contains information about the date or time. It can have
information about both the date and time.
Feature statistics
• Prediction power – Prediction power measures how useful the column is in predicting the target.
• Outliers (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust
to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature
values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the
clipped vector. All values larger than median + 5 * RSTD or smaller than median - 5 * RSTD are
considered to be outliers.
• Skew (in numeric columns) – Skew measures the symmetry of the distribution and is defined as
the third moment of the distribution divided by the third power of the standard deviation. The
skewness of the normal distribution or any other symmetric distribution is zero. Positive values
imply that the right tail of the distribution is longer than the left tail. Negative values imply that
the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is
considered skewed when the absolute value of the skew is larger than 3.
• Kurtosis (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the
distribution. It's defined as the fourth moment of the distribution divided by the square of the
second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply
that the distribution is concentrated around the mean and the tails are lighter than the tails of the
normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
• Missing values – Null-like objects, empty strings and strings composed of only white spaces are
considered missing.
1055
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality
• Valid values for numeric features or regression target – All values that you can cast to finite
floats are valid. Missing values are not valid.
• Valid values for categorical, binary, or text features, or for classification target – All values that
are not missing are valid.
• Datetime features – All values that you can cast to a datetime object are valid. Missing values are
not valid.
• Invalid values – Values that are either missing or you can't properly cast. For example, in a
numeric column, you can't cast the string "six" or a null value.
The following are the definitions for the quick model metrics:
The following are the definitions for the quick model metrics:
• Accuracy – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range
[0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the
perfect model.
• Balanced accuracy – Balanced accuracy is the ratio of samples that are predicted accurately when
the class weights are adjusted to balance the data. All classes are given the same importance,
regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model
that predicts all samples wrong. 1 is the score of the perfect model.
• AUC (binary classification) – This is the area under the receiver operating characteristic curve.
AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model
returns a score of 1.
• AUC (OVR) – For multiclass classification, this is the area under the receiver operating
characteristic curve calculated separately for each label using one versus rest. Data Wrangler
reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score
of 0.5 and the perfect model returns a score of 1.
• Precision – Precision is defined for a specific class. Precision is the fraction of true positives out
of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the
score of the model that has no false positives for the class. For binary classification, Data Wrangler
reports the precision of the positive class.
• Recall – Recall is defined for a specific class. Recall is the fraction of the relevant class instances
that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that
classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the
recall of the positive class.
1056
Amazon SageMaker Developer Guide
Automatically Train Models on Your Data Flow
• F1 – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the
range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports
the F1 for classes with positive values.
Textual patterns
Patterns describe the textual format of a string using an easy to read format. The following are
examples of textual patterns:
Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can
describe many of the commonly used patterns. The confidence expressed as a percentage indicates
how much of the data is estimated to match the pattern. Using the textual pattern, you can see
which rows in your data you need to correct or drop.
The following describes the patterns that Data Wrangler can recognize:
A word character is either an underscore or a character that might appear in a word in any language.
For example, the strings 'Hello_word' and 'écoute' both consist of word characters. 'H' and 'é' are
both examples of word characters.
When you train and tune a model, Data Wrangler exports your data to an Amazon S3 location where
Amazon SageMaker Autopilot can access it.
You can prepare and deploy a model by choosing a node in your Data Wrangler flow and choosing
Export and Train in the data preview. You can use this method to view your dataset before you choose to
train a model on it.
You can also train and deploy a model directly from your data flow.
1057
Amazon SageMaker Developer Guide
Transform Data
The following procedure prepares and deploys a model from the data flow. For Data Wrangler flows
with multi-row transforms, you can't use the transforms from the Data Wrangler flow when you're
deploying the model. You can use the following procedure to process the data before you use it to
perform inference.
To train and deploy a model directly from your data flow, do the following.
• Join
• Concatenate
• Group by
You can export the dataset from a join to Amazon S3. You can create a new flow using the
dataset that you've exported. You can use the dataset to train and deploy a model.
13. Choose Next: Review and create.
14. Choose Create experiment.
For more information about model training and deployment, see Create an Amazon SageMaker
Autopilot experiment (p. 470). Autopilot shows you analyses about the best model's performance. For
more information about model performance, see View an Autopilot Model Performance Report (p. 498).
Transform Data
Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning,
transforming, and featurizing your data. When you add a transform, it adds a step to the data flow. Each
transform you add modifies your dataset and produces a new dataframe. All subsequent transforms
apply to the resulting dataframe.
Data Wrangler includes built-in transforms, which you can use to transform columns without any code.
You can also add custom transformations using PySpark, Python (User-Defined Function), pandas,
1058
Amazon SageMaker Developer Guide
Transform Data
and PySpark SQL. Some transforms operate in place, while others create a new output column in your
dataset.
You can apply transforms to multiple columns at once. For example, you can delete multiple columns in
a single step.
You can apply the Process numeric and Handle missing transforms only to a single column.
Use this page to learn more about these built-in and custom transforms.
Transform UI
Most of the built-in transforms are located in the Prepare tab of the Data Wrangler UI. You can access
the join and concatenate transforms through the data flow view. Use the following table to preview
these two views.
Transform
You can add a transform to any step in your data flow. Use the following procedure to add a
transform to your data flow.
1059
Amazon SageMaker Developer Guide
Transform Data
4. Choose a transform.
5. (Optional) You can search for the transform that you want to use. Data Wrangler highlights the
query in the results.
1060
Amazon SageMaker Developer Guide
Transform Data
Join View
To join two datasets, select the first dataset in your data flow and choose Join. When you choose
Join, you see results similar to those shown in the following image. Your left and right datasets are
displayed in the left panel. The main panel displays your data flow, with the newly joined dataset
added.
1061
Amazon SageMaker Developer Guide
Transform Data
When you choose Configure to configure your join, you see results similar to those shown in the
following image. Your join configuration is displayed in the left panel. You can use this panel to
choose the joined dataset name, join type, and columns to join. The main panel displays three tables.
The top two tables display the left and right datasets on the left and right respectively. Under this
table, you can preview the joined dataset.
To concatenate two datasets, you select the first dataset in your data flow and choose Concatenate.
When you select Concatenate, you see results similar to those shown in the following image. Your
1062
Amazon SageMaker Developer Guide
Transform Data
left and right datasets are displayed in the left panel. The main panel displays your data flow, with
the newly concatenated dataset added.
When you choose Configure to configure your concatenation, you see results similar to those
shown in the following image. Your concatenate configuration displays in the left panel. You can
use this panel to choose the concatenated dataset's name, and choose to remove duplicates after
concatenation and add columns to indicate the source dataframe. The main panel displays three
tables. The top two tables display the left and right datasets on the left and right respectively. Under
this table, you can preview the concatenated dataset.
1063
Amazon SageMaker Developer Guide
Transform Data
Join Datasets
You join dataframes directly in your data flow. When you join two datasets, the resulting joined dataset
appears in your flow. The following join types are supported by Data Wrangler.
• Left Outer – Include all rows from the left table. If the value for the column joined on a left table row
does not match any right table row values, that row contains null values for all right table columns in
the joined table.
• Left Anti – Include rows from the left table that do not contain values in the right table for the joined
column.
• Left semi – Include a single row from the left table for all identical rows that satisfy the criteria in the
join statement. This excludes duplicate rows from the left table that match the criteria of the join.
• Right Outer – Include all rows from the right table. If the value for the joined column in a right table
row does not match any left table row values, that row contains null values for all left table columns in
the joined table.
• Inner – Include rows from left and right tables that contain matching values in the joined column.
• Full Outer – Include all rows from the left and right tables. If the row value for the joined column in
either table does not match, separate rows are created in the joined table. If a row doesn’t contain a
value for a column in the joined table, null is inserted for that column.
• Cartesian Cross – Include rows which combine each row from the first table with each row from the
second table. This is a Cartesian product of rows from tables in the join. The result of this product is
the size of the left table times the size of the right table. Therefore, we recommend caution in using
this join between very large datasets.
1. Select + next to the left dataframe that you want to join. The first dataframe you select is always the
left table in your join.
2. Choose Join.
3. Select the right dataframe. The second dataframe you select is always the right table in your join.
4. Choose Configure to configure your join.
5. Give your joined dataset a name using the Name field.
6. Select a Join type.
7. Select a column from the left and right tables to join.
8. Choose Apply to preview the joined dataset on the right.
9. To add the joined table to your data flow, choose Add.
Concatenate Datasets
Concatenate two datasets:
1. Choose + next to the left dataframe that you want to concatenate. The first dataframe you select is
always the left table in your concatenate.
2. Choose Concatenate.
3. Select the right dataframe. The second dataframe you select is always the right table in your
concatenate.
4. Choose Configure to configure your concatenate.
5. Give your concatenated dataset a name using the Name field.
1064
Amazon SageMaker Developer Guide
Transform Data
6. (Optional) Select the checkbox next to Remove duplicates after concatenation to remove duplicate
columns.
7. (Optional) Select the checkbox next to Add column to indicate source dataframe if, for each
column in the new dataset, you want to add an indicator of the column's source.
8. Choose Apply to preview the new dataset.
9. Choose Add to add the new dataset to your data flow.
Balance Data
You can balance the data for datasets with an underrepresented category. Balancing a dataset can help
you create better models for binary classification.
Note
You can't balance datasets containing column vectors.
You can use the Balance data operation to balance your data using one of the following operators:
• Random oversampling – Randomly duplicates samples in the minority category. For example, if
you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal
proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases
within the dataset 8 times.
• Random undersampling – Roughly equivalent to random oversampling. Randomly removes samples
from the overrepresented category to get the proportion of samples that you desire.
• Synthetic Minority Oversampling Technique (SMOTE) – Uses samples from the underrepresented
category to interpolate new synthetic minority samples. For more information about SMOTE, see the
following description.
You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE
interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to
determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric
features to calculate the distances between samples in the underrepresented group.
For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features
by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1].
For numeric features, Data Wrangler interpolates samples using a weighted average of the samples.
For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The
interpolated sample has a value of 0.7A + 0.3B.
Data Wrangler interpolates non-numeric features by copying from either of the interpolated real
samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A
and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of
the time.
Custom Transforms
The Custom Transforms group allows you to use Python (User-Defined Function), PySpark, pandas,
or PySpark (SQL) to define custom transformations. For all three options, you use the variable df to
access the dataframe to which you want to apply the transform. To apply your custom code to your
dataframe, assign the dataframe with the transformations that you've made to the df variable. If you're
not using Python (User-Defined Function), you don't need to include a return statement. Choose Preview
to preview the result of the custom transform. Choose Add to add the custom transform to your list of
Previous steps.
You can import the popular libraries with an import statement in the custom transform code block,
such as the following:
1065
Amazon SageMaker Developer Guide
Transform Data
Important
Custom transform doesn't support columns with spaces or special characters in the name.
We recommend that you specify column names that only have alphanumeric characters and
underscores. You can use the Rename column transform in the Manage columns transform
group to remove spaces from a column's name. You can also add a Python (Pandas) Custom
transform similar to the following to remove spaces from multiple columns in a single step.
This example changes columns named A column and B column to A_column and B_column
respectively.
If you include print statements in the code block, the result appears when you select Preview. You can
resize the custom code transformer panel. Resizing the panel provides more space to write code. The
following image shows the resizing of the panel.
1066
Amazon SageMaker Developer Guide
Transform Data
1067
Amazon SageMaker Developer Guide
Transform Data
The following sections provide additional context and examples for writing custom transform code.
The Python function gives you the ability to write custom transformations without needing to know
Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar
performance using custom Python code and an Apache Spark plugin.
To use the Python (User-Defined Function) code block, you specify the following:
• Input column – The input column where you're applying the transform.
• Mode – The scripting mode, either pandas or Python.
• Return type – The data type of the value that you're returning.
Using the pandas mode gives better performance. The Python mode makes it easier for you to write
transformations by using pure Python functions.
The following video shows an example of how to use custom code to create a transformation. It uses the
Titanic dataset to create a column with the person's salutation.
1068
Amazon SageMaker Developer Guide
Transform Data
1069
Amazon SageMaker Developer Guide
Transform Data
PySpark
pandas
The following example provides an overview of the dataframe to which you are adding transforms.
df.info()
PySpark (SQL)
The following example creates a new dataframe with four columns: name, fare, pclass, survived.
If you don’t know how to use PySpark, you can use custom code snippets to help you get started.
Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform
tasks such as dropping columns, grouping by columns, or modelling.
To use a code snippet, choose Search example snippets and specify a query in the search bar. The text
you specify in the query doesn’t have to match the name of the code snippet exactly.
The following example shows a Drop duplicate rows code snippet that can delete rows with similar data
in your dataset. You can find the code snippet by searching for one of the following:
• Duplicates
• Identical
• Remove
The following snippet has comments to help you understand the changes that you need to make. For
most snippets, you must specify the column names of your dataset in the code.
To use a snippet, copy and paste its content into the Custom transform field. You can copy and paste
multiple code snippets into the custom transform field.
Custom Formula
Use Custom formula to define a new column using a Spark SQL expression to query data in the current
dataframe. The query must use the conventions of Spark SQL expressions.
1070
Amazon SageMaker Developer Guide
Transform Data
Important
Custom formula doesn't support columns with spaces or special characters in the name.
We recommend that you specify column names that only have alphanumeric characters and
underscores. You can use the Rename column transform in the Manage columns transform
group to remove spaces from a column's name. You can also add a Python (Pandas) Custom
transform similar to the following to remove spaces from multiple columns in a single step.
This example changes columns named A column and B column to A_column and B_column
respectively.
You can use this transform to perform operations on columns, referencing the columns by name. For
example, assuming the current dataframe contains columns named col_a and col_b, you can use the
following operation to produce an Output column that is the product of these two columns with the
following code:
col_a * col_b
Other common operations include the following, assuming a dataframe contains col_a and col_b
columns:
The first component accounts for the largest amount of variation in the data. The second component
accounts for the second largest amount of variation in the data, and so on.
You can use dimensionality reduction to reduce the size of the data sets that you use to train models.
Instead of using the features in your dataset, you can use the principal components instead.
To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns
in your dataset. The first principal component is the value on the axis that has the largest amount of
variance. The second principal component is the value on the axis that has the second largest amount
of variance. The nth principal component is the value on the axis that has the nth largest amount of
variance.
You can configure the number of principal components that Data Wrangler returns. You can either
specify the number of principal components directly or you can specify the variance threshold
percentage. Each principal component explains an amount of variance in the data. For example, you
might have a principal component with a value of 0.5. The component would explain 50% of the
variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the
smallest number of components that meet the percentage that you specify.
1071
Amazon SageMaker Developer Guide
Transform Data
The following are example principal components with the amount of variance that they explain in the
data.
• Component 1 – 0.5
• Component 2 – 0.45
• Component 3 – 0.05
If you specify a variance threshold percentage of 94 or 95, Data Wrangler returns Component 1 and
Component 2. If you specify a variance threshold percentage of 96, Data Wrangler returns all three
principal components.
You can use the following procedure to run PCA on your dataset.
Encode Categorical
Categorical data is usually composed of a finite number of categories, where each category is
represented with a string. For example, if you have a table of customer data, a column that indicates the
country a person lives in is categorical. The categories would be Afghanistan, Albania, Algeria, and so
on. Categorical data can be nominal or ordinal. Ordinal categories have an inherent order, and nominal
categories do not. The highest degree obtained (High school, Bachelors, Masters, and so on) is an example
of ordinal categories.
Encoding categorical data is the process of creating a numerical representation for categories. For
example, if your categories are Dog and Cat, you may encode this information into two vectors, [1,0] to
represent Dog, and [0,1] to represent Cat.
1072
Amazon SageMaker Developer Guide
Transform Data
When you encode ordinal categories, you may need to translate the natural order of categories into your
encoding. For example, you can represent the highest degree obtained with the following map: {"High
school": 1, "Bachelors": 2, "Masters":3}.
Use categorical encoding to encode categorical data that is in string format into arrays of integers.
The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the
time the step is defined. If new categories have been added to a column when you start a Data Wrangler
job to process your dataset at time t, and this column was the input for a Data Wrangler categorical
encoding transform at time t-1, these new categories are considered missing in the Data Wrangler job.
The option you select for Invalid handling strategy is applied to these missing values. Examples of when
this can occur are:
• When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the
creation of the data flow. For example, you may use a data flow to regularly process sales data each
month. If that sales data is updated weekly, new categories may be introduced into columns for which
an encode categorical step is defined.
• When you select Sampling when you import your dataset, some categories may be left out of the
sample.
In these situations, these new categories are considered missing values in the Data Wrangler job.
You can choose from and configure an ordinal and a one-hot encode. Use the following sections to learn
more about these options.
Both transforms create a new column named Output column name. You specify the output format of
this column with Output style:
Ordinal Encode
Select Ordinal encode to encode categories into an integer between 0 and the total number of
categories in the Input column you select.
• Choose Skip if you want to omit the rows with missing values.
• Choose Keep to retain missing values as the last category.
• Choose Error if you want Data Wrangler to throw an error if missing values are encountered in the
Input column.
• Choose Replace with NaN to replace missing with NaN. This option is recommended if your ML
algorithm can handle missing values. Otherwise, the first three options in this list may produce better
results.
One-Hot Encode
Select One-hot encode for Transform to use one-hot encoding. Configure this transform using the
following:
• Drop last category: If True, the last category does not have a corresponding index in the one-hot
encoding. When missing values are possible, a missing category is always the last one and setting this
to True means that a missing value results in an all zero vector.
1073
Amazon SageMaker Developer Guide
Transform Data
Similarity encode
Use similarity encoding when you have the following:
The similarity encoder creates embeddings for columns with categorical data. An embedding is a
mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to
vectors containing similar values. For example, it creates very similar encodings for "California" and
"Calfornia".
Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It
converts the tokens into an embedding using min-hash encoding.
The following example shows how the similarity encoder creates vectors from strings.
1074
Amazon SageMaker Developer Guide
Transform Data
For the preceding reasons, similarity encoding is more versatile than one-hot encoding.
To add the similarity encoding transform to your dataset, use the following procedure.
Featurize Text
Use the Featurize Text transform group to inspect string-typed columns and use text embedding to
featurize these columns.
1075
Amazon SageMaker Developer Guide
Transform Data
This feature group contains two features, Character statistics and Vectorize. Use the following sections to
learn more about these transforms. For both options, the Input column must contain text data (string
type).
Character Statistics
Use Character statistics to generate statistics for each row in a column containing text data.
This transform computes the following ratios and counts for each row, and creates a new column to
report the result. The new column is named using the input column name as a prefix and a suffix that is
specific to the ratio or count.
• Number of words: The total number of words in that row. The suffix for this output column is -
stats_word_count.
• Number of characters: The total number of characters in that row. The suffix for this output column is
-stats_char_count.
• Ratio of upper: The number of uppercase characters, from A to Z, divided by all characters in the
column. The suffix for this output column is -stats_capital_ratio.
• Ratio of lower: The number of lowercase characters, from a to z, divided by all characters in the
column. The suffix for this output column is -stats_lower_ratio.
• Ratio of digits: The ratio of digits in a single row over the sum of digits in the input column. The suffix
for this output column is -stats_digit_ratio.
• Special characters ratio: The ratio of non-alphanumeric (characters like #$&%:@) characters
to over the sum of all characters in the input column. The suffix for this output column is -
stats_special_ratio.
Vectorize
Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use
the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–
inverse document frequency (TF-IDF) vectors.
When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real
number that represents its semantic importance. Higher numbers are associated with less frequent
words, which tend to be more meaningful.
When you define a Vectorize transform step, Data Wrangler uses the data in your dataset to define the
count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.
• Output column name: This transform creates a new column with the text embedding. Use this field to
specify a name for this output column.
• Tokenizer: A tokenizer converts the sentence into a list of words, or tokens.
Choose Standard to use a tokenizer that splits by white space and converts each word to lowercase.
For example, "Good dog" is tokenized to ["good","dog"].
Choose Custom to use a customized tokenizer. If you choose Custom, you can use the following fields
to configure the tokenizer:
• Minimum token length: The minimum length, in characters, for a token to be valid. Defaults to 1.
For example, if you specify 3 for minimum token length, words like a, at, in are dropped from
the tokenized sentence.
• Should regex split on gaps: If selected, regex splits on gaps. Otherwise, it matches tokens. Defaults
to True.
1076
Amazon SageMaker Developer Guide
Transform Data
• Regex pattern: Regex pattern that defines the tokenization process. Defaults to ' \\ s+'.
• To lowercase: If chosen, Data Wrangler converts all characters to lowercase before tokenization.
Defaults to True.
To learn more about this option, see the Spark documentation on CountVectorizer.
• Hashing is computationally faster. Hash vectorize parameters includes the following:
• Number of features during hashing: A hash vectorizer maps tokens to a vector index according to
their hash value. This feature determines the number of possible hash values. Large values result
in fewer collisions between hash values but a higher dimension output vector.
To learn more about this option, see the Spark documentation on FeatureHasher
• Apply IDF applies an IDF transformation, which multiplies the term frequency with the standard
inverse document frequency used for TF-IDF embedding. IDF parameters include the following:
• Minimum document frequency : Minimum number of documents (rows) in which a term (token)
must appear to be included. If count_vectorize is the chosen vectorizer, we recommend that you
keep the default value and only modify the min_doc_freq field in Count vectorize parameters.
Defaults to 5.
• Output format:The output format of each row.
• Select Vector to produce a single column with a sparse vector.
• Select Flattened to create a column for every category with an indicator variable for whether the
text in the original column contains a value that is equal to that category. You can only choose
flattened when Vectorizer is set as Count vectorizer.
1077
Amazon SageMaker Developer Guide
Transform Data
4 09:00
10 10:00
14 11:00
25 12:00
20 13:00
18 14:00
For the preceding table, the Number of Customers column contains the time series data. The time series
data is indexed on the hourly data in the Time (hour) column.
You might need to perform a series of transformations on your data to get it in a format that you can
use for your analysis. Use the Time series transform group to transform your time series data. For more
information about the transformations that you can perform, see the following sections.
Topics
• Group by a Time Series (p. 1078)
• Resample Time Series Data (p. 1079)
• Handle Missing Time Series Data (p. 1081)
• Validate the Timestamp of Your Time Series Data (p. 1082)
• Standardizing the Length of the Time Series (p. 1083)
• Extract Features from Your Time Series Data (p. 1084)
• Use Lagged Features from Your Time Series Data (p. 1085)
• Create a Datetime Range In Your Time Series (p. 1085)
• Use a Rolling Window In Your Time Series (p. 1086)
For example, you have the following table that tracks the average daily electricity usage in a household.
household_0 1/1/2020 30 2
household_0 1/2/2020 40 2
household_0 1/4/2020 35 3
household_1 1/2/2020 45 3
household_1 1/3/2020 55 4
1078
Amazon SageMaker Developer Guide
Transform Data
Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of
the sequence corresponds to the first timestamp of the series. For household_0, 30 is the first value of
the Electricity Usage Series. The value of 30 corresponds to the first timestamp of 1/1/2020.
You can include the starting timestamp and ending timestamp. The following table shows how that
information appears.
You can use the following procedure to group by a time series column.
Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals.
Resampling gives you the ability to establish regular intervals for the observations in your dataset.
You can either upsample or downsample a time series. Downsampling increases the interval between
observations in the dataset. For example, if you downsample observations that are taken either
every hour or every two hours, each observation in your dataset is taken every two hours. The hourly
1079
Amazon SageMaker Developer Guide
Transform Data
observations are aggregated into a single value using an aggregation method such as the mean or
median.
Upsampling reduces the interval between observations in the dataset. For example, if you upsample
observations that are taken every two hours into hourly observations, you can use an interpolation
method to infer hourly observations from the ones that have been taken every two hours. For
information on interpolation methods, see pandas.DataFrame.interpolate.
Use the Resample operation to resample your time series data. If you have multiple time series in your
dataset, Data Wrangler standardizes the time interval for each time series.
The following table shows an example of downsampling time series data by using the mean as the
aggregation method. The data is downsampled from every two hours to every hour.
12:00 30
1:00 32
2:00 35
3:00 32
4:00 30
12:00 30
2:00 33.5
2:00 35
4:00 32.5
You can use the following procedure to resample time series data.
1080
Amazon SageMaker Developer Guide
Transform Data
• For datasets that have multiple time series, drop the time series that have missing values that are
greater than a threshold that you specify.
• Impute the missing values in a time series by using other values in the time series.
Imputing a missing value involves replacing the data by either specifying a value or by using an
inferential method. The following are the methods that you can use for imputation:
• Constant value – Replace all the missing data in your dataset with a value that you specify.
• Most common value – Replace all the missing data with the value that has the highest frequency in the
dataset.
• Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes
the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced
with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
• Backward fill – Use a backward fill to replace the missing values with the non-missing value that
follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are
replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8].
• Interpolate – Uses an interpolation function to impute the missing values. For more information on the
functions that you can use for interpolation, see pandas.DataFrame.interpolate.
Some of the imputation methods might not be able to impute of all the missing value in your dataset.
For example, a Forward fill can't impute a missing value that appears at the beginning of the time series.
You can impute the values by using either a forward fill or a backward fill.
You can either impute missing values within a cell or within a column.
The following example shows how values are imputed within a cell.
The following example shows how values are imputed within a column.
household_0 30
1081
Amazon SageMaker Developer Guide
Transform Data
household_0 40
household_0 NaN
household_1 NaN
household_1 NaN
Average daily household electricity usage with values imputed using a forward fill
household_0 30
household_0 40
household_0 40
household_1 40
household_1 40
If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use
Data Wrangler to identify invalid timestamps and understand where you need to clean your data.
1082
Amazon SageMaker Developer Guide
Transform Data
You can configure Data Wrangler to do one of the following if it encounters missing values in your
dataset:
You can validate the timestamps on columns that either have the timestamp type or the string type.
If the column has the string type, Data Wrangler converts the type of the column to timestamp and
performs the validation.
You can use the following procedure to validate the timestamps in your dataset.
You can standardize your time series for data transformations that require the length of your data to be
fixed.
Many ML algorithms require you to flatten your time series data before you use them. Flattening time
series data is separating each value of the time series into its own column in a dataset. The number of
columns in a dataset can't change, so the lengths of the time series need to be standardized between
you flatten each array into a set of features.
Each time series is set to the length that you specify as a quantile or percentile of the time series set. For
example, you can have three sequences that have the following lengths:
• 3
• 4
• 5
You can set the length of all of the sequences as the length of the sequence that has the 50th percentile
length.
1083
Amazon SageMaker Developer Guide
Transform Data
Time series arrays that are shorter than the length you've specified have missing values added. The
following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN,
NaN].
You can use different approaches to handle the missing values. For information on those approaches, see
Handle Missing Time Series Data (p. 1081).
The time series arrays that are longer than the length that you specify are truncated.
You can use the following procedure to standardize the length of the time series.
Use the following options to choose how you want to extract features from your data:
• Use Minimal subset to specify extracting 8 features that you know are useful in downstream analyses.
You can use a minimal subset when you need to perform computations quickly. You can also use it
when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
• Use Efficient subset to specify extracting the most features possible without extracting features that
are computationally intensive in your analyses.
• Use All features to specify extracting all features from the tune series.
• Use Manual subset to choose a list of features that you think explain the variation in your data well.
Use the following the procedure to extract features from your time series data.
1084
Amazon SageMaker Developer Guide
Transform Data
• Collecting a handful of past values. For example, for time, t + 1, you collect t, t - 1, t - 2, and t - 3.
• Collecting values that correspond to seasonal behavior in the data. For example, to predict the
occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the
previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as
predictive as using the features from previous days.
For example, you might have the following time series data for the number of customers at a restaurant.
Number of customers
10
14
24
1085
Amazon SageMaker Developer Guide
Transform Data
Number of customers
40
30
20
If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can
add a timestamp column that corresponds to the time series data. You can see the timestamp column in
the following table.
10 1:00 PM
14 2:00 PM
24 3:00 PM
40 4:00 PM
30 5:00 PM
20 6:00 PM
You can use the following procedure to extract features over a time period.
1086
Amazon SageMaker Developer Guide
Transform Data
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Rolling window features.
6. For Generate rolling window features for this column, choose a column.
7. For Timestamp Column, choose the column containing the timestamps.
8. (Optional) For Output Column, specify the name of the output column.
9. For Window size, specify the window size.
10. For Strategy, choose the extraction strategy.
11. Choose Preview to generate a preview of the transform.
12. Choose Add to add the transform to the Data Wrangler data flow.
Featurize Datetime
Use Featurize date/time to create a vector embedding representing a datetime field. To use this
transform, your datetime data must be in one of the following formats:
You can choose to Infer datetime format and provide a Datetime format. If you provide a datetime
format, you must use the codes described in the Python documentation. The options you select for these
two configurations have implications for the speed of the operation and the final results.
• The most manual and computationally fastest option is to specify a Datetime format and select No
for Infer datetime format.
• To reduce manual labor, you can choose Infer datetime format and not specify a datetime format. It
is also a computationally fast operation; however, the first datetime format encountered in the input
column is assumed to be the format for the entire column. If there are other formats in the column,
these values are NaN in the final output. Inferring the datetime format can give you unparsed strings.
• If you don't specify a format and select No for Infer datetime format, you get the most robust results.
All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower
than the first two options in this list.
When you use this transform, you specify an Input column which contains datetime data in one of the
formats listed above. The transform creates an output column named Output column name. The format
of the output column depends on your configuration using the following:
Additionally, you must choose an Embedding mode. For linear models and deep networks, we
recommend choosing cyclic. For tree-based algorithms, we recommend choosing ordinal.
Format String
The Format string transforms contain standard string formatting operations. For example, you can use
these operations to remove special characters, normalize string lengths, and update string casing.
1087
Amazon SageMaker Developer Guide
Transform Data
This feature group contains the following transforms. All transforms return copies of the strings in the
Input column and add the result to a new, output column.
Name Function
Center (pad on either side) Center-pad the string (add padding on both sides
of the string) with a given Fill character to the
given width. If the string is longer than width, the
return value is shortened to width characters.
Strip left and right Returns a copy of the string with the leading and
trailing characters removed.
Strip characters from left Returns a copy of the string with leading
characters removed.
Strip characters from right Returns a copy of the string with trailing
characters removed.
Add prefix or suffix Adds a prefix and a suffix the string column. You
must specify at least one of Prefix and Suffix.
Handle Outliers
Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or
rare values, can negatively impact model accuracy and lead to longer training times. Use this feature
group to detect and update outliers in your dataset.
When you define a Handle outliers transform step, the statistics used to detect outliers are generated on
the data available in Data Wrangler when defining this step. These same statistics are used when running
a Data Wrangler job.
Use the following sections to learn more about the transforms this group contains. You specify an
Output name and each of these transforms produces an output column with the resulting data.
1088
Amazon SageMaker Developer Guide
Transform Data
You must define an Upper quantile and a Lower quantile for the statistics used to calculate outliers. You
must also specify the number of Standard deviations from which a value must vary from the mean to be
considered an outlier. For example, if you specify 3 for Standard deviations, a value must fall more than
3 standard deviations from the mean to be considered an outlier.
The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:
• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.
You specify the number of Standard deviations a value must vary from the mean to be considered an
outlier. For example, if you specify 3 for Standard deviations, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.
The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:
• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.
The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:
• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.
You specify a Upper threshold and a Lower threshold, and if values fall above or below those thresholds
respectively, they are considered outliers.
The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:
1089
Amazon SageMaker Developer Guide
Transform Data
• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.
Replace Rare
When you use the Replace rare transform, you specify a threshold and Data Wrangler finds all values
that meet that threshold and replaces them with a string that you specify. For example, you may want to
use this transform to categorize all outliers in a column into an "Others" category.
Fill Missing
Use the Fill missing transform to replace missing values with a Fill value you define.
Impute Missing
Use the Impute missing transform to create a new column that contains imputed values where missing
values were found in input categorical and numerical data. The configuration depends on your data type.
For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute.
You can choose to impute the mean or the median over the values that are present in your dataset. Data
Wrangler uses the value that it computes to impute the missing values.
For categorical data, Data Wrangler imputes missing values using the most frequent value in the column.
To impute a custom string, use the Fill missing transform instead.
Drop Missing
Use the Drop missing option to drop rows that contain missing values from the Input column.
Manage Columns
You can use the following transforms to quickly update and manage columns in your dataset:
1090
Amazon SageMaker Developer Guide
Transform Data
Name Function
Manage Rows
Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the
following:
• Sort: Sort the entire dataframe by a given column. Select the check box next to Ascending order for
this option; otherwise, deselect the check box and descending order is used for the sort.
• Shuffle: Randomly shuffle all rows in the dataset.
Manage Vectors
Use this transform group to combine or flatten vector columns. This group contains the following
transforms.
• Assemble: Use this transform to combine Spark vectors and numeric data into a single column. For
example, you can combine three columns: two containing numeric data and one containing vectors.
Add all the columns you want to combine in Input columns and specify a Output column name for
the combined data.
• Flatten: Use this transform to flatten a single column containing vector data. The input column must
contain PySpark vectors or array-like objects. You can control the number of columns created by
specifying a Method to detect number of outputs. For example, if you select Length of first vector,
the number of elements in the first valid vector or array found in the column determines the number
of output columns that are created. All other input vectors with too many items are truncated. Inputs
with too few items are filled with NaNs.
You also specify an Output prefix, which is used as the prefix for each output column.
Process Numeric
Use the Process Numeric feature group to process numeric data. Each scalar in this group is defined
using the Spark library. The following scalars are supported:
• Standard Scaler: Standardize the input column by subtracting the mean from each value and scaling
to unit variance. To learn more, see the Spark documentation for StandardScaler.
• Robust Scaler: Scale the input column using statistics that are robust to outliers. To learn more, see
the Spark documentation for RobustScaler.
• Min Max Scaler: Transform the input column by scaling each feature to a given range. To learn more,
see the Spark documentation for MinMaxScaler.
• Max Absolute Scaler: Scale the input column by dividing each value by the maximum absolute value.
To learn more, see the Spark documentation for MaxAbsScaler.
1091
Amazon SageMaker Developer Guide
Transform Data
Sampling
After you've imported your data, you can use the Sampling transformer to take one or more samples of
it. When you use the sampling transformer, Data Wrangler samples your original dataset.
• Limit: Samples the dataset starting from the first row up to the limit that you specify.
• Randomized: Takes a random sample of a size that you specify.
• Stratified: Takes a stratified random sample.
You can stratify a randomized sample to make sure that it represents the original distribution of the
dataset.
You might be performing data preparation for multiple use cases. For each use case, you can take a
different sample and apply a different set of transformations.
1. Choose the + to the right of the dataset that you've imported. The name of your dataset is located
below the +.
2. Choose Add transform.
3. Choose Sampling.
4. For Sampling method, choose the sampling method.
5. For Approximate sample size, choose the approximate number of observations that you want in
your sample.
6. (Optional) Specify an integer for Random seed to create a reproducible sample.
1. Choose the + to the right of the dataset that you've imported. The name of your dataset is located
below the +.
2. Choose Add transform.
3. Choose Sampling.
4. For Sampling method, choose the sampling method.
5. For Approximate sample size, choose the approximate number of observations that you want in
your sample.
6. For Stratify column, specify the name of the column that you want to stratify on.
7. (Optional) Specify an integer for Random seed to create a reproducible sample.
The following transforms are supported under Search and edit. All transforms return copies of the
strings in the Input column and add the result to a new output column.
1092
Amazon SageMaker Developer Guide
Transform Data
Name Function
Find substring (from right) Returns the index of the last occurrence of the
Substring for which you searched. You can start
and end the search at Start and End respectively.
Extract between delimiters Returns a string with all characters found between
Left delimiter and Right delimiter.
Find and replace substring Returns a string with all matches of a given
Pattern (regular expression) replaced by
Replacement string.
Replace between delimiters Returns a string with the substring found between
the first appearance of a Left delimiter and the
last appearance of a Right delimiter replaced by
Replacement string. If no match is found, nothing
is replaced.
Split string by delimiter Returns an array of strings from the input string,
split by Delimiter, with up to Max number of
splits (optional). The delimiter defaults to white
space.
Split data
Use the Split data transform to split your dataset into two or three datasets. For example, you can split
your dataset into a dataset used to train your model and a dataset used to test it. You can determine the
1093
Amazon SageMaker Developer Guide
Transform Data
proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two
datasets, the training dataset can have 80% of the data while the testing dataset has 20%.
Splitting your data into three datasets gives you the ability to create training, validation, and test
datasets. You can see how well the model performs on the test dataset by dropping the target column.
Your use case determines how much of the original dataset each of your datasets get and the method
you use to split the data. For example, you might want to use a stratified split to make sure that the
distribution of the observations in the target column are the same across datasets. You can use the
following split transforms:
• Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger
datasets, using a randomized split might be computationally expensive and take longer than an
ordered split.
• Ordered split – Splits the dataset based on the sequential order of the observations. For example, for
an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training
dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in
keeping the existing order of the data between splits.
• Stratified split – Splits the dataset to make sure that the number of observations in the input column
have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the
1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go
to the testing set.
• Split by key – Avoids data with the same key occurring in more than one split. For example, if you have
a dataset with the column 'customer_id' and you're using it as a key, no customer id is in more than
one split.
After you split the data, you can apply additional transformations to each dataset. For most use cases,
they aren't necessary.
Data Wrangler calculates the proportions of the splits for performance. You can choose an error
threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions
that you specify for the splits. If you set a higher error threshold, you get better performance, but lower
accuracy.
For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for
better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.
If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would
get observations approximating one of the following results:
• 8010 observations in the training set and 1990 in the testing set
• 7990 observations in the training set and 2010 in the testing set
The number of observations for the testing set in the preceding example is in the interval between 8010
and 7990.
By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a
different value for the seed to create a different reproducible split.
Randomized split
1. Choose the + next to the node containing the dataset that you're splitting.
1094
Amazon SageMaker Developer Guide
Transform Data
• Specify the names and proportions of all the splits. The proportions must sum to 1.
6. (Optional) Specify a value for Error threshold other than the default value.
7. (Optional) Specify a value for Random seed.
8. Choose Preview.
9. Choose Add.
Ordered split
1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. For Transform, choose Ordered split.
4. Choose Split data.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.
• Specify the names and proportions of all the splits. The proportions must sum to 1.
7. (Optional) Specify a value for Error threshold other than the default value.
8. (Optional) For Input column, specify a column with numeric values. Uses the values of the
columns to infer which records are in each split. The smaller values are in one split with the
larger values in the other splits.
9. (Optional) Select Handle duplicates to add noise to duplicate values and create a dataset of
entirely unique values.
10. (Optional) Specify a value for Random seed.
11. Choose Preview.
12. Choose Add.
Stratified split
1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. Choose Split data.
4. For Transform, choose Stratified split.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.
• Specify the names and proportions of all the splits. The proportions must sum to 1.
1095
Amazon SageMaker Developer Guide
Transform Data
7. For Input column, specify a column with up to 100 unique values. Data Wrangler can't stratify a
column with more than 100 unique values.
8. (Optional) Specify a value for Error threshold other than the default value.
9. (Optional) Specify a value for Random seed to specify a different seed.
10. Choose Preview.
11. Choose Add.
Use the following procedure to split by the column keys in your dataset.
1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. Choose Split data.
4. For Transform, choose Split by key.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.
• Specify the names and proportions of all the splits. The proportions must sum to 1.
7. For Key columns, specify the columns with values that you don't want to appear in both
datasets.
8. (Optional) Specify a value for Error threshold other than the default value.
9. Choose Preview.
10. Choose Add.
• Long
• Float
• Boolean
• Date, in the format dd-MM-yyyy, representing day, month, and year respectively.
• String
Validate String
Use the Validate string transforms to create a new column that indicates that a row of text data meets
a specified condition. For example, you can use a Validate string transform to verify that a string only
contains lowercase characters. The following transforms are supported under Validate string.
The following transforms are included in this transform group. If a transform outputs a Boolean value,
True is represented with a 1 and False is represented with a 0.
Name Function
1096
Amazon SageMaker Developer Guide
Transform Data
Name Function
Use the Flatten structured operator to separate the first level keys into separate columns. A first level
key is a key that isn't nested within a value.
For example, you might have a dataset that has a person column with demographic information on each
person stored as JSON strings. A JSON string might look like the following.
The Flatten structured operator converts the following first level keys into additional columns in your
dataset:
• seq
• name
• age
1097
Amazon SageMaker Developer Guide
Transform Data
• city
• state
Data Wrangler puts the values of the keys as values under the columns. The following shows the column
names and values of the JSON.
For each value in your dataset containing JSON, the Flatten structured operator creates columns for the
first-level keys. To create columns for nested keys, call the operator again. For the preceding example,
calling the operator creates the columns:
• name_first
• name_last
The following example shows the dataset that results from calling the operation again.
Choose Keys to flatten on to specify the first-level keys that want to extract as separate columns. If you
don't specify any keys, Data Wrangler extracts all the keys by default.
Explode Array
Use Explode array to expand the values of the array into separate output rows. For example, the
operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the
following rows:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
You can call the Explode array operation multiple times to get the nested values of the array into
separate output columns. The following example shows the result of calling the operation multiple times
on a dataset with a nested array.
1098
Amazon SageMaker Developer Guide
Transform Data
2 2 rose
2 2 petunia
2 2 lily
2 2 daisy
You can use the information provided here to familiarize yourself with importing and transforming
image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about
supported image formats, see Image file reading and writing.
After you've familiarized yourself with the concepts of transforming your image data, go through the
following tutorial, Prepare image data with Amazon SageMaker Data Wrangler.
The following industries and use cases are examples where applying machine learning to transformed
image data can be useful:
When you work with image data in Data Wrangler, you go through the following process:
1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.
2. Transform – Use the built-in transformations to prepare the images for your machine learning
pipeline.
3. Export – Export the images that you’ve transformed to a location that can be accessed from the
pipeline.
1099
Amazon SageMaker Developer Guide
Transform Data
Data Wrangler uses the open-source imgaug library for its built-in image transformations. You can use
the following built-in transformations:
• ResizeImage
• EnhanceImage
• CorruptImage
• SplitImage
• DropCorruptedImages
• DropImageDuplicates
• Brightness
• ColorChannels
• Grayscale
• Rotate
Use the following procedure to transform your images without writing code.
1. From your Data Wrangler flow, choose the + next to the node representing the images that you've
imported.
2. Choose Add transform.
3. Choose Add step.
4. Choose the transform and configure it.
5. Choose Preview.
6. Choose Add.
In addition to using the transformations that Data Wrangler provides, you can also use your own
custom code snippets. For more information about using custom code snippets, see Custom
Transforms (p. 1065). You can import the OpenCV and imgaug libraries within your code snippets and
use the transforms associated with them. The following is an example of a code snippet that detects
edges within the images.
@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
# To use the code snippet on your image data, modify the following lines within the
function
HYST_THRLD_1, HYST_THRLD_2 = 100, 200
edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
return edges
@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
1100
Amazon SageMaker Developer Guide
Analyze and Visualize
return my_transform(image_row)
df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample
of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't
apply the transforms to all of your images.
To apply the transformations to all of your images, export your Data Wrangler flow to an Amazon S3
location. You can use the images that you've exported in your training or inference pipelines. Use a
destination node or a Jupyter Notebook to export your data. You can access either method for exporting
your data from the Data Wrangler flow. For information about using these methods, see Export to
Amazon S3 (p. 1118).
You add an analysis to a dataframe by selecting a step in your data flow, and then choosing Add
analysis. To access an analysis you've created, select the step that contains the analysis, and select the
analysis.
Histogram
Use histograms to see the counts of feature values for a specific feature. You can inspect the
relationships between features using the Color by option. For example, the following histogram charts
the distribution of user ratings of the best-selling books on Amazon from 2009–2019, colored by genre.
1101
Amazon SageMaker Developer Guide
Analyze and Visualize
You can use the Facet by feature to create histograms of one column, for each value in another column.
For example, the following diagram shows histograms of user reviews of best-selling books on Amazon if
faceted by year.
Scatter Plot
Use the Scatter Plot feature to inspect the relationship between features. To create a scatter plot, select
a feature to plot on the X axis and the Y axis. Both of these columns must be numeric typed columns.
You can color scatter plots by an additional column. For example, the following example shows a scatter
plot comparing the number of reviews against user ratings of top-selling books on Amazon between
2009 and 2019. The scatter plot is colored by book genre.
1102
Amazon SageMaker Developer Guide
Analyze and Visualize
Additionally, you can facet scatter plots by features. For example, the following image shows an example
of the same review versus user rating scatter plot, faceted by year.
Table Summary
Use the Table Summary analysis to quickly summarize your data.
For columns with numerical data, including log and float data, a table summary reports the number of
entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.
For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table
summary reports the number of entries (count), least frequent value (min), and most frequent value
(max).
1103
Amazon SageMaker Developer Guide
Analyze and Visualize
Quick Model
Use the Quick Model visualization to quickly evaluate your data and produce importance scores for
each feature. A feature importance score score indicates how useful a feature is at predicting a target
label. The feature importance score is between [0, 1] and a higher number indicates that the feature
is more important to the whole dataset. On the top of the quick model chart, there is a model score. A
classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.
When you create a quick model chart, you select a dataset you want evaluated, and a target label against
which you want feature importance to be compared. Data Wrangler does the following:
• Infers the data types for the target label and each feature in the dataset selected.
• Determines the problem type. Based on the number of distinct values in the label column, Data
Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a
categorical threshold to 100. If there are more than 100 distinct values in the label column, Data
Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem.
• Pre-processes features and label data for training. The algorithm used requires encoding features to
vector type and encoding labels to double type.
• Trains a random forest algorithm with 70% of data. Spark’s RandomForestRegressor is used to train a
model for regression problems. The RandomForestClassifier is used to train a model for classification
problems.
• Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates
classification models using an F1 score and evaluates regression models using an MSE score.
• Calculates feature importance for each feature using the Gini importance method.
The following image shows the user interface for the quick model feature.
Target Leakage
Target leakage occurs when there is data in a machine learning training dataset that is strongly
correlated with the target label, but is not available in real-world data. For example, you may have a
column in your dataset that serves as a proxy for the column you want to predict with your model.
When you use the Target Leakage analysis, you specify the following:
• Target: This is the feature about which you want your ML model to be able to make predictions.
1104
Amazon SageMaker Developer Guide
Analyze and Visualize
• Problem type: This is the ML problem type on which you are working. Problem type can either be
classification or regression.
• (Optional) Max features: This is the maximum number of features to present in the visualization, which
shows features ranked by their risk of being target leakage.
For classification, the target leakage analysis uses the area under the receiver operating characteristic,
or AUC - ROC curve for each column, up to Max features. For regression, it uses a coefficient of
determination, or R2 metric.
The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-
validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities,
which often indicates target leakage. A score of 0.5 or lower indicates that the information on the
column could not provide, on its own, any useful information towards predicting the target. Although it
can happen that a column is uninformative on its own but is useful in predicting the target when used in
tandem with other features, a low score could indicate the feature is redundant.
For example, the following image shows a target leakage report for a diabetes classification problem,
that is, predicting if a person has diabetes or not. An AUC - ROC curve is used to calculate the predictive
ability of five features, and all are determined to be safe from target leakage.
Multicollinearity
Multicollinearity is a circumstance where two or more predictor variables are related to each other. The
predictor variables are the features in your dataset that you're using to predict a target variable. When
you have multicollinearity, the predictor variables are not only predictive of the target variable, but also
predictive of each other.
You can use the Variance Inflation Factor (VIF), Principal Component Analysis (PCA), or Lasso feature
selection as measures for the multicollinearity in your data. For more information, see the following.
The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler
returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is
a positive number that is greater than or equal to 1.
A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1
indicate higher correlation.
1105
Amazon SageMaker Developer Guide
Analyze and Visualize
Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50.
If you have a VIF score greater than 50, Data Wrangler sets the score to 50.
You can use the following guidelines to interpret your VIF scores:
• A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the
other variables.
• A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the
other variables.
Principal Component Analysis (PCA) measures the variance of the data along different directions in
the feature space. The feature space consists of all the predictor variables that you use to predict the
target variable in your dataset.
For example, if you're trying to predict who survived on the RMS Titanic after it hit an iceberg, your
feature space can include the passengers' age, gender, and the fare that they paid.
From the feature space, PCA generates an ordered list of variances. These variances are also known
as singular values. The values in the list of variances are greater than or equal to 0. We can use them
to determine how much multicollinearity there is in our data.
When the numbers are roughly uniform, the data has very few instances of multicollinearity. When
there is a lot of variability among the values, we have many instances of multicollinearity. Before it
performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation
of 1.
Note
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).
Lasso feature selection
Lasso feature selection uses the L1 regularization technique to only include the most predictive
features in your dataset.
For both classification and regression, the regularization technique generates a coefficient for
each feature. The absolute value of the coefficient provides an importance score for the feature. A
higher importance score indicates that it is more predictive of the target variable. A common feature
selection method is to use all the features that have a non-zero lasso coefficient.
For the error term, you specify a threshold as the number of standard of deviations the residual can be
away from the mean for it to be considered an anomaly. For example, you can specify a threshold as
being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an
anomaly.
You can use the following procedure to perform an Anomaly detection analysis.
1106
Amazon SageMaker Developer Guide
Analyze and Visualize
5. For Anomaly threshold, choose the threshold that a value is considered an anomaly.
6. Choose Preview to generate a preview of the analysis.
7. Choose Add to add the transform to the Data Wrangler data flow.
You can use the following procedure to perform a Seasonal-Trend decomposition analysis.
Bias Report
You can use the bias report in Data Wrangler to uncover potential biases in your data. To generate a bias
report, you must specify the target column, or Label, that you want to predict and a Facet, or the column
that you want to inspect for biases.
Label: The feature about which you want a model to make predictions. For example, if you are predicting
customer conversion, you may select a column containing data on whether or not a customer has placed
an order. You must also specify whether this feature is a label or a threshold. If you specify a label, you
must specify what a positive outcome looks like in your data. In the customer conversion example, a
positive outcome may be a 1 in the orders column, representing the positive outcome of a customer
placing an order within the last three months. If you specify a threshold, you must specify a lower bound
defining a positive outcome. For example, if your customer orders columns contains the number of
orders placed in the last year, you may want to specify 1.
Facet: The column that you want to inspect for biases. For example, if you are trying to predict customer
conversion, your facet may be the age of the customer. You may choose this facet because you believe
that your data is biased toward a certain age group. You must identify whether the facet is measured as a
value or threshold. For example, if you wanted to inspect one or more specific ages, you select Value and
specify those ages. If you want to look at an age group, you select Threshold and specify the threshold
of ages you want to inspect.
After you select your feature and label, you select the types of bias metrics you want to calculate.
1107
Amazon SageMaker Developer Guide
Analyze and Visualize
You must provide the output variable, chart, to store an Altair output chart. For example, you can use
the following code block to create a custom histogram using the Titanic dataset.
1. Next to the node containing the transformation that you'd like to visualize, choose the +.
2. Choose Add analysis.
3. For Analysis type, choose Custom Visualization.
4. For Analysis name, specify a name.
5. Enter your code in the code box.
6. Choose Preview to preview your visualization.
7. Choose Save to add your visualization.
If you don’t know how to use the Altair visualization package in Python, you can use custom code
snippets to help you get started.
Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose
Search example snippets and specify a query in the search bar.
1108
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
The following example uses the Binned scatterplot code snippet. It plots a histogram for 2 dimensions.
The snippets have comments to help you understand the changes that you need to make to the code.
You usually need to specify the column names of your dataset in the code.
chart = (
alt.Chart(df)
.mark_circle()
.encode(
# Specify the column names for binning and number of bins for X and Y axis
x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
size="count()",
)
)
After you created a Data Wrangler flow, you might have trained a model on the data that you've
transformed. For datasets that have the same schema, you can use parameters to apply the same
transformations on a different dataset and train a different model. You can use the new datasets to
perform inference with your model or you could be using them to retrain your model.
Note
Datetime parameters have a time range attribute that they use as the default value.
Data Wrangler uses curly braces, {{}}, to indicate that a parameter is being used in
the Amazon S3 path. For example, you can have a URL such as s3://DOC-EXAMPLE-
BUCKET1/{{example_parameter_name}}/example-dataset.csv.
1109
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
You create a parameter when you're editing the Amazon S3 data source that you've imported. You can
set any portion of the file path to a parameter value. You can set the parameter value to either a value or
a pattern. The following are the available parameter value types in the Data Wrangler flow:
• Number
• String
• Pattern
• Datetime
Note
You can't create a pattern parameter or a datetime parameter for the name of the bucket in the
Amazon S3 path.
You must set a number as the default value of a number parameter. You can change the value of
the parameter to a different number when you're editing a parameter or when you're launching a
processing job. For example, in the S3 path, s3://DOC-EXAMPLE-BUCKET/example-prefix/
example-file-1.csv, you can create a number parameter named number_parameter in the place
of 1. Your S3 path now appears as s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-
{{number_parameter}}.csv. The path continues to point to the example-file-1.csv dataset
until you change the value of the parameter. If you change the value of number_parameter to 2 the
path is now s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-2.csv. You can import
example-file-2.csv into Data Wrangler if you've uploaded the file to that Amazon S3 location.
A string parameter stores a string as its default value. For example, in the S3 path, s3://DOC-EXAMPLE-
BUCKET/example-prefix/example-file-1.csv, you can create a string parameter named
string_parameter in the place of the filename, example-file-1.csv. The path now appears as
s3://DOC-EXAMPLE-BUCKET/example-prefix/{{string_parameter}}. It continues to match
s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-1.csv, until you change the value
of the parameter.
Instead of specifying the filename as a string parameter, you can create a string parameter using the
entire Amazon S3 path. You can specify a dataset from any Amazon S3 location in the string parameter.
A pattern parameter stores a regular expression (Python REGEX) string as its default value. You can use
a pattern parameter to import multiple data files at the same time. To import more than one object at a
time, specify a parameter value that matches the Amazon S3 objects that you're importing.
You can also create a pattern parameter for the following datasets:
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-1.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-2.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-10.csv
• s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-0123.csv
You can also use pattern parameters to match all CSV objects within your bucket. To match all objects
in a bucket, create a pattern parameter with the default value of .* and set the path to s3://DOC-
EXAMPLE-BUCKET/{{pattern_parameter}}.csv. The .* character matches any string character in
the path.
1110
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
• example-file-1.csv
• other-example-file.csv
• example-file-a.csv
• s3://DOC-EXAMPLE-BUCKET/2021/01/01/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET/2021/06/30/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET/2021/12/31/example-dataset.csv
The datetime values within a relative time range change as time passes. The S3 paths that fall within the
relative time range might also differ.
To view a table of all parameters that you've created in Data Wrangler flow, choose the `{{}}` to the
right of the text box containing the Amazon S3 path. If you no longer need a parameter that you've
created, you can edit or delete. To edit or delete a parameter, choose icons to the right of the parameter.
Important
Before you delete a parameter, make sure that you haven't used it anywhere in your Data
Wrangler flow. Deleted parameters that are still within the flow cause errors.
You can create parameters for any step of your Data Wrangler flow. You can edit or delete any parameter
that you create. If you're applying transformations to data that is no longer relevant to your use case,
you can modify the values of parameters. Modifying the values of the parameters changes the data that
you're importing.
The following sections provide additional examples and general guidance on using parameters. You can
use the sections to understand the parameters that work best for you.
Note
The following sections contain procedures that use the Data Wrangler interface to override the
parameters and create a processing job.
You can also override the parameters by using the following procedures.
To export your Data Wrangler flow and override the value of a parameter, do the following.
1111
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
4. Under parameter_overrides, specify different values for the parameters that you've
created.
5. Run the Jupyter Notebook.
You can replace each number in the following paths with a single parameter that has a value of \d+.
• s3://DOC-EXAMPLE-BUCKET1/example-prefix-3/example-prefix-4/example-prefix-5/
example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix-8/example-prefix-12/example-
prefix-13/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix-4/example-prefix-9/example-
prefix-137/example-dataset.csv
The following procedure creates a pattern parameter for a dataset with the path s3://DOC-EXAMPLE-
BUCKET1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv.
1112
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
You might have the transformations from your Data Wrangler flow that you've applied to datasets under
example-prefix-1. You might want to apply the same transformations to example-dataset.csv
that falls under example-prefix-10 or example-prefix-20.
You can create a parameter that stores the value 1. If you want to apply the transformations to different
datasets, you can create processing jobs that replace the value of the parameter with a different value.
The parameter acts as a placeholder for you to change when you want to apply the transformations
from your Data Wrangler flow to new data. You can override the value of the parameter when you create
a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different
datasets.
After you've created the parameters, apply the transforms to your dataset and create a destination node
for them. For more information about destination nodes, see Export (p. 1116).
Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.
To change the value of a numeric parameter in a Data Wrangler processing job, do the following.
1113
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
4. Choose Parameters.
5. Choose the name of a parameter that you've created.
6. Change the value of the parameter.
7. Repeat the procedure for the other parameters.
8. Choose Run.
You might have transformations from your Data Wrangler flow that you've applied to datasets under
example-prefix. You might want to apply the same transformations to example-dataset.csv
under another-example-prefix or example-prefix-20.
You can create a parameter that stores the value example-prefix. If you want to apply the
transformations to different datasets, you can create processing jobs that replace the value of the
parameter with a different value. The parameter acts as a placeholder for you to change when you want
to apply the transformations from your Data Wrangler flow to new data. You can override the value of
the parameter when you create a Data Wrangler processing job to apply the transformations in your
Data Wrangler flow to different datasets.
After you've created the parameter, apply the transforms to your dataset and create a destination node
for them. For more information about destination nodes, see Export (p. 1116).
Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.
To change the value of a numeric parameter in a Data Wrangler processing job, do the following:
1114
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/03/15/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/01/08/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/07/31/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2021/09/07/example-dataset.csv
The transformations in the Data Wrangler flow apply to all of the preceding prefixes. Changing the value
of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler
flow. To apply the transformations to datasets within a different time range, do the following:
1. Create a destination node containing all the transformations that you'd like to use.
2. Create a Data Wrangler job.
3. Configure the job to use a different time range for the parameter. Changing the value of the
parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow.
For more information about destination nodes and Data Wrangler jobs, see Export (p. 1116).
The following procedure creates a datetime parameter for the Amazon S3 path: s3://DOC-EXAMPLE-
BUCKET1/example-prefix/2022/05/15/example-dataset.csv.
To create a datetime parameter for the preceding S3 URI path, do the following.
1115
Amazon SageMaker Developer Guide
Export
10. (Optional) Enter a description to describe how you're using the parameter.
11. Choose Create.
After you've created the datetime parameters, apply the transforms to your dataset and create a
destination node for them. For more information about destination nodes, see Export (p. 1116).
Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.
To change the value of a datetime parameter in a Data Wrangler processing job, do the following:
Export
In your Data Wrangler flow, you can export some or all of the transformations that you've made to your
data processing pipelines.
A Data Wrangler flow is the series of data preparation steps that you've performed on your data. In your
data preparation, you perform one or more transformations to your data. Each transformation is done
using a transform step. The flow has a series of nodes that represent the import of your data and the
transformations that you've performed. For an example of nodes, see the following image.
1116
Amazon SageMaker Developer Guide
Export
The preceding image shows a Data Wrangler flow with two nodes. The Source - sampled node shows the
data source from which you've imported your data. The Data types node indicates that Data Wrangler
has performed a transformation to convert the dataset into a usable format.
Each transformation that you add to the Data Wrangler flow appears as an additional node. For
information on the transforms that you can add, see Transform Data (p. 1058). The following image
shows a Data Wrangler flow that has a Rename-column node to change the name of a column in a
dataset.
• Amazon S3
• SageMaker Pipelines
• Amazon SageMaker Feature Store
• Python Code
Important
We recommend that you use the IAM AmazonSageMakerFullAccess managed policy to grant
AWS permission to use Data Wrangler. If you don't use the managed policy, you can use an IAM
policy that gives Data Wrangler access to an Amazon S3 bucket. For more information on the
policy, see Security and Permissions (p. 1141).
When you export your data flow, you're charged for the AWS resources that you use. You can use cost
allocation tags to organize and manage the costs of those resources. You create these tags for your user-
profile and Data Wrangler automatically applies them to the resources used to export the data flow. For
more information, see Using Cost Allocation Tags.
1117
Amazon SageMaker Developer Guide
Export
Export to Amazon S3
Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You
can specify the location using one of the following methods:
• Destination node – Where Data Wrangler stores the data after it has processed it.
• Export to – Exports the data resulting from a transformation to Amazon S3.
• Export data – For small datasets, can quickly export the data that you've transformed.
Use the following sections to learn more about each of these methods.
Destination Node
If you want to output a series of data processing steps that you've performed to Amazon S3, you
create a destination node. A destination node tells Data Wrangler where to store the data after
you've processed it. After you create a destination node, you create a processing job to output the
data. A processing job is an Amazon SageMaker processing job. When you're using a destination
node, it runs the computational resources needed to output the data that you've transformed to
Amazon S3.
You can use a destination node to export some of the transformations or all of the transformations
that you've made in your Data Wrangler flow.
You can use multiple destination nodes to export different transformations or sets of
transformations. The following example shows two destination nodes in a single Data Wrangler flow.
You can use the following procedure to create destination nodes and export them to an Amazon S3
bucket.
To export your data flow, you create destination nodes and a Data Wrangler job to export the data.
Creating a Data Wrangler job starts a SageMaker processing job to export your flow. You can choose
the destination nodes that you want to export after you've created them.
1118
Amazon SageMaker Developer Guide
Export
Note
You can choose Create job in the Data Wrangler flow to view the instructions to use a
processing job.
1. Choose the + next to the nodes that represent the transformations that you want to export.
2. Choose Add destination.
1119
Amazon SageMaker Developer Guide
Export
• Dataset name – The name that you specify for the dataset that you're exporting.
• File type – The format of the file that you're exporting.
• Delimiter (CSV and Parquet files only) – The value used to separate other values.
• Compression (CSV and Parquet files only) – The compression method used to reduce the file
size. You can use the following compression methods:
• bzip2
• deflate
• gzip
• (Optional) Amazon S3 location – The S3 location that you're using to output the files.
• (Optional) Number of partitions – The number of datasets that you're writing as the output
of the processing job.
• (Optional) Partition by column – Writes all data with the same unique value from the column.
• (Optional) Inference Parameters – Selecting Generate inference artifact applies all of the
transformations you've used in the Data Wrangler flow to data coming into your inference
pipeline. The model in your pipeline makes predictions on the transformed data.
5. Choose Add destination.
Create a job from the Data flow page and choose the destination nodes that you want to export.
Note
You can choose Create job in the Data Wrangler flow to view the instructions for creating a
processing job.
1. Choose Create job. The following image shows the pane that appears after you select Create
job.
1120
Amazon SageMaker Developer Guide
Export
For more information about refitting the transformations you've made to an entire dataset, see
Refit Transforms to The Entire Dataset and Export Them (p. 1132).
Note
For image data, Data Wrangler exports the transformations that you've made to all of
the images. Refitting the transformations isn't applicable to your use case.
6. Choose Configure job. The following image shows the Configure job page.
7. (Optional) Configure the Data Wrangler job. You can make the following configurations:
• Job configuration
• Spark memory configuration
1121
Amazon SageMaker Developer Guide
Export
• Network configuration
• Tags
• Parameters
• Associate Schedules
8. Choose Run.
Export to
As an alternative to using a destination node, you can use the Export to option to export your Data
Wrangler flow to Amazon S3 using a Jupyter notebook. You can choose any data node in your Data
Wrangler flow and export it. Exporting the data node exports the transformation that the node
represents and the transformations that precede it.
Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Amazon S3.
When you run the notebook, it exports your data flow (.flow file) in the same AWS Region as the
Data Wrangler flow.
The notebook provides options that you can use to configure the processing job and the data that it
outputs.
Important
We provide you with job configurations to configure the output of your data. For the
partitioning and driver memory options, we strongly recommend that you don't specify a
configuration unless you already have knowledge about them.
• output_content_type – The content type of the output file. Uses CSV as the default format,
but you can specify Parquet.
1122
Amazon SageMaker Developer Guide
Export
• delimiter – The character used to separate values in the dataset when writing to a CSV file.
• compression – If set, compresses the output file. Uses gzip as the default compression format.
• num_partitions – The number of partitions or files that Data Wrangler writes as the output.
• partition_by – The names of the columns that you use to partition the output.
To change the output file format from CSV to Parquet, change the value from "CSV" to "Parquet".
For the rest of the preceding fields, uncomment the lines containing the fields that you want to
specify.
Under (Optional) Configure Spark Cluster Driver Memory you can configure Spark properties for
the job, such as the Spark driver memory, in the config dictionary.
config = json.dumps({
"Classification": "spark-defaults",
"Properties": {
"spark.driver.memory": f"{driver_memory_in_mb}m",
}
})
To apply the configuration to the processing job, uncomment the following lines:
# data_sources.append(ProcessingInput(
# source=config_s3_uri,
# destination="/opt/ml/processing/input/conf",
# input_name="spark-config",
# s3_data_type="S3Prefix",
# s3_input_mode="File",
# s3_data_distribution_type="FullyReplicated"
# ))
Export data
If you have a transformation on a small dataset that you want to export quickly, you can use the
Export data method. When you start choose Export data, Data Wrangler works synchronously to
export the data that you've transformed to Amazon S3. You can't use Data Wrangler until either it
finishes exporting your data or you cancel the operation.
For information on using the Export data method in your Data Wrangler flow, see the following
procedure.
1. Choose a node in your Data Wrangler flow by opening (double-clicking on) it.
1123
Amazon SageMaker Developer Guide
Export
When you export your data flow to an Amazon S3 bucket, Data Wrangler stores a copy of the
flow file in the S3 bucket. It stores the flow file under the data_wrangler_flows prefix. If you use
the default Amazon S3 bucket to store your flow files, it uses the following naming convention:
sagemaker-region-account number. For example, if your account number is 111122223333
and you are using Studio in us-east-1, your imported datasets are stored in sagemaker-us-
east-1-111122223333. In this example, your .flow files created in us-east-1 are stored in s3://
sagemaker-region-account number/data_wrangler_flows/.
When you export one or more steps from your data flow to SageMaker Pipelines, Data Wrangler creates
a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.
Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to SageMaker Pipelines.
1124
Amazon SageMaker Developer Guide
Export
You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline
includes the data processing steps that are defined by your Data Wrangler flow.
You can add additional steps to your pipeline by adding steps to the steps list in the following code in
the notebook:
pipeline = Pipeline(
name=pipeline_name,
parameters=[instance_type, instance_count],
steps=[step_process], #Add more steps to this list to run in your Pipeline
)
The pipeline provides the ability to perform either batch or real-time inference. You can also add the
Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see Host
multiple models in one container behind one endpoint (p. 2205).
When you export one or more steps from your data flow to an inference endpoint, Data Wrangler creates
a Jupyter notebook that you can use to define, instantiate, run, and manage the inference pipeline.
1125
Amazon SageMaker Developer Guide
Export
Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Python Code.
You might need to configure the Python script to make it run in your pipeline. For example, if you're
running a Spark environment, make sure that you are running the script from an environment that has
permission to access AWS resources.
A core concept in Feature Store is a feature group. A feature group is a collection of features, their
records (observations), and associated metadata. It's similar to a table in a database.
1126
Amazon SageMaker Developer Guide
Export
• Update an existing feature group with new records. A record is an observation in the dataset.
• Create a new feature group from a node in your Data Wrangler flow. Data Wrangler adds the
observations from your datasets as records in your feature group.
If you're updating an existing feature group, your dataset's schema must match the schema of the
feature group. All the records in the feature group are replaced with the observations in your dataset.
You can use either a Jupyter notebook or a destination node to update your feature group with the
observations in the dataset.
If your feature groups with the Iceberg table format have a custom offline store encryption key, make
sure you grant the IAM that you're using for the Amazon SageMaker Processing job permissions to use it.
At a minimum, you must grant it permissions to encrypt the data that you're writing to Amazon S3. To
grant the permissions, give the IAM role the ability to use the GenerateDataKey. For more information
about granting IAM roles permissions to use AWS KMS keys see https://docs.aws.amazon.com/kms/
latest/developerguide/key-policies.html
Destination Node
If you want to output a series of data processing steps that you've performed to a feature group, you
can create a destination node. When you create and run a destination node, Data Wrangler updates
a feature group with your data. You can also create a new feature group from the destination node
UI. After you create a destination node, you create a processing job to output the data. A processing
job is an Amazon SageMaker processing job. When you're using a destination node, it runs the
computational resources needed to output the data that you've transformed to the feature group.
You can use a destination node to export some of the transformations or all of the transformations
that you've made in your Data Wrangler flow.
Use the following procedure to create a destination node to update a feature group with the
observations from your dataset.
1. Choose the + symbol next to the node containing the dataset that you'd like to export.
2. Under Add destination, choose SageMaker Feature Store.
1127
Amazon SageMaker Developer Guide
Export
3. Choose (double-click) the feature group. Data Wrangler checks whether the schema of the
feature group matches the schema of the data that you're using to update the feature group.
4. (Optional) Select Export to offline store only for feature groups that have both an online store
and an offline store. This option only updates the offline store with observations from your
dataset.
5. After Data Wrangler validates the schema of your dataset, choose Add.
Use the following procedure to create a new feature group with data from your dataset.
You can store your feature group in one of the following ways:
• Online – Low-latency, high-availability cache for a feature group that provides real-time lookup of
records. The online store allows quick access to the latest value for a record in a feature group.
• Offline – Stores data for your feature group in an Amazon S3 bucket. You can store your data
offline when you don't need low-latency (sub-second) reads. You can use an offline store for
features used in data exploration, model training, and batch inference.
• Both online and offline – Stores your data in both an online store and an offline store.
1. Choose the + symbol next to the node containing the dataset that you'd like to export.
2. Under Add destination, choose SageMaker Feature Store.
3. Choose Create Feature Group.
4. In the following dialog box, if your dataset doesn't have an event time column, select Create
"EventTime" column.
5. Choose Next.
6. Choose Copy JSON Schema. When you create a feature group, you paste the schema into the
feature definitions.
7. Choose Create.
8. For Feature group name, specify a name for your feature group.
9. For Description (optional), specify a description to make your feature group more discoverable.
10. To create a feature group for an online store, do the following.
a. Select Enable storage offline. Specify values for the following fields:
• S3 bucket name – The name of the Amazon S3 bucket that stores the feature group.
• (Optional) Dataset directory name – The Amazon S3 prefix that you're using to store the
feature group.
• IAM Role ARN – The IAM role that has access to Feature Store.
• Table Format – Table format of your offline store. You can specify Glue or Iceberg. Glue
is the default format.
• Offline store encryption key – By default, Feature Store uses an AWS Key Management
Service managed key, but you can use the field to specify a key of your own.
b. Specify values for the following fields:
• S3 bucket name – The name of the bucket storing the feature group.
1128
Amazon SageMaker Developer Guide
Export
• (Optional) Dataset directory name – The Amazon S3 prefix that you're using to store the
feature group.
• IAM Role ARN – The IAM role that has access to feature store.
• Offline store encryption key – By default, Feature Store uses an AWS managed key, but
you can use the field to specify a key of your own.
12. Choose Continue.
13. Choose JSON.
14. Remove the placeholder brackets in the window.
15. Paste the JSON text from Step 6.
16. Choose Continue.
17. For RECORD IDENTIFIER FEATURE NAME, choose the column in your dataset that has unique
identifiers for each record in your dataset.
18. For EVENT TIME FEATURE NAME, choose the column with the timestamp values.
19. Choose Continue.
20. (Optional) Add tags to make your feature group more discoverable.
21. Choose Continue.
22. Choose Create feature group.
23. Navigate back to your Data Wrangler flow and choose the refresh icon next to the Feature
Group search bar.
Note
If you've already created a destination node for a feature group within a flow, you can't
create another destination node for the same feature group. If you want to create another
destination node for the same feature group, you must create another flow file.
Create a job from the Data flow page and choose the destination nodes that you want to export.
1. Choose Create job. The following image shows the pane that appears after you select Create
job.
2. For Job name, specify the name of the export job.
3. Choose the destination nodes that you want to export.
4. (Optional) For Output KMS Key, specify an ARN, ID, or alias of an AWS KMS key. A KMS key is
a cryptographic key. You can use the key to encrypt the output data from the job. For more
information about AWS KMS keys, see AWS Key Management Service.
5. The following image shows the Configure job page with the Job configuration tab open.
1129
Amazon SageMaker Developer Guide
Export
(Optional) Under Trained parameters. choose Refit if you've done the following:
For more information about refitting the transformations you've made to an entire dataset, see
Refit Transforms to The Entire Dataset and Export Them (p. 1132).
6. Choose Configure job.
7. (Optional) Configure the Data Wrangler job. You can make the following configurations:
• Job configuration
• Spark memory configuration
• Network configuration
• Tags
• Parameters
• Associate Schedules
8. Choose Run.
1130
Amazon SageMaker Developer Guide
Export
Jupyter notebook
Use the following procedure to a Jupyter notebook to export to Amazon SageMaker Feature Store.
Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Feature Store.
Running a Jupyter notebook runs a Data Wrangler job. Running a Data Wrangler job starts a
SageMaker processing job. The processing job ingests the flow into an online and offline feature
store.
Important
The IAM role you use to run this notebook must have the following
AWS managed policies attached: AmazonSageMakerFullAccess and
AmazonSageMakerFeatureStoreAccess.
You only need to enable one online or offline feature store when you create a feature group. You can
also enable both. To disable online store creation, set EnableOnlineStore to False:
The notebook uses the column names and types of the dataframe you export to create a feature
group schema, which is used to create a feature group. A feature group is a group of features
defined in the feature store to describe a record. The feature group defines the schema and features
contained in the feature group. A feature group definition is composed of a list of features, a record
identifier feature name, an event time feature name, and configurations for its online store and
offline store.
1131
Amazon SageMaker Developer Guide
Export
Each feature in a feature group can have one of the following types: String, Fractional, or Integral. If
a column in your exported dataframe is not one of these types, it defaults to String.
column_schema = [
{
"name": "Height",
"type": "long"
},
{
"name": "Input",
"type": "string"
},
{
"name": "Output",
"type": "string"
},
{
"name": "Sum",
"type": "string"
},
{
"name": "Time",
"type": "string"
}
]
Additionally, you must specify a record identifier name and event time feature name:
• The record identifier name is the name of the feature whose value uniquely identifies a record
defined in the feature store. Only the latest record per identifier value is stored in the online store.
The record identifier feature name must be one of feature definitions' names.
• The event time feature name is the name of the feature that stores the EventTime of a record
in a feature group. An EventTime is a point in time when a new event occurs that corresponds
to the creation or update of a record in a feature. All records in the feature group must have a
corresponding EventTime.
The notebook uses these configurations to create a feature group, process your data at scale, and
then ingest the processed data into your online and offline feature stores. To learn more, see Data
Sources and Ingestion.
The notebook uses these configurations to create a feature group, process your data at scale, and then
ingest the processed data into your online and offline feature stores. To learn more, see Data Sources
and Ingestion.
The following transformations use your data to create a column in the dataset:
1132
Amazon SageMaker Developer Guide
Export
If you used sampling to import your data, the preceding transforms only use the data from the sample
to create the column. The transform might not have used all of the relevant data. For example, if you use
the Encode Categorical transform, there might have been a category in the entire dataset that wasn't
present in the sample.
You can either use a destination node or a Jupyter notebook to refit the transformations to the entire
dataset. When Data Wrangler exports the transformations in the flow, it creates a SageMaker processing
job. When the processing job finishes, Data Wrangler saves the following files in either the default
Amazon S3 location or an S3 location that you specify:
• The Data Wrangler flow file that specifies the transformations that are refit to the dataset
• The dataset with the refit transformations applied to it
You can open a Data Wrangler flow file within Data Wrangler and apply the transformations to a
different dataset. For example, if you've applied the transformations to a training dataset, you can open
and use the Data Wrangler flow file to apply the transformations to a dataset used for inference.
For a information about using destination nodes to refit transforms and export see the following pages:
Use the following procedure to run a Jupyter notebook to refit the transformations and export the data.
To run a Jupyter notebook and to refit the transformations and export your Data Wrangler flow, do the
following.
When you create a job you must specify an IAM role that has permissions to create the job. By default,
the IAM role that you use to access Data Wrangler is the SageMakerExecutionRole.
The following permissions allow Data Wrangler to access EventBridge and allow EventBridge to run
processing jobs:
• Add the following AWS Managed policy to the Amazon SageMaker Studio execution role that provides
Data Wrangler with permissions to use EventBridge:
arn:aws:iam::aws:policy/AmazonEventBridgeFullAccess
For more information about the policy, see AWS managed policies for EventBridge.
1133
Amazon SageMaker Developer Guide
Export
• Add the following policy to the IAM role that you specify when you create a job in Data Wrangler:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sagemaker:StartPipelineExecution",
"Resource": "arn:aws:sagemaker:Region:AWS-account-id:pipeline/data-wrangler-
*"
}
]
}
If you're using the default IAM role, you add the preceding policy to the Amazon SageMaker Studio
execution role.
Add the following trust policy to the role to allow EventBridge to assume it.
{
"Effect": "Allow",
"Principal": {
"Service": "events.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
Important
When you create a schedule, Data Wrangler creates an eventRule in EventBridge. You incur
charges for both the event rules that you create and the instances used to run the processing
job.
For information about EventBridge pricing, see Amazon EventBridge pricing. For information
about processing job pricing, see Amazon SageMaker Pricing.
• CRON expressions
Note
Data Wrangler doesn't support the following expressions:
• LW#
• Abbreviations for days
• Abbreviations for months
• RATE expressions
• Recurring – Set an hourly or daily interval to run the job.
• Specific time – Set specific days and times to run the job.
CRON
1134
Amazon SageMaker Developer Guide
Export
• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
13. Choose Run
RATE
• Minutes
• Hours
• Days
11. Choose Create.
12. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
13. Choose one of the following:
1135
Amazon SageMaker Developer Guide
Export
• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
14. Choose Run
Recurring
Use the following procedure to create a schedule that runs a job on a recurring basis.
• Every Day
• Weekends
• Weekdays
• Select Days
• (Optional) If you've selected Select Days, choose the days of the week to run the job.
Note
The schedule resets every day. If you schedule a job to run every five hours, it runs at
the following times during the day:
• 00:00
• 05:00
• 10:00
• 15:00
• 20:00
11. Choose Create.
12. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
13. Choose one of the following:
• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
1136
Amazon SageMaker Developer Guide
Export
Specific time
Use the following procedure to create a schedule that runs a job at specific times.
• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
11. Choose Run
You can use Amazon SageMaker Studio view the jobs that are scheduled to run. Your processing jobs run
within SageMaker Pipelines. Each processing job has its own pipeline. It runs as a processing step within
the pipeline. You can view the schedules that you've created within a pipeline. For information about
viewing a pipeline, see View a Pipeline (p. 2792).
Use the following procedure to view the jobs that you've scheduled.
The pipeline running the job uses the job name as a prefix. For example, if you've created a job
named housing-data-feature-enginnering, the name of the pipeline is data-wrangler-
housing-data-feature-engineering.
4. Choose the pipeline containing your job.
5. View the status of the pipelines. Pipelines with a Status of Succeeded have run the processing job
successfully.
To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an
event rule stops all the jobs associated with the schedule from running. For information about deleting a
rule, see Disabling or deleting an Amazon EventBridge rule.
1137
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights
You can stop and delete the pipelines associated with the schedules as well. For information about
stopping a pipeline, see StopPipelineExecution. For information about deleting a pipeline, see
DeletePipeline.
You can access the data preparation widget from an Amazon SageMaker Studio notebook. For each
column, the widget creates a visualization that helps you better understand its distribution. If a column
has data quality issues, a warning appears in its header.
To see the data quality issues, select the column header showing the warning. You can use the
information that you get from the insights and the visualizations to apply the widget's built-in
transformations to help you fix the issues.
For example, the widget might detect that you have a column that only has one unique value and show
you a warning. The warning provides the option to drop the column from the dataset.
Open a notebook in Amazon SageMaker Studio. For information about opening a notebook, see Create
or Open an Amazon SageMaker Studio Notebook (p. 148).
Important
To run the widget, the notebook must use one of the following images:
For more information about images, see Available Amazon SageMaker Images (p. 164).
Use the following code to import the data preparation widget and pandas. The widget uses pandas
dataframes to analyze your data.
import pandas as pd
import sagemaker_datawrangler
The following example code loads a file into the dataframe called df.
df = pd.read_csv("example-dataset.csv")
You can use a dataset in any format that you can load as a pandas dataframe object. For more
information about pandas formats, see IO tools (text, CSV, HDF5, …).
1138
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights
df
• View the Pandas table – Switches between the interactive visualization and a pandas table.
• Use all of the rows in your dataset to compute the insights. Using the entire dataset might increase
the time it takes to generate the insights. – If you don't select the option, Data Wrangler computes
the insights for the first 10,000 rows of the dataset.
The dataframe shows the first 1000 rows of the dataset. Each column header has a stacked bar chart
that shows the column's characteristics. It shows the proportion of valid values, invalid values, and
missing values. You can hover over the different portions of the stacked bar chart to get the calculated
percentages.
Each column has a visualization in the header. The following shows the types of visualizations the
columns can have:
For each visualization, the data preparation widget highlights outliers in orange.
When you choose a column, it opens a side panel. The side panel shows you the Insights tab. The pane
provides a count for the following types of values:
• Invalid values – Values whose type doesn’t match the column type.
• Missing values – Values that are missing, such as NaN or None.
• Valid values – Values that are neither missing nor invalid.
For numeric columns, the Insights tab shows the following summary statistics:
For categorical columns, the Insights tab shows the following summary statistics:
The columns that have warning icons in their headers have data quality issues. Choosing a column opens
a Data quality tab that you can use to find transforms to help you fix the issue. A warning has one of the
following severity levels:
• Low – Issues that might not affect your analysis, but can be useful to fix.
• Medium – Issues that are likely to affect your analysis, but are likely not critical to fix.
• High – Severe issues that we strongly recommend fixing.
1139
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights
Note
The widget sorts the column to show the values that have data quality issues at the top of the
dataframe. It also highlights the values that are causing the issues. The color of the highlighting
corresponds to the severity level.
Under SUGGESTED TRANSFORMS, you can choose a transform to fix the data quality issue. The widget
can offer multiple transforms that can fix the issue. It can offer recommendations for the transforms that
are best suited to the problem. You can move your cursor over the transform to get more information
about it.
To apply a transform to the dataset, choose Apply and export code. The transform modifies the
dataset and updates the visualization with modified values. The code for the transform appears in the
following cell of the notebook. If you apply additional transforms to the dataset, the widget appends the
transforms to the cell. You can use the code that the widget generates to do the following:
You can reproduce all the transforms you've made by rerunning all of the cells in the notebook.
The widget can provide insights and warnings for the target column. The target column is the column
that you're trying to predict. Use the following procedure to get target column insights.
• Missing values – The column has missing values such as None, NaN (not a number), or NaT (not a
timestamp). Many machine learning algorithms don’t support missing values in the input data. Filling
them in or dropping the rows with missing data is therefore a crucial data preparation step. If you see
the missing values warning, you can use one of the following transforms to correct the issue.
• Drop missing – Drops rows with missing values. We recommend dropping rows when the percentage
of rows with missing data is small and imputing the missing values isn't appropriate.
• Replace with new value – Replaces textual missing values with Other. You can change Other to a
different value in the output code. Replaces numeric missing values with 0.
• Replace with mean – Replaces missing values with the mean of the column.
• Replace with median – Replaces missing values with the median of the column.
• Drop column – Drops the column with missing values from the dataset. We recommend dropping
the entire column when there's a high percentage of rows with missing data.
• Disguised missing values – The column has disguised missing values. A disguised missing value is a
value that isn't explicitly encoded as a missing value. For example, instead of using a NaN to indicate
1140
Amazon SageMaker Developer Guide
Security and Permissions
a missing value, the value could be Placeholder. You can use one of the following transforms to
handle the missing values:
• Drop missing – Drops rows with missing values
• Replace with new value – Replaces textual missing values with Other. You can change Other to a
different value in the output code. Replaces numeric missing values with 0.
• Constant column – The column only has one value. It therefore has no predictive power. We strongly
recommend using the Drop column transform to drop the column from the dataset.
• ID column – The column has no repeating values. All of the values in the column are unique. They
might be either IDs or database keys. Without additional information, the column has no predictive
power. We strongly recommend using the Drop column transform to drop the column from the
dataset.
• High cardinality – The column has a high percentage of unique values. High cardinality limits the
predictive power of categorical columns. Examine the importance of the column in your analysis and
consider using the Drop column transform to drop it.
For the target column, you can get the following insights to warn you about issues with your dataset. You
can use the suggested transformation provided with the warning to correct the issue.
• Mixed data types in target (Regression) – There are some non-numeric values in the target column.
There might be data entry errors. We recommend removing the rows that have the values that can't be
converted.
• Frequent label – Certain values in the target column appear more frequently than what would
be normal in the context of regression. There might be an error in data collection or processing. A
frequently appearing category might indicate that either the value is used as a default value or that
it’s a placeholder for missing values. We recommend using the Replace with new value transform to
replace the missing values with Other.
• Too few instances per class – The target column has categories that appear rarely. Some of the
categories don't have enough rows for the target column to be useful. You can use one of the
following transforms:
• Drop rare target – Drops unique values with fewer than ten observations. For example, drops the
value cat if it appears nine times in the column.
• Replace rare target – Replaces categories that appear rarely in the dataset with the value Other.
• Classes too imbalanced (multi-class classification) – There are categories in the dataset that appear
much more frequently than the other categories. The class imbalance might affect prediction accuracy.
For the most accurate predictions possible, we recommend updating the dataset with rows that have
the categories that currently appear less frequently.
• Large amount of classes/too many classes – There's a large number of classes in the target column.
Having many classes might result in longer training times or poor predictive quality. We recommend
doing one of the following:
• Grouping some of the categories into their own category. For example, if six categories are closely
related, we recommend using a single category for them.
• Using an ML algorithm that's resilient to multiple categories.
For high-level security needs, you can configure a bucket policy that restricts the AWS roles that have
access to this default SageMaker S3 bucket. Use the following section to add this type of policy to an S3
1141
Amazon SageMaker Developer Guide
Security and Permissions
bucket. To follow the instructions on this page, use the AWS Command Line Interface (AWS CLI). To learn
how, see Configuring the AWS CLI in the IAM User Guide.
Additionally, you need to grant each IAM role that uses Data Wrangler permissions to access required
resources. If you do not require granular permissions for the IAM role you use to access Data Wrangler,
you can add the IAM managed policy, AmazonSageMakerFullAccess, to an IAM role that you use to
create your Studio user. This policy grants you full permission to use Data Wrangler. If you require more
granular permissions, refer to the section, Grant an IAM Role Permission to Use Data Wrangler (p. 1143).
• Queried Amazon Redshift results. These are stored under the redshift/ prefix.
• Queried Athena results. These are stored under the athena/ prefix.
• The .flow files uploaded to Amazon S3 when you run an exported Jupyter Notebook Data Wrangler
produces. These are stored under the data_wrangler_flows/ prefix.
Use the following procedure to create an S3 bucket policy that you can add to restrict IAM role access to
that bucket. To learn how to add a policy to an S3 bucket, see How do I add an S3 Bucket policy?.
To set up a bucket policy on the S3 bucket that stores your Data Wrangler resources:
1. Configure one or more IAM roles that you want to be able to access Data Wrangler.
2. Open a command prompt or shell. For each role that you create, replace role-name with the name
of the role and run the following:
In the response, you see a RoleId string which begins with AROA. Copy this string.
3. Add the following policy to the SageMaker default bucket in the AWS Region in which you are using
Data Wrangler. Replace region with the AWS Region in which the bucket is located, and account-
id with your AWS account ID. Replace userIds starting with AROAEXAMPLEID with the IDs of an
AWS roles to which you want to grant permission to use Data Wrangler.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/",
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/*",
"arn:aws:s3:::sagemaker-region-account-id/athena",
"arn:aws:s3:::sagemaker-region-account-id/athena/*",
"arn:aws:s3:::sagemaker-region-account-id/redshift",
"arn:aws:s3:::sagemaker-region-account-id/redshift/*"
],
"Condition": {
"StringNotLike": {
"aws:userId": [
1142
Amazon SageMaker Developer Guide
Security and Permissions
"AROAEXAMPLEID_1:*",
"AROAEXAMPLEID_2:*"
]
}
}
}
]
}
Your organization might not provide your users with permissions to make those API calls by default.
To provide permissions, you must create and attach a policy to the user's IAM roles using the following
policy template: Data Wrangler Allow List Example.
Note
The preceding policy example only gives your users access to the Data Wrangler application.
For information about creating a policy, see Creating policies on the JSON tab. When you're creating a
policy, copy and paste the JSON policy from Data Wrangler Allow List Example in the JSON tab.
Important
Delete any IAM policies that prevent users from running the following operations:
• CreateApp
• DescribeApp
If you don't delete the policies, your users could still be affected by them.
After you've creating the policy using the template, attach it to the IAM roles of your users. For
information about attaching a policy, see Adding IAM identity permissions (console).
• If you import data from Amazon Redshift, the Database User name must have the prefix
sagemaker_access.
• This managed policy only grants permission to access buckets with one of the following in the name:
SageMaker, SageMaker, sagemaker, or aws-glue. If want to use Data Wrangler to import from an
S3 bucket without these phrases in the name, refer to the last section on this page to learn how to
grant permission to an IAM entity to access your S3 buckets.
If you have high-security needs, you can attach the policies in this section to an IAM entity to grant
permissions required to use Data Wrangler.
If you have datasets in Amazon Redshift or Athena that an IAM role needs to import from Data Wrangler,
you must add a policy to that entity to access these resources. The following policies are the most
restrictive policies you can use to give an IAM role permission to import data from Amazon Redshift and
Athena.
1143
Amazon SageMaker Developer Guide
Security and Permissions
To learn how to attach a custom policy to an IAM role, refer to Managing IAM policies in the IAM User
Guide.
The following policy assumes that the IAM role has permission to access the underlying S3 bucket where
data is stored through a separate IAM policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Resource": [
"arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
"arn:aws:glue:*:*:table/sagemaker_featurestore/*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:DeleteTable"
],
"Resource": [
"arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:*:*:table/*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [
1144
Amazon SageMaker Developer Guide
Security and Permissions
"glue:CreateDatabase",
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/sagemaker_featurestore",
"arn:aws:glue:*:*:database/sagemaker_processing",
"arn:aws:glue:*:*:database/default",
"arn:aws:glue:*:*:database/sagemaker_data_wrangler"
]
}
]
}
The following policy grants permission to set up an Amazon Redshift connection to Data Wrangler using
database users that have the prefix sagemaker_access in the name. To grant permission to connect
using additional database users, add additional entries under "Resources" in the following policy. The
following policy assumes that the IAM role has permission to access the underlying S3 bucket where data
is stored through a separate IAM policy, if applicable.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"redshift-data:ExecuteStatement",
"redshift-data:DescribeStatement",
"redshift-data:CancelStatement",
"redshift-data:GetStatementResult",
"redshift-data:ListSchemas",
"redshift-data:ListTables"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"redshift:GetClusterCredentials"
],
"Resource": [
"arn:aws:redshift:*:*:dbuser:*/sagemaker_access*",
"arn:aws:redshift:*:*:dbname:*"
]
}
]
}
If your dataset is stored in Amazon S3, you can grant an IAM role permission to access this bucket with a
policy similar to the following. This example grants programmatic read-write access to the bucket named
test.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
1145
Amazon SageMaker Developer Guide
Security and Permissions
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::test"]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": ["arn:aws:s3:::test/*"]
}
]
}
To import data from Athena and Amazon Redshift, you must grant an IAM role permission to access
the following prefixes under the default Amazon S3 bucket in the AWS Region Data Wrangler in which
is being used: athena/, redshift/. If a default Amazon S3 bucket does not already exist in the AWS
Region, you must also give the IAM role permission to create a bucket in this region.
Additionally, if you want the IAM role to be able to use the Amazon SageMaker Feature Store,
SageMaker Pipelines, and Data Wrangler job export options, you must grant access to the prefix
data_wrangler_flows/ in this bucket.
Data Wrangler uses the athena/ and redshift/ prefixes to store preview files and imported datasets.
To learn more, see Imported Data Storage (p. 1033).
Data Wrangler uses the data_wrangler_flows/ prefix to store .flow files when you run a Jupyter
Notebook exported from Data Wrangler. To learn more, see Export (p. 1116).
Use a policy similar to the following to grant the permissions described in the preceding paragraphs.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/",
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/*",
"arn:aws:s3:::sagemaker-region-account-id/athena",
"arn:aws:s3:::sagemaker-region-account-id/athena/*",
"arn:aws:s3:::sagemaker-region-account-id/redshift",
"arn:aws:s3:::sagemaker-region-account-id/redshift/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::sagemaker-region-account-id"
},
{
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"
1146
Amazon SageMaker Developer Guide
Security and Permissions
],
"Resource": "*"
}
]
}
You can also access data in your Amazon S3 bucket from another AWS account by specifying the Amazon
S3 bucket URI. To do this, the IAM policy that grants access to the Amazon S3 bucket in the other
account should use a policy similar to the following example, where BucketFolder is the specific
directory in the user's bucket UserBucket. This policy should be added to the user granting access to
their bucket for another user.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::UserBucket/BucketFolder/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::UserBucket",
"Condition": {
"StringLike": {
"s3:prefix": [
"BucketFolder/*"
]
}
}
}
]
}
The user that is accessing the bucket (not the bucket owner) must add a policy similar to the following
example to their user. Note that AccountX and TestUser below refers to the bucket owner and their
user respectively.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountX:user/TestUser"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::UserBucket/BucketFolder/*"
]
},
{
1147
Amazon SageMaker Developer Guide
Security and Permissions
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountX:user/TestUser"
},
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::UserBucket"
]
}
]
}
Use a policy like to the following to create an IAM execution role that can be used to set up a Studio
instance.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:DescribeDomain",
"sagemaker:ListDomains",
"sagemaker:DescribeUserProfile",
"sagemaker:ListUserProfiles",
"sagemaker:*App",
"sagemaker:ListApps"
],
"Resource": "*"
}
]
}
Note that the Snowflake COPY INTO Amazon S3 command moves data from Snowflake to Amazon S3
over the public internet by default, but data in transit is secured using SSL. Data at rest in Amazon S3 is
encrypted with SSE-KMS using the default AWS KMS key.
With respect to Snowflake credentials storage, Data Wrangler does not store customer credentials. Data
Wrangler uses Secrets Manager to store the credentials in a secret and rotates secrets as part of a best
practice security plan. The Snowflake or Studio administrator needs to ensure that the data scientist’s
Studio execution role is granted permission to perform GetSecretValue on the secret storing the
credentials. If already attached to the Studio execution role, the AmazonSageMakerFullAccess policy
has the necessary permissions to read secrets created by Data Wrangler and secrets created by following
the naming and tagging convention in the instructions above. Secrets that do not follow the conventions
must be separately granted access. We recommend using Secrets Manager to prevent sharing credentials
over unsecured channels; however, note that a logged-in user can retrieve the plain-text password by
launching a terminal or Python notebook in Studio and then invoking API calls from the Secrets Manager
API.
1148
Amazon SageMaker Developer Guide
Security and Permissions
• server-side encryption
• SSE-KMS as the encryption type
To decrypt the file and import to a Data Wrangler flow, you must add the SageMaker Studio user that
you're using as a key user.
The following screenshot shows a Studio user role added as a key user. See IAM Roles to access users
under the left panel to make this change.
Amazon S3 customer managed key setup for Data Wrangler imported data
storage
By default, Data Wrangler uses Amazon S3 buckets that have the following naming convention:
sagemaker-region-account number. For example, if your account number is 111122223333
and you are using Studio in us-east-1, your imported datasets are stored with the following naming
convention: sagemaker-us-east-1-111122223333.
The following instructions explain how to set up a customer managed key for your default Amazon S3
bucket.
1. To enable server-side encryption and setup a customer managed key for your default S3 bucket, see
Using KMS Encryption.
2. After following step 1, navigate to AWS KMS in your AWS Management Console. Find the customer
managed key you selected in step 1 of the previous step and add the Studio role as the key user. To do
this, follow the instructions in Allows key users to use a customer managed key.
• Specifying that your Amazon S3 bucket has object use SSE-KMS encryption.
• Specifying an AWS KMS key to encrypt the data that you export from Data Wrangler.
On the Export data page, specify a value for the AWS KMS key ID or ARN.
1149
Amazon SageMaker Developer Guide
Security and Permissions
For more information on using AWS KMS keys, see Protecting Data Using Server-Side Encryption with
AWS KMS keys Stored in AWSAWS Key Management Service (SSE-KMS) .
When you run a transfer, Amazon AppFlow stores metadata from the transfer in the AWS Glue Data
Catalog. Data Wrangler uses the metadata from the catalog to determine whether it's available for you
to query and import.
To add permissions to Amazon AppFlow, add the AmazonAppFlowFullAccess AWS managed policy
to the IAM role. For more information about adding policies, see Adding or removing IAM identity
permissions.
If you're transferring data to Amazon S3, you must also attach the following policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetBucketTagging",
"s3:ListBucketVersions",
"s3:CreateBucket",
"s3:ListBucket",
"s3:GetBucketPolicy",
"s3:PutEncryptionConfiguration",
"s3:GetEncryptionConfiguration",
"s3:PutBucketTagging",
"s3:GetObjectTagging",
"s3:GetBucketOwnershipControls",
"s3:PutObjectTagging",
"s3:DeleteObject",
"s3:DeleteBucket",
"s3:DeleteObjectTagging",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketPolicyStatus",
"s3:PutBucketPublicAccessBlock",
"s3:PutAccountPublicAccessBlock",
"s3:ListAccessPoints",
"s3:PutBucketOwnershipControls",
"s3:PutObjectVersionTagging",
"s3:DeleteObjectVersionTagging",
"s3:GetBucketVersioning",
"s3:GetBucketAcl",
"s3:PutObject",
"s3:GetObject",
"s3:GetAccountPublicAccessBlock",
"s3:ListAllMyBuckets",
1150
Amazon SageMaker Developer Guide
Security and Permissions
"s3:GetAnalyticsConfiguration",
"s3:GetBucketLocation"
],
"Resource": "*"
}
]
}
To add AWS Glue permissions, add the AWSGlueConsoleFullAccess managed policy to the IAM role.
For more information about AWS Glue permissions with Amazon AppFlow, see [link-to-appflow-page].
Amazon AppFlow needs to access AWS Glue and Data Wrangler for you to import the data that you've
transferred. To grant Amazon AppFlow access, add the following trust policy to the IAM role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root",
"Service": [
"appflow.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
To display the Amazon AppFlow data in Data Wrangler, add the following policy to the IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "glue:SearchTables",
"Resource": [
"arn:aws:glue:*:*:table/*/*",
"arn:aws:glue:*:*:database/*",
"arn:aws:glue:*:*:catalog"
]
}
]
}
For more information about lifecycle configurations, see Use Lifecycle Configurations with Amazon
SageMaker Studio (p. 182).
1151
Amazon SageMaker Developer Guide
Release Notes
The default lifecycle configuration for your instance doesn't support using Data Wrangler. You can make
the following modifications to the default configuration to use Data Wrangler with your instance.
#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'
fi
You attach the lifecycle configuration to your Studio domain or user profile. For more information
about creating and attaching a lifecycle configuration, see Creating and Associating a Lifecycle
Configuration (p. 183).
The following instructions show you how to attach a lifecycle configuration to a Studio domain or user
profile.
You might run into errors when you're creating or attaching a lifecycle configuration. For information
about debugging lifecycle configuration errors, KernelGateway App failure (p. 190).
Release Notes
Data Wrangler is regularly updated with new features and bug fixes. To upgrade the version of Data
Wrangler you are using in Studio, follow the instructions in Shut down and Update Studio Apps (p. 200).
Release Notes
4/18/2022
New functionality:
You can now get your data in a format that Amazon Personalize can interpret. For more information,
see Map Columns for Amazon Personalize (p. 1101).
3/1/2022
New functionality:
You can now use Hive to import your data from Amazon EMR. For more information, see Import data
from Amazon EMR (p. 1004).
12/10/2022
New functionality:
1152
Amazon SageMaker Developer Guide
Release Notes
Release Notes
You can now export your Data Wrangler flow to an inference endpoint. For more information, see
Export to an Inference Endpoint (p. 1125).
New functionality:
You can now use an interactive notebook widget for data preparation. For more information, see
Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Notebook to Get Data
Insights (p. 1138).
New functionality:
You can now import data from SaaS platforms. For more information, see Import Data From Software
as a Service (SaaS) Platforms (p. 1030).
10/12/2022
New functionality:
You can now reuse data flows for different data sets. For more information, see Reusing Data Flows for
Different Datasets (p. 1109).
10/05/2022
New functionality:
You can now use Principal Component Analysis (PCA) as a transform. For more information, see Reduce
Dimensionality within a Dataset (p. 1071).
10/05/2022
New functionality:
You can now refit parameters in your Data Wrangler flow. For more information, see Export (p. 1116).
10/03/2022
New functionality:
You can now deploy models from your Data Wrangler flow. For more information, see Automatically
Train Models on Your Data Flow (p. 1057).
9/20/2022
New functionality:
You can now set data retention periods in Athena. For more information, see Import data from
Athena (p. 997).
6/9/2022
New functionality:
You can now use Amazon SageMaker Autopilot to train a model directly from your Data Wrangler flow.
For more information, see Automatically Train Models on Your Data Flow (p. 1057).
5/6/2022
New functionality:
1153
Amazon SageMaker Developer Guide
Release Notes
Release Notes
You can now use additional m5 and r5 instances. For more information, see Instances (p. 1034).
4/27/2022
New functionalities:
• You can now get a data quality report. For more information, see Get Insights On Data and Data
Quality (p. 1045)
• You can now perform random sampling and stratified sampling. For more information, see
Sampling (p. 1092).
4/1/2022
New functionality:
You can now use Databricks as a data source. For more information, see Import data from Databricks
(JDBC) (p. 1011).
2/2/2022
New functionalities:
• You can now export using destination nodes. For more information, see Export (p. 1116)
• You can import ORC and JSON files. For more information about file types, see Import (p. 991).
• Data Wrangler now supports using the SMOTE transform. For more information, see Balance
Data (p. 1065).
• Data Wrangler now supports similarity encoding for categorical data. For more information, see
Similarity encode (p. 1074).
• Data Wrangler now supports unnesting JSON data. For more information, see Unnest JSON
Data (p. 1097).
• Data Wrangler now supports expanding the values of an array into separate columns. For more
information, see Explode Array (p. 1098).
• Data Wrangler now supports reaching out to the service team when you're having issues. For more
information, see Troubleshoot (p. 1156).
• Data Wrangler supports editing and deleting steps in your data flow. For more information, see
Delete a Step from Your Data Flow (p. 1038) and Edit a Step in Your Data Wrangler Flow (p. 1042).
• You can now perform transformations on multiple columns. For more information, see Transform
Data (p. 1058).
• Data Wrangler now supports cost allocation tags. For more information, see Using Cost Allocation
Tags.
10/16/2021
New functionality:
Data Wrangler now supports Athena workgroups. For more information, see Import data from
Athena (p. 997).
10/6/2021
New functionality:
1154
Amazon SageMaker Developer Guide
Release Notes
Release Notes
Data Wrangler now supports transforming time series data. For more information, see Transform Time
Series (p. 1077).
7/15/2021
New functionalities:
• Snowflake and Data Wrangler (p. 1148) is now supported. You can use Snowflake as a data source in
Data Wrangler.
• Added support for custom field delimiter in CSV. Now comma, colon, semicolon, pipe (|) and Tab are
supported.
• Now you can export results directly to Amazon S3.
• Added a few new multicollinearity analyzers: Variance Inflation Factors, Principal Component
Analysis and Lasso feature selection.
Enhancements:
• The analyze charts can no longer be could be packed with overlapping labels.
Bug Fixes:
4/26/2021
Enhancements:
• Added support for distributed processing Jobs. You can use multiple instances when running a
processing job.
• Data Wrangler Processing job now automatically coalesces small outputs when estimated result size
is less than 1 gigabytes.
• Feature Store Notebook: Improved feature store ingestion performance
• Data Wrangler Processing jobs now use 1.x as the authoritative container tag for future releases.
Bug Fixes:
2/8/2021
New Functionalities:
1155
Amazon SageMaker Developer Guide
Troubleshoot
Release Notes
Enhancements:
• To improve performance, importing CSV files that contain multiple lines in a single field is no longer
supported.
Bug Fixes:
Troubleshoot
If an issue arises when using Amazon SageMaker Data Wrangler, we recommend you do the following:
• If an error message is provided, read the message and resolve the issue it reports if possible.
• Make sure the IAM role of your Studio user has the required permissions to perform the action. For
more information, see Security and Permissions (p. 1141).
• If the issue occurs when you are trying to import from another AWS service, such as Amazon Redshift
or Athena, make sure that you have configured the necessary permissions and resources to perform
the data import. For more information, see Import (p. 991).
• If you're still having issues, choose Get help at the top right of your screen to reach out to the Data
Wrangler team. For more information, see the following images.
1156
Amazon SageMaker Developer Guide
Troubleshoot
As a last resort, you can try restarting the kernel on which Data Wrangler is running.
1. Save and exit the .flow file for which you want to restart the kernel.
2. Select the Running Terminals and Kernels icon, as shown in the following image.
1157
Amazon SageMaker Developer Guide
Troubleshoot
3. Select the Stop icon to the right of the .flow file for which you want to terminate the kernel, as
shown in the following image.
1158
Amazon SageMaker Developer Guide
Troubleshoot
• Connection failure – If the connection fails with the following message The IP address of the
EMR cluster isn't private error message, your Amazon EMR cluster might not have been
launched in a private subnet. As a security best practice, Data Wrangler only supports connecting to
private Amazon EMR clusters. Choose a private EC2 subnet you launch an EMR cluster.
• Connection hanging and timing out – The issue is most likely due to a network connectivity issue.
After you start connecting to the cluster, the screen doesn't refresh. After about 2 minutes, you
might see the following error JdbcAddConnectionError: An error occurred when trying
to connect to presto: xxx: Connect to xxx failed: Connection timed out
(Connection timed out) will display on top of the screen..
1159
Amazon SageMaker Developer Guide
Increase Amazon EC2 Instance Limit
Check the authentication method. The authentication method that you've specified in Data Wrangler
should match the authentication method that you're using on the cluster.
• You don't have HDFS permissions for LDAP authentication – Use the following guidance to resolve the
issue Set up HDFS Permissions using Linux Credentials. You can log into the cluster using the following
commands:
• LDAP authentication missing connection key error – You might see the following error message:
Data Wrangler couldn't connect to EMR hive successfully. JDBC connection is
missing required connection key(s): PWD.
For LDAP authentication, you must specify both a username and a password. The JDBC URL stored in
Secrets Manager is missing property PWD.
• When you're troubleshooting the LDAP configuration: We recommend making sure that the LDAP
authenticator (LDAP server) is correctly configured to connect to the Amazon EMR cluster. Use the
ldapwhoami command to help you resolve the configuration issue. The following are example
commands that you can run:
• For LDAPS – ldapwhoami -x -H ldaps://ldap-server
• For LDAP – ldapwhoami -x -H ldap://ldap-server
Either command should return Anonymous if you've configured the authenticator successfully.
The message can indicate that you need to select a different instance type, but it can also indicate that
you don't have enough Amazon EC2 instances to successfully run Data Wrangler on your workflow. You
can increase the number of instances by using the following procedure.
1160
Amazon SageMaker Developer Guide
Update Data Wrangler
If your request is approved, AWS sends a notification to the email address associated with your account.
You can also check the status of your request by choosing Quota request history on the Service Quotas
page. Processed requests have a Status of Closed.
Alternatively, if you are using a Data Wrangler application version that is not the latest version, and you
have an existing Data Wrangler flow open, you are prompted to update your Data Wrangler application
version in the Studio UI. The following screenshot shows this prompt.
Important
This updates the Data Wrangler kernel gateway app only. You still need to shut down the
JupyterServer app in your user account. To do this, follow the preceding steps.
1161
Amazon SageMaker Developer Guide
Shut Down Data Wrangler
You can also choose Remind me later, in which case an Update button appears in the top-right corner of
the screen.
To avoid losing work, save your data flow before shutting Data Wrangler down. To save your data flow in
Studio, choose File and then choose Save Data Wrangler Flow. Data Wrangler automatically saves your
data flow every 60 seconds.
1162
Amazon SageMaker Developer Guide
Prepare data at scale with Studio notebooks
).
2. Under RUNNING APPS is the sagemaker-data-wrangler-1.0 app. Select the shutdown icon next to
this app ( ).
Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING
INSTANCES when you shut down the Data Wrangler app.
After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler
flow file. This can take a few minutes.
1163
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Amazon EMR is a managed big data platform with resources to help you run petabyte-scale distributed
data processing jobs using open-source analytics frameworks on AWS such as Apache Spark, Apache
Hive, Presto, HBase, Flink, and Hudi among others. Data engineers and data scientists use Amazon
EMR for a wide variety of use cases, including big data analytics, what-if analyses, real-time analytics,
and data preparation for machine learning. With Studio integration with Amazon EMR, you can create,
browse, discover, and connect to Amazon EMR clusters without leaving your Studio notebook. You can
also monitor and debug your Spark workloads with one-click access to the Spark UI from within the
notebook. You should consider Amazon EMR for your data preparation workloads if you want maximum
control over hardware and software versions, containers, and big data processing applications.
AWS Glue Interactive Sessions is a serverless service that you can enlist to collect, transform, cleanse,
and prepare data for storage in your data lakes and data pipelines. AWS Glue Interactive Sessions
provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds
on a dedicated Data Processing Unit (DPU) without having to worry about provisioning and managing
complex compute cluster infrastructure. After initialization, you can quickly browse the AWS Glue data
catalog, run large queries, access data governed by AWS Lake Formation, and interactively analyze and
prepare data using Spark, right in your Studio notebook. You can then use the prepared data to train,
tune, and deploy models using the purpose-built ML tools within SageMaker Studio. You should consider
AWS Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark
service with moderate control of configurability and flexibility.
Content
• Prepare data using Amazon EMR (p. 1164)
• Prepare data using AWS Glue Interactive Sessions (p. 1192)
Administrators can use the AWS Service Catalog to define AWS CloudFormation templates of Amazon
EMR clusters accessible to Studio users. Data scientists can then choose a predefined template to self-
provision an Amazon EMR cluster directly from Amazon SageMaker Studio notebooks. Administrators
can further parameterize the templates to let users choose aspects of the cluster to match their
workloads within predefined values. For example, a data scientist or data engineer may want to specify
the number of core nodes of the cluster up to a predetermined maximum value, or select the instance
type of a node from a dropdown menu.
• If you are an administrator, make sure that you have enabled communication between Amazon
SageMaker Studio notebooks and Amazon EMR clusters. For instructions, see the Configure
networking (for administrators) (p. 1165) section. Once this communication is enabled, you have the
option to:
• Define cluster templates in AWS Service Catalog and ensure the availability of these templates
through Studio's notebooks: Configure Amazon EMR templates in AWS Service Catalog (for
administrators) (p. 1168).
• Configure the discoverability of existing Amazon EMR clusters directly from Studio's notebooks:
Configure the discoverability of Amazon EMR clusters (for administrators) (p. 1178).
• If you are a data scientist or data engineer looking to self-provision an Amazon EMR cluster, see
Launch an Amazon EMR cluster from Studio (p. 1175).
• If you are a data scientist or data engineer looking to discover and connect to existing Amazon EMR
clusters from Studio, see Use Amazon EMR clusters from Studio notebooks (p. 1177).
1164
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
List of topics
• Configure networking (for administrators) (p. 1165)
• Create an Amazon EMR cluster from Studio notebooks (p. 1168)
• Use Amazon EMR clusters from Studio notebooks (p. 1177)
• Access Spark UI from Studio (p. 1189)
• Walkthroughs and whitepapers (p. 1190)
• Additional Configuration for cross accounts use cases (for administrators) (p. 1191)
The networking instructions vary based on whether SageMaker Studio and Amazon EMR are deployed
within a private Amazon Virtual Private Cloud (VPC) or communicate over the internet.
By default, SageMaker Studio runs in an AWS managed VPC with internet access. When using an internet
connection, Studio accesses AWS resources, such as Amazon S3 buckets, over the internet. However,
if you have security requirements to control access to your data and job containers, we recommend
that you configure Studio and Amazon EMR so that your data and containers aren’t accessible over the
internet. To control access to your resources or run SageMaker Studio without public internet access, you
can specify the VPC only network access type when you onboard to Amazon SageMaker Domain (p. 37).
In this scenario, SageMaker Studio establishes connections with other AWS services via private VPC
endpoints. For information about configuring SageMaker Studio in VPC only mode, see Connect
SageMaker Studio notebooks in a Amazon VPC to external resources..
The first two sections describe how to ensure communication between SageMaker Studio and an
Amazon EMR cluster in VPCs without public internet access. The last section covers how to ensure
communication between SageMaker Studio and Amazon EMR using an internet connection. Prior
to connecting SageMaker Studio and Amazon EMR without internet access, make sure to establish
endpoints for Amazon Simple Storage Service (data storage), Amazon CloudWatch (logging and
monitoring), and Amazon SageMaker Runtime (fine-grained role-based access control (RBAC)).
• If your Amazon SageMaker Studio and Amazon EMR cluster are set up in different VPCs in the
same AWS account or in different accounts, see Studio and Amazon EMR are deployed in separate
VPCs (p. 1165).
• If your Amazon SageMaker Studio and Amazon EMR cluster are set up in the same VPC, see Amazon
SageMaker Studio and Amazon EMR are in the same VPC (p. 1167).
• If you chose to connect Amazon SageMaker Studio and Amazon EMR cluster over public internet, see
Amazon SageMaker Studio and Amazon EMR communicate over public internet (p. 1168).
1165
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
The steps are similar, regardless of whether Amazon SageMaker Studio and the Amazon EMR cluster
are deployed within the same AWS account (Single account use case) or different AWS accounts (Cross
accounts use case).
1. VPC peering
Create a VPC peering connection to facilitate the networking between the two VPCs (SageMaker
Studio and Amazon EMR).
a. From your SageMaker Studio account, on the Amazon VPC dashboard, choose Peering
connections, then Create peering connection.
b. Create your request to peer the Studio VPC within the Amazon EMR VPC. When requesting
peering in another AWS account, choose Another account in Select another VPC to peer with.
For cross accounts peering, the administrator must accept the request from the Amazon EMR
account.
When peering private subnets, you should enable private IP DNS resolution at the VPC peering
connection level.
2. Routing tables
Send the network traffic between SageMaker Studio subnets and Amazon EMR subnets both ways.
After you establish the peering connection, the administrator (on each account for cross accounts
access) can add routes to the private subnet route tables to route the traffic between the notebooks
and the cluster subnets. You can define those routes by going to the Route Tables section of each
VPC in the Amazon VPC dashboard.
The following illustration of the route table of a Studio VPC subnet shows an example of an
outbound route from the Studio account to the Amazon EMR VPC IP range (here 2.0.1.0/24)
through the peering connection.
The following illustration of a route table of an Amazon EMR VPC subnet shows an example of
return routes from the Amazon EMR VPC to Studio VPC IP range (here 10.0.20.0/24) through the
peering connection.
3. Security groups
Lastly, the security group of your Studio domain must allow outbound traffic, and the security group
of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive, or Presto TCP
1166
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
ports (respectively 8998, 10000, and 8889) from the Studio instance security group. Apache Livy is a
service that enables interaction with a Amazon EMR cluster over a REST interface.
The following illustration is an example of a VPC setup allowing SageMaker Studio notebooks to
provision Amazon EMR clusters from AWS CloudFormation templates in the Service Catalog, then
connect to an Amazon EMR cluster deployed in the same AWS account. The diagram provides an
additional illustration of the required endpoints when Studio and Amazon EMR communicate without
access to internet or the option to use a NAT gateway, which enables internet connectivity through an
internet gateway.
Amazon SageMaker Studio and Amazon EMR are in the same VPC
If Amazon SageMaker Studio and the cluster are in different subnets, add routes to each private subnet
route table to route the traffic between the notebooks and the cluster subnets. You can define those
routes by going to the Route Tables section of each VPC in the Amazon VPC dashboard. If you deployed
Amazon SageMaker Studio and an Amazon EMR cluster in the same VPC and the same subnet, you do
not need to route the traffic between the notebooks and the cluster.
Whether or not you needed to update your routing tables, the security group of your Studio domain
must allow outbound traffic, and the security group of the Amazon EMR primary node must allow
inbound traffic on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from the
Studio instance security group. Apache Livy is a service that enables interaction with a Amazon EMR
cluster over a REST interface.
1167
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Amazon SageMaker Studio and Amazon EMR communicate over public internet
By default, SageMaker Studio provides a network interface that allows communication with the internet
through an internet gateway in the VPC associated with the SageMaker Domain. If you choose to connect
to Amazon EMR through the public internet, your Amazon EMR cluster needs to accept inbound traffic
on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from its internet gateway.
Apache Livy is a service that enables interaction with an Amazon EMR cluster over a REST interface.
Keep in mind that any port on which you allow inbound traffic represents a potential security
vulnerability. Carefully review custom security groups to ensure that you minimize vulnerabilities. For
more information, see Control network traffic with security groups.
Alternatively, see Walkthroughs and whitepapers (p. 1190) for a detailed walkthrough of how to enable
Kerberos on Amazon EMR, set the cluster in a private subnet, and access the cluster using a Network
Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups.
Note
When connecting to your Apache Livy endpoint through the public internet, we recommend that
you secure communications between Amazon SageMaker Studio and your Amazon EMR cluster
using TLS.
For information on setting up HTTPS with Apache Livy, see Enabling HTTPS with Apache
Livy. For information on setting an Amazon EMR cluster with transit encryption enabled, see
Providing certificates for encrypting data in transit with Amazon EMR encryption. Additionally,
you need to configure Studio to access your certificate key as specified in Connect to an Amazon
EMR cluster over HTTPS (p. 1183).
• If you are an administrator looking to configure AWS CloudFormation templates as AWS Service
Catalog products so users can create Amazon EMR clusters from Studio, see Configure Amazon EMR
templates in AWS Service Catalog (for administrators) (p. 1168).
• If you are a data scientist or data engineer looking to self-provision an Amazon EMR cluster to process
data at scale using open-source frameworks such as Apache Spark, Apache Hive, or Presto, see Launch
an Amazon EMR cluster from Studio (p. 1175).
• If you are looking to discover and connect to existing Amazon EMR clusters from Studio, see Use
Amazon EMR clusters from Studio notebooks (p. 1177).
Topics
• Configure Amazon EMR templates in AWS Service Catalog (for administrators) (p. 1168)
• Launch an Amazon EMR cluster from Studio (p. 1175)
1168
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
end-users can customize various aspects of the cluster to suit their specific requirements. For example,
the administrator can define a list of permissible instance types from which users can choose when
creating cluster.
This topic assumes that you are familiar with the creation of portfolios and products in AWS Service
Catalog as well as Amazon EMR, and AWS CloudFormation (CFN).
Note
You can refer to the CFN templates in aws-samples/sagemaker-studio-emr GitHub repository as
examples of AWS CloudFormation stacks to deploy IAM roles, VPCs, a sandbox Studio Domain,
a user profile, as well as an AWS CloudFormation template to launch an Amazon EMR cluster.
Several options are available depending on your authentication method between Studio and the
Amazon EMR cluster. In these examples, a parent CFN template passes the SageMaker VPC ID,
security group, and subnet ID parameters to the CFN template of an Amazon EMR cluster.
You can access various examples of CFN Amazon EMR templates in the nested repository
sagemaker-studio-emr/cloudformation/emr_servicecatalog_templates and further choose from
a single account deployment to cross accounts.
For more information about the authentication methods available when connecting to an
Amazon EMR cluster, see Use Amazon EMR clusters from Studio notebooks (p. 1177).
To simplify the creation of Amazon EMR clusters, administrators can register the CloudFormation
template of an Amazon EMR cluster as a product in the portfolio of the AWS Service Catalog. Then
they associate the Service Catalog portfolio with the Studio execution role to ensure the availability of
the template in Studio. Furthermore, to make sure that data scientists can discover those templates,
provision Amazon EMR clusters, and connect to Amazon EMR clusters from their Studio notebooks,
administrators need to provide the Studio execution role with additional permissions.
The following list provides the additional settings that administrators need to apply to a baseline CFN
stack to enable Studio to access the Service Catalog products and provision Amazon EMR clusters. Those
settings must be applied at multiple levels:
Finally, administrators need to assign a set of necessary permissions to the Studio execution role and the
account where Amazon EMR is deployed, depending on whether Studio and Amazon EMR are deployed
within the same or different AWS accounts.
As a prerequisite, ensure that you have reviewed the networking and security requirements in
Configure networking (for administrators) (p. 1165) and that you have created a baseline CFN stack
supporting the authentication method of your choice. You can find examples of CFN templates in aws-
samples/sagemaker-studio-emr.
• In your Service Catalog portfolio:
Add the following section to your portfolio CFN template (see the example in YAML format) to
associate your portfolio with the Studio execution role used by the user profiles.
SageMakerStudioEMRProductPortfolioPrincipalAssociation:
Type: AWS::ServiceCatalog::PortfolioPrincipalAssociation
Properties:
PrincipalARN: SageMakerExecutionRole.Arn
PortfolioId: SageMakerStudioEMRProductPortfolio ID
PrincipalType: IAM
1169
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Add the following tag key "sagemaker:studio-visibility:emr" and set to the value "true"
(here in YAML) to the Service Catalog product referencing the Amazon EMR template resource. This
ensures the visibility of the template in Studio .
SMStudioEMRNoAuthProduct:
Type: AWS::ServiceCatalog::CloudFormationProduct
Properties:
Owner: AWS
Name: SageMaker Studio Domain No Auth EMR
ProvisioningArtifactParameters:
- Name: SageMaker Studio Domain No Auth EMR
Description: Provisions a SageMaker domain and No Auth EMR Cluster
Info:
LoadTemplateFromURL: Link to your CFN template. For example, https://
aws-ml-blog.s3.amazonaws.com/artifacts/astra-m4-sagemaker/end-to-end/CFN-EMR-
NoStudioNoAuthTemplate-v3.yaml
Tags:
- Key: "sagemaker:studio-visibility:emr"
Value: "true"
• In the CFN template of the Amazon EMR cluster within your Service Catalog product:
Add the following mandatory stack parameters as a placeholder. This section is populated with the
Studio project name and identifier used by the user when provisioning a cluster from Studio.
SageMakerProjectName:
Type: String
Description: Name of the project
SageMakerProjectId:
Type: String
Description: Service generated Id of the project.
Administrators have the option to incorporate choices in the parameters section of a template
so users can input or select custom values when creating a cluster by specifying Default and
AllowedValues. The following example illustrates additional input parameters that administrators
can set when creating an Amazon EMR template.
"Parameters": {
"EmrClusterName": {
"Type": "String",
"Description": "EMR cluster Name."
},
"MasterInstanceType": {
"Type": "String",
"Description": "Instance type of the EMR master node.",
"Default": "m5.xlarge",
"AllowedValues": [
"m5.xlarge",
"m5.2xlarge",
"m5.4xlarge"
]
},
"CoreInstanceType": {
"Type": "String",
"Description": "Instance type of the EMR core nodes.",
"Default": "m5.xlarge",
"AllowedValues": [
"m5.xlarge",
"m5.2xlarge",
"m5.4xlarge",
1170
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
"m3.medium",
"m3.large",
"m3.xlarge",
"m3.2xlarge"
]
},
"CoreInstanceCount": {
"Type": "String",
"Description": "Number of core instances in the EMR cluster.",
"Default": "2",
"AllowedValues": [
"2",
"5",
"10"
]
},
"EmrReleaseVersion": {
"Type": "String",
"Description": "The release version of EMR to launch.",
"Default": "emr-5.33.1",
"AllowedValues": [
"emr-5.33.1",
"emr-6.4.0"
]
}
}
• Last, attach the required IAM policies to enable the visibility of CFN Amazon EMR templates and the
self-provisioning of Amazon EMR clusters from the Studio notebooks. The role to which you must add
those policies depends on whether Studio and Amazon EMR are deployed within the same account
(single account) or in different accounts (cross accounts).
• If your Amazon EMR cluster is deployed in the same AWS account as the Studio account, see the
Single Account tab.
• If your Amazon EMR cluster is deployed in a different AWS account than the Studio account, see the
Cross Accounts tab.
Single account
Attach the following permissions to the Studio execution role accessing your cluster.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
1171
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:studio-region:studio-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:studio-region:studio-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
},
{
"Sid": "AllowEMRTemplateDiscovery",
"Effect": "Allow",
"Action": [
"servicecatalog:SearchProducts"
],
"Resource": "*"
},
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:studio-region:studio-account-id:project/*"
},
]
}
Cross accounts
If your Amazon EMR clusters and Studio are deployed in separate AWS accounts, you configure the
permissions in multiple steps.
• On the trusting account (the account in which Amazon EMR is deployed ), create a custom
IAM role (referred to as ASSUME-ROLE in this page) with the following trust relationship and
permissions.
For information about creating a role on an AWS account, see Creating an IAM role (console).
• To grant the trusted account (the account in which the Studio account is deployed) the
permission to assume a role in the trusting account, add the following trust relationship.
{
"Version": "2012-10-17",
"Statement": [
1172
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio-account:root"
},
"Action": "sts:AssumeRole"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
}
]
}
• On the trusted account (the account in which Studio is deployed), add the following trust
relationship and permissions to the Studio execution role.
• To grant SageMaker Studio's execution role the permission to assume the ASSUME-ROLE in
the trusting account, add the following trust relationship.
1173
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": ["arn:aws:iam::emr-account:role/ASSUME-ROLE" ]
}]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:::project/*"
},
{
"Sid": "AllowEMRTemplateDiscovery",
"Effect": "Allow",
"Action": [
"servicecatalog:SearchProducts"
],
"Resource": "*"
}
]
}
• Last, see Additional Configuration for cross accounts use cases (for administrators) (p. 1191) to
provide the ARN of the previously created IAM role ASSUME-ROLE to the Studio execution role.
The ARN of this assumable cross-accounts role is loaded by the Studio Jupyter server at launch.
The Studio execution role assumes that remote role to discover and connect to Amazon EMR
clusters in the trusting account.
Once the CFN templates are available in Amazon SageMaker Studio, data scientists can self-provision
Amazon EMR clusters from those templates. Each item in the list of "Parameters" provided
in the template becomes an input box of the cluster creation form in Studio where the values in
"AllowedValues" appear in a dropdown menu.
The following illustration shows the dynamic form assembled from a CFN Amazon EMR template to
create an Amazon EMR cluster in SageMaker Studio.
1174
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Visit Launch an Amazon EMR cluster from Studio (p. 1175) to learn about how to launch a cluster from
Studio using those Amazon EMR templates.
1175
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
1.
Select the Home ( ) icon in the Studio UI's left-side panel, then select the Data node in the
navigation menu. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR
clusters that you can access from SageMaker Studio.
2. Choose Create cluster. This opens up a page, in the main working area, listing the cluster templates
available to you.
3. Select a cluster configuration template by choosing a template name. The selection of a template
activates the Select template button. Choose Select template. This opens up a cluster creation
form.
4. Enter the cluster's details, such as a cluster name and any specific configurable parameter set by
your administrator, then choose Create cluster. The creation of the cluster might take a couple of
minutes.
Once the cluster is provisioned, the Studio UI displays a The cluster has been successfully created
message.
To connect to your cluster, see Use Amazon EMR clusters from Studio notebooks (p. 1177)
1176
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
• If you are an administrator, see Configure the discoverability of Amazon EMR clusters (for
administrators) (p. 1178) to configure the discoverability of Amazon EMR clusters from SageMaker
Studio notebooks.
• If you are a data scientist or data engineer looking to discover Amazon EMR clusters from your Studio
notebooks, see Discover Amazon EMR clusters from SageMaker Studio (p. 1180).
• If you are a data scientist or data engineer looking to connect to existing Amazon EMR clusters from
your Studio notebooks, see Connect to an Amazon EMR cluster from SageMaker Studio (p. 1181).
When connecting to your Amazon EMR cluster from SageMaker Studio, you can authenticate to
your cluster with Kerberos, Lightweight Directory Access Protocol (LDAP), or use runtime IAM role
authentication. Your authentication method depends on your cluster configuration. You can refer
to this example Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon
EMR cluster to set up an Amazon EMR cluster that uses Kerberos. Alternatively, you can look at the
CloudFormation example templates using Kerberos or LDAP in the aws-samples/sagemaker-studio-emr
GitHub repository.
Find the list of available connection commands to an Amazon EMR cluster per authentication method
in Enter the connection command to an Amazon EMR cluster manually (p. 1182) to connect to your
Amazon EMR cluster.
Those images and kernels come with sagemaker-studio-analytics-extension, a notebook extension that
enables connection to a remote Spark (Amazon EMR) cluster via the SparkMagic library using Apache
Livy.
To connect to Amazon EMR clusters using another built-in image or your own image, follow the
instructions in Bring your own image (p. 1177).
1177
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Additionally, to connect to Amazon EMR with Kerberos authentication, you must install the kinit client.
Depending on your OS, the command to install the kinit client can vary. To bring an Ubuntu (Debian
based) image, use the apt-get install -y -qq krb5-user command.
For more information on bringing your own image in SageMaker Studio, see Bring your own SageMaker
image.
Single Account
Attach the following permissions to the SageMaker Studio's execution role accessing your cluster.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"
1178
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
},
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:region:account-id:project/*"
},
]
}
Cross Accounts
If your Amazon EMR clusters and SageMaker Studio are deployed in separate AWS accounts, you
configure the permissions in multiple steps.
• On the trusting account (the account in which Amazon EMR is deployed ), create a custom IAM role
(referred to as ASSUME-ROLE in this page) with the following trust relationship and permissions.
For information about creating a role on an AWS account, see Creating an IAM role (console).
• To grant the trusted account (the account in which SageMaker Studio's account is deployed ) the
permission to assume a role in the trusting account, add the following trust relationship.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio-account:root"
},
"Action": "sts:AssumeRole"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
1179
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
}
]
}
• On the trusted account (the account in which SageMaker Studio is deployed), add the following
trust relationship to SageMaker Studio's execution role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": ["arn:aws:iam::emr-account:role/ASSUME-ROLE" ]
}]
}
• Last, see Additional Configuration for cross accounts use cases (for administrators) (p. 1191)
to provide the ARN of the previously created IAM role ASSUME-ROLE to SageMaker Studio's
execution role. The ARN of this assumable cross accounts role is loaded by the Studio Jupyter
server at launch. SageMaker Studio's execution role assumes that remote role to discover and
connect to Amazon EMR clusters in the trusting account.
Visit Discover Amazon EMR clusters from SageMaker Studio (p. 1180) to learn about how to discover and
connect to Amazon EMR clusters from Studio notebooks.
1180
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
If your administrator configured the cross accounts discovery of Amazon EMR clusters, you can see a
consolidated list of clusters in the AWS account used by SageMaker Studio as well as in the remote
accounts.
If you are an administrator looking to set up the discoverability of Amazon EMR clusters from SageMaker
Studio, see Configure the discoverability of Amazon EMR clusters (for administrators) (p. 1178).
To view the list of available Amazon EMR clusters from SageMaker Studio:
1.
Select the Home ( ) icon in Studio UI's left-side panel, then select the Data node in the
navigation menu.
2. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR clusters that you
can access from SageMaker Studio.
The list displays the status of each cluster. A cluster status can be Starting, Bootstrapping,
Running/Walking, Terminating, Terminated, and Terminated with error. You can filter clusters by
status by selecting the filter icon. The following image shows an example of a list of clusters.
3. To connect to a particular Running/Walking cluster, see Connect to an Amazon EMR cluster from
SageMaker Studio (p. 1181).
1. Choose the name of the cluster in your list. This activates the Attach to new notebook button.
2. Choose Attach to new notebook. This opens up the images and kernels selection box.
3. Select your image and kernel, then choose Select. For a list of supported images, see Supported
images and kernels to connect to an Amazon EMR cluster from SageMaker Studio (p. 1177) or refer
to Bring your own image (p. 1177).
4. If the cluster you select does not use Kerberos, LDAP, or runtime role authentication, Studio prompts
you to select the credential type. Choose from Http basic authentication or No credentials,
then enter your credentials, if applicable. A connection command populates the first cell of your
notebook and initiates the connection with the Amazon EMR cluster.
Once the connection succeeds, a message confirms the connection and the start of the Spark
application.
1181
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Cluster is only visible when you use a kernel from Supported images and kernels to connect to an
Amazon EMR cluster from SageMaker Studio (p. 1177) or from Bring your own image (p. 1177). If
you cannot see Cluster at the top of your notebook, ensure that your administrator has configured
the discoverability of your clusters and switch to a supported kernel.
Otherwise, if the cluster you choose does not use Kerberos, LDAP, or runtime role authentication,
Studio prompts you to select the credential type. You can choose HTTP basic authentication or No
credential.
4. An active cell populates and runs. This cell contains the connection command to connect to your
Amazon EMR cluster.
Once the connection succeeds, a message confirm the connection and the start of the Spark
application.
You can manually connect to your Amazon EMR cluster from a Studio notebook whether or not your
Studio application and cluster reside in the same AWS account.
For each of the following authentication types, use the specified command to manually connect to your
cluster from your Studio notebook.
• Kerberos
Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type Kerberos --language python
[--assumable-role-arn EMR_access_role_ARN ]
[--verify-certificate /home/user/certificateKey.pem]
• LDAP
Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
1182
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
• NoAuth
Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type None --language python
[--assumable-role-arn EMR_access_role_ARN ]
[--verify-certificate /home/user/certificateKey.pem]
Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.
For more information on connecting to an Amazon EMR cluster using runtime IAM roles, see Connect
to an Amazon EMR cluster from Studio using runtime IAM roles (p. 1184).
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type Basic_Access \
--emr-execution-role-arn arn:aws:iam::studio_account_id:role/emr-execution-role-name
[--assumable-role-arn EMR_access_role_ARN]
[--verify-certificate /home/user/certificateKey.pem]
If you have configured your Amazon EMR cluster with transit encryption enabled and Apache Livy server
for HTTPS and would like Studio to communicate with Amazon EMR using HTTPS, you need to configure
Studio to access your certificate key.
For self-signed or local Certificate Authority (CA) signed certificates, you can do this in two steps:
1. Download the PEM file of your certificate to your local file system using one of the following options:
• Jupyter's built-in file upload function.
• A notebook cell.
• A lifecycle configuration (LCC) script.
For information on how to use an LCC script, see Customize a Notebook Instance Using a Lifecycle
Configuration Script
2. Enable the validation of the certificate by providing the path to your certificate in the --verify-
certificate argument of your connection command.
For public CA issued certificates, set the certificate validation by setting the --verify-certificate
parameter as true.
Alternatively, you can disable the certificate validation by setting the --verify-certificate
parameter as false.
1183
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
You can find the list of available connection commands to an Amazon EMR cluster in Enter the
connection command to an Amazon EMR cluster manually (p. 1182).
Connect to an Amazon EMR cluster from Studio using runtime IAM roles
When you connect to an Amazon EMR cluster from your Amazon SageMaker Studio notebook, you can
visually browse a list of IAM roles, known as runtime roles, and select one on the fly. Subsequently, all
your Apache Spark, Apache Hive, or Presto jobs created from your Studio notebook access only the data
and resources permitted by policies attached to the runtime role. Also, when data is accessed from data
lakes managed with AWS Lake Formation, you can enforce table-level and column-level access using
policies attached to the runtime role.
With this capability, you and your teammates can connect to the same cluster, each using a runtime
role scoped with permissions matching your individual level of access to data. Your sessions are also
isolated from one another on the shared cluster. With this ability to control fine-grained access to data
on the same shared cluster, you can simplify provisioning of Amazon EMR clusters, reducing operational
overhead and saving costs.
To try out this new feature, see Apply fine-grained data access controls with AWS Lake Formation and
Amazon EMR from Amazon SageMaker Studio . This blog post helps you set up a demo environment
where you can try using preconfigured runtime roles to connect to Amazon EMR clusters.
Prerequisites
Before you get started, make sure you meet the following prerequisites:
Runtime role authentication supports a variety of cross-account connection scenarios when your data
resides outside of your Studio account. The following image shows three different ways you can assign
your Amazon EMR cluster, data, and even Amazon EMR execution role between your Studio and data
accounts:
1184
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
In option 1, your Amazon EMR cluster and Amazon EMR execution role are in a separate data account
from your Studio account. You define a separate Amazon EMR access role permission policy which grants
permission to your Studio execution role to assume the Amazon EMR access role. The Amazon EMR
access role then calls the Amazon EMR API GetClusterSessionCredentials on behalf of your Studio
execution role, giving you access to the cluster.
In option 2, your Amazon EMR cluster and Amazon EMR execution role are in your Studio account. Your
Studio execution role has permission to use the Amazon EMR API GetClusterSessionCredentials
to gain access to your cluster. To access the Amazon S3 bucket, give the Amazon EMR execution role
cross-account Amazon S3 bucket access permissions — you grant these permissions within your Amazon
S3 bucket policy.
In option 3, your Amazon EMR clusters are in your Studio account, and the Amazon EMR execution
role is in the data account. Your Studio execution role has permission to use the Amazon EMR API
GetClusterSessionCredentials to gain access to your cluster. Add the Amazon EMR execution role
into the execution role configuration JSON. Then you can select the role in the UI when you choose your
cluster. For details about how to set up your execution role configuration JSON file, see Preload your
execution roles into Studio (p. 1188).
To establish runtime role authentication for your Amazon EMR clusters, configure the required IAM
policies, network, and usability enhancements. Your setup depends on whether you handle any cross-
account arrangements if your Amazon EMR clusters, Amazon EMR execution role, or both, reside outside
of your Amazon SageMaker Studio account. The following discussion guides you through the policies to
install, how to configure the network to allow traffic between cross-accounts, and the local configuration
file to set up to automate your Amazon EMR connection.
Configure runtime role authentication when your Amazon EMR cluster and Studio are in the
same account
If your Amazon EMR cluster resides in your Studio account, add the basic policy to
connect to your Amazon EMR cluster and set permissions to call the Amazon EMR API
GetClusterSessionCredentials, which gives you access to the cluster. Complete the following steps
to add necessary permissions to your Studio execution policy:
1185
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
1. Add the required IAM policy to connect to Amazon EMR clusters. For details, see Discover Amazon EMR
clusters from SageMaker Studio (p. 1180).
2. Grant permission to call the Amazon EMR API GetClusterSessionCredentials when you pass
one or more permitted Amazon EMR execution roles specified in the policy.
3. (Optional) Grant permission to pass IAM roles that follow any user-defined naming conventions.
4. (Optional) Grant permission to access Amazon EMR clusters that are tagged with specific user-defined
strings.
5. If you don't want to manually call the Amazon EMR connection command, install a SageMaker
configuration file in your local Amazon EFS and select the role to use when you select your Amazon
EMR cluster. For details about how to preload your IAM roles, see Preload your execution roles into
Studio (p. 1188).
The following example policy permits Amazon EMR execution roles belonging to the modeling and
training groups to call GetClusterSessionCredentials. In addition, the policyholder can access
Amazon EMR clusters tagged with the strings modeling or training.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "elasticmapreduce:GetClusterSessionCredentials",
"Resource": "*",
"Condition": {
"StringLike": {
"elasticmapreduce:ExecutionRoleArn": [
"arn:aws:iam::123456780910:role/emr-execution-role-ml-modeling*",
"arn:aws:iam::123456780910:role/emr-execution-role-ml-training*"
],
"elasticmapreduce:ResourceTag/group": [
"*modeling*",
"*training*"
]
}
}
}
]
}
Configure runtime role authentication when your cluster and Studio are in different accounts
If your Amazon EMR cluster is not in your Studio account, allow your Studio execution role to assume the
cross-account Amazon EMR access role so you can connect to the cluster. Complete the following steps
to set up your cross-account configuration:
1. Create your Studio execution role permission policy so that the execution role can assume the Amazon
EMR access role. The following policy is an example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeCrossAccountEMRAccessRole",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::emr_account_id:role/emr-access-role-name"
}
]
1186
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
2. Create the trust policy to specify which Studio account IDs are trusted to assume the Amazon EMR
access role. The following policy is an example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrossAccountSageMakerExecutionRoleToAssumeThisRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio_account_id:role/studio_execution_role"
},
"Action": "sts:AssumeRole"
}
}
3. Create the Amazon EMR access role permission policy, which grants the Amazon EMR execution role
the needed permissions to carry out the intended tasks on the cluster. Configure the Amazon EMR
access role to call the API GetClusterSessionCredentials with the Amazon EMR execution roles
specified in the access role permission policy. The following policy is an example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCallingEmrGetClusterSessionCredentialsAPI",
"Effect": "Allow",
"Action": "elasticmapreduce:GetClusterSessionCredentials",
"Resource": "",
"Condition": {
"StringLike": {
"elasticmapreduce:ExecutionRoleArn": [
"arn:aws:iam::emr_account_id:role/emr-execution-role-name"
]
}
}
}
]
}
4. Set up the cross-account network so that traffic can move back and forth between your accounts. For
guided instruction, see Set up the network in the blog post Create and manage Amazon EMR Clusters
from SageMaker Studio to run interactive Spark and ML workloads – Part 2. The steps in the blog post
help you complete the following tasks:
a. VPC-peer your Studio account and your Amazon EMR account to establish a connection.
b. Manually add routes to the private subnet route tables in both accounts. This permits creation
and connection of Amazon EMR clusters from the Studio account to the remote account’s private
subnet.
c. Set up the security group attached to your Studio domain to allow outbound traffic and the
security group of the Amazon EMR primary node to allow inbound TCP traffic from the Studio
instance security group.
5. If you don't want to manually call the Amazon EMR connection command, install a SageMaker
configuration file in your local Amazon EFS so you can select the role to use when you choose your
Amazon EMR cluster. For details about how to preload your IAM roles, see Preload your execution roles
into Studio (p. 1188).
1187
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
To write a configuration file for the Amazon EMR execution roles, associate a Use Lifecycle
Configurations with Amazon SageMaker Studio (p. 182) (LCC) to the Jupyter server application.
Alternatively, you can write or update the configuration file and restart the Jupyter server with the
command: restart-jupyter-server.
The following snippet is an example LCC bash script you can apply if your Studio application and cluster
are in the same account:
#!/bin/bash
set -eux
FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"
mkdir -p $FILE_DIRECTORY
If your Studio application and clusters are in different accounts, specify the Amazon EMR access roles
that can use the cluster. In the following example policy, 123456789012 is the ARN for the Amazon EMR
cluster account, and 212121212121 and 434343434343 are the ARNs for the permitted Amazon EMR
access roles.
#!/bin/bash
set -eux
FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"
mkdir -p $FILE_DIRECTORY
1188
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
"arn:aws:iam::434343434343:role/emr-execution-role-2"
]
}
}
EOF
mkdir -p $FILE_DIRECTORY
To terminate a cluster in a Running state, navigate to the list of available Amazon EMR
clusters.
1.
In SageMaker Studio, select the Home ( ) icon in Studio UI's left-side panel, then select the Data
node in the navigation menu.
2. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR clusters that you
can access from SageMaker Studio.
3. Select the name of the cluster that you want to terminate, then choose Terminate.
4. This opens up a confirmation window informing you that any pending work or data on your cluster
will be lost permanently after termination. Confirm by choosing Terminate again.
• Option 1: Set up an SSH tunnel to the master node using local port forwarding
• Option 2, part 1: Set up an SSH tunnel to the master node using dynamic port forwarding
Option 2, part 2: Configure proxy settings to view websites hosted on the master node
For information about viewing web interfaces hosted on Amazon EMR clusters, see View web interfaces
hosted on Amazon EMR Clusters. You can also visit your Amazon EMR console to get access to the Spark
UI.
Note
You can set up an SSH tunnel even if presigned URLs are not available to you.
1189
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
Presigned URLs
To create one-click URLs that can access Spark UI on Amazon EMR from SageMaker Studio notebooks,
you must enable the following IAM permissions. Choose the option that applies to you:
• For Amazon EMR clusters that are in the same account as the SageMaker Studio notebook: Add the
following permissions to the SageMaker Studio IAM execution role.
• For Amazon EMR clusters that are in a different account (not SageMaker Studio notebook): Add the
following permissions to the cross-account role that you created for Discover Amazon EMR clusters
from SageMaker Studio (p. 1180).
Note
You can access presigned URLs from the console in the following regions:
The following policy gives access to presigned URLs for your execution role.
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"
]
}
1190
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR
• Create and manage Amazon EMR clusters from SageMaker Studio to run interactive Spark and ML
workloads.
• To extend the use case to a cross-account configuration where SageMaker Studio and your Amazon
EMR cluster are deployed in separete AWS accounts, see Create and manage Amazon EMR clusters
from SageMaker Studio to run interactive Spark and ML workloads - Part 2.
See also:
• A walkthrough of the configuration of Access Apache Livy using a Network Load Balancer on a
Kerberos-enabled Amazon EMR cluster.
• AWS whitepapers for SageMaker Studio best practices.
The following is an example LCC script. To modify the script, replace ASSUMABLE-ROLE and emr-
account with your role name and remote account ID, respectively. The number of cross accounts is
limited to five.
# This script creates the file that informs SageMaker Studio that the role
"arn:aws:iam::emr-account:role/ASSUMABLE-ROLE" in remote account "emr-account" must be
assumed to list and describe Amazon EMR clusters in the remote account.
#!/bin/bash
set -eux
FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"
mkdir -p $FILE_DIRECTORY
1191
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions
After the LCC runs and the files are written, the server reads the file /home/sagemaker-
user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-
DO_NOT_DELETE.json and stores that cross-account ARN.
Starting a Glue interactive session from a SageMaker Studio notebook is simple. When you create
your Studio notebook, choose the built-in Glue PySpark or Glue Spark kernel and start coding in
your interactive, serverless Spark session in just seconds. You don’t have worry about provisioning or
managing complex compute cluster infrastructure. After initialization, you can quickly browse the Glue
data catalog, run large queries, and interactively analyze and prepare data using Spark, all within your
Studio notebook. You can then use the prepared data to build, train, tune, and deploy models using the
purpose-built ML tools within SageMaker Studio.
Before you start your AWS Glue interactive session in SageMaker Studio, you need to set the appropriate
roles and policies. You may also need access to additional resources, such as Amazon S3, which may
require additional policies. For more information about required and additional IAM policies, see
Permissions for AWS Glue Interactive Sessions in SageMaker Studio (p. 1193).
SageMaker Studio provides a default configuration for your AWS Glue interactive session, but you
can use Glue’s full catalog of Jupyter magic commands to further customize your environment. For
information about the default and additional Jupyter magics that you can use in your Glue interactive
session, see Configure your Glue interactive session in SageMaker Studio (p. 1194).
The supported images and kernels for connecting to a Glue interactive session are as follows:
Prerequisites:
The SparkAnalytics image that you select to launch your Glue session in Studio is a combination of two
frameworks - the SparkMagic framework (used with Amazon EMR), and AWS Glue. For this reason, the
prerequisites for both frameworks apply. However, you do not have to set up the EMR cluster if you
only plan to use Glue Interactive Sessions. Before you start your first Glue interactive session in Studio,
complete the following:
• Complete the prerequisites required to use the SparkMagic image. For a list of the prerequisites, see
the Prerequisites section in Prepare Data at Scale with Studio Notebooks.
• Create an execution role with permissions for both AWS Glue and SageMaker Studio. Add the managed
policy AwsGlueSessionUserRestrictedServiceRole, and create a custom policy that includes
permissions sts:GetCallerIdentity, iam:GetRole, and IAM:Passrole. For instructions
about how to create the necessary permissions, see Permissions for AWS Glue Interactive Sessions in
SageMaker Studio (p. 1193).
• Create a SageMaker domain with the execution role you created. For instructions about how to create
a domain, see Onboard to Amazon SageMaker Domain Using IAM (p. 43).
1192
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "unique_statement_id",
"Effect": "Allow",
"Action": [
"iam:GetRole",
"iam:PassRole",
"sts:GetCallerIdentity"
],
"Resource": "*"
}
]
}
1193
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"glue.amazonaws.com",
"sagemaker.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
You can add additional roles and policies if you need to access other AWS resources. For a description
of the additional roles and policies you can include, see Interactive sessions with IAM in the AWS Glue
documentation.
1. Create a SageMaker domain. For instructions on how to create a new domain, see Onboard to
Amazon SageMaker Domain (p. 37).
2. Sign in to the SageMaker console at https://console.aws.amazon.com/sagemaker/.
3. Select Control Panel in the left-side panel.
4. In the Launch App dropdown menu next to the user name, select Studio.
5. In the Jupyter view, choose File, then New, then Notebook.
6. In the Image dropdown menu, select SparkAnalytics 1.0 or SparkAnalytics 2.0. In the kernel
dropdown menu, select Glue Spark or Glue Python [PySpark and Ray]. Choose Select.
7. (optional) Use Jupyter magics to customize your environment. For more information about Jupyter
magics, see Configure your Glue interactive session in SageMaker Studio (p. 1194).
8. Start writing your Spark data processing scripts.
1194
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions
%glue_version 3.0
You can use magics to further customize your environment. For example, if you want to change
the number of workers allocated to your job from the default five to 10, you can specify
%number_of_workers 10. If you want to configure your session to stop after 10 minutes of idle time
instead of the default 2880, you can specify %idle_timeout 10.
All of the Jupyter magics currently available in AWS Glue are also available in SageMaker Studio. For the
complete list of AWS Glue magics available, see Configuring AWS Glue interactive sessions for Jupyter
and AWS Glue Studio notebooks.
AWS charges for Glue Interactive Sessions based on how long the session is active and the number of
Data Processing Units (DPU) used. You are charged an hourly rate for the number of DPUs used to run
your workloads, billed in increments of one second. Glue Interactive Sessions assigns a default of five
DPUs and requires a minimum of two DPUs. There is also a one-minute minimum billing duration for
each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using
the AWS Pricing Calculator, see AWS Glue pricing .
Your SageMaker Studio notebook runs on an Amazon EC2 instance and you are charged for the instance
type you choose, based on the duration of use. Studio assigns you a default EC2 instance type of ml-
t3-medium when you select the SparkAnalytics image and associated kernel. You can change the
instance type for of your Studio notebook to suit your workload. For information about SageMaker
Studio pricing, see Amazon SageMaker Pricing.
1195
Amazon SageMaker Developer Guide
Sample Notebooks
Process Data
To analyze data and evaluate machine learning models on Amazon SageMaker, use Amazon SageMaker
Processing. With Processing, you can use a simplified, managed experience on SageMaker to run your
data processing workloads, such as feature engineering, data validation, model evaluation, and model
interpretation. You can also use the Amazon SageMaker Processing APIs during the experimentation
phase and after the code is deployed in production to evaluate performance.
The preceding diagram shows how Amazon SageMaker spins up a Processing job. Amazon SageMaker
takes your script, copies your data from Amazon Simple Storage Service (Amazon S3), and then pulls a
processing container. The processing container image can either be an Amazon SageMaker built-in image
or a custom image that you provide. The underlying infrastructure for a Processing job is fully managed
by Amazon SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up
when a job completes. The output of the Processing job is stored in the Amazon S3 bucket you specified.
Note
Your input data must be stored in an Amazon S3 bucket. Alternatively, you can use Amazon
Athena or Amazon Redshift as input sources.
Tip
To learn best practices for distributed computing of machine learning (ML) training and
processing jobs in general, see Distributed computing with SageMaker best practices (p. 1944).
For a sample notebook that shows how to run scikit-learn scripts to perform data preprocessing
and model training and evaluation with the SageMaker Python SDK for Processing, see scikit-learn
Processing. This notebook also shows how to use your own custom container to run processing
workloads with your Python libraries and other specific dependencies.
For a sample notebook that shows how to use Amazon SageMaker Processing to perform distributed
data preprocessing with Spark, see Distributed Processing (Spark). This notebook also shows how to train
a regression model using XGBoost on the preprocessed dataset.
For instructions on how to create and access Jupyter notebook instances that you can use to run these
samples in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
1196
Amazon SageMaker Developer Guide
CloudWatch Logs and Metrics
notebook instance and opened it, choose the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.
A code repository that contains the source code and Dockerfiles for the Spark images is available on
GitHub.
The following code example shows how to run a processing job that invokes your PySpark script
preprocess.py.
spark_processor = PySparkProcessor(
base_job_name="spark-preprocessor",
framework_version="2.4",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
spark_processor.run(
submit_app="preprocess.py",
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', output_prefix]
)
1197
Amazon SageMaker Developer Guide
Data Processing with scikit-learn
For an in-depth look, see the Distributed Data Processing with Apache Spark and SageMaker Processing
example notebook.
If you are not using the Amazon SageMaker Python SDK and one of its Processor classes to retrieve the
pre-built images, you can retrieve these images yourself. The SageMaker prebuilt Docker images are
stored in Amazon Elastic Container Registry (Amazon ECR). For a complete list of the available pre-built
Docker images, see the available images document.
To learn more about using the SageMaker Python SDK with Processing containers, see Amazon
SageMaker Python SDK.
This notebook runs a processing job using SKLearnProcessor class from the the SageMaker Python
SDK to run a scikit-learn script that you provide. The script preprocesses data, trains a model using a
SageMaker training job, and then runs a processing job to evaluate the trained model. The processing job
estimates how the model is expected to perform in production.
To learn more about using the SageMaker Python SDK with Processing containers, see the SageMaker
Python SDK. For a complete list of pre-built Docker images available for processing jobs, see Docker
Registry Paths and Example Code.
The following code example shows how the notebook uses SKLearnProcessor to run your own scikit-
learn script using a Docker image provided and maintained by SageMaker, instead of your own Docker
image.
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
instance_type='ml.m5.xlarge',
instance_count=1)
sklearn_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/
validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')]
)
To process data in parallel using Scikit-Learn on Amazon SageMaker Processing, you can shard
input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key' inside a
ProcessingInput so that each instance receives about the same number of input objects.
1198
Amazon SageMaker Developer Guide
Hugging Face Framework Processor
framework you choose. FrameworkProcessor provides premade containers for the following machine
learning frameworks: Hugging Face, MXNet, PyTorch, TensorFlow, and XGBoost.
The FrameworkProcessor class also provides you with customization over the container configuration.
The FrameworkProcessor class supports specifying a source directory source_dir for your
processing scripts and dependencies. With this capability, you can give the processor access to multiple
scripts in a directory instead of only specifying one script. FrameworkProcessor also supports
including a requirements.txt file in the source_dir for customizing the Python libraries to install in
the container.
For more information on the FrameworkProcessor class and its methods and parameters, see
FrameworkProcessor in the Amazon SageMaker Python SDK.
To see examples of using a FrameworkProcessor for each of the supported machine learning
frameworks, see the following topics.
Topics
• Hugging Face Framework Processor (p. 1199)
• MXNet Framework Processor (p. 1200)
• PyTorch Framework Processor (p. 1201)
• TensorFlow Framework Processor (p. 1202)
• XGBoost Framework Processor (p. 1203)
The following code example shows how you can use the HuggingFaceProcessor to run your
Processing job using a Docker image provided and maintained by SageMaker. Note that when you
run the job, you can specify a directory containing your scripts and dependencies in the source_dir
argument, and you can have a requirements.txt file located inside your source_dir directory that
specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies
in requirements.txt in the container for you.
1199
Amazon SageMaker Developer Guide
MXNet Framework Processor
ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input/data/'
)
],
outputs=[
ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='val', source='/opt/ml/processing/output/val/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
]
)
If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the HuggingFaceProcessor class, see Hugging Face
Estimator in the Amazon SageMaker Python SDK.
The following code example shows how you can use the MXNetProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.
1200
Amazon SageMaker Developer Guide
PyTorch Framework Processor
],
outputs=[
ProcessingOutput(
output_name='processed_data',
source='/opt/ml/processing/output/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
)
]
)
If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the MXNetProcessor class, see MXNet Estimator in the
Amazon SageMaker Python SDK.
The following code example shows how you can use the PyTorchProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.
1201
Amazon SageMaker Developer Guide
TensorFlow Framework Processor
ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='logs', source='/opt/ml/processing/logs',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
]
)
If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the PyTorchProcessor class, see PyTorch Estimator in
the Amazon SageMaker Python SDK.
The following code example shows how you can use the TensorFlowProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.
1202
Amazon SageMaker Developer Guide
XGBoost Framework Processor
source='/opt/ml/processing/output',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
)
]
)
If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use
an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory
you specify for source_dir. To learn more about the TensorFlowProcessor class, see TensorFlow
Estimator in the Amazon SageMaker Python SDK.
The following code example shows how you can use the XGBoostProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.
If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
1203
Amazon SageMaker Developer Guide
Use Your Own Processing Code
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the XGBoostProcessor class, see XGBoost Estimator in
the Amazon SageMaker Python SDK.
Topics
• Run Scripts with Your Own Processing Container (p. 1204)
• Build Your Own Processing Container (Advanced Scenario) (p. 1205)
The following example shows a general workflow for using a ScriptProcessor class with your own
processing container. The workflow shows how to create your own image, build your container, and use
a ScriptProcessor class to run a Python preprocessing script with the container. The processing job
processes your input data and saves the processed data in Amazon Simple Storage Service (Amazon S3).
Before using the following examples, you need to have your own input data and a Python script
prepared to process your data. For an end-to-end, guided example of this process, refer back to the
scikit-learn Processing sample notebook.
1. Create a Docker directory and add the Dockerfile used to create the processing container. Install
pandas and scikit-learn into it. (You could also install your own dependencies with a similar RUN
command.)
mkdir docker
%%writefile docker/Dockerfile
FROM python:3.7-slim-buster
ENTRYPOINT ["python3"]
2. Build the container using the docker command, create an Amazon Elastic Container Registry (Amazon
ECR) repository, and push the image to Amazon ECR.
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.Session().region_name
ecr_repository = 'sagemaker-processing-container'
1204
Amazon SageMaker Developer Guide
Build Your Own Processing Container
tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region,
ecr_repository + tag)
3. Set up the ScriptProcessor from the SageMaker Python SDK to run the script. Replace image_uri
with the URI for the image you created, and replace role_arn with the ARN for an AWS Identity and
Access Management role that has access to your target Amazon S3 bucket.
script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')
4. Run the script. Replace preprocessing.py with the name of your own Python processing script, and
replace s3://path/to/my/input-data.csv with the Amazon S3 path to your input data.
script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/
train'),
ProcessingOutput(source='/opt/ml/processing/output/
validation'),
ProcessingOutput(source='/opt/ml/processing/output/
test')])
You can use the same procedure with any other library or system dependencies. You can also use existing
Docker images. This includes images that you run on other platforms such as Kubernetes.
The following example of a Dockerfile builds a container with the Python libraries scikit-learn and
pandas, which you can run as a processing job.
FROM python:3.7-slim-buster
1205
Amazon SageMaker Developer Guide
Build Your Own Processing Container
For an example of a processing script, see Get started with SageMaker Processing.
Build and push this Docker image to an Amazon Elastic Container Registry (Amazon ECR) repository and
ensure that your SageMaker IAM role can pull the image from Amazon ECR. Then you can run this image
on Amazon SageMaker Processing.
This command runs the ENTRYPOINT command configured in your Docker image.
You can also override the entrypoint command in the image or give command-line arguments
to your entrypoint command using the AppSpecification.ContainerEntrypoint and
AppSpecification.ContainerArgument parameters in your CreateProcessingJob request.
Specifying these parameters configures Amazon SageMaker Processing to run the container similar to
the way that the following command does.
• Amazon SageMaker Processing decides whether the job completes or fails depending on the exit code
of the command run. A processing job completes if all of the processing containers exit successfully
with an exit code of 0, and fails if any of the containers exits with a non-zero exit code.
• Amazon SageMaker Processing lets you override the processing container's entrypoint and set
command-line arguments just like you can with the Docker API. Docker images can also configure
the entrypoint and command-line arguments using the ENTRYPOINT and CMD instructions. The
way CreateProcessingJob's ContainerEntrypoint and ContainerArgument parameters
configure a Docker image's entrypoint and arguments mirrors how Docker overrides the entrypoint
and arguments through the Docker API:
• If neither ContainerEntrypoint nor ContainerArguments are provided, Processing uses the
default ENTRYPOINT or CMD in the image.
• If ContainerEntrypoint is provided, but not ContainerArguments, Processing runs the image
with the given entrypoint, and ignores the ENTRYPOINT and CMD in the image.
• If ContainerArguments is provided, but not ContainerEntrypoint, Processing runs the image
with the default ENTRYPOINT in the image and with the provided arguments.
• If both ContainerEntrypoint and ContainerArguments are provided, Processing runs the
image with the given entrypoint and arguments, and ignores the ENTRYPOINT and CMD in the
image.
1206
Amazon SageMaker Developer Guide
Build Your Own Processing Container
• You must use the exec form of the ENTRYPOINT instruction in your Dockerfile (ENTRYPOINT
["executable", "param1", "param2"]) instead of the shell form (ENTRYPOINT command
param1 param2). This lets your processing container receive SIGINT and SIGKILL signals, which
Processing uses to stop processing jobs with the StopProcessingJob API.
• /opt/ml and all its subdirectories are reserved by SageMaker. When building your Processing Docker
image, don't place any data required by your processing container in these directories.
• If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Include
only the CUDA toolkit in containers. Don't bundle NVIDIA drivers with the image. For more information
about nvidia-docker, see NVIDIA/nvidia-docker.
You use the ProcessingInput parameter to specify an Amazon Simple Storage Service (Amazon
S3) URI to download data from, and a path in your processing container to download the data to.
The ProcessingOutput parameter configures a path in your processing container from which to
upload data, and where in Amazon S3 to upload that data to. For both ProcessingInput and
ProcessingOutput, the path in the processing container must begin with /opt/ml/processing/ .
For example, you might create a processing job with one ProcessingInput parameter that downloads
data from s3://your-data-bucket/path/to/input/csv/data into /opt/ml/processing/
csv in your processing container, and a ProcessingOutput parameter that uploads data from /opt/
ml/processing/processed_csv to s3://your-data-bucket/path/to/output/csv/data.
Your processing job would read the input data, and write output data to /opt/ml/processing/
processed_csv. Then it uploads the data written to this path to the specified Amazon S3 output
location.
Important
Symbolic links (symlinks) can not be used to upload output data to Amazon S3. Symlinks are not
followed when uploading output data.
Amazon SageMaker Processing also provides CloudWatch metrics for each instance running your
processing container. For information about metrics, see Monitor Amazon SageMaker with Amazon
CloudWatch (p. 3271).
When a processing job starts, it uses the environment variables that you specified with
the Environment map in the CreateProcessingJob request. The /opt/ml/config/
1207
Amazon SageMaker Developer Guide
Build Your Own Processing Container
{
"ProcessingJobArn": "<processing_job_arn>",
"ProcessingJobName": "<processing_job_name>",
"AppSpecification": {
"ImageUri": "<image_uri>",
"ContainerEntrypoint": null,
"ContainerArguments": null
},
"Environment": {
"KEY": "VALUE"
},
"ProcessingInputs": [
{
"InputName": "input-1",
"S3Input": {
"LocalPath": "/opt/ml/processing/input/dataset",
"S3Uri": "<s3_uri>",
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3CompressionType": "None",
"S3DownloadMode": "StartOfJob"
}
}
],
"ProcessingOutputConfig": {
"Outputs": [
{
"OutputName": "output-1",
"S3Output": {
"LocalPath": "/opt/ml/processing/output/dataset",
"S3Uri": "<s3_uri>",
"S3UploadMode": "EndOfJob"
}
}
],
"KmsKeyId": null
},
"ProcessingResources": {
"ClusterConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m5.xlarge",
"VolumeSizeInGB": 30,
"VolumeKmsKeyId": null
}
},
"RoleArn": "<IAM role>",
"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
}
}
{
"current_host": "algo-1",
1208
Amazon SageMaker Developer Guide
Build Your Own Processing Container
"hosts": ["algo-1","algo-2","algo-3"]
}
Don't use the information about hostnames contained in /etc/hostname or /etc/hosts because it
might be inaccurate.
Hostname information might not be immediately available to the processing container. We recommend
adding a retry policy on hostname resolution operations as nodes become available in the cluster.
If the data in this file isn't UTF-8 encoded, the job fails and returns a ClientError. If multiple
containers exit with an ExitMessage, the content of the ExitMessage from each processing container
is concatenated, then truncated to 1 KB.
processor = Processor(image_uri='<your_ecr_image_uri>',
role=role,
instance_count=1,
instance_type="ml.m5.xlarge")
processor.run(inputs=[ProcessingInput(
source='<s3_uri or local path>',
destination='/opt/ml/processing/input_data')],
outputs=[ProcessingOutput(
source='/opt/ml/processing/processed_data',
destination='<s3_uri>')],
)
Instead of building your processing code into your processing image, you can provide a
ScriptProcessor with your image and the command that you want to run, along with the code
that you want to run inside that container. For an example, see Run Scripts with Your Own Processing
Container (p. 1204).
You can also use the scikit-learn image that Amazon SageMaker Processing provides through
SKLearnProcessor to run scikit-learn scripts. For an example, see Data Processing with scikit-
learn (p. 1198).
1209
Amazon SageMaker Developer Guide
How Feature Store works
Further, the processing logic for your data is authored only once, and features generated are used for
both training and inference, reducing the training-serving skew. Feature Store is a centralized store for
features and associated metadata so features can be easily discovered and reused. You can create an
online or an offline store. The online store is used for low latency real-time inference use cases, and the
offline store is used for training and batch inference.
The following diagram shows how you can use Feature Store as part of your machine learning pipeline.
First, you read in your raw data and process it. You can ingest data via streaming to the online and offline
store, or in batches directly to the offline store. You first create a FeatureGroup and configure it to an
online or offline store, or both. Then, you can ingest data into your FeatureGroup and store it in your
store. A FeatureGroup is a group of features that is defined via a schema in Feature Store to describe a
record.
Online store is primarily designed for supporting real-time predictions that need low millisecond latency
reads and high throughput writes. Offline store is primarily intended for batch predictions and model
training. Offline store is an append only store and can be used to store and access historical feature data.
The offline store can help you store and serve features for exploration and model training. The online
store retains only the latest feature data. Feature Groups are mutable and can evolve their schema after
creation.
1210
Amazon SageMaker Developer Guide
Create feature groups
feature group is composed of features and values specific to each feature. A Record is a collection of
values for features that correspond to a unique RecordIdentifier. Altogether, a FeatureGroup is a
group of features defined in your FeatureStore to describe a Record.
• Online – In online mode, features are read with low latency (milliseconds) reads and used for high
throughput predictions. This mode requires a feature group to be stored in an online store.
• Offline – In offline mode, large streams of data are fed to an offline store, which can be used for
training and batch inference. This mode requires a feature group to be stored in an offline store. The
offline store uses your S3 bucket for storage and can also fetch data using Athena queries.
• Online and Offline – This includes both online and offline modes.
You can ingest data into feature groups in Feature Store in two ways: streaming or in batches. When
you ingest data through streaming, a collection of records are pushed to Feature Store by calling a
synchronous PutRecord API call. This API enables you to maintain the latest feature values in Feature
Store and to push new feature values as soon an update is detected.
Alternatively, Feature Store can process and ingest data in batches. You can author features using
Amazon SageMaker Data Wrangler, create feature groups in Feature Store and ingest features in batches
using a SageMaker Processing job with a notebook exported from Data Wrangler. This mode allows for
batch ingestion into the offline store. It also supports ingestion into the online store if the feature group
is configured for both online and offline use.
1211
Amazon SageMaker Developer Guide
Offline store for model training and batch inference
You can also perform joins across different FeatureGroups for real-time inference by querying two
different FeatureGroups in the client application.
You can query, explore, and visualize features using Data Wrangler from Amazon SageMaker Studio.
Feature Store supports combining data to produce, train, validate, and test data sets, and allows you to
extract data at different points in time.
You can push records to Feature Store by calling the synchronous PutRecord API call. Since this is a
synchronous API call, it allows small batches of updates to be pushed in a single API call. This enables
you to maintain high freshness of the feature values and publish values as soon as an update is detected.
These are also called streaming features.
When feature data is ingested and updated, Feature Store stores historical data for all features in the
offline store. For batch ingest, you can pull feature values from your S3 bucket or use Athena to query.
You can also use Data Wrangler to process and engineer new features that can then be exported to a
chosen S3 bucket to be accessed by Feature Store. For batch ingestion, you can configure a processing
job to batch ingest your data into Feature Store, or you can pull feature values from your S3 bucket using
Athena.
To remove a Record from your online store, use the DeleteRecord API call. This will also add the
deleted record to the offline store.
1212
Amazon SageMaker Developer Guide
Feature Store concepts
groups using a Amazon SageMaker Studio Jupyter or JupyterLab notebook, how to use Feature Store in
the Studio User Interface, and how to delete feature groups using SDK for Python and Studio.
Topics
• Feature Store concepts (p. 1213)
• Adding required policies to your IAM role (p. 1215)
• Create feature groups (p. 1215)
• Use Amazon SageMaker Feature Store with Amazon SageMaker Studio (p. 1227)
• Delete a feature group (p. 1227)
• Feature Store: Storage and data management layer for machine learning (ML) features. Serves as the
single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
In the following example diagram, the Feature Store is a store for your feature groups, which contains
your ML data, and provides additional services.
• Online store: Low latency, high availability store for a feature group that enables real-time lookup of
records. The online store allows quick access to the latest record via the GetRecord API.
• Offline store: Stores historical data in your Amazon S3 bucket. The offline store is used when low (sub-
second) latency reads are not needed. For example, the offline store can be used when you want to
store and serve features for exploration, model training, and batch inference.
• Feature group: The main resource of Feature Store that contains the data and metadata used for
training or predicting with a ML model. A feature group is a logical grouping of features used to
describe records. In the following example diagram, a feature group contains your ML data.
• Feature: A property that is used as one of the inputs to train or predict using your ML model. In the
Feature Store API a feature is an attribute of a record. In the following example diagram, a feature
describes a column in your ML data table.
• Feature definition: Consists of a name and one of the data types: integral, string or fractional. A
feature group contains a list of feature definitions. For more information on Feature Store data types,
see Data types (p. 1266).
• Record: Collection of values for features for a single record identifier. A combination of record
identifier and event time values uniquely identify a record within a feature group. In the following
example diagram, a record is a row in your ML data table.
• Record identifier name: The record identifier name is the name of the feature that identifies
the records. It must refer to one of the names of a feature defined in the feature group's feature
definitions. Each feature group is defined with a record identifier name.
• Event time: Timestamp that you provide corresponding to when the record event occurred. All records
in a feature group must have a corresponding event time. The online store only contains the record
corresponding to the latest event time, whereas the offline store contains all historic records. For more
information on event time formats, see Data types (p. 1266).
• Ingestion: Adding new records to a feature group. Ingestion is typically achieved via the PutRecord
API.
1213
Amazon SageMaker Developer Guide
Feature Store concepts
1214
Amazon SageMaker Developer Guide
Adding required policies to your IAM role
The Feature Store contains your feature groups and a feature group contains your ML data. In the
example diagram, the original feature group contains ML data (table) contains three features (each
describing a column) and two records (rows).
• A feature (describes a column) is made up of a feature definition, that describes the feature name and
data type of the feature values, that are associated with records.
• A record (row) must be uniquely identified by its record identifier (diamond markers) and include the
event time (circle markers) of when the record event occurred.
Ingestion is the action of adding new data to a feature group. Records are added to a feature group
differently, depending on if you are ingesting into the online store or offline store. While ingesting new
data into a feature group and a new record identifier does not already exist within the feature group, the
record is added for both stores. While ingesting data into a feature group and a record identifier already
exists within the feature group:
For examples on how to find your execution role ARN for a notebook within SageMaker (from the
SageMaker console or Amazon SageMaker Studio), see Get execution role (p. 3086). The role is at
the end of the execution role ARN.
4. After you enter the role in the search bar, choose the role.
Under Permissions policies you can view the policies attached to the role.
5. After you choose the role, choose Add permissions, then choose Attach policies.
6. In the search bar under Other permissions policies enter
AmazonSageMakerFeatureStoreAccess and press enter. If the policy does not show, you may
already have the policy attached, listed under your Current permissions policies.
7. After you press enter, select the check box next to the policy and then choose Add permissions.
8. After you have attached the policy to your role, the policy will appear under Permissions policies for
your IAM role.
1215
Amazon SageMaker Developer Guide
Create feature groups
Prior to using a feature store you typically load your dataset, run transformations, and set up your
features for ingestion. This process has a lot of variation and is highly dependent on your data. The
example code in the following topics refer to the Introduction to Feature Store, Fraud Detection with
Amazon SageMaker Feature Store example notebooks respectively. We recommend that you run this
notebook in Amazon SageMaker Studio because the code in this guide is conceptual and not fully
functional if copied.
Feature Store supports the following data types: String, Fractional (IEEE 64-bit floating point
value), and Integral (Int64 - 64 bit signed integral value). The default type is set to String. This
means that, if a column in your dataset is not a float or long type, it defaults to String in your
feature store.
You may use a schema to describe your data’s columns and data types. You pass this schema
into FeatureDefinitions, a required parameter for a FeatureGroup. You can use
the SageMaker Python SDK, which has automatic data type detection when you use the
load_feature_definitions function.
The default behavior when a new feature record is added with an already existing record ID is as follows.
In the offline store, the new record will be appended. In the online store, if the event time of the new
record is less than the existing event time than nothing will happen, however if the event time of the
new record is greater than or equal to the existing event time, the record will be over written.
When you create a new feature group you can choose one of the following table formats:
Ingesting data, especially when streaming, can result in a large number of small files deposited into the
offline store. This can negatively impact query performance due the higher number of file operations
required. To avoid potential performance issues, use the Apache Iceberg table format when creating new
feature groups. With Iceberg you can compact the small data files into fewer large files in the partition,
resulting in significantly faster queries. This compaction operation is concurrent and does not affect
ongoing read and write operations on the feature group. If you choose the Iceberg option when creating
new feature groups, Amazon SageMaker Feature Store will create the Iceberg tables using Parquet file
format, and register the tables with the AWS Glue Data Catalog.
Important
Note that for feature groups in Iceberg table format, you must specify String as the value for
the event time. If you specify any other type, you can't create the feature group successfully.
Topics
• Introduction to Feature Store notebook (p. 1216)
• Fraud detection with Feature Store Notebook (p. 1221)
1216
Amazon SageMaker Developer Guide
Create feature groups
Step 1: Set up
To start using Feature Store, create a SageMaker session and set up the Amazon S3 bucket you want
to use for your features. The Amazon S3 bucket is your offline store. The following code uses the
SageMaker default bucket and adds a custom prefix to it.
Note
The role that you use to run the notebook must have the following managed policies attached
to it: AmazonS3FullAccess and AmazonSageMakerFeatureStoreAccess. For information
on adding policies to your IAM role, see Adding required policies to your IAM role (p. 1215).
import boto3
import pandas as pd
import numpy as np
import io
from sagemaker.session import Session
from sagemaker import get_execution_role
prefix = 'sagemaker-featurestore-introduction'
role = get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()
customer_data = pd.read_csv("data/feature_store_introduction_customer.csv")
orders_data = pd.read_csv("data/feature_store_introduction_orders.csv")
print(customer_data.head())
print(orders_data.head())
The following diagram illustrates the steps the data goes through before it is ingested into Feature
Store. In this notebook, we illustrate the use-case where you have data from multiple sources and want
to store them independently in a Feature Store. Our example considers data from a data warehouse
(customer data), and data from a real-time streaming service (order data).
1217
Amazon SageMaker Developer Guide
Create feature groups
1218
Amazon SageMaker Developer Guide
Create feature groups
import time
from time import strftime, gmtime
customers_feature_group_name = 'customers-feature-group-' + strftime('%d-%H-%M-%S',
gmtime())
orders_feature_group_name = 'orders-feature-group-' + strftime('%d-%H-%M-%S', gmtime())
customers_feature_group = FeatureGroup(
name=customers_feature_group_name, sagemaker_session=sagemaker_session
)
orders_feature_group = FeatureGroup(
name=orders_feature_group_name, sagemaker_session=sagemaker_session
)
import time
current_time_sec = int(round(time.time()))
record_identifier_feature_name = "customer_id"
Append EventTime feature to your data frame. This parameter is required, and time stamps each data
point.
customer_data["EventTime"] = pd.Series([current_time_sec]*len(customer_data),
dtype="float64")
orders_data["EventTime"] = pd.Series([current_time_sec]*len(orders_data), dtype="float64")
customers_feature_group.load_feature_definitions(data_frame=customer_data)
orders_feature_group.load_feature_definitions(data_frame=orders_data)
customers_feature_group.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name="EventTime",
role_arn=role,
enable_online_store=True
)
orders_feature_group.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name="EventTime",
role_arn=role,
enable_online_store=True
)
1219
Amazon SageMaker Developer Guide
Create feature groups
To confirm that your feature group has been created we use DescribeFeatureGroup and
ListFeatureGroups APIs to display the created feature group.
customers_feature_group.describe()
orders_feature_group.describe()
sagemaker_session.boto_session.client('sagemaker',
region_name=region).list_feature_groups() # We use the boto client to list FeatureGroups
def check_feature_group_status(feature_group):
status = feature_group.describe().get("FeatureGroupStatus")
while status == "Creating":
print("Waiting for Feature Group to be Created")
time.sleep(5)
status = feature_group.describe().get("FeatureGroupStatus")
print(f"FeatureGroup {feature_group.name} successfully created.")
check_feature_group_status(customers_feature_group)
check_feature_group_status(orders_feature_group)
customers_feature_group.ingest(
data_frame=customer_data, max_workers=3, wait=True
)
orders_feature_group.ingest(
data_frame=orders_data, max_workers=3, wait=True
)
Using an arbirary customer record id, 573291 we use get_record to check that the data has been
ingested into the feature group.
customer_id = 573291
sample_record = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime',
region_name=region).get_record(FeatureGroupName=customers_feature_group_name,
RecordIdentifierValueAsString=str(customer_id))
print(sample_record)
all_records = sagemaker_session.boto_session.client(
"sagemaker-featurestore-runtime", region_name=region
).batch_get_record(
Identifiers=[
{
"FeatureGroupName": customers_feature_group_name,
1220
Amazon SageMaker Developer Guide
Create feature groups
print(all_records)
Step 5: Clean up
Here we remove the Feature Groups we created.
customers_feature_group.delete()
orders_feature_group.delete()
For an advanced example on how to use Feature Store for a Fraud Detection use-case, see Fraud
Detection with Feature Store.
Below we list API calls used in this notebook that exist within the Python SDK and ones that exist in
boto3 for your reference.
describe()
ingest()
delete()
create()
load_feature_definitions()
list_feature_groups()
get_record()
1221
Amazon SageMaker Developer Guide
Create feature groups
import boto3
import sagemaker
from sagemaker.session import Session
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
boto_session = boto3.Session(region_name=region)
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-featurestore'
offline_feature_store_bucket = 's3://{}/{}'.format(default_bucket, prefix)
feature_store_session = Session(
boto_session=boto_session,
sagemaker_client=sagemaker_client,
sagemaker_featurestore_runtime_client=featurestore_runtime
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io
fraud_detection_bucket_name = 'sagemaker-featurestore-fraud-detection'
identity_file_key = 'sampled_identity.csv'
transaction_file_key = 'sampled_transactions.csv'
identity_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name,
Key=identity_file_key)
transaction_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name,
Key=transaction_file_key)
identity_data = pd.read_csv(io.BytesIO(identity_data_object['Body'].read()))
transaction_data = pd.read_csv(io.BytesIO(transaction_data_object['Body'].read()))
identity_data = identity_data.round(5)
transaction_data = transaction_data.round(5)
identity_data = identity_data.fillna(0)
transaction_data = transaction_data.fillna(0)
1222
Amazon SageMaker Developer Guide
Create feature groups
# Feature transformations for this dataset are applied before ingestion into FeatureStore.
# One hot encode card4, card6
encoded_card_bank = pd.get_dummies(transaction_data['card4'], prefix = 'card_bank')
encoded_card_type = pd.get_dummies(transaction_data['card6'], prefix = 'card_type')
For example, in the fraud detection example, the two feature groups are identity and transaction.
In the following code you can see how the names are customized with a timestamp, and then each group
is set up by passing in the name and the session.
import time
from time import gmtime, strftime, sleep
from sagemaker.feature_store.feature_group import FeatureGroup
identity_feature_group = FeatureGroup(name=identity_feature_group_name,
sagemaker_session=feature_store_session)
transaction_feature_group = FeatureGroup(name=transaction_feature_group_name,
sagemaker_session=feature_store_session)
record_identifier_name = "TransactionID"
event_time_feature_name = "EventTime"
current_time_sec = int(round(time.time()))
identity_data[event_time_feature_name] = pd.Series([current_time_sec]*len(identity_data),
dtype="float64")
transformed_transaction_data[event_time_feature_name] =
pd.Series([current_time_sec]*len(transaction_data), dtype="float64")
1223
Amazon SageMaker Developer Guide
Create feature groups
of each column of data. For developers using a schema rather than automatic detection, see the Export
Feature Groups from Data Wrangler example for code that shows how to load the schema, map it, and
add it as a FeatureDefinition that you can use to create the FeatureGroup. This example also
covers a boto3 implementation, which you can use instead of the SageMaker Python SDK.
identity_feature_group.load_feature_definitions(data_frame=identity_data); # output is
suppressed
transaction_feature_group.load_feature_definitions(data_frame=transformed_transaction_data);
# output is suppressed
# create a FeatureGroup
feature_group.create(
description = "Some info about the feature group",
feature_group_name = feature_group_name,
record_identifier_name = record_identifier_name,
event_time_feature_name = event_time_feature_name,
feature_definitions = feature_definitions,
role_arn = role,
s3_uri = offline_feature_store_bucket,
enable_online_store = True,
online_store_kms_key_id = None,
offline_store_kms_key_id = None,
disable_glue_table_creation = False,
data_catalog_config = None,
tags = ["tag1","tag2"])
The following code from the fraud detection example shows a minimal create call for each of the two
features groups being created.
identity_feature_group.create(
s3_uri=offline_feature_store_bucket,
record_identifier_name=record_identifier_name,
event_time_feature_name=event_time_feature_name,
role_arn=role,
enable_online_store=True
)
transaction_feature_group.create(
s3_uri=offline_feature_store_bucket,
record_identifier_name=record_identifier_name,
event_time_feature_name=event_time_feature_name,
role_arn=role,
enable_online_store=True
)
When you create a feature group, it takes time to load the data, and you need to wait until the feature
group is created before you can use it. You can check status using the following method.
status = feature_group.describe().get("FeatureGroupStatus")
While the feature group is being created, you receive Creating as a response. When this step has
finished successfully, the response is Created. Other possible statuses are CreateFailed, Deleting,
or DeleteFailed.
1224
Amazon SageMaker Developer Guide
Create feature groups
Topics
• Describe a feature group (p. 1225)
• List feature groups (p. 1225)
• Put records in a feature group (p. 1225)
• Get records from a feature group (p. 1225)
• Generate hive DDL commands (p. 1226)
• Build a training dataset (p. 1226)
• Write and execute an Athena query (p. 1226)
• Delete a feature group (p. 1227)
You can retrieve information about your feature group with the describe function.
feature_group.describe()
You can list all of your feature groups with the list_feature_groups function.
sagemaker_client.list_feature_groups()
You can use the ingest function to load your feature data. You pass in a data frame of feature data, set
the number of workers, and choose to wait for it to return or not. The following example demonstrates
using the ingest function.
feature_group.ingest(
data_frame=feature_data, max_workers=3, wait=True
)
For each feature group you have, run the ingest function on the feature data you want to load.
You can use the get_record function to retrieve the data for a specific feature by its record identifier.
The following example uses an example identifier to retrieve the record.
record_identifier_value = str(2990130)
featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name,
RecordIdentifierValueAsString=record_identifier_value)
...
'Record': [{'FeatureName': 'TransactionID', 'ValueAsString': '2990130'},
{'FeatureName': 'isFraud', 'ValueAsString': '0'},
{'FeatureName': 'TransactionDT', 'ValueAsString': '152647'},
1225
Amazon SageMaker Developer Guide
Create feature groups
The SageMaker Python SDK’s FeatureStore class also provides the functionality to generate Hive DDL
commands. The schema of the table is generated based on the feature definitions. Columns are named
after feature name and data-type are inferred based on feature type.
print(feature_group.as_hive_ddl())
Example output:
Feature Store automatically builds an AWS Glue data catalog when you create feature groups and you
can turn this off if you want. The following describes how to create a single training dataset with feature
values from both identity and transaction feature groups created earlier in this topic. Also, the following
describes how to run an Amazon Athena query to join data stored in the offline store from both identity
and transaction feature groups.
To start, create an Athena query using athena_query() for both identity and transaction feature
groups. The `table_name` is the AWS Glue table that is autogenerated by Feature Store.
identity_query = identity_feature_group.athena_query()
transaction_query = transaction_feature_group.athena_query()
identity_table = identity_query.table_name
transaction_table = transaction_query.table_name
You write your query using SQL on these feature groups, and then execute the query with the .run()
command and specify your S3 bucket location for the data set to be saved there.
# Athena query
query_string = 'SELECT * FROM "'+transaction_table+'" LEFT JOIN "'+identity_table+'" ON
"'+transaction_table+'".transactionid = "'+identity_table+'".transactionid'
From here you can train a model using this data set and then perform inference.
1226
Amazon SageMaker Developer Guide
Use Feature Store with Studio
feature_group.delete()
identity_feature_group.delete()
transaction_feature_group.delete()
Topics
• Create a feature group in Amazon SageMaker Studio (p. 1227)
• View feature group details in Studio (p. 1229)
Consider which of the following options best fits your use case:
• Create an online store, an offline store, or both. For more information on the differences between
online and offline stores, see Feature Store concepts (p. 1213).
• Use a default AWS KMS key or your own AWS KMS key. The default key is the AWS managed
encryption key (SSE-S3), though you can reduce AWS KMS request costs by using Amazon S3 bucket
keys. For more information on reducing the cost by using Amazon S3 bucket keys, see Reducing the
cost of SSE-KMS with Amazon S3 Bucket Keys.
You can use the same key for both online and offline stores, or have a unique key for each. For more
information on AWS KMS, see AWS Key Management Service.
• If you create an offline store:
• You should decide if you want to create an Amazon S3 bucket or use an existing one. When using an
existing one, you need to know the Amazon S3 bucket URL or Amazon S3 bucket name and dataset
directory name, if applicable.
• You should choose which IAM role ARN to use. For more information on how to find your role and
attached policies, see Adding required policies to your IAM role (p. 1215).
• You should decide whether to use the AWS Glue (Default) or Apache Iceberg table format. In most
use cases you will want to use the Apache Iceberg Table format. For more information on table
formats, see Create feature groups (p. 1215)
1. Open Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).
1227
Amazon SageMaker Developer Guide
Use Feature Store with Studio
2.
Choose the Home icon ( ) on the left panel.
3. Choose Data.
4. From the dropdown list, choose Feature Store.
5. Choose Create feature group.
6. Under Feature group details, enter a feature group name.
7. (Optional) Enter a description of the feature group.
8. Under Feature group storage configuration, choose a storage type from the Storage type
dropdown list.
a. From the S3 bucket name dropdown list, you may choose an existing Amazon S3 bucket name,
enter a new bucket name, or choose Enter bucket URL manually and enter the URL under S3
bucket address.
b. (Optional) If you have a specified directory name for your dataset, choose from the Dataset
directory name dropdown list.
c. From the Table format dropdown list, choose the table format. In most use cases, you should
use the Apache Iceberg Table format. For more information on table formats, see Create
feature groups (p. 1215).
d. Under IAM role ARN, choose the IAM role ARN you want to attach to this feature group. For
more information on how to find your role and attached policies, see Adding required policies to
your IAM role (p. 1215).
9. Under the Offline store encryption key dropdown list, choose Use AWS managed AWS KMS key
(default) or Enter a AWS KMS key ARN and enter your AWS KMS key ARN under Offline store
encryption key ARN. For more information about AWS KMS, see AWS Key Management Service
10. If you have chosen the offline storage Table format and AWS Glue (default) Table format, under
Data catalog, you have the option to choose Use default values for your AWS Glue data catalog
or provide your existing data catalog name, table name, and database name to extend your existing
AWS Glue catalog.
11. Once all of the required information has been specified, the Continue button is available. Choose
Continue.
12. Under Specify feature definitions, you have two options for providing a schema for your features: a
JSON editor, or a table editor. In the JSON tab, type in or copy and paste your feature definitions in
the JSON format. For the table editor, type in the name and choose the corresponding data type for
each feature in your feature group. Choose Add feature definitions to include more features.
There must be at least two features in a feature group representing the record identifier and event
time:
1228
Amazon SageMaker Developer Guide
Delete a feature group
18. Under Review feature group, review the feature group information. You may edit any step by
choosing the Edit button that corresponds to that step. This brings you to the corresponding step
for editing. To return to step 5, choose Continue until you return to step 5.
19. Once you have finalized the setup for your feature group, choose Create feature group.
If there are any issues with the setup, there is a red alert pop-up message that appears at the
bottom of the page with tips on solving the issue. You can return to previous steps to fix them.
If the feature group has been successfully created, a green pop-up message appears at the bottom
of the page. When the feature group is successfully created, it appears in your feature groups
catalog.
1. Open Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).
2.
Choose the Home icon ( ) on the left panel.
3. Choose Data.
4. From the dropdown list, choose Feature Store.
5. Under the Feature group catalog tab, choose your feature group name from the list. This opens the
feature group page.
6. Under the Details tab, you can review your feature group Information and Tags. Choose Add new
tag to add a new tag or remove to remove a tag.
7. On the Features tab, you can find a list of all of the features. Use the filter to refine your list. Choose
a feature to view its details.
The following sections provide an overview of using both to delete a feature group.
Topics
• Delete a feature group using Studio (p. 1229)
• Delete feature group example Python code (p. 1230)
1229
Amazon SageMaker Developer Guide
Data sources and ingestion
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
sagemaker_session = sagemaker.Session()
fg_name = 'your-feature-group-name'
Topics
• Stream ingestion (p. 1230)
• Data Wrangler with Feature Store (p. 1230)
• Batch ingestion with Amazon SageMaker Feature Store Spark (p. 1232)
Stream ingestion
You can use streaming sources such as Kafka or Kinesis as a data source where features are extracted
from there and directly fed to the online feature store for training, inference or feature creation. Records
can be pushed into the feature store by calling the synchronous PutRecord API call. Since this is a
synchronous API call it allows small batches of updates to be pushed in a single API call. This enables you
to maintain high freshness of the feature values and publish values as soon an update is detected. These
are also called streaming features.
In Studio, after interacting with Data Wrangler, choose the Export tab, choose Export Step, and the
choose Feature Store, as shown in the following screenshot. This exports a Jupyter notebook that has all
the source code in it to create a Feature Store feature group that adds your features from Data Wrangler
to an offline or online feature store.
1230
Amazon SageMaker Developer Guide
Data Wrangler with Feature Store
After the feature group has been created, you can also select and join data across multiple feature
groups to create new engineered features in Data Wrangler and then export your data set to an S3
bucket.
1231
Amazon SageMaker Developer Guide
Feature Store Spark
For more information on how to export to Feature Store, see Export to SageMaker Feature Store.
Methods for installing and implementing batch data ingestion are provided for Python and Scala
developers. Python developers can use the open-source sagemaker-feature-store-pyspark Python
library for local development, installation on Amazon EMR, and for Jupyter Notebooks by following the
instructions in the Amazon SageMaker Feature Store Spark GitHub repository. Scala developers can use
the Feature Store Spark connector available in the Amazon SageMaker Feature Store Spark SDK Maven
central repository.
You can use the Spark connector to ingest data in the following ways, depending on if the online store,
offline store, or both are enabled.
1. Ingest by default – If the online store is enabled, Spark connector first ingests your dataframe into the
online store using the PutRecord API. Only the record with the largest event time remains in the online
store. If the offline store is enabled, within 15 minutes Feature Store ingests your dataframe into the
offline store. For more information about how the online and offline stores work, see Feature Store
concepts (p. 1213).
You can accomplish this by not specifying target_stores in the .ingest_data(...) method.
2. Offline store direct ingestion – If offline store is enabled, Spark connector batch ingests your
dataframe directly into the offline store. Ingesting the dataframe directly into the offline store doesn't
update the online store.
For information about using the different ingestion methods, see Example implementations (p. 1236).
Topics
• Feature Store Spark installation (p. 1232)
• Retrieving the JAR for Feature Store Spark (p. 1235)
• Example implementations (p. 1236)
The Feature Store Spark SDK is available in the Amazon SageMaker Feature Store Spark SDK Maven
central repository for Scala users.
Requirements
1232
Amazon SageMaker Developer Guide
Feature Store Spark
The Feature Store Spark connector has a dependency on the iceberg-spark-runtime library. You
must therefore add corresponding version of the iceberg-spark-runtime library to the dependency
if you're ingesting data into a feature group that you've auto-created with the Iceberg table format. For
example, if you're using Spark 3.1, you must declare the following in your project’s POM.xml:
<dependency>
<groupId>software.amazon.sagemaker.featurestore</groupId>
<artifactId>sagemaker-feature-store-spark-sdk_2.12</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.1_2.12</artifactId>
<version>0.14.0</version>
</dependency>
Python users
The Feature Store Spark SDK is available in the open-source Amazon SageMaker Feature Store Spark
GitHub repository.
Requirements
We recommend setting the $SPARK_HOME to the directory where you have Spark installed. During
installation, Feature Store uploads the required JAR to SPARK_HOME, so that the dependencies load
automatically. Spark starting a JVM is required to make this PySpark library work.
Local installation
To find more info about the installation, enable verbose mode by appending --verbose to the
following installation command.
Create an Amazon EMR cluster with the release version 6.1.0 or later. Enable SSH to help you
troubleshoot any issues.
1233
Amazon SageMaker Developer Guide
Feature Store Spark
Note
The following information uses Spark version 3.1, but you can specify any version that meets
the requirements.
export SPARK_HOME=/usr/lib/spark
sudo -E pip3 install sagemaker-feature-store-pyspark-3.1 --no-binary :all: --verbose
Note
If you want to install the dependent JARs automatically to SPARK_HOME, do not use the
bootstrap step.
Install a version of PySpark that's compatible with the Spark connector using the following commands:
If you're performing batch ingestion to the offline store, the dependencies aren't within the notebook
instance environment.
extra_jars = ",".join(feature_store_pyspark.classpath_jars())
spark = SparkSession.builder \
.config("spark.jars", extra_jars) \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-
aws:3.2.1,org.apache.hadoop:hadoop-common:3.2.1") \
.getOrCreate()
Use the following information to help you install the PySpark connector in an AWS Glue Interactive
Session (GIS).
Amazon SageMaker Feature Store Spark requires a specific Spark connector JAR during the initialization
of the session to be uploaded to your Amazon S3 bucket. For more information on uploading the
required JAR to your S3 bucket, see Retrieving the JAR for Feature Store Spark (p. 1235).
After you’ve uploaded the JAR, you must provide the GIS sessions with the JAR using the following
command.
%extra_jars s3:/<YOUR_BUCKET>/spark-connector-jars/sagemaker-feature-store-spark-sdk.jar
To install Feature Store Spark in the AWS Glue runtime, use the %additional_python_modules magic
command within the GIS notebook. AWS Glue runs pip to the modules that you’ve specified under
%additional_python_modules.
%additional_python_modules sagemaker-feature-store-pyspark-3.1
1234
Amazon SageMaker Developer Guide
Feature Store Spark
Before you start the AWS Glue session, you must use both of the preceding magic commands.
To install the Spark connector on a AWS Glue job, use the --extra-jars argument to provide the
necessary JARs and --additional-python-modules to install the Spark Connector as job parameters
when you create the AWS Glue job as shown in the following example. For more information on
uploading the required JAR to your S3 bucket, see Retrieving the JAR for Feature Store Spark (p. 1235).
To use Feature Store Spark with Amazon SageMaker Processing jobs, bring your own image. For
more information about bringing your image, see Bring your own SageMaker image (p. 169). Add the
installation step to a Dockerfile. After you've pushed the Docker image to an Amazon ECR repository,
you can use the PySparkProcessor to create the processing job. For more information about creating a
processing job with the PySpark processor, see Data Processing with Apache Spark (p. 1197).
FROM <ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/sagemaker-spark-processing:3.1-cpu-
py38-v1.0
1235
Amazon SageMaker Developer Guide
Feature Store Spark
After you've installed Feature Store Spark, you can retrieve the JAR location and upload the JAR to
Amazon S3.
jar_location = !feature-store-pyspark-dependency-jars
jar_location = jar_location[0]
s3_client = boto3.client("s3")
s3_client.upload_file(jar_location, "<YOUR_BUCKET>","spark-connector-jars/sagemaker-
feature-store-spark-sdk.jar")
Example implementations
Example Python script
FeatureStoreBatchIngestion.py
spark = SparkSession.builder \
.getOrCreate()
df = spark.createDataFrame(data).toDF(*columns)
# Load the feature definitions from input schema. The feature definitions can be used
to create a feature group
feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)
feature_group_arn = "arn:aws:sagemaker:<AWS_REGION>:<ACCOUNT_ID>:feature-
group/<YOUR_FEATURE_GROUP_NAME>"
# Ingest by default. The connector will leverage PutRecord API to ingest your data in
stream
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_feature_store_PutRecord.html
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn)
# To select the target stores for ingestion, you can specify the target store as the
paramter
1236
Amazon SageMaker Developer Guide
Feature Store Spark
# If OnlineStore is selected, the connector will leverage PutRecord API to ingest your
data in stream
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn, target_stores=["OfflineStore", "OnlineStore"])
# If only OfflineStore is selected, the connector will batch write the data to offline
store directly
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn, target_stores=["OfflineStore"])
The PySpark version requires an extra dependent JAR to be imported, so extra steps are needed to
run the Spark application.
If you did not specify SPARK_HOME during installation, then you have to load required JARs in JVM
when running spark-submit. feature-store-pyspark-dependency-jars is a Python script
installed by the Spark library to automatically fetch the path to all JARs for you.
If you are running this application on Amazon EMR, we recommended that you run the application in
client mode, so that you do not need to distribute the dependent JARs to other task nodes. Add one
more step in Amazon EMR cluster with Spark argument similar to the following:
FeatureStoreBatchIngestion.scala
import software.amazon.sagemaker.featurestore.sparksdk.FeatureStoreManager
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object TestSparkApp {
def main(args: Array[String]): Unit = {
1237
Amazon SageMaker Developer Guide
Add features to a feature group
// Load the feature definitions from input schema. The feature definitions can be
used to create a feature group
val featureDefinitions = featureStoreManager.loadFeatureDefinitionsFromSchema(df)
// Ingest by default. The connector will leverage PutRecord API to ingest your data
in stream
// https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_feature_store_PutRecord.html
featureStoreManager.ingestData(df, featureGroupArn)
// To select the target stores for ingestion, you can specify the target store as
the paramter
// If OnlineStore is selected, the connector will leverage PutRecord API to ingest
your data in stream
featureStoreManager.ingestData(df, featureGroupArn, List("OfflineStore",
"OnlineStore"))
// If only OfflineStore is selected, the connector will batch write the data to
offline store directly
featureStoreManager.ingestData(df, featureGroupArn, ["OfflineStore"])
Scala
You should be able to use Feature Store Spark as a normal dependency. No extra instruction is
needed to run the application on all platforms.
The features that you've added don't have any data. You can add new records to the feature group or
overwrite them. You can think of a record as a row in the data table.
The following sections provide an overview of using the API and Studio to add features to a feature
group. With the API, you can also add or overwrite records after you've updated the feature group.
1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
1238
Amazon SageMaker Developer Guide
Example code
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Feature Store.
8. Choose Feature group catalog.
9. Under Feature group name, choose a feature group.
10. Choose Add feature definitions.
11. Choose Add feature definition.
12. Specify a name for the Feature name field.
13. For Type, select the feature's data type.
14. Choose Add new feature definition.
15. (Optional) Choose Add new feature definition to add feature definitions.
16. Specify information for the additional features.
17. Choose Save changes.
18. Choose Confirm.
API
Use the UpdateFeatureGroup operation to add features to a feature group.
You can use the DescribeFeatureGroup operation to see if you've added the features successfully.
To see the updates that you've made to a record, use the GetRecord operation. To see the updates that
you've made to multiple records, use the BatchGetRecord operation. It can take up to five minutes for
the updates that you've made to appear.
You can use the example code in the following section to walk through adding features and records
using the AWS SDK for Python (Boto3).
Example code
The example code walks you through the following process:
import boto3
1239
Amazon SageMaker Developer Guide
Example code
sagemaker_client = boto3.client("sagemaker")
sagemaker_client.update_feature_group(
FeatureGroupName=feature_group_name,
FeatureAdditions=[
{"FeatureName": "new-feature-1", "FeatureType": "Integral"},
{"FeatureName": "new-feature-2", "FeatureType": "Fractional"},
{"FeatureName": "new-feature-3", "FeatureType": "String"}
]
)
The following code uses the DescribeFeatureGroup operation to check the status of the update. If
the LastUpdateStatus field is Successful, you've added the features successfully.
sagemaker_client.describe_feature_group(
FeatureGroupName=feature_group_name
)
record_identifier_value = 'new_record'
sagemaker_featurestore_runtime_client = boto3.client("sagemaker-featurestore-runtime")
sagemaker_runtime_client.put_record(
FeatureGroupName=feature_group_name,
Record=[
{
'FeatureName': "record-identifier-feature-name",
'ValueAsString': record_identifier_value
},
{
'FeatureName': "event-time-feature",
'ValueAsString': "timestamp-that-feature-store-returns"
},
{
'FeatureName': "new-feature-1",
'ValueAsString': "value-as-string"
},
{
'FeatureName': "new-feature-2",
'ValueAsString': "value-as-string"
},
{
'FeatureName': "new-feature-3",
'ValueAsString': "value-as-string"
},
]
)
Use the GetRecord operation to see which records in your feature group don't have data for the
features that you've added. You can use the PutRecord operation to overwrite the records that don't
have data for the features that you've added.
1240
Amazon SageMaker Developer Guide
Find features in your feature groups
To search for features in your feature groups, the feature groups must be within the same AWS account
and Region.
Important
Use the latest version of Amazon SageMaker Studio to make sure that you're using the most
recent version of the search functionality. For information on updating Studio, see Shut down
and Update SageMaker Studio (p. 199).
If you're on a team, you might have teammates that are looking for features to use in their models, they
can search through all the features in all of the feature groups.
You can add searchable parameters and descriptions to make your features more discoverable. For more
information, see Adding searchable metadata to your features (p. 1248).
The following are the types of metadata that you can use in your search.
You can search for features using either Amazon SageMaker Studio or the Search operation in the
SageMaker API. The following table lists all of the searchable metadata and whether you can search for it
in Studio.
The following sections show you how to search for your features.
Studio
Use the following procedure to search through all the features that you've created.
1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
1241
Amazon SageMaker Developer Guide
Find features in your feature groups
The example uses the Search operation in the AWS SDK for Python (Boto3) to run the search query.
For information about the other languages to submit a query, see See Also in the Amazon SageMaker
API Reference.
The following code shows different example search queries using the API.
# Search for all features that belong to a feature group that contain the "ver"
substring
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
]
}
)
# Search for all features that belong to a feature group that have the EXACT name
"airport"
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Equals',
'Value': 'airport'
},
]
}
)
# Search for all features that belong to a feature group that contains the name "ver"
AND have a name that contains "wha"
AND have a parameter (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
1242
Amazon SageMaker Developer Guide
Find features in your feature groups
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllParameters',
'Operator': 'Contains',
'Value': 'hea'
},
]
}
)
# Search for all features that belong to a feature group with substring "ver" in its
name
OR features that have a name that contain "wha"
OR features that have a parameter (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllParameters',
'Operator': 'Contains',
'Value': 'hea'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)
# Search for all features that belong to a feature group with substring "ver" in its
name
OR features that have a name that contain "wha"
OR parameters with the value 'Sage' for the 'org' key
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
1243
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store
'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'Parameters.org',
'Operator': 'Contains',
'Value': 'Sage'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)
The following table shows the searchable fields and whether you can use Studio to search for a specific
field.
You can search for features using either Amazon SageMaker Studio or the Search operation in the
SageMaker API. The following table lists all of the searchable metadata and whether you can search for it
in Studio.
1244
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store
The following sections show you how to search for your features.
Studio
Use the following procedure to search through all the feature groups that you've created.
1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Feature Store.
8. Under Feature Group Catalog, specify a text query with at least three characters to search for
your feature groups.
9. (Optional) Use advanced filters after you've specified a query. You can use filters to specify
parameters or date ranges in your search results. If you're searching for a parameter, specify
both its key and value. To find your features more easily, you can do the following:
The example uses the Search operation in the AWS SDK for Python (Boto3) to run the search query.
For information about the other languages to submit a query, see See Also in the Amazon SageMaker
API Reference.
The following code shows different example search queries using the API.
# Search for all feature groups with a name that contains the "ver" substring
sagemaker_client.search(
1245
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
]
}
)
# Search for all feature groups that have the EXACT name "airport"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Equals',
'Value': 'airport'
},
]
}
)
# Search for all feature groups that contains the name "ver"
# AND have a record identifier feature name that contains "wha"
# AND have a tag (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllTags',
'Operator': 'Contains',
'Value': 'hea'
},
]
}
)
# Search for all feature groups with substring "ver" in its name
# OR feature groups that have a record identifier feature name that contains "wha"
# OR feature groups that have a tag (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',
1246
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllTags',
'Operator': 'Contains',
'Value': 'hea'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)
# Search for all feature groups with substring "ver" in its name
# OR feature groups that have a record identifier feature name that contains "wha"
# OR tags with the value 'Sage' for the 'org' key
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'Tags.org',
'Operator': 'Contains',
'Value': 'Sage'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)
1247
Amazon SageMaker Developer Guide
Adding searchable metadata to your features
{
'Name': 'OnlineStoreConfig.EnableOnlineStore',
'Operator': 'Equals',
'Value': 'true'
},
{
'Name': 'OfflineStoreConfig.S3StorageConfig.S3Uri',
'Operator': 'NotExists'
}
]
}
)
# Search for all feature groups that are BOTH online and offline
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'OnlineStoreConfig.EnableOnlineStore',
'Operator': 'Equals',
'Value': 'true'
},
{
'Name': 'OfflineStoreConfig.S3StorageConfig.S3Uri',
'Operator': 'Exists'
}
]
}
)
For parameters, you must specify a key-value pair in your search. You can add up to 25 parameters.
To update the metadata of a feature, you can use either Amazon SageMaker Studio or the
UpdateFeatureMetadata operation.
Use the following procedure to update the metadata using Amazon SageMaker Studio.
1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
1248
Amazon SageMaker Developer Guide
Adding searchable metadata to your features
The following describes how you can use the UpdateFeatureMetadata operation for different
scenarios.
To add a list of parameters to a feature, specify values for the following fields:
• FeatureGroupName
• Feature
• Parameters
The following example code uses the AWS SDK for Python (Boto3) to add two parameters.
sagemaker_client.update_feature_metadata(
FeatureGroupName="feature_group_name",
FeatureName="feature-name",
ParameterAdditions=[
{"Key": "example-key-0", "Value": "example-value-0"},
{"Key": "example-key-1", "Value": "example-value-1"},
]
)
• FeatureGroupName
• Feature
• Description
sagemaker_client.update_feature_metadata(
FeatureGroupName="feature-group-name",
FeatureName="feature-name",
Description="description"
)
1249
Amazon SageMaker Developer Guide
Example code
• FeatureGroupName
• Feature
Specify the keys for the parameters that you're removing under ParameterRemovals.
sagemaker_client.update_feature_metadata(
FeatureGroupName="feature_group_name",
FeatureName="feature-name",
ParameterRemovals=[
{"Key": "example-key-0"},
{"Key": "example-key-1"},
]
)
• FeatureGroupName
• Feature
sagemaker_client.update_feature_metadata(
FeatureGroupName="feature-group-name",
FeatureName="feature-name",
Description=""
)
After you've updated the metadata for a feature, you can use the DescribeFeatureMetadata
operation to see the updates that you've made.
The following code goes through an example workflow using the AWS SDK for Python (Boto3).
Example code
The example code does the following:
Step 1: Setup
To start using Feature Store, create SageMaker, boto3 and Feature Store sessions. Then set up the
S3 bucket you want to use for your features. This is your offline store. The following code uses the
SageMaker default bucket and adds a custom prefix to it.
1250
Amazon SageMaker Developer Guide
Example code
Note
The role that you use must have the following managed policies attached to it:
AmazonS3FullAccess and AmazonSageMakerFeatureStoreAccess.
import boto3
import pandas as pd
import numpy as np
import io
from sagemaker.session import Session
from sagemaker import get_execution_role
from botocore.exceptions import ClientError
prefix = 'sagemaker-featurestore-introduction'
role = get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()
feature_group_name = "test-for-feature-metadata"
feature_definitions = [
{"FeatureName": "feature-1", "FeatureType": "String"},
{"FeatureName": "feature-2", "FeatureType": "String"},
{"FeatureName": "feature-3", "FeatureType": "String"},
{"FeatureName": "feature-4", "FeatureType": "String"},
{"FeatureName": "feature-5", "FeatureType": "String"}
]
try:
sagemaker_client.create_feature_group(
FeatureGroupName=feature_group_name,
RecordIdentifierFeatureName="feature-1",
EventTimeFeatureName="feature-2",
FeatureDefinitions=feature_definitions,
OnlineStoreConfig={"EnableOnlineStore": True}
)
except ClientError as e:
if e.response["Error"]["Code"] == "ResourceInUse":
pass
else:
raise e
sagemaker_client.describe_feature_group(
1251
Amazon SageMaker Developer Guide
Create a dataset from your feature groups
FeatureGroupName=feature_group_name
)
sagemaker_client.update_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1",
Description="new description"
)
You can use the DescribeFeatureMetadata operation to see if you' have successfully updated the
description for the feature group.
sagemaker_client.describe_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1"
)
sagemaker_client.update_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1",
ParameterAdditions=[
{"Key": "team", "Value": "featurestore"},
{"Key": "org", "Value": "sagemaker"},
]
)
You can use the DescribeFeatureMetadata operation again to see if you have successfully added the
parameters.
sagemaker_client.describe_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1"
)
Important
Feature Store requires data to be registered in a AWS Glue data catalog. By default, Feature
Store automatically builds an AWS Glue data catalog when you create a feature group.
1252
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups
After you've created feature groups for your offline store and populated them with data, you can create a
dataset by running queries or using the SDK to join data stored in the offline store from different feature
groups. You can also join the feature groups to a single pandas dataframe. You can use Amazon Athena
to write and execute SQL queries.
Note
To make sure that your data is up to date, you can set up a AWS Glue crawler to run on a
schedule.
To set up a AWS Glue crawler, specify an IAM role that the crawler is using to access the offline
store’s Amazon S3 buckets. For more information, see Create an IAM role.
For more information on how to use AWS Glue and Athena to build a training dataset for model
training and inference, see Create feature groups (p. 1215).
By default, Feature Store doesn't include records that you've deleted from the dataset. It also doesn't
include duplicated records. A duplicate record has the record ID and timestamp value in the event time
column.
Before you use the SDK to create a dataset, you must start a SageMaker session. Use the following code
to start the session.
import boto3
from sagemaker.session import Session
from sagemaker.feature_store.feature_store import FeatureStore
region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(
service_name="sagemaker", region_name=region
)
featurestore_runtime = boto_session.client(
service_name="sagemaker-featurestore-runtime",region_name=region
)
feature_store_session = Session(
boto_session=boto_session,
sagemaker_client=sagemaker_client,
sagemaker_featurestore_runtime_client=featurestore_runtime,
)
feature_store = FeatureStore(feature_store_session)
The following code shows an example of creating a dataset from multiple feature groups. The
following code snippet uses the example feature groups "base_fg_name", "first_fg_name", and
"second_fg_name", which may not exist or have the same schema within your Feature Store. It is
recommended to replace these feature groups with feature groups that exist within your Feature Store.
For information on how to create a feature group, see Step 3: Create feature groups (p. 1219).
1253
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups
s3_bucket_name = "offline-store-sdk-test"
base_fg_name = "base_fg_name"
base_fg = FeatureGroup(name=base_fg_name, sagemaker_session=feature_store_session)
first_fg_name = "first_fg_name"
first_fg = FeatureGroup(name=first_fg_name, sagemaker_session=feature_store_session)
second_fg_name = "second_fg_name"
second_fg = FeatureGroup(name=second_fg_name, sagemaker_session=feature_store_session)
feature_store = FeatureStore(feature_store_session)
builder = feature_store.create_dataset(
base=base_fg,
output_path=f"s3://{DOC-EXAMPLE-BUCKET1}",
).with_feature_group(first_fg
).with_feature_group(second_fg, "base_id", ["base_feature_1"])
The following code shows an example of creating a dataset from multiple feature groups and a pandas
dataframe.
builder = feature_store.create_dataset(
base=base_data_df,
event_time_identifier_feature_name='base_time',
record_identifier_feature_name='base_id',
output_path=f"s3://{s3_bucket_name}"
).with_feature_group(first_fg
).with_feature_group(second_fg, "base_id", ["base_feature_1"])
The Feature Store APIs provides you with helper methods for the create_dataset function. You can
use them to do the following:
The base feature group is an important concept for joins. The base feature group is the feature group
that has other feature groups or the pandas dataframe joined to it. For each dataset
1254
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups
You can add the following optional methods to the create_dataset function to configure how you're
creating dataset:
• with_feature_group – Performs an inner join between the base feature group and another feature
group using the record identifier and the target feature name in the base feature group. The following
provides information about the parameters that you specify:
• feature_group – The feature group that you're joining.
• target_feature_name_in_base – The name of the feature in the base feature group that you're
using as a key in the join. The record identifier in the other feature groups are the other keys that
Feature Store uses in the join.
• included_feature_names – A list of strings representing the feature names of the base feature
group. You can use the field to specify the features that you want to include in the dataset.
• feature_name_in_target – Optional string representing the feature in the target feature group
that will be compared to the target feature in the base feature group.
• join_comparator – Optional JoinComparatorEnum representing the comparator used when
joining the target feature in the base feature group and the feature in the target feature group.
These JoinComparatorEnum values can be GREATER_THAN, GREATER_THAN_OR_EQUAL_TO,
LESS_THAN, LESS_THAN_OR_EQUAL_TO, NOT_EQUAL_TO or EQUALS by default.
• join_type – Optional JoinTypeEnum representing the type of join between the base and target
feature groups. These JoinTypeEnum values can be LEFT_JOIN, RIGHT_JOIN, FULL_JOIN,
CROSS_JOIN or INNER_JOIN by default.
• with_event_time_range – Creates a dataset using the event time range that you specify.
• as_of – Creates a dataset up to a timestamp that you specify. For example, if you specify
datetime(2021, 11, 28, 23, 55, 59, 342380) as the value, creates a dataset up to
November 28th, 2021.
• point_time_accurate_join – Creates a dataset where all of the event time values of the base
feature group is less than all the event time values of the feature group or pandas dataframe that
you're joining.
• include_duplicated_records – Keeps duplicated values in the feature groups.
• include_deleted_records – Keeps deleted values in the feature groups.
• with_number_of_recent_records_by_record_identifier – An integer that you specify to
determine how many of the most recent records appear in the dataset.
• with_number_of_records_by_record_identifier – An integer that represents how many
records appear in the dataset.
After you've configured the dataset, you can specify the output using one of the following methods:
You can retrieve data that comes after a specific period in time. The following code retrieves data after a
timestamp.
fg1 = FeatureGroup("example-feature-group-1")
feature_store.create_dataset(
base=fg1,
output_path="s3://example-S3-path"
).with_number_of_records_from_query_results(5).to_csv_file()
You can also retrieve data from a specific time period. You can use the following code to get data for a
specific time range:
1255
Amazon SageMaker Developer Guide
Sample Amazon Athena queries
fg1 = FeatureGroup("fg1")
feature_store.create_dataset(
base=fg1,
output_path="example-S3-path"
).with_event_time_range(
datetime(2021, 11, 28, 23, 55, 59, 342380),
datetime(2020, 11, 28, 23, 55, 59, 342380)
).to_csv_file() #example time range specified in datetime functions
You might want to join multiple feature groups to a pandas dataframe where the event time values of
the feature group happen no later than the event time of the data frame. Use the following code as a
template to help you perform the join.
fg1 = FeatureGroup("fg1")
fg2 = FeatureGroup("fg2")
events = [['2020-02-01T08:30:00Z', 6, 1],
['2020-02-02T10:15:30Z', 5, 2],
['2020-02-03T13:20:59Z', 1, 3],
['2021-01-01T00:00:00Z', 1, 4]]
df = pd.DataFrame(events, columns=['event_time', 'customer-id', 'title-id'])
feature_store.create_dataset(
base=df,
event_time_identifier_feature_name='event_time',
record_identifier_feature_name='customer_id',
output_path="s3://example-S3-path"
).with_feature_group(fg1, "customer-id"
).with_feature_group(fg2, "title-id"
).point_in_time_accurate_join(
).to_csv_file()
You can also retrieve data that comes after a specific period in time. The following code retrieves data
after the time specified by the timestamp in the as_of method.
fg1 = FeatureGroup("fg1")
feature_store.create_dataset(
base=fg1,
output_path="s3://example-s3-file-path"
).as_of(datetime(2021, 11, 28, 23, 55, 59, 342380)
).to_csv_file() # example datetime values
Interactive Exploration
SELECT *
FROM <FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>
LIMIT 1000
SELECT *
1256
Amazon SageMaker Developer Guide
Cross-account offline store access
FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>)
WHERE row_num = 1;
Latest snapshot without duplicates and deleted records in the offline store
This query filters out any deleted records and selects non-duplicate records from the offline store.
SELECT *
FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>)
WHERE row_num = 1 and
NOT is_deleted;
Time Travel without duplicates and deleted records in the offline store
This query filters out any deleted records and selects non-duplicate records from a particular point in
time.
SELECT *
FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>
where <EventTimeFeatureName> <= timestamp '<timestamp>')
-- replace timestamp '<timestamp>' with just <timestamp> if EventTimeFeature is of
type fractional
WHERE row_num = 1 and
NOT is_deleted
Topics
• Step 1: Set up the offline store access role in Account A (p. 1258)
• Step 2: Set up an offline store Amazon S3 bucket in Account B (p. 1259)
• Step 3: Set up an offline store AWS KMS encryption key in Account A (p. 1259)
1257
Amazon SageMaker Developer Guide
Step 1: Set up the offline store access role in Account A
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketAcl",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::*SageMaker*",
"arn:aws:s3:::*Sagemaker*",
"arn:aws:s3:::*sagemaker*"
]
}
]
}
The preceding code snippet shows the AmazonSageMakerFeatureStoreAccess policy. The Resource
section of the policy is scoped down by default to S3 buckets with names that contain SageMaker,
Sagemaker, or sagemaker. This means the offline store Amazon S3 bucket being used must follow
this naming convention. If this is not your case, or if you want to further scope down the resource, you
can copy and paste the policy to your Amazon S3 bucket policy in the console, customize the Resource
section to be arn:aws:s3:::your-offline-store-bucket-name, and then attach to the role.
Additionally, this role must have AWS KMS permissions attached. At a minimum, it requires the
kms:GenerateDataKey permission to be able to write to the offline store using your customer
managed key. See Step 3 to learn about why a customer managed key is needed for the cross-account
scenario and how to set it up. The following example shows an inline policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey"
],
"Resource": "arn:aws:kms:*:Account-A-Account-Id:key/*"
}
]
}
The Resource section of this policy is scoped to any key in Account A. To further scope this down, after
setting up the offline store KMS key in Step 3, return to this policy and replace it with the key ARN.
1258
Amazon SageMaker Developer Guide
Step 2: Set up an offline store
Amazon S3 bucket in Account B
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3CrossAccountBucketAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetBucketAcl"
],
"Principal": {
"AWS": [
"*Account-A-Offline-Feature-Store-Role-ARN*"
],
},
"Resource": [
"arn:aws:s3:::offline-store-bucket-name/*",
"arn:aws:s3:::offline-store-bucket-name"
]
}
]
}
{
"Version": "2012-10-17",
"Id": "key-consolepolicy-3",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
1259
Amazon SageMaker Developer Guide
Step 3: Set up an offline store AWS
KMS encryption key in Account A
"Principal": {
"AWS": "arn:aws:iam::Account-A-Account-Id:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow access for Key Administrators",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::Account-A-Account-Id:role/Administrator",
]
},
"Action": [
"kms:Create*",
"kms:Describe*",
"kms:Enable*",
"kms:List*",
"kms:Put*",
"kms:Update*",
"kms:Revoke*",
"kms:Disable*",
"kms:Get*",
"kms:Delete*",
"kms:TagResource",
"kms:UntagResource",
"kms:ScheduleKeyDeletion",
"kms:CancelKeyDeletion"
],
"Resource": "*"
},
{
"Sid": "Allow Feature Store to get information about the customer managed key",
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": [
"kms:Describe*",
"kms:Get*",
"kms:List*"
],
"Resource": "*"
},
{
"Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {
"AWS": [
"*Account-A-Offline-Feature-Store-Role-ARN*",
"*arn:aws:iam::Account-B-Account-Id:root*"
]
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:ReEncryptFrom",
"kms:ReEncryptTo",
"kms:GenerateDataKey",
"kms:ListAliases",
"kms:ListGrants"
],
1260
Amazon SageMaker Developer Guide
Step 4: Create a feature group in Account A
"Resource": "*",
}
]
}
To learn more about CloudTrail, see the AWS CloudTrail User Guide.
Management events
Management events capture operations performed on Feature Store resources in your AWS account. For
example, the log generated from the management events provides visibility if a user creates or deletes a
Feature Store. The following APIs log management events with Amazon SageMaker Feature Store.
• CreateFeatureGroup
• DeleteFeatureGroup
• DescribeFeatureGroup
• UpdateFeatureGroup
Amazon SageMaker API calls and management events are logged by default when you create the
account, as described in Log Amazon SageMaker API Calls with AWS CloudTrail (p. 3285). For more
information, see Logging management events for trails.
Data events
Data events capture data plane operations performed using the Feature Store resources in your AWS
account. For example, the log generated from the data events provides visibility if a user adds or deletes
a record within a feature group. The following APIs log data events with Amazon SageMaker Feature
Store.
1261
Amazon SageMaker Developer Guide
Security and access control
• BatchGetRecord
• DeleteRecord
• GetRecord
• PutRecord
Data events are not logged by CloudTrail trails by default. To activate logging of data events, turn on
logging of data plane API activity in CloudTrail. For more information, see CloudTrail's Logging data
events for trails.
{
"eventVersion": "1.08",
"userIdentity": {
"type": "IAMUser",
"principalId": "USERPRINCIPALID",
"arn": "arn:aws:iam::123456789012:user/user",
"accountId": "123456789012",
"accessKeyId": "USERACCESSKEYID",
"userName": "your-user-name"
},
"eventTime": "2023-01-01T01:00:00Z",
"eventSource": "sagemaker.amazonaws.com",
"eventName": "PutRecord",
"awsRegion": "us-east-1",
"sourceIPAddress": "192.0.2.0",
"userAgent": "your-user-agent",
"requestParameters": {
"featureGroupName": "your-feature-group-name"
},
"responseElements": null,
"requestID": "request-id",
"eventID": "event-id",
"readOnly": false,
"resources": [
{
"accountId": "123456789012",
"type": "AWS::SageMaker::FeatureGroup",
"ARN": "arn:aws:sagemaker:us-east-1:123456789012:feature-group/your-feature-
group-name"
}
],
"eventType": "AwsApiCall",
"managementEvent": false,
"recipientAccountId": "123456789012",
"eventCategory": "Data",
"tlsDetails": {
...
}
}
1262
Amazon SageMaker Developer Guide
Using AWS KMS permissions for
Amazon SageMaker Feature Store
you can select storage type and optionally provide a AWS KMS key for encrypting data, then you can call
various APIs for data management such as PutRecord, GetRecord, DeleteRecord.
Feature Store allows you to grant or deny access to individuals at the feature group-level and enables
cross-account access to Feature Store. For example, you can set up developer accounts to access the
offline store for model training and exploration that do not have write access to production accounts.
You can set up production accounts to access both online and offline stores. Feature Store uses unique
customer AWS KMS keys for offline and online store data at-rest encryption. Access control is enabled
through both API and AWS KMS key access. You can also create feature group-level access control.
For more information about customer managed key, see customer managed keys. For more information
about AWS KMS, see AWS KMS.
Feature Store supports only symmetric customer managed keys. You cannot use an asymmetric customer
managed key to encrypt your data in your online or offline store. For help determining whether a
customer managed key is symmetric or asymmetric, see Identifying symmetric and asymmetric customer
managed keys.
When you use a customer managed key, you can take advantage of the following features:
• You create and manage the customer managed key, including setting the key policies, IAM policies
and grants to control access to the customer managed key. You can enable and disable the customer
managed key, enable and disable automatic key rotation, and delete the customer managed key when
it is no longer in use.
• You can use a customer managed key with imported key material or a customer managed key in a
custom key store that you own and manage.
• You can audit the encryption and decryption of your online or offline store by examining the API calls
to AWS KMS in AWS CloudTrail logs.
You do not pay a monthly fee for AWS owned customer managed keys. Customer managed keys will
incur a charge for each API call and AWS Key Management Service quotas apply to each customer
managed key.
Feature Store does not need additional authorization to use the default AWS owned KMS key to protect
your online or offline stores in your AWS account.
1263
Amazon SageMaker Developer Guide
Authorizing use of a customer
managed Key for your online store
principal, a user or role, must have the permissions on the customer managed key that Feature Store
requires. You can provide these permissions in a key policy, an IAM policy, or a grant. At a minimum,
Feature Store requires the following permissions on a customer managed key:
For example, the following example key policy provides only the required permissions. The policy has the
following effects:
• Allows Feature Store to use the customer managed key in cryptographic operations and create grants,
but only when it is acting on behalf of principals in the account who have permission to use your
Feature Store. If the principals specified in the policy statement don't have permission to use your
Feature Store, the call fails, even when it comes from the Feature Store service.
• The kms:ViaService condition key allows the permissions only when the request comes from
FeatureStore on behalf of the principals listed in the policy statement. These principals can't call these
operations directly. The value for kms:ViaService should be sagemaker.*.amazonaws.com.
Note
The kms:ViaService condition key can only be used for the online store customer managed
AWS KMS key, and cannot be used for the offline store. If you add this special condition to
your customer managed key, and use the same AWS KMS key for both the online and offline
store, then it will fail the CreateFeatureGroup API operation.
• Gives the customer managed key administrators read-only access to the customer managed key and
permission to revoke grants, including the grants that Feature Store uses to protect your data.
Before using an example key policy, replace the example principals with actual principals from your AWS
account.
{"Id": "key-policy-feature-store",
"Version":"2012-10-17",
"Statement": [
{"Sid" : "Allow access through Amazon SageMaker Feature Store for all principals in
the account that are authorized to use Amazon SageMaker Feature Store",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:user/featurestore-user"},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:ReEncryptFrom",
"kms:ReEncryptTo",
"kms:GenerateDataKey",
"kms:ListAliases",
"kms:ListGrants"
],
"Resource": "*",
"Condition": {"StringLike": {"kms:ViaService" : "sagemaker.*.amazonaws.com"
}
}
},
{"Sid": "Allow administrators to view the customer managed key and revoke grants",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:role/featurestore-admin"
},
"Action": [
"kms:Describe*",
1264
Amazon SageMaker Developer Guide
Using grants to authorize Feature Store
"kms:Get*",
"kms:List*",
"kms:RevokeGrant"
],
"Resource": "*"
},
{"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789:root"
},
"Action": "kms:*",
"Resource": "*"
}
]
}
Feature Store uses the grant permissions when it performs background system maintenance and
continuous data protection tasks.
Each grant is specific to an online store. If the account includes multiple stores encrypted under the
same customer managed key, there will be unique grants per FeatureGroup using the same customer
managed key.
The key policy can also allow the account to revoke the grant on the customer managed key. However,
if you revoke the grant on an active encrypted online store, Feature Store won't be able to protect and
maintain the store.
"kms:Decrypt"
"kms:GenerateDataKey"
1265
Amazon SageMaker Developer Guide
Quotas, naming rules and data types
Note
The key policy for the online store also works for the offline store, only when the
kms:ViaService condition is not specified.
Important
You can specify a AWS KMS encryption key to encrypt the Amazon S3 location used for your
offline feature store when you create a feature group. If AWS KMS encryption key is not
specified, by default we encrypt all data at rest using AWS KMS key. By defining your bucket-
level key for SSE, you can reduce AWS KMS requests costs by up to 99 percent.
• Maximum number of feature groups per AWS account: Soft limit of 100.
• Maximum number of feature definitions per feature group: 2500.
• Maximum number of RRU per record identifier: 2400 RRU per second.
• Maximum number of WRU per record identifier: 500 WRU per second.
• Maximum Transactions per second (TPS) per API per AWS account: Soft limit of 10000 TPS per API
excluding the BatchGetRecord API call, which has a soft limit of 500 TPS.
• Maximum size of a record: 350KB.
• Maximum size of a record identifier: 2KB.
• Maximum size of a feature value: 350KB.
• Maximum number of concurrent feature group creation workflows: 4.
• BatchGetRecord API: Can contain as many as 100 records and can query up to 10 feature groups.
For information about service quotas, see AWS service quotas. For information about requesting an
increase to a quota, see Requesting a quota increase.
Naming rules
• Reserved Words: The following are reserved words and cannot be used as feature names in feature
definitions: is_deleted, write_time, and api_invocation_time.
Data types
• String Feature Type: Strings are Unicode with UTF-8 binary encoding. The minimum length of a string
can be zero, the maximum length is constrained by the maximum size of a record.
1266
Amazon SageMaker Developer Guide
Amazon SageMaker Feature Store offline store data format
• Fractional Feature Type: Fractional feature values must conform to a double precision floating point
number as defined by the IEEE 754 standard.
• Integral Feature Type: Feature Store supports integral values in the range of a 64-bit signed integer.
63 63
Minimum value of -2 and a maximum value: 2 - 1.
• Event Time Features: All feature groups have an event time feature with nanosecond precision. Any
event time with lower than nanosecond precision will lead to backwards incompatibility. The feature
can have a feature type of either String or Fractional.
• A string event time is accepted in ISO-8601 format, in UTC time, conforming to the pattern(s): [yyyy-
MM-dd'T'HH:mm:ssZ, yyyy-MM-dd'T'HH:mm:ss.SSSSSSSSSZ].
• A fractional event time value is accepted as seconds from unix epoch. Event times must be in the
range of [0000-01-01T00:00:00.000000000Z, 9999-12-31T23:59:59.999999999Z]. For feature
groups in the Iceberg table format, you can only use String type for the event time.
Amazon SageMaker Feature Store offline store data is stored in an Amazon S3 bucket within your
account. When you call PutRecord, your data is buffered, batched, and written into Amazon S3 within
15 minutes. Feature Store only supports the Parquet file format. Specifically, when your data is written
to your offline store, the data can only be retrieved from your Amazon S3 bucket in Parquet format. Each
file can contain multiple Records.
For the Iceberg format, Feature Store saves the table’s metadata in the same Amazon S3 bucket that
you’re using to store the offline store data. You can find it under the metadata prefix.
The following additional fields are added by Feature Store to each Record when they persist in the offline
store:
• api_invocation_time – The timestamp when the service receives the PutRecord or DeleteRecord
call. If using managed ingestion (e.g. Data Wrangler), this is the timestamp when data was written into
the offline store.
• write_time – The timestamp when data was written into the offline store. Can be used for constructing
time-travel related queries.
• is_deleted – False by default. If DeleteRecord is called, a new Record is inserted into
RecordIdentifierValue and set to True in the offline store.
The following information shows the organization of a Parquet file using the AWS Glue format:
s3://DOC-EXAMPLE-BUCKET/example-prefix-name/111122223333/sagemaker/AWS Region/
offline-store/example-feature-group-account-id/data/year=year/month=month/day=day/
hour=hour/timestamp_of_latest_event_time_in_file_16-random-alphanumeric-digits.parquet
Records in the offline store are partitioned by event time into hourly partitions. You can’t configure the
partitioning scheme. The following shows an example of the output location of a Parquet file:
1267
Amazon SageMaker Developer Guide
Amazon SageMaker Feature Store notebook examples
s3://DOC-EXAMPLE-BUCKET/example-prefix/111122223333/sagemaker/AWS Region/offline-
store/customer-purchase-history-patterns-1593511200/data/year=2020/month=06/day=31/
hour=00/20200631T064401Z_108934320012Az11.parquet
The following shows the organization of the data files saved in the Iceberg table format.
s3://DOC-EXAMPLE-BUCKET/example-prefix/account-id/sagemaker/AWS Region/offline-
store/feature-group-name-feature-group-creation-time/data/8-random-alphanumeric-
digits/event-time-feature-name_trunc=event-time-year-event-time-month-event-time-day/
timestamp-of-latest-event-time-in-file_16-random-alphanumeric-digits.parquet
Records in the offline store are partitioned by event time into daily partitions. You can’t configure the
partitioning scheme. The following shows an example of the output location of a Parquet file where the
event time feature name is EventTime:
s3://DOC-EXAMPLE-BUCKET/example-prefix/sagemaker/AWS Region/offline-
store/customer-purchase-history-patterns-1593511200/data/0aec19ca/
EventTime_trunc=2022-11-09/20221109T215231Z_yolTtpyuWbkaeGIl.parquet
The following shows the example location of a metadata file for data files saved in the Iceberg table
format.
s3://DOC-EXAMPLE-BUCKET/example-prefix/account-id/sagemaker/AWS Region/offline-
store/feature-group-name-feature-group-creation-time/metadata/
See IAM Roles to access your role and attach this policy. For a walkthrough on how to view the policies
attached to a role and how to add a policy to your role, see Adding policies to your IAM role.
Fraud Detection with Feature Store An advanced example on how to train a fraud
detection model by ingesting data into a Feature
Store, querying it to form a training dataset, and
how to train a simple model for inference.
1268
Amazon SageMaker Developer Guide
Feature Store sample notebooks
Encrypt Data in your online or offline store using An advanced example on how to encrypt and
AWS KMS key decrypt data in an online or offline store using
AWS KMS key and how to verify that your data
is encrypted. Note that this notebook tackles
encryption at rest.
Client-side Encryption with Feature Store using An advanced example how to do client-side
AWS Encryption SDK encryption with Feature Store using the AWS
Encryption SDK library, which encrypts your data
prior to ingesting it into your online or offline
store.
How to securely store an image dataset in Feature An advanced example that demonstrates how
Store with AWS KMS key? to securely store a dataset of images into your
Feature Store using a AWS KMS key for server-side
encryption.
Create a machine learning workflow from an A machine learning (ML) workflow that
Amazon SageMaker Ground Truth classification demonstrates how to feed the output of an image
labeling job to Feature Store or text classification labeling job from Amazon
SageMaker Ground Truth to Feature Store.
For a comprehensive set of notebooks with examples for common workflows, see SageMaker Feature
Store Workshop.
1269
Amazon SageMaker Developer Guide
The simplest training workflow in SageMaker
Train Models
The training stage of the full machine learning (ML) lifecycle spans from accessing your training dataset
to generating a final model and selecting the best performing model for deployment. The following
sections provide an overview of available SageMaker training features and resources with in-depth
technical information for each.
For intermediate coding experiences, consider using a SageMaker Studio notebook or SageMaker
Notebook Instances. To get started, follow the instructions at the section called “Step 4: Train a
Model” (p. 94) of the SageMaker Getting Started guide. We recommend this for use cases in which you
create your own model and training script using an ML framework.
The following architecture diagram shows how SageMaker manages ML training jobs and provisions
Amazon EC2 instances on behalf of SageMaker users. You as a SageMaker user can bring your own
training dataset, saving it to Amazon S3. You can choose an ML model training from available SageMaker
built-in algorithms, or bring your own training script with a model built with popular machine learning
frameworks.
1270
Amazon SageMaker Developer Guide
Before training
before, during, and after training to make sure your model is trained well to meet the target accuracy for
your objectives.
The following flow chart shows a high-level overview of your actions (in blue boxes) and available
SageMaker Training features (in light blue boxes) throughout the training phase of the ML lifecycle.
The following sections walk you through each phase of training depicted in the previous flow chart and
useful features offered by SageMaker throughout the three sub-stages of the ML training.
Topics
• Before training (p. 1271)
• During training (p. 1273)
• After training (p. 1275)
Before training
There are a number of scenarios of setting up data resources and access you need to consider before
training. Refer to the following diagram and details of each before-training stage to get a sense of what
decisions you need to make.
1271
Amazon SageMaker Developer Guide
Before training
• Prepare data: Before training, you must have finished data cleaning and feature engineering during
the data preparation stage. SageMaker has several labeling and feature engineering tools to help you.
See Label Data, Prepare and Analyze Datasets, Process Data, and Create, Store, and Share Features for
more information.
• Choose an algorithm or framework: Depending on how much customization you need, there are
different options for algorithms and frameworks.
• If you prefer a low-code implementation of a pre-built algorithm, use one of the built-in algorithms
offered by SageMaker. For more information, see Choose an Algorithm.
• If you need more flexibility to customize your model, run your training script using your preferred
frameworks and toolkits within SageMaker. For more information, see ML Frameworks and Toolkits.
• To extend pre-built SageMaker Docker images as the base image of your own container, see Use Pre-
built SageMaker Docker images.
• To bring your custom Docker container to SageMaker, see Adapting your own Docker container to
work with SageMaker. You need to install the sagemaker-training-toolkit to your container.
• Manage data storage: Understand mapping between the data storage (such as Amazon S3, Amazon
EFS, or Amazon FSx) and the training container that runs in the Amazon EC2 compute instance.
1272
Amazon SageMaker Developer Guide
During training
SageMaker helps map the storage paths and local paths in the training container. You can also
manually specify them. After mapping is done, consider using one of the data transmission modes:
File, Pipe, and FastFile mode. To learn how SageMaker maps storage paths, see Training Storage
Folders.
• Set up access to training data: Use Amazon SageMaker Domain, a Domain user profile, IAM, Amazon
VPC, and AWS KMS to meet the requirements of the most security-sensitive organizations.
• For account administration, see Amazon SageMaker Domain.
• For a complete reference about IAM policies and security, see Security in Amazon SageMaker.
• Stream your input data: SageMaker provides three data input modes, File, Pipe, and FastFile. The
default input mode is File mode, which loads the entire dataset during initializing the training job. To
learn about general best practices for streaming data from your data storage to the training container,
see Access Training Data.
In case of Pipe mode, you can also consider using an augmented manifest file to stream your data
directly from Amazon Simple Storage Service (Amazon S3) and train your model. Using pipe mode
reduces disk space because Amazon Elastic Block Store only needs to store your final model artifacts,
rather than storing your full training dataset. For more information, see Provide Dataset Metadata to
Training Jobs with an Augmented Manifest File.
• Analyze your data for bias: Before training, you can analyze your dataset and model for bias against
a disfavored group so that you can check that your model learns an unbiased dataset using SageMaker
Clarify.
• Choose which SageMaker SDK to use: There are two ways to launch a training job in SageMaker:
using the high-level SageMaker Python SDK, or using the low-level SageMaker APIs for the SDK for
Python (Boto3) or the AWS CLI. The SageMaker Python SDK abstracts the low-level SageMaker API to
provide convenient tools. As aforementioned in the section called “The simplest training workflow in
SageMaker” (p. 1270), you can also pursue no-code or minimal-code options using SageMaker Canvas,
SageMaker JumpStart within SageMaker Studio, or SageMaker Autopilot.
During training
During training, you need to continuously improve training stability, training speed, training efficiency
while scaling compute resources, cost optimization, and, most importantly, model performance. Read on
for more information about during-training stages and relevant SageMaker Training features.
1273
Amazon SageMaker Developer Guide
During training
• Set up infrastructure: Choose the right instance type and infrastructure management tools for your
use case. You can start from a small instance and scale up depending on your workload. For training
a model on a tabular dataset, start with the smallest CPU instance of the C4 or C5 instance families.
For training a large model for computer vision or natural language processing, start with the smallest
GPU instance of the P2, P3, G4dn or G5 instance families. You can also mix different instance types in
a cluster, or keep instances in warm pools using the following instance management tools offered by
SageMaker. You can also use persistent cache to reduce latency and billable time on iterative training
jobs over the latency reduction from warm pools alone. To learn more, see the following topics.
• Train Using a Heterogeneous Cluster (p. 2105)
• Train Using SageMaker Managed Warm Pools (p. 2119)
• Using persistent cache (p. 2121)
To check the currently available quotas in your account, you must use your Service Quotas console.
To learn more about how to request quota increase, see Supported Regions and Quotas. Also, to find
pricing information and available instance types depending on the AWS Regions, loop up the tables in
the Amazon SageMaker Pricing page.
• Run a training job from a local code: You can annotate your local code with a remote decorator to run
your code as a SageMaker training job from inside Amazon SageMaker Studio, an Amazon SageMaker
notebook or from your local integrated development environment. For more information, see Run
your local code as a SageMaker training job (p. 1565).
• Track training jobs: Monitor and track your training jobs using SageMaker Experiments, SageMaker
Debugger, or Amazon CloudWatch. You can watch the model performance in terms of accuracy
and convergence, and run comparative analysis of metrics between multiple training jobs by using
SageMaker Experiments. You can watch the compute resource utilization rate by using SageMaker
Debugger’s profiling tools or Amazon CloudWatch. To learn more, see the following topics.
• Manage Machine Learning with Amazon SageMaker Experiments
• Profile Training Jobs Using Amazon SageMaker Debugger
• Monitor and Analyze Using CloudWatch Metrics
Additionally, for deep learning tasks, use the Amazon SageMaker Debugger model debugging tools
and built-in rules to identify more complex issues in model convergence and weight update processes.
1274
Amazon SageMaker Developer Guide
After training
• Distributed training: If your training job is going into a stable stage without breaking due to
misconfiguration of the training infrastructure or out-of-memory issues, you might want to find
more options to scale your job and run over an extended period of time for days and even months.
When you’re ready to scale up, consider distributed training. SageMaker provides various options for
distributed computation from light ML workloads to heavy deep learning workloads.
For deep learning tasks that involve training very large models on very large datasets, consider using
one of the SageMaker distributed training strategies to scale up and achieve data parallelism, model
parallelism, or a combination of the two. You can also use SageMaker Training Compiler for compiling
and optimizing model graphs on GPU instances. These SageMaker features support deep learning
frameworks such as PyTorch, TensorFlow, and Hugging Face Transformers.
• Model hyperparameter tuning: Tune your model hyperparameters using Automatic Model Tuning
with SageMaker. SageMaker provides hyperparameter tuning methods such as grid search and
Bayesian search, launching parallel hyperparameter tuning jobs with early-stopping functionality for
non-improving hyperparameter tuning jobs.
• Checkpointing and cost saving with Spot instances: If training time is not a big concern, you might
consider optimizing model training costs with managed Spot instances. Note that you must activate
checkpointing for Spot training to keep restoring from intermittent job pauses due to Spot instance
replacements. You can also use the checkpointing functionality to back up your models in case of
unexpected training job termination. To learn more, see the following topics.
• Managed Spot Training
• Use Checkpoints
After training
After training, you obtain a final model artifact to use for model deployment and inference. There are
additional actions involved in the after-training phase as shown in the following diagram.
1275
Amazon SageMaker Developer Guide
Choose an Algorithm
• Obtain baseline model: After you have the model artifact, you can set it as a baseline model. Consider
the following post-training actions and using SageMaker features before moving on to model
deployment to production.
• Examine model performance and check for bias: Use Amazon CloudWatch Metrics and SageMaker
Clarify for post-training bias to detect any bias in incoming data and model over time against the
baseline. You need to evaluate your new data and model predictions against the new data regularly or
in real time. Using these features, you can receive alerts about any acute changes or anomalies, as well
as gradual changes or drifts in data and model.
• You can also use the Incremental Training functionality of SageMaker to load and update your model
(or fine-tune) with an expanded dataset.
• You can register model training as a step in your SageMaker Pipeline or as part of other Workflow
features offered by SageMaker in order to orchestrate the full ML lifecycle.
Choose an Algorithm
Machine learning can help you accomplish empirical tasks that require some sort of inductive inference.
This task involves induction as it uses data to train algorithms to make generalizable inferences. This
means that the algorithms can make statistically reliable predictions or decisions, or complete other
tasks when applied to new data that was not used to train them.
To help you select the best algorithm for your task, we classify these tasks on various levels of
abstraction. At the highest level of abstraction, machine learning attempts to find patterns or
relationships between features or less structured items, such as text in a data set. Pattern recognition
techniques can be classified into distinct machine learning paradigms, each of which address specific
1276
Amazon SageMaker Developer Guide
Choose an algorithm implementation
problem types. There are currently three basic paradigms for machine learning used to address various
problem types:
The types of problems that each learning paradigm can address are identified by considering the
inferences (or predictions, decisions, or other tasks) you want to make from the type of data that you
have or could collect. Machine learning paradigms use algorithmic methods to address their various
problem types. The algorithms provide recipes for solving these problems.
However, many algorithms, such as neural networks, can be deployed with different learning paradigms
and on different types of problems. Multiple algorithms can also address a specific problem type. Some
algorithms are more generally applicable and others are quite specific for certain kinds of objectives and
data. So the mapping between machine learning algorithms and problem types is many-to-many. Also,
there are various implementation options available for algorithms.
The following sections provide guidance concerning implementation options, machine learning
paradigms, and algorithms appropriate for different problem types.
Topics
• Choose an algorithm implementation (p. 1277)
• Problem types for the basic machine learning paradigms (p. 1279)
• Use Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281)
• Use Reinforcement Learning with Amazon SageMaker (p. 1559)
• Pre-trained models require the least effort and are models ready to deploy or to fine-tune and deploy
using SageMaker JumpStart.
• Built-in algorithms require more effort and scale if the data set is large and significant resources are
needed to train and deploy the model.
• If there is no built-in solution that works, try to develop one that uses pre-made images for machine
and deep learning frameworks for supported frameworks such as Scikit-Learn, TensorFlow, PyTorch,
MXNet, or Chainer.
• If you need to run custom packages or use any code which isn’t a part of a supported framework or
available via PyPi, then you need to build your own custom Docker image that is configured to install
the necessary packages or software. The custom image must also be pushed to an online repository
like the Amazon Elastic Container Registry.
Topics
• Use a built-in algorithm (p. 1278)
• Use script mode in a supported framework (p. 1278)
• Use a custom Docker image (p. 1279)
1277
Amazon SageMaker Developer Guide
Choose an algorithm implementation
Implementation Requires code Pre-coded Support for Support for Level of effort
algorithms third party custom code
packages
• The built-in algorithms require no coding to start running experiments. The only inputs you need to
provide are the data, hyperparameters, and compute resources. This allows you to run experiments
more quickly, with less overhead for tracking results and code changes.
• The built-in algorithms come with parallelization across multiple compute instances and GPU support
right out of the box for all applicable algorithms (some algorithms may not be included due to
inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms
can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be
easier to use its corollary in SageMaker and input the hyper-parameters you already know than to port
it over, using script mode on a supported framework.
For more information on the built-in algorithms provided by SageMaker, see Use Amazon SageMaker
Built-in Algorithms or Pre-trained Models (p. 1281).
For important information about docker registry paths, data formats, recommended EC2 instance types,
and CloudWatch logs common to all of the built-in algorithms provided by SageMaker, see Common
Information About Built-in Algorithms (p. 1287).
1278
Amazon SageMaker Developer Guide
Problem types for the basic machine learning paradigms
requirements.txt file with your training code or to include your own code directories. R is also supported
natively in SageMaker notebook kernels. Some frameworks, like scikit-learn and Spark ML, have pre-
coded algorithms you can use easily, while other frameworks like TensorFlow and PyTorch may require
you to implement the algorithm yourself. The only limitation when using a supported framework image
is that you cannot import any software packages that are not hosted on PyPi or that are not already
included with the framework’s image.
For more information on the frameworks supported by SageMaker, see Use Machine Learning
Frameworks, Python, and R with Amazon SageMaker (p. 15).
For more information on custom Docker images in SageMaker, see Using Docker containers with
SageMaker (p. 2668).
Topics
• Supervised learning (p. 1279)
• Unsupervised learning (p. 1280)
• Reinforcement learning (p. 1280)
Supervised learning
If your data set consists of features or attributes (inputs) that contain target values (outputs), then you
have a supervised learning problem. If your target values are categorical (mathematically discrete),
then you have a classification problem. It is a standard practice to distinguish binary from multiclass
classification.
• Binary classification is a type of supervised learning that assigns an individual to one of two
predefined and mutually exclusive classes based on the individual's attributes. It is supervised because
the models are trained using examples in which the attributes are provided with correctly labeled
objects. A medical diagnosis for whether an individual has a disease or not based on the results of
diagnostic tests is an example of binary classification.
• Multiclass classification is a type of supervised learning that assigns an individual to one of several
classes based on the individual's attributes. It is supervised because the models are trained using
examples in which the attributes are provided with correctly labeled objects. An example is the
1279
Amazon SageMaker Developer Guide
Problem types for the basic machine learning paradigms
prediction of the topic most relevant to a text document. A document may be classified as being about
religion, politics, or finance, or as about one of several other predefined topic classes.
If the target values you are trying to predict are mathematically continuous, then you have a regression
problem. Regression estimates the values of a dependent target variable based on one or more other
variables or attributes that are correlated with it. An example is the prediction of house prices using
features like the number of bathrooms and bedrooms and the square footage of the house and garden.
Regression analysis can create a model that takes one or more of these features as an input and predicts
the price of a house.
For more information on the built-in supervised learning algorithms provided by SageMaker, see
Supervised Learning (p. 1285).
Unsupervised learning
If your data set consists of features or attributes (inputs) that do not contain labels or target values
(outputs), then you have an unsupervised learning problem. In this type of problem, the output must be
predicted based on the pattern discovered in the input data. The goal in unsupervised learning problems
is to discover patterns such as groupings within the data. There are a large variety of tasks or problem
types to which unsupervised learning can be applied. Principal component and cluster analyses are two
of the main methods commonly deployed for preprocessing data. Here is a short list of problem types
that can be addressed by unsupervised learning:
• Dimension reduction is typically part of a data exploration step used to determine the most relevant
features to use for model construction. The idea is to transform data from a high-dimensional,
sparsely populated space into a low-dimensional space that retains most significant properties of
the original data. This provides relief for the curse of dimensionality that can arise with sparsely
populated, high-dimensional data on which statistical analysis becomes problematic. It can also be
used to help understand data, reducing high-dimensional data to a lower dimensionality that can be
visualized.
• Cluster analysis is a class of techniques that are used to classify objects or cases into groups called
clusters. It attempts to find discrete groupings within data, where members of a group are as similar
as possible to one another and as different as possible from members of other groups. You define the
features or attributes that you want the algorithm to use to determine similarity, select a distance
function to measure similarity, and specify the number of clusters to use in the analysis.
• Anomaly detection is the identification of rare items, events, or observations in a data set which raise
suspicions because they differ significantly from the rest of the data. The identification of anomalous
items can be used, for example, to detect bank fraud or medical errors. Anomalies are also referred to
as outliers, novelties, noise, deviations, and exceptions.
• Density estimation is the construction of estimates of unobservable underlying probability density
functions based on observed data. A natural use of density estimates is for data exploration. Density
estimates can discover features such as skewness and multimodality in the data. The most basic form
of density estimation is a rescaled histogram.
SageMaker provides several built-in machine learning algorithms that you can use for these
unsupervised learning tasks. For more information on the built-in unsupervised algorithms provided by
SageMaker, see Unsupervised Learning (p. 1285).
Reinforcement learning
Reinforcement learning is a type of learning that is based on interaction with the environment. This
type of learning is used by an agent that must learn behavior through trial-and-error interactions with
a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives
as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain
rewards with exploiting actions that have known rewards.
1280
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information on SageMaker's frameworks, toolkits, and environments for reinforcement
learning, see Use Reinforcement Learning with Amazon SageMaker (p. 1559).
Text Generation
Text
Summarization
Semantic
Segmentation
1281
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1282
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1283
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For important information about Docker registry paths, data formats, recommenced Amazon EC2
instance types, and CloudWatch logs common to all of the built-in algorithms provided by SageMaker,
see Common Information About Built-in Algorithms (p. 1287).
The following sections provide additional guidance for the Amazon SageMaker built-in algorithms
grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions
of these learning paradigms and their associated problem types, see Choose an Algorithm (p. 1276).
Sections are also provided for the SageMaker built-in algorithms available to address two important
machine learning domains: textual analysis and image processing.
1284
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Supervised Learning
Amazon SageMaker provides several built-in general purpose algorithms that can be used for either
classification or regression problems.
Amazon SageMaker also provides several built-in supervised learning algorithms that are used for more
specialized tasks during feature engineering and forecasting from time series data.
• Object2Vec Algorithm (p. 1421)—a new highly customizable multi-purpose algorithm used for
feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to
produce features that improve training efficiencies for downstream models. While this is a supervised
algorithm, as it requires labeled data for training, there are many scenarios in which the relationship
labels can be obtained purely from natural clusterings in data, without any explicit human annotation.
• DeepAR Forecasting Algorithm (p. 1460)—a supervised learning algorithm for forecasting scalar (one-
dimensional) time series using recurrent neural networks (RNN).
Unsupervised Learning
Amazon SageMaker provides several built-in algorithms that can be used for a variety of unsupervised
learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.
1285
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Principal Component Analysis (PCA) Algorithm (p. 1493)—reduces the dimensionality (number of
features) within a dataset by projecting data points onto the first few principal components. The
objective is to retain as much information or variation as possible. For mathematicians, principal
components are eigenvectors of the data's covariance matrix.
• K-Means Algorithm (p. 1485)—finds discrete groupings within data, where members of a group are as
similar as possible to one another and as different as possible from members of other groups.
• IP Insights (p. 1476)—learns the usage patterns for IPv4 addresses. It is designed to capture
associations between IPv4 addresses and various entities, such as user IDs or account numbers.
• Random Cut Forest (RCF) Algorithm (p. 1497)—detects anomalous data points within a data set that
diverge from otherwise well-structured or patterned data.
Textual Analysis
SageMaker provides algorithms that are tailored to the analysis of textual documents used in natural
language processing, document classification or summarization, topic modeling or classification, and
language transcription or translation.
• BlazingText algorithm (p. 1399)—a highly optimized implementation of the Word2vec and text
classification algorithms that scale to large datasets easily. It is useful for many downstream natural
language processing (NLP) tasks.
• Sequence-to-Sequence Algorithm (p. 1437)—a supervised algorithm commonly used for neural
machine translation.
• Latent Dirichlet Allocation (LDA) Algorithm (p. 1409)—an algorithm suitable for determining topics in
a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with
answers during training.
• Neural Topic Model (NTM) Algorithm (p. 1415)—another unsupervised technique for determining
topics in a set of documents, using a neural network approach.
• Text Classification - TensorFlow (p. 1450)—a supervised algorithm that supports transfer learning with
available pretrained models for text classification.
Image Processing
SageMaker also provides image processing algorithms that are used for image classification, object
detection, and computer vision.
• Image Classification - MXNet (p. 1506)—uses example data with answers (referred to as a supervised
algorithm). Use this algorithm to classify images.
• Image Classification - TensorFlow (p. 1517)—uses pretrained TensorFlow Hub models to fine-tune for
specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.
• Semantic Segmentation Algorithm (p. 1549)—provides a fine-grained, pixel-level approach to
developing computer vision applications.
• Object Detection - MXNet (p. 1530)—detects and classifies objects in images using a single deep
neural network. It is a supervised learning algorithm that takes images as input and identifies all
instances of objects within the image scene.
• Object Detection - TensorFlow (p. 1541)—detects bounding boxes and object labels in an image. It is
a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow
models.
Topics
• Common Information About Built-in Algorithms (p. 1287)
1286
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Image training and File image files CPU or GPU Yes (only
Classification validation (.jpg, .jpeg, across
- TensorFlow or .png) multiple
GPUs on
a single
instance)
1287
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Neural Topic train and File or Pipe recordIO- CPU or GPU Yes
Model (optionally) protobuf or
validation, CSV
test, or both
1288
Amazon SageMaker Developer Guide
Use Built-in Algorithms
XGBoost train and File or Pipe CSV, CPU (or GPU Yes
(0.90-1, (optionally) LibSVM, or for 1.2-1)
0.90-2, validation Parquet
1.0-1, 1.2-1,
1.2-21)
Algorithms that are parallelizable can be deployed on multiple compute instances for distributed
training.
The following topics provide information about data formats, recommended Amazon EC2 instance types,
and CloudWatch logs common to all of the built-in algorithms provided by Amazon SageMaker.
Note
To look up the Docker image URIs of the built-in algorithms managed by SageMaker, see Docker
Registry Paths and Example Code.
Topics
• Common Data Formats for Built-in Algorithms (p. 1289)
• Instance Types for Built-in Algorithms (p. 1298)
• Logs for Built-in Algorithms (p. 1299)
1289
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Topics
• Common Data Formats for Training (p. 1290)
• Common Data Formats for Inference (p. 1293)
To prepare for training, you can preprocess your data using a variety of AWS services, including AWS
Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After
preprocessing, publish the data to an Amazon S3 bucket. For training, the data need to go through a
series of conversions and transformations, including:
When using Amazon SageMaker in the training portion of the algorithm, make sure to upload all data
at once. If more data is added to that location, a new training call would need to be made to construct a
brand new model.
Topics
• Content Types Supported by Built-In Algorithms (p. 1290)
• Using Pipe Mode (p. 1291)
• Using CSV Format (p. 1291)
• Using RecordIO Format (p. 1291)
• Trained Model Deserialization (p. 1293)
The following table lists some of the commonly supported ContentType values and the algorithms that
use them:
ContentType Algorithm
text/libsvm XGBoost
1290
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For a summary of the parameters used by each algorithm, see the documentation for the individual
algorithms or this table.
In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3).
Streaming can provide faster start times for training jobs and better throughput. This is in contrast to
File mode, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses
disk space to store both your final model artifacts and your full training dataset. By streaming in your
data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes
of your training instances. Pipe mode needs only enough disk space to store your final model artifacts.
See the AlgorithmSpecification for additional details on the training input mode.
Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV
format for training, in the input data channel specification, specify text/csv as the ContentType.
Amazon SageMaker requires that a CSV file does not have a header record and that the target variable
is in the first column. To run unsupervised learning algorithms that don't have a target, specify the
number of label columns in the content type. For example, in this case 'content_type=text/
csv;label_size=0'. For a notebook example that uses CSV format, see Breast Cancer Prediction. For
more information, see Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker
built-in algorithms.
In the protobuf recordIO format, SageMaker converts each observation in the dataset into a binary
representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python
for your data preparation, we strongly recommend that you use these existing transformations. However,
if you are using another language, the protobuf definition file below provides the schema that you use to
convert your data into SageMaker protobuf format.
Note
For an example that shows how to convert the commonly used numPy array into the protobuf
recordIO format, see An Introduction to Factorization Machines with MNIST .
syntax = "proto2";
package aialgs.data;
1291
Amazon SageMaker Developer Guide
Use Built-in Algorithms
message Float64Tensor {
// Each value in the vector. If keys is empty, this is treated as a
// dense vector.
repeated double values = 1 [packed = true];
// A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
message Int32Tensor {
// Each value in the vector. If keys is empty, this is treated as a
// dense vector.
repeated int32 values = 1 [packed = true];
// Support for storing binary data for parsing in other ways (such as JPEG/etc).
// This is an example of another type of value and may not immediately be supported.
message Bytes {
repeated bytes value = 1;
message Value {
oneof value {
// The numbering assumes the possible use of:
// - float16, float128
// - int8, int16, int32
Float32Tensor float32_tensor = 2;
Float64Tensor float64_tensor = 3;
Int32Tensor int32_tensor = 7;
Bytes bytes = 9;
}
}
message Record {
// Map from the name of the feature to the value.
//
// For vectors and libsvm-like datasets,
// a single feature with the name `values`
// should be specified.
map<string, Value> features = 1;
1292
Amazon SageMaker Developer Guide
Use Built-in Algorithms
After creating the protocol buffer, store it in an Amazon S3 location that Amazon SageMaker can access
and that can be passed as part of InputDataConfig in create_training_job.
Note
For all Amazon SageMaker algorithms, the ChannelName in InputDataConfig must be set to
train. Some algorithms also support a validation or test input channels. These are typically
used to evaluate the model's performance by using a hold-out dataset. Hold-out datasets are
not used in the initial training but can be used to further tune the model.
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
1293
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Topics
• Convert Data for Inference Request Serialization (p. 1294)
• Convert Data for Inference Response Deserialization (p. 1295)
• Common Request Formats for All Algorithms (p. 1296)
• Use Batch Transform with Built-in Algorithms (p. 1297)
Content type options for Amazon SageMaker algorithm inference requests include: text/csv,
application/json, and application/x-recordio-protobuf. Algorithms that don't support all of
these types can support other types. XGBoost, for example, only supports text/csv from this list, but
also supports text/libsvm.
For text/csv, the value for the Body argument to invoke_endpoint should be a string with commas
separating the values for each feature. For example, a record for a model with four features might look
like 1.5,16.0,14,23.0. Any transformations performed on the training data should also be performed
on the data before obtaining inference. The order of the features matters and must remain unchanged.
application/json is significantly more flexible and provides multiple possible formats for developers
to use in their applications. At a high level, in JavaScript, the payload might look like the following:
let request = {
// Instances might contain multiple rows that predictions are sought for.
"instances": [
{
// Request and algorithm specific inference parameters.
"configuration": {},
// Data in the specific format required by the algorithm.
"data": {
"<field name>": dataElement
}
}
]
}
// Has the same format as the protocol buffers implementation described for training.
let dataElement = {
"keys": [],
"values": [],
"shape": []
}
1294
Amazon SageMaker Developer Guide
Use Built-in Algorithms
"features": {
"values": dataElement
}
}
let request = {
"instances": [
// First instance.
{
"features": [ 1.5, 16.0, 14.0, 23.0 ]
},
// Second instance.
{
"features": [ -2.0, 100.2, 15.2, 9.2 ]
}
]
}
Amazon SageMaker algorithms return JSON in several layouts. At a high level, the structure is:
let response = {
"predictions": [{
// Fields in the response object are defined on a per algorithm-basis.
}]
}
The fields that are included in predictions differ across algorithms. The following are examples of output
for the k-means algorithm.
Single-record inference
let response = {
"predictions": [{
"closest_cluster": 5,
"distance_to_cluster": 36.5
}]
}
Multi-record inference
let response = {
"predictions": [
// First instance prediction.
{
"closest_cluster": 5,
"distance_to_cluster": 36.5
},
// Second instance prediction.
{
"closest_cluster": 2,
"distance_to_cluster": 90.3
}
]
}
1295
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{
"features": [],
"label": {
"closest_cluster": {
"values": [ 5.0 ] // e.g. the closest centroid/cluster was 1.0
},
"distance_to_cluster": {
"values": [ 36.5 ]
}
},
"uid": "abc123",
"metadata": "{ "created_at": '2017-06-03' }"
}
SageMaker algorithms also support the JSONLINES format, where the per-record response content
is same as that in JSON format. The multi-record structure is a concatenation of per-record response
objects separated by newline characters. The response content for the built-in KMeans algorithm for 2
input data points is:
While running batch transform, we recommended using the jsonlines response type by setting the
Accept field in the CreateTransformJobRequest to application/jsonlines.
Dense format
let request = {
"instances": [
{
"features": [1.5, 16.0, 14.0, 23.0]
}
]
}
let request = {
"instances": [
{
"data": {
"features": {
"values": [ 1.5, 16.0, 14.0, 23.0]
}
}
}
]
}
Sparse format
{
"instances": [
{"data": {"features": {
1296
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Dense format
or:
Sparse Format
{"data": {"features": { "keys": [26, 182, 232, 243, 431], "shape": [2000], "values": [1, 1,
1, 4, 1] } } }
1297
Amazon SageMaker Developer Guide
Use Built-in Algorithms
When you create a transform job, the SplitType must be set according to the ContentType of
the input data. Similarly, depending on the Accept field in the CreateTransformJobRequest,
AssembleWith must be set accordingly. Please use the following table to help appropriately set these
fields:
application/x-recordio-protobuf RecordIO
text/csv Line
application/jsonlines Line
application/json None
application/x-image None
image/* None
application/x-recordio-protobuf None
application/json None
application/jsonlines Line
For more information on response formats for specific algorithms, see the following:
Most Amazon SageMaker algorithms have been engineered to take advantage of GPU computing for
training. For most algorithm training, we support P2, P3, G4dn, and G5 GPU instances. Despite higher
1298
Amazon SageMaker Developer Guide
Use Built-in Algorithms
per-instance costs, GPUs train more quickly, making them more cost effective. Exceptions are noted in
this guide.
The size and type of data can have a great effect on which hardware configuration is most effective.
When the same model is trained on a recurring basis, initial testing across a spectrum of instance types
can discover configurations that are more cost-effective in the long run. Additionally, algorithms that
train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine
the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom
load tests, use Amazon SageMaker Inference Recommender.
For more information on SageMaker hardware specifications, see Amazon SageMaker ML Instance Types.
The contents of logs vary by algorithms. However, you can typically find the following information:
Common Errors
If a training job fails, some details about the failure are provided by the FailureReason return value in
the training job description, as follows:
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
Others are reported only in the CloudWatch logs. Common errors include the following:
FailureReason
1299
Amazon SageMaker Developer Guide
Use Built-in Algorithms
FailureReason
1300
Amazon SageMaker Developer Guide
Use Built-in Algorithms
XGBoost train and File or Pipe CSV, CPU (or GPU Yes
(0.90-1, (optionally) LibSVM, or for 1.2-1)
0.90-2, validation Parquet
1.0-1, 1.2-1,
1.2-21)
AutoGluon-Tabular
AutoGluon-Tabular is a popular open-source AutoML framework that trains highly accurate machine
learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily
focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple
models and stacking them in multiple layers.
You can use AutoGluon-Tabular as an Amazon SageMaker built-in algorithm. The following section
describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use
AutoGluon-Tabular from the Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).
After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to
construct an estimator using the SageMaker Estimator API and initiate a training job. The AutoGluon-
Tabular built-in algorithm runs in script mode, but the training script is provided for you and there
is no need to replace it. If you have extensive experience using script mode to create a SageMaker
training job, then you can incorporate your own AutoGluon-Tabular training scripts.
1301
Amazon SageMaker Developer Guide
Use Built-in Algorithms
train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
1302
Amazon SageMaker Developer Guide
Use Built-in Algorithms
# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)
For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the
following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region
as the notebook instance used to run them.
• Tabular classification with Amazon SageMaker AutoGluon-Tabular algorithm
• Tabular regression with Amazon SageMaker AutoGluon-Tabular algorithm
Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of AutoGluon-Tabular supports CSV for training and inference:
Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.
Input format for training data, validation data, and categorical features
Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must
provide the path to an Amazon S3 bucket that contains your training and validation data. You can also
include a list of categorical features. Use both the training and validation channels to provide your
input data. Alternatively, you can use only the training channel.
You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the AutoGluon-Tabular algorithm concatenates
the files. The validation data is used to compute a validation score at the end of each boosting iteration.
Early stopping is applied when the validation score stops improving.
If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.
1303
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.
To use a model trained with SageMaker AutoGluon-Tabular with the AutoGluon framework
import tarfile
from autogluon.tabular import TabularPredictor
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
model = TabularPredictor.load(model_file_path)
Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
AutoGluon-Tabular algorithm Amazon SageMaker AutoGluon-Tabular algorithm
to train and host a tabular classification model.
Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
AutoGluon-Tabular algorithm Amazon SageMaker AutoGluon-Tabular algorithm
to train and host a tabular regression model.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
1304
Amazon SageMaker Developer Guide
Use Built-in Algorithms
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.
AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble
methods. It automatically recognizes the data type in each column for robust data preprocessing,
including special handling of text fields.
AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks.
These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-
wise manner that guarantees raw data can be translated into high-quality predictions within a given time
constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of
out-of-fold examples.
The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust
handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for
regression, classification (binary and multiclass), and ranking problems.
Refer to the following diagram illustrating how the multi-layer stacking strategy works.
For more information, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.
AutoGluon-Tabular hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker AutoGluon-Tabular algorithm. Users set these parameters to facilitate
the estimation of model parameters from data. The SageMaker AutoGluon-Tabular algorithm is an
implementation of the open-source AutoGluon-Tabular package.
Note
The default hyperparameters are based on example datasets in the AutoGluon-Tabular sample
notebooks (p. 1304).
1305
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1306
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Valid values: string, any integer between (and including) "0" and
"10".
Default value: 1.
Default value: 0.
refit_full Whether or not to retrain all models on all of the data (training and
validation) after the normal training procedure. For more details,
see AutoGluon Predictors.
1307
Amazon SageMaker Developer Guide
Use Built-in Algorithms
set_best_to_refit_full Whether or not to change the default model that the predictor uses
for prediction. If set_best_to_refit_full is set to "True",
the default model changes to the model that exhibited the highest
validation score as a result of refitting (activated by refit_full).
Only valid if refit_full is set.
save_space Whether or note to reduce the memory and disk size of predictor by
deleting auxiliary model files that aren’t needed for prediction on
new data. This has no impact on inference accuracy. We recommend
setting save_space to "True" if the only goal is to use the
trained model for prediction. Certain advanced functionality may
no longer be available if save_space is set to "True". Refer to the
predictor.save_space() documentation for more details.
Default value: 2.
Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance
using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather
than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and
training in a layer-wise manner.
CatBoost
CatBoost is a popular and high-performance open-source implementation of the Gradient Boosting
Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately
predict a target variable by combining an ensemble of estimates from a set of simpler and weaker
models.
Both techniques were created to fight a prediction shift caused by a special kind of target leakage
present in all currently existing implementations of gradient boosting algorithms.
1308
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can use CatBoost as an Amazon SageMaker built-in algorithm. The following section describes how
to use CatBoost with the SageMaker Python SDK. For information on how to use CatBoost from the
Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).
Use the CatBoost built-in algorithm to build a CatBoost training container as shown in the following
code example. You can automatically spot the CatBoost built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 2).
After specifying the CatBoost image URI, you can use the CatBoost container to construct an estimator
using the SageMaker Estimator API and initiate a training job. The CatBoost built-in algorithm runs in
script mode, but the training script is provided for you and there is no need to replace it. If you have
extensive experience using script mode to create a SageMaker training job, then you can incorporate
your own CatBoost training scripts.
train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
1309
Amazon SageMaker Developer Guide
Use Built-in Algorithms
training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)
For more information about how to set up CatBoost as a built-in algorithm, see the following
notebook examples.
• Tabular classification with Amazon SageMaker LightGBM and CatBoost algorithm
• Tabular regression with Amazon SageMaker LightGBM and CatBoost algorithm
Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of CatBoost supports CSV for training and inference:
Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.
Input format for training data, validation data, and categorical features
Be mindful of how to format your training data for input to the CatBoost model. You must provide the
path to an Amazon S3 bucket that contains your training and validation data. You can also include a list
of categorical features. Use both the training and validation channels to provide your input data.
Alternatively, you can use only the training channel.
1310
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the CatBoost algorithm concatenates the files.
The validation data is used to compute a validation score at the end of each boosting iteration. Early
stopping is applied when the validation score stops improving.
If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.
You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.
import tarfile
from catboost import CatBoostClassifier
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
1311
Amazon SageMaker Developer Guide
Use Built-in Algorithms
SageMaker CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to
compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice
than a compute-optimized instance (for example, C5). Further, we recommend that you have enough
total memory in selected instances to hold the training data.
The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker CatBoost algorithm.
Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker CatBoost algorithm to train
and host a tabular classification model.
Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker CatBoost algorithm to train
and host a tabular regression model.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.
CatBoost implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition
of two critical algorithmic advances:
Both techniques were created to fight a prediction shift caused by a special kind of target leakage
present in all currently existing implementations of gradient boosting algorithms.
The CatBoost algorithm performs well in machine learning competitions because of its robust handling
of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you
can fine-tune. You can use CatBoost for regression, classification (binary and multiclass), and ranking
problems.
For more information on gradient boosting, see How XGBoost Works (p. 1376). For in-depth details
about the additional GOSS and EFB techniques used in the CatBoost method, see CatBoost: unbiased
boosting with categorical features.
CatBoost hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker CatBoost algorithm. Users set these parameters to facilitate the estimation
of model parameters from data. The SageMaker CatBoost algorithm is an implementation of the open-
source CatBoost package.
Note
The default hyperparameters are based on example datasets in the CatBoost sample
notebooks (p. 1312).
1312
Amazon SageMaker Developer Guide
Use Built-in Algorithms
By default, the SageMaker CatBoost algorithm automatically chooses an evaluation metric and loss
function based on the type of classification problem. The CatBoost algorithm detects the type of
classification problem based on the number of labels in your data. For regression problems, the
evaluation metric and loss functions are both root mean squared error. For binary classification
problems, the evaluation metric is Area Under the Curve (AUC) and the loss function is log loss. For
multiclass classification problems, the evaluation metric and loss functions are multiclass cross entropy.
You can use the eval_metric hyperparameter to change the default evaluation metric. Refer to the
following table for more information on LightGBM hyperparameters, including descriptions, valid values,
and default values.
early_stopping_rounds The training will stop if one metric of one validation data point
does not improve in the last early_stopping_rounds round.
If early_stopping_rounds is less than or equal to zero, this
hyperparameter is ignored.
Default value: 5.
learning_rate The rate at which the model weights are updated after working
through each batch of training examples.
Default value: 6.
1313
Amazon SageMaker Developer Guide
Use Built-in Algorithms
random_strength The amount of randomness to use for scoring splits when the tree
structure is selected. Use this parameter to avoid overfitting the
model.
max_leaves The maximum number of leaves in the resulting tree. Can only be
used with the "Lossguide" growing policy.
Default value: 1.
1314
Amazon SageMaker Developer Guide
Use Built-in Algorithms
scale_pos_weight The weight for positive class in binary classification. The value is
used as a multiplier for the weights of objects from positive class.
max_bin The number of splits for numerical features. "Auto" means that
max_bin is selected based on the processing unit type and other
parameters. For details, see the CatBoost documentation.
grow_policy The tree growing policy. Defines how to perform greedy tree
construction.
Default value: 1.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning loss function is automatically assigned based on the type of classification
task, which is determined by the number of unique integers in the label column. For more
information, see CatBoost hyperparameters (p. 1312).
1315
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Automatic model tuning searches your chosen hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for CatBoost is only available from the Amazon SageMaker SDKs, not
from the SageMaker console.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The SageMaker CatBoost algorithm computes the following metrics to use for model validation. The
evaluation metric is automatically assigned based on the type of classification task, which is determined
by the number of unique integers in the label column. For a
1316
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune the CatBoost model with the following hyperparameters. The hyperparameters that have
the greatest effect on optimizing the CatBoost evaluation metrics are: learning_rate, depth,
l2_leaf_reg, and random_strength. For a list of all the CatBoost hyperparameters, see CatBoost
hyperparameters (p. 1312).
Topics
• Input/Output Interface for the Factorization Machines Algorithm (p. 1317)
• EC2 Instance Recommendation for the Factorization Machines Algorithm (p. 1318)
• Factorization Machines Sample Notebooks (p. 1318)
• How Factorization Machines Work (p. 1318)
• Factorization Machines Hyperparameters (p. 1319)
• Tune a Factorization Machines Model (p. 1324)
• Factorization Machines Response Formats (p. 1326)
The Factorization Machines algorithm can be run in either in binary classification mode or regression
mode. In each mode, a dataset can be provided to the test channel along with the train channel dataset.
The scoring depends on the mode used. In regression mode, the testing dataset is scored using Root
Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross
Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5).
For training, the Factorization Machines algorithm currently supports only the recordIO-protobuf
format with Float32 tensors. Because their use case is predominantly on sparse data, CSV is not a good
candidate. Both File and Pipe mode training are supported for recordIO-wrapped protobuf.
1317
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For inference, the Factorization Machines algorithm supports the application/json and x-
recordio-protobuf formats.
• For the binary classification problem, the algorithm predicts a score and a label. The label is a number
and can be either 0 or 1. The score is a number that indicates how strongly the algorithm believes that
the label should be 1. The algorithm computes score first and then derives the label from the score
value. If the score is greater than or equal to 0.5, the label is 1.
• For the regression problem, just a score is returned and it is the predicted value. For example, if
Factorization Machines is used to predict a movie rating, score is the predicted rating value.
Please see Factorization Machines Sample Notebooks (p. 1318) for more details on training and
inference file formats.
The Amazon SageMaker Factorization Machines algorithm is highly scalable and can train across
distributed instances. We recommend training and inference with CPU instances for both sparse and
dense datasets. In some circumstances, training with one or more GPUs on dense data might provide
some benefit. Training with GPUs is available only on dense data. Use CPU instances for sparse data. The
Factorization Machines algorithm supports P2, P3, G4dn, and G5 instances for training and inference.
For a sample notebook that uses the SageMaker Factorization Machines algorithm to analyze the images
of handwritten digits from zero to nine in the MNIST dataset, see An Introduction to Factorization
Machines with MNIST. For instructions how to create and access Jupyter notebook instances that you can
use to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker samples. Example notebooks that use Factorization Machines algorithm are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.
The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature set xi to
a target domain. This domain is real-valued for regression and binary for classification. The Factorization
Machines model is supervised and so has a training dataset (xi,yj) available. The advantages this model
presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It
can be represented mathematically as follows:
The three terms in this equation correspond respectively to the three components of the model:
The global bias and linear terms are the same as in a linear model. The pairwise feature interactions
are modeled in the third term as the inner product of the corresponding factors learned for each
feature. Learned factors can also be considered as embedding vectors for each feature. For example, in
a classification task, if a pair of features tends to co-occur more often in positive labeled samples, then
the inner product of their factors would be large. In other words, their embedding vectors would be close
to each other in cosine similarity. For more information about the Factorization Machines model, see
Factorization Machines.
1318
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For regression tasks, the model is trained by minimizing the squared error between the model prediction
ŷn and the target value yn. This is known as the square loss:
For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log
loss:
where:
For more information about loss functions for classification, see Loss functions for classification.
The following table contains the hyperparameters for the Factorization Machines algorithm. These
are parameters that are set by users to facilitate the estimation of model parameters from data.
The required hyperparameters that must be set are listed first, in alphabetical order. The optional
hyperparameters that can be set are listed next, also in alphabetical order.
feature_dim The dimension of the input feature space. This could be very high
with sparse input.
Required
Required
Required
1319
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
bias_init_sigma The standard deviation for initialization of the bias term. Takes
effect if bias_init_method is set to normal.
Optional
Optional
Optional
Optional
1320
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 1
Optional
Optional
Optional
1321
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Optional
Optional
Optional
1322
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
linear_init_sigma The standard deviation for initialization of linear terms. Takes effect
if linear_init_method is set to normal.
Optional
Optional
Optional
1323
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The Factorization Machines algorithm has both binary classification and regression predictor types. The
predictor type determines which metric you can use for automatic model tuning. The algorithm reports a
test:rmse regressor metric, which is computed during training. When tuning the model for regression
tasks, choose this metric as the objective.
The Factorization Machines algorithm reports three binary classification metrics, which are computed
during training. When tuning the model for binary classification tasks, choose one of these as the
objective.
Accuracy
test:binary_classification_accuracy Maximize
Cross Entropy
test:binary_classification_cross_entropy Minimize
1324
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can tune the following hyperparameters for the Factorization Machines algorithm. The initialization
parameters that contain the terms bias, linear, and factorization depend on their initialization method.
There are three initialization methods: uniform, normal, and constant. These initialization methods
are not themselves tunable. The parameters that are tunable are dependent on this choice of the
initialization method. For example, if the initialization method is uniform, then only the scale
parameters are tunable. Specifically, if bias_init_method==uniform, then bias_init_scale,
linear_init_scale, and factors_init_scale are tunable. Similarly, if the initialization method is
normal, then only sigma parameters are tunable. If the initialization method is constant, then only
value parameters are tunable. These dependencies are listed in the following table.
ContinuousParameterRange
factors_init_scale MinValue: 1e-8, bias_init_method==uniform
MaxValue: 512
ContinuousParameterRange
factors_init_sigma MinValue: 1e-8, bias_init_method==normal
MaxValue: 512
ContinuousParameterRange
factors_init_value MinValue: 1e-8, bias_init_method==constan
MaxValue: 512
1325
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Binary classification
let response = {
"predictions": [
{
"score": 0.4,
"predicted_label": 0
}
]
}
Regression
let response = {
"predictions": [
{
"score": 0.4
}
]
}
Binary classification
Regression
{"score": 0.4}
Binary classification
[
Record = {
features = {},
label = {
'score’: {
keys: [],
values: [0.4] # float32
},
'predicted_label': {
1326
Amazon SageMaker Developer Guide
Use Built-in Algorithms
keys: [],
values: [0.0] # float32
}
}
}
]
Regression
[
Record = {
features = {},
label = {
'score’: {
keys: [],
values: [0.4] # float32
}
}
}
]
Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building.
Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction,
the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model
in memory and inference latency. We provide two methods of dimension reduction methods: random
projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for
high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical
analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is
to construct the index. The index enables efficient lookups of distances between points whose values or
class labels have not yet been determined and the k nearest points to use for inference.
Topics
• Input/Output Interface for the k-NN Algorithm (p. 1327)
• k-NN Sample Notebooks (p. 1328)
• How the k-NN Algorithm Works (p. 1328)
• EC2 Instance Recommendation for the k-NN Algorithm (p. 1329)
• k-NN Hyperparameters (p. 1329)
• Tune a k-NN Model (p. 1331)
• Data Formats for k-NN Training Input (p. 1332)
• k-NN Request and Response Formats (p. 1333)
• Use a train channel for data that you want to sample and construct into the k-NN index.
• Use a test channel to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for
classifier, mean-squared error (mse) for regressor for score.
1327
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For batch transform, k-NN supports the application/jsonlines data format for both input and
output. An example input is as follows:
content-type: application/jsonlines
accept: application/jsonlines
{"predicted_label": 0.0}
{"predicted_label": 2.0}
For more information on input and output file formats, see Data Formats for k-NN Training
Input (p. 1332) for training, k-NN Request and Response Formats (p. 1333) for inference, and the k-NN
Sample Notebooks (p. 1328).
Use a Jupyter notebook instance to run the example in SageMaker. To learn how to create and open a
Jupyter notebook instance in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker example notebooks. Find K-Nearest Neighbor notebooks in the Introduction to Amazon
algorithms section. To open a notebook, click on its Use tab and select Create copy.
1328
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Both methods preserve the L2 and inner product distances. The fjlt method should be used when the
target dimension is large and has better performance with CPU inference. The methods differ in their
computational complexity. The sign method requires O(ndk) time to reduce the dimension of a batch of
n points of dimension d into a target dimension k. The fjlt method requires O(nd log(d)) time, but the
constants involved are larger. Using dimension reduction introduces noise into the data and this noise
can reduce prediction accuracy.
• model_algo-1: Contains the serialized index for computing the nearest neighbors.
• model_algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted
label based on the query result from the index.
• model_algo-1.json: Contains the JSON-formatted model metadata which stores the k and
predictor_type hyper-parameters from training for inference along with other relevant state.
With the current implementation of k-NN, you can modify the metadata file to change the way
predictions are computed. For example, you can change k to 10 or change predictor_type to
regressor.
{
"k": 5,
"predictor_type": "classifier",
"dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
"normalize": False,
"version": "1.0"
}
Inference requests from CPUs generally have a lower average latency than requests from GPUs because
there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally
have higher throughput for larger batches.
k-NN Hyperparameters
Required
1329
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Required
sample_size The number of data points to be sampled from the training data set.
Required
Valid values: positive integer greater than 0 and less than feature_dim.
Optional
Valid values: sign for random projection or fjlt for the fast Johnson-
Lindenstrauss transform.
Optional
Optional
1330
Amazon SageMaker Developer Guide
Use Built-in Algorithms
index_metric The metric to measure the distance between points when finding nearest
neighbors. When training with index_type set to faiss.IVFPQ, the
INNER_PRODUCT distance and COSINE similarity are not supported.
Optional
Default value: L2
Optional
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
The Amazon SageMaker k-nearest neighbors algorithm is a supervised algorithm. The algorithm
consumes a test data set and emits a metric about the accuracy for a classification task or about the
mean squared error for a regression task. These accuracy metrics compare the model predictions for
their respective task to the ground truth provided by the empirical test data. To find the best model that
reports the highest accuracy or lowest error on the test dataset, run a hyperparameter tuning job for k-
NN.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective
metric appropriate for the prediction task of the algorithm. Automatic model tuning searches the
hyperparameters chosen to find the combination of values that result in the model that optimizes the
objective metric. The hyperparameters are used only to help estimate model parameters and are not
used by the trained model to make predictions.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The k-nearest neighbors algorithm computes one of two metrics in the following table during training
depending on the type of task specified by the predictor_type hyper-parameter.
1331
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Choose the predictor_type value appropriate for the type of task undertaken to calculate the
relevant objective metric when tuning a model.
Tune the Amazon SageMaker k-nearest neighbor model with the following hyperparameters.
All Amazon SageMaker built-in algorithms adhere to the common input training formats described
in Common Data Formats - Training. This topic contains a list of the available input formats for the
SageMaker k-nearest-neighbor algorithm.
4,1.2,1.3,9.6,20.3
The first label_size columns are interpreted as the label vector for that row.
content-type: application/x-recordio-protobuf
[
Record = {
features = {
'values': {
values: [1.2, 1.3, 9.6, 20.3] # float32
1332
Amazon SageMaker Developer Guide
Use Built-in Algorithms
}
},
label = {
'values': {
values: [4] # float32
}
}
}
]
All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker k-nearest-neighbor algorithm.
content-type: text/csv
1.2,1.3,9.6,20.3
This accepts a label_size or encoding parameter. It assumes a label_size of 0 and a utf-8 encoding.
content-type: application/json
{
"instances": [
{"data": {"features": {"values": [-3, -1, -4, 2]}}},
{"features": [3.0, 0.1, 0.04, 0.002]}]
}
content-type: application/jsonlines
content-type: application/x-recordio-protobuf
[
Record = {
features = {
'values': {
values: [-3, -1, -4, 2] # float32
}
},
label = {}
},
Record = {
features = {
1333
Amazon SageMaker Developer Guide
Use Built-in Algorithms
'values': {
values: [3.0, 0.1, 0.04, 0.002] # float32
}
},
label = {}
},
]
accept: application/json
{
"predictions": [
{"predicted_label": 0.0},
{"predicted_label": 2.0}
]
}
accept: application/jsonlines
{"predicted_label": 0.0}
{"predicted_label": 2.0}
In verbose mode, the API provides the search results with the distances vector sorted from smallest to
largest, with corresponding elements in the labels vector. In this example, k is set to 3.
{
"predictions": [
{
"predicted_label": 0.0,
"distances": [3.11792408, 3.89746071, 6.32548437],
"labels": [0.0, 1.0, 0.0]
},
{
"predicted_label": 2.0,
"distances": [1.08470316, 3.04917915, 5.25393973],
"labels": [2.0, 2.0, 0.0]
}
]
}
content-type: application/x-recordio-protobuf
[
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
}
1334
Amazon SageMaker Developer Guide
Use Built-in Algorithms
}
},
Record = {
features = {},
label = {
'predicted_label': {
values: [2.0] # float32
}
}
}
]
In verbose mode, the API provides the search results with the distances vector sorted from smallest to
largest, with corresponding elements in the labels vector. In this example, k is set to 3.
[
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
},
'distances': {
values: [3.11792408, 3.89746071, 6.32548437] # float32
},
'labels': {
values: [0.0, 1.0, 0.0] # float32
}
}
},
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
},
'distances': {
values: [1.08470316, 3.04917915, 5.25393973] # float32
},
'labels': {
values: [2.0, 2.0, 0.0] # float32
}
}
}
]
1335
Amazon SageMaker Developer Guide
Use Built-in Algorithms
LightGBM
LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree
(GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target
variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM
uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT.
You can use LightGBM as an Amazon SageMaker built-in algorithm. The following section describes how
to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the
Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).
Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following
code example. You can automatically spot the LightGBM built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 2).
After specifying the LightGBM image URI, you can use the LightGBM container to construct an
estimator using the SageMaker Estimator API and initiate a training job. The LightGBM built-in
algorithm runs in script mode, but the training script is provided for you and there is no need to
replace it. If you have extensive experience using script mode to create a SageMaker training job, then
you can incorporate your own LightGBM training scripts.
train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
1336
Amazon SageMaker Developer Guide
Use Built-in Algorithms
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"train": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)
For more information about how to set up the LightGBM as a built-in algorithm, see the following
notebook examples.
• Tabular classification with Amazon SageMaker LightGBM and CatBoost algorithm
• Tabular regression with Amazon SageMaker LightGBM and CatBoost algorithm
The SageMaker implementation of LightGBM supports CSV for training and inference:
Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
1337
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For CSV inference, the algorithm assumes that CSV input does not have the label column.
Input format for training data, validation data, and categorical features
Be mindful of how to format your training data for input to the LightGBM model. You must provide the
path to an Amazon S3 bucket that contains your training and validation data. You can also include a
list of categorical features. Use both the train and validation channels to provide your input data.
Alternatively, you can use only the train channel.
Note
Both train and training are valid channel names for LightGBM training.
You can provide your input data by way of two S3 paths, one for the train channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files
are provided for the train or validation channels, the LightGBM algorithm concatenates the files.
The validation data is used to compute a validation score at the end of each boosting iteration. Early
stopping is applied when the validation score stops improving.
If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a
JSON file for categorical features, your train channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.
You can alternatively provide your input data by way of a single S3 path for the train channel. This S3
path should point to a directory with a subdirectory named train/ that contains one or more CSV files.
You can optionally include another subdirectory in the same location called validation/ that also has
one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly
sampled to serve as the validation data. If your predictors include categorical features, you can provide a
JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.
SageMaker LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be
used for saving or loading the model.
To use a model trained with SageMaker LightGBM with the JobLib module
import joblib
import tarfile
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
model = joblib.load(model_file_path)
1338
Amazon SageMaker Developer Guide
Use Built-in Algorithms
SageMaker LightGBM currently supports single-instance and multi-instance CPU training. For multi-
instance CPU training (distributed training), specify an instance_count greater than 1 when you
define your Estimator. For more information on distributed training with LightGBM, see Amazon
SageMaker LightGBM Distributed training using Dask.
The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker LightGBM algorithm.
Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker LightGBM algorithm to train
and host a tabular classification model.
Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker LightGBM algorithm to train
and host a tabular regression model.
Amazon SageMaker LightGBM Distributed training This notebook demonstrates distributed training
using Dask with the Amazon SageMaker LightGBM algorithm
using the Dask framework.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.
LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the
addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature
Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of
GBDT.
The LightGBM algorithm performs well in machine learning competitions because of its robust handling
of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you
can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking
problems.
For more information on gradient boosting, see How XGBoost Works (p. 1376). For in-depth details
about the additional GOSS and EFB techniques used in the LightGBM method, see LightGBM: A Highly
Efficient Gradient Boosting Decision Tree.
1339
Amazon SageMaker Developer Guide
Use Built-in Algorithms
LightGBM hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker LightGBM algorithm. Users set these parameters to facilitate the estimation
of model parameters from data. The SageMaker LightGBM algorithm is an implementation of the open-
source LightGBM package.
Note
The default hyperparameters are based on example datasets in the LightGBM sample
notebooks (p. 1339).
By default, the SageMaker LightGBM algorithm automatically chooses an evaluation metric and
objective function based on the type of classification problem. The LightGBM algorithm detects the
type of classification problem based on the number of labels in your data. For regression problems,
the evaluation metric is root mean squared error and the objective function is L2 loss. For binary
classification problems, the evaluation metric and objective function are both binary cross entropy. For
multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective
function is softmax. You can use the metric hyperparameter to change the default evaluation metric.
Refer to the following table for more information on LightGBM hyperparameters, including descriptions,
valid values, and default values.
early_stopping_rounds The training will stop if one metric of one validation data point
does not improve in the last early_stopping_rounds round.
If early_stopping_rounds is less than or equal to zero, this
hyperparameter is ignored.
metric The evaluation metric for validation data. If metric is set to the
default "auto" value, then the algorithm automatically chooses an
evaluation metric based on the type of classification problem:
learning_rate The rate at which the model weights are updated after working
through each batch of training examples.
1340
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Default value: 1.
max_depth The maximum depth for a tree model. This is used to deal with
overfitting when the amount of data is small. If max_depth is less
than or equal to zero, this means there is no limit for maximum
depth.
Default value: 6.
min_data_in_leaf The minimum amount of data in one leaf. Can be used to deal with
overfitting.
Default value: 3.
1341
Amazon SageMaker Developer Guide
Use Built-in Algorithms
lambda_l1 L1 regularization.
lambda_l2 L2 regularization.
scale_pos_weight The weight of the labels with positive class. Used only for binary
classification tasks. scale_pos_weight cannot be used if
is_unbalance is set to "True".
1342
Amazon SageMaker Developer Guide
Use Built-in Algorithms
tweedie_variance_power Controls the variance of the Tweedie distribution. Set this closer to
2.0 to shift toward a gamma distribution. Set this closer to 1.0 to
shift toward a Poisson distribution. Used only for regression tasks.
Default value: 0.
Default value: 1.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning objective function is automatically assigned based on the type of classification
task, which is determined by the number of unique integers in the label column. For more
information, see LightGBM hyperparameters (p. 1340).
Automatic model tuning searches your specified hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not
from the SageMaker console.
1343
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The SageMaker LightGBM algorithm computes the following metrics to use for model validation. The
evaluation metric is automatically assigned based on the type of classification task, which is determined
by the number of unique integers in the label column.
Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the LightGBM evaluation metrics are: learning_rate, num_leaves,
feature_fraction, bagging_fraction, bagging_freq, max_depth and min_data_in_leaf. For
a list of all the LightGBM hyperparameters, see LightGBM hyperparameters (p. 1340).
1344
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The Amazon SageMaker linear learner algorithm provides a solution for both classification and
regression problems. With the SageMaker algorithm, you can simultaneously explore different training
objectives and choose the best solution from a validation set. You can also explore a large number of
models and choose the best. The best model optimizes either of the following:
• Continuous objectives, such as mean square error, cross entropy loss, absolute error.
• Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy.
Compared with methods that provide a solution for only continuous objectives, the SageMaker linear
learner algorithm provides a significant increase in speed over naive hyperparameter optimization
techniques. It is also more convenient.
The linear learner algorithm requires a data matrix, with rows representing the observations, and
columns representing the dimensions of the features. It also requires an additional column that contains
the labels that match the data points. At a minimum, Amazon SageMaker linear learner requires you to
specify input and output data locations, and objective type (classification or regression) as arguments.
The feature dimension is also required. For more information, see CreateTrainingJob. You can specify
additional parameters in the HyperParameters string map of the request body. These parameters
control the optimization procedure, or specifics of the objective function that you train on. For example,
the number of epochs, regularization, and loss type.
If you're using Managed Spot Training, the linear learner algorithm supports using checkpoints to take a
snapshot of the state of the model.
Topics
1345
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The Amazon SageMaker linear learner algorithm supports three data channels: train, validation
(optional), and test (optional). If you provide validation data, the S3DataDistributionType should
be FullyReplicated. The algorithm logs validation loss at every epoch, and uses a sample of the
validation data to calibrate and select the best model. If you don't provide validation data, the algorithm
uses a sample of the training data to calibrate and select the model. If you provide test data, the
algorithm logs include the test score for the final model.
For training, the linear learner algorithm supports both recordIO-wrapped protobuf and CSV
formats. For the application/x-recordio-protobuf input type, only Float32 tensors are
supported. For the text/csv input type, the first column is assumed to be the label, which is the target
variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data
that is formatted as recordIO-wrapped-protobuf or as CSV.
• For binary classification, predicted_label is 0 or 1, and score is a single floating point number
that indicates how strongly the algorithm believes that the label should be 1.
• For multiclass classification, the predicted_class will be an integer from 0 to num_classes-1,
and score will be a list of one floating point number per class.
To interpret the score in classification problems, you have to consider the loss function used. If the
loss hyperparameter value is logistic for binary classification or softmax_loss for multiclass
classification, then the score can be interpreted as the probability of the corresponding class. These
are the loss values used by the linear learner when the loss value is auto default value. But if the loss
is set to hinge_loss, then the score cannot be interpreted as a probability. This is because hinge loss
corresponds to a Support Vector Classifier, which does not produce probability estimates.
For more information on input and output file formats, see Linear learner response formats (p. 1360).
For more information on inference formats, and the Linear learner sample notebooks (p. 1347).
The linear learner algorithm supports both CPU and GPU instances for training and inference. For GPU,
the linear learner algorithm supports P2, P3, G4dn, and G5 GPU families.
During testing, we have not found substantial evidence that multi-GPU instances are faster than single-
GPU instances. Results can vary, depending on your specific use case.
1346
Amazon SageMaker Developer Guide
Use Built-in Algorithms
An Introduction with the MNIST dataset Using the MNIST dataset, we train a binary
classifier to predict a single digit.
How to Build a Machine Learning (ML) Pipeline for Using a Scikit-learn container, we demonstrate
Inference? how to build an end-to-end ML pipeline.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. The topic modeling example notebooks using the linear learning algorithm are
located in the Introduction to Amazon algorithms section. To open a notebook, choose its Use tab and
choose Create copy.
Step 1: Preprocess
Normalization, or feature scaling, is an important preprocessing step for certain loss functions that
ensures the model being trained on a dataset does not become dominated by the weight of a single
feature. The Amazon SageMaker Linear Learner algorithm has a normalization option to assist with this
preprocessing step. If normalization is turned on, the algorithm first goes over a small sample of the data
to learn the mean value and standard deviation for each feature and for the label. Each of the features in
the full dataset is then shifted to have mean of zero and scaled to have a unit standard deviation.
Note
For best results, ensure your data is shuffled before training. Training with unshuffled data may
cause training to fail.
You can configure whether the linear learner algorithm normalizes the feature data and the labels using
the normalize_data and normalize_label hyperparameters, respectively. Normalization is enabled
by default for both features and labels for regression. Only the features can be normalized for binary
classification and this is the default behavior.
Step 2: Train
With the linear learner algorithm, you train with a distributed implementation of stochastic gradient
descent (SGD). You can control the optimization process by choosing the optimization algorithm. For
example, you can choose to use Adam, AdaGrad, stochastic gradient descent, or other optimization
algorithms. You also specify their hyperparameters, such as momentum, learning rate, and the learning
rate schedule. If you aren't sure which algorithm or hyperparameter value to use, choose a default that
works for the majority of datasets.
During training, you simultaneously optimize multiple models, each with slightly different objectives. For
example, you vary L1 or L2 regularization and try out different optimizer settings.
1347
Amazon SageMaker Developer Guide
Use Built-in Algorithms
When training multiple models in parallel, the models are evaluated against a validation set to select
the most optimal model once training is complete. For regression, the most optimal model is the one
that achieves the best loss on the validation set. For classification, a sample of the validation set is used
to calibrate the classification threshold. The most optimal model selected is the one that achieves the
best binary classification selection criteria on the validation set. Examples of such criteria include the F1
measure, accuracy, and cross-entropy loss.
Note
If the algorithm is not provided a validation set, then evaluating and selecting the most optimal
model is not possible. To take advantage of parallel training and model selection ensure you
provide a validation set to the algorithm.
The following table contains the hyperparameters for the linear learner algorithm. These are parameters
that are set by users to facilitate the estimation of model parameters from data. The required
hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters
that can be set are listed next, also in alphabetical order. When a hyperparameter is set to auto, Amazon
SageMaker will automatically calculate and set the value of that hyperparameter.
num_classes The number of classes for the response variable. The algorithm assumes
that classes are labeled 0, ..., num_classes - 1.
Required
accuracy_top_k When computing the top-k accuracy metric for multiclass classification,
the value of k. If the model assigns one of the top-k scores to the true
label, an example is scored as correct.
Optional
Default value: 3
Specifies whether to use class weights, which give each class equal
balance_multiclass_weights
importance in the loss function. Used only when the predictor_type is
multiclass_classifier.
Optional
1348
Amazon SageMaker Developer Guide
Use Built-in Algorithms
beta_1 The exponential decay rate for first-moment estimates. Applies only when
the optimizer value is adam.
Optional
beta_2 The exponential decay rate for second-moment estimates. Applies only
when the optimizer value is adam.
Optional
bias_lr_mult Allows a different learning rate for the bias term. The actual learning rate
for the bias is learning_rate * bias_lr_mult.
Optional
bias_wd_mult Allows different regularization for the bias term. The actual L2
regularization weight for the bias is wd * bias_wd_mult. By default, there
is no regularization on the bias term.
Optional
binary_classifier_model_selection_criteria
When predictor_type is set to binary_classifier, the model
evaluation criteria for the validation dataset (or for the training dataset if
you don't provide a validation dataset). Criteria include:
Optional
1349
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 3
Optional
Optional
Default value: 15
f_beta The value of beta to use when calculating F score metrics for binary
or multiclass classification. Also used if the value specified for
binary_classifier_model_selection_criteria is f_beta.
Optional
Optional
1350
Amazon SageMaker Developer Guide
Use Built-in Algorithms
huber_delta The parameter for Huber loss. During training and metric evaluation,
compute L2 loss for errors smaller than delta and L1 loss for errors larger
than delta.
Optional
Optional
Default value: 0
init_method Sets the initial distribution function used for model weights. Functions
include:
Optional
init_scale Scales an initial uniform distribution for model weights. Applies only
when the init_method hyperparameter is set to uniform.
Optional
init_sigma The initial standard deviation for the normal distribution. Applies only
when the init_method hyperparameter is set to normal.
Optional
Optional
1351
Amazon SageMaker Developer Guide
Use Built-in Algorithms
learning_rate The step size used by the optimizer for parameter updates.
Optional
The available loss functions and their default values depend on the value
of predictor_type:
Optional
loss_insensitivity The parameter for the epsilon-insensitive loss type. During training and
metric evaluation, any error smaller than this value is considered to be
zero.
Optional
Optional
1352
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The learning rate never decreases to a value lower than the value
lr_scheduler_minimum_lr
set for lr_scheduler_minimum_lr. Applies only when the
use_lr_scheduler hyperparameter is set to true.
Optional
lr_scheduler_step The number of steps between decreases of the learning rate. Applies only
when the use_lr_scheduler hyperparameter is set to true.
Optional
Optional
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
Optional
normalize_data Normalizes the feature data before training. Data normalization shifts
the data for each feature to have a mean of zero and scales it to have unit
standard deviation.
Optional
1353
Amazon SageMaker Developer Guide
Use Built-in Algorithms
normalize_label Normalizes the label. Label normalization shifts the label to have a mean
of zero and scales it to have unit standard deviation.
The auto default value normalizes the label for regression problems but
does not for classification problems. If you set the normalize_label
hyperparameter to true for classification problems, the algorithm ignores
it.
Optional
The number of observations from the validation dataset to use for model
num_calibration_samples
calibration (when finding the best threshold).
Optional
num_models The number of models to train in parallel. For the default, auto, the
algorithm decides the number of parallel models to train. One model
is trained according to the given training parameter (regularization,
optimizer, loss), and the rest by close parameters.
Optional
Optional
Optional
Valid values:
1354
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
quantile The quantile for quantile loss. For quantile q, the model attempts to
produce predictions so that the value of true_label is greater than the
prediction with probability q.
Optional
Optional
Optional
unbias_data Unbiases the features before training so that the mean is 0. By default
data is unbiased as the use_bias hyperparameter is set to true.
Optional
1355
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
use_bias Specifies whether the model should include a bias term, which is the
intercept term in the linear equation.
Optional
use_lr_scheduler Whether to use a scheduler for the learning rate. If you want to use a
scheduler, specify true.
Optional
Optional
The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate
from the automatic model tuning feature described here. By default, the linear learner algorithm tunes
hyperparameters by training multiple models in parallel. When you use automatic model tuning, the
linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel
models, num_models, to 1. The algorithm ignores any value that you set for num_models.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
1356
Amazon SageMaker Developer Guide
Use Built-in Algorithms
test:absolute_loss The absolute loss of the final model on the test Minimize
dataset. This objective metric is only valid for
regression.
test:binary_f_beta The F-beta score of the final model on the test Maximize
dataset. By default, it is the F1 score, which
is the harmonic mean of precision and recall.
This objective metric is only valid for binary
classification.
test:macro_f_beta The F-beta score of the final model on the test Maximize
dataset. This objective metric is only valid for
multiclass classification.
test:macro_recall The recall score of the final model on the test Maximize
dataset. This objective metric is only valid for
multiclass classification.
test:mse The mean square error of the final model on the Minimize
test dataset. This objective metric is only valid for
regression.
1357
Amazon SageMaker Developer Guide
Use Built-in Algorithms
test:recall The recall of the final model on the test dataset. Maximize
If you choose this metric as the objective, we
recommend setting a target precision by setting
the binary_classifier_model_selection
hyperparameter to
recall_at_target_precision and
setting the value for the target_precision
hyperparameter. This objective metric is only valid
for binary classification.
validation:mse The mean square error of the final model on the Minimize
validation dataset. This objective metric is only
valid for regression.
1358
Amazon SageMaker Developer Guide
Use Built-in Algorithms
validation:rmse The root mean square error of the final model Minimize
on the validation dataset. This objective metric is
only valid for regression.
1359
Amazon SageMaker Developer Guide
Use Built-in Algorithms
positive_example_weight_mult
ContinuousParameterRanges MinValue: 1e-5,
MaxValue: 1e5
Binary Classification
let response = {
"predictions": [
{
"score": 0.4,
"predicted_label": 0
}
]
}
Multiclass Classification
let response = {
"predictions": [
{
"score": [0.1, 0.2, 0.4, 0.3],
"predicted_label": 2
}
]
}
Regression
let response = {
"predictions": [
{
"score": 0.4
}
]
}
1360
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Multiclass Classification
Regression
{"score": 0.4}
Binary Classification
[
Record = {
features = {},
label = {
'score': {
keys: [],
values: [0.4] # float32
},
'predicted_label': {
keys: [],
values: [0.0] # float32
}
}
}
]
Multiclass Classification
[
Record = {
"features": [],
"label": {
"score": {
"values": [0.1, 0.2, 0.3, 0.4]
},
"predicted_label": {
"values": [3]
}
},
"uid": "abc123",
"metadata": "{created_at: '2017-06-03'}"
}
]
Regression
[
Record = {
features = {},
label = {
'score': {
keys: [],
values: [0.4] # float32
}
1361
Amazon SageMaker Developer Guide
Use Built-in Algorithms
}
}
]
TabTransformer
TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The
TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers
transform the embeddings of categorical features into robust contextual embeddings to achieve higher
prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly
robust against both missing and noisy data features, and provide better interpretability.
You can use TabTransformer as an Amazon SageMaker built-in algorithm. The following section
describes how to use TabTransformer with the SageMaker Python SDK. For information on how to use
TabTransformer from the Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).
Use the TabTransformer built-in algorithm to build a TabTransformer training container as shown in
the following code example. You can automatically spot the TabTransformer built-in algorithm image
URI using the SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon
SageMaker Python SDK version 2).
After specifying the TabTransformer image URI, you can use the TabTransformer container to construct
an estimator using the SageMaker Estimator API and initiate a training job. The TabTransformer built-
in algorithm runs in script mode, but the training script is provided for you and there is no need to
replace it. If you have extensive experience using script mode to create a SageMaker training job, then
you can incorporate your own TabTransformer training scripts.
train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
1362
Amazon SageMaker Developer Guide
Use Built-in Algorithms
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)
For more information about how to set up the TabTransformer as a built-in algorithm, see the
following notebook examples.
• Tabular classification with Amazon SageMaker TabTransformer algorithm
• Tabular regression with Amazon SageMaker TabTransformer algorithm
TabTransformer operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of TabTransformer supports CSV for training and inference:
1363
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.
Input format for training data, validation data, and categorical features
Be mindful of how to format your training data for input to the TabTransformer model. You must
provide the path to an Amazon S3 bucket that contains your training and validation data. You can also
include a list of categorical features. Use both the training and validation channels to provide your
input data. Alternatively, you can use only the training channel.
You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the TabTransformer algorithm concatenates the
files. The validation data is used to compute a validation score at the end of each boosting iteration.
Early stopping is applied when the validation score stops improving.
If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.
You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.
SageMaker TabTransformer supports single-instance CPU and single-instance GPU training. Despite
higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage
of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker
TabTransformer currently does not support multi-GPU training.
The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker TabTransformer algorithm.
1364
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
TabTransformer algorithm Amazon SageMaker TabTransformer algorithm to
train and host a tabular classification model.
Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
TabTransformer algorithm Amazon SageMaker TabTransformer algorithm to
train and host a tabular regression model.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.
TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The
TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the
embeddings of categorical features into robust contextual embeddings to achieve higher prediction
accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust
against both missing and noisy data features, and provide better interpretability.
TabTransformer performs well in machine learning competitions because of its robust handling of a
variety of data types, relationships, distributions, and the diversity of hyperparameters that you can
fine-tune. You can use TabTransformer for regression, classification (binary and multiclass), and ranking
problems.
1365
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information, see TabTransformer: Tabular Data Modeling Using Contextual Embeddings.
1366
Amazon SageMaker Developer Guide
Use Built-in Algorithms
TabTransformer hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly
used for the Amazon SageMaker TabTransformer algorithm. Users set these parameters to facilitate
the estimation of model parameters from data. The SageMaker TabTransformer algorithm is an
implementation of the open-source TabTransformer package.
Note
The default hyperparameters are based on example datasets in the TabTransformer sample
notebooks (p. 1364).
The SageMaker TabTransformer algorithm automatically chooses an evaluation metric and objective
function based on the type of classification problem. The TabTransformer algorithm detects the type
of classification problem based on the number of labels in your data. For regression problems, the
evaluation metric is r square and the objective function is mean square error. For binary classification
problems, the evaluation metric and objective function are both binary cross entropy. For multiclass
classification problems, the evaluation metric and objective function are both multiclass cross entropy.
Note
The TabTransformer evaluation metric and objective functions are not currently available as
hyperparameters. Instead, the SageMaker TabTransformer built-in algorithm automatically
detects the type of classification task (regression, binary, or multiclass) based on the number of
unique integers in the label column and assigns an evaluation metric and objective function.
Default value: 5.
patience The training will stop if one metric of one validation data point
does not improve in the last patience round.
learning_rate The rate at which the model weights are updated after working
through each batch of training examples.
Valid values: string, any of the following: "16", "32", "64", "128",
"256", or "512".
1367
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Default value: 4.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning objective function and evaluation metric are both automatically assigned based on
the type of classification task, which is determined by the number of unique integers in the label
column. For more information, see TabTransformer hyperparameters (p. 1367).
Automatic model tuning searches your chosen hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for TabTransformer is only available from the Amazon SageMaker
SDKs, not from the SageMaker console.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The SageMaker TabTransformer algorithm computes the following metrics to use for model validation.
The evaluation metric is automatically assigned based on the type of classification task, which is
determined by the number of unique integers in the label column.
1368
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune the TabTransformer model with the following hyperparameters. The hyperparameters that
have the greatest effect on optimizing the TabTransformer evaluation metrics are: learning_rate,
input_dim, n_blocks, attn_dropout, mlp_dropout, and frac_shared_embed. For a list of all the
TabTransformer hyperparameters, see TabTransformer hyperparameters (p. 1367).
XGBoost Algorithm
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation
of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that
attempts to accurately predict a target variable by combining an ensemble of estimates from a set of
simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions
because of its robust handling of a variety of data types, relationships, distributions, and the variety of
hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and
multiclass), and ranking problems.
You can use the new release of the XGBoost algorithm either as a Amazon SageMaker built-in algorithm
or as a framework to run training scripts in your local environments. This implementation has a smaller
memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics
than the original versions. It provides an XGBoost estimator that executes a training script in a
managed XGBoost environment. The current release of SageMaker XGBoost is based on the original
XGBoost versions 1.0, 1.2, 1.3, 1.5, and 1.7.
1369
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Supported versions
• Framework (open source) mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1
• Algorithm mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1
Warning
Due to required compute capacity, version 1.7-1 of SageMaker XGBoost is not compatible with
GPU instances from the P2 instance family for training or inference.
Important
When you retrieve the SageMaker XGBoost image URI, do not use :latest or :1 for the image
URI tag. You must specify one of the Supported versions (p. 1370) to choose the SageMaker-
managed XGBoost container with the native XGBoost package version that you want to use. To
find the package version migrated into the SageMaker XGBoost containers, see Docker Registry
Paths and Example Code, choose your AWS Region, and navigate to the XGBoost (algorithm)
section.
Warning
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for
XGBoost 0.90 is discontinued. It is highly recommended to upgrade the XGBoost version to one
of the newer versions.
Note
XGBoost v1.1 is not supported on SageMaker because XGBoost 1.1 has a broken capability to
run prediction when the test input has fewer features than the training data in LIBSVM inputs.
This capability has been restored in XGBoost v1.2. Consider using SageMaker XGBoost 1.2-2 or
later.
With SageMaker, you can use XGBoost as a built-in algorithm or framework. By using XGBoost as a
framework, you have more flexibility and access to more advanced scenarios, such as k-fold cross-
validation, because you can customize your own training scripts. The following sections describe how to
use XGBoost with the SageMaker Python SDK. For information on how to use XGBoost from the Amazon
SageMaker Studio UI, see SageMaker JumpStart (p. 47).
Use XGBoost as a framework to run your customized training scripts that can incorporate additional
data processing into your training jobs. In the following code example, you can find how SageMaker
Python SDK provides the XGBoost API as a framework in the same way it provides other framework
APIs, such as TensorFlow, MXNet, and PyTorch.
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
# initialize hyperparameters
hyperparameters = {
"max_depth":"5",
"eta":"0.2",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"verbosity":"1",
"objective":"reg:squarederror",
"num_round":"50"}
1370
Amazon SageMaker Developer Guide
Use Built-in Algorithms
# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)
For an end-to-end example of using SageMaker XGBoost as a framework, see Regression with Amazon
SageMaker XGBoost
• Use XGBoost as a built-in algorithm
Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following
code example. You can automatically spot the XGBoost built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 1). If you want to ensure if the image_uris.retrieve API finds the correct URI,
see Common parameters for built-in algorithms and look up xgboost from the full list of built-in
algorithm image URIs and available regions.
After specifying the XGBoost image URI, you can use the XGBoost container to construct an estimator
using the SageMaker Estimator API and initiate a training job. This XGBoost built-in algorithm mode
does not incorporate your own XGBoost training script and runs directly on the input datasets.
Important
When you retrieve the SageMaker XGBoost image URI, do not use :latest or :1 for the
image URI tag. You must specify one of the Supported versions (p. 1370) to choose the
SageMaker-managed XGBoost container with the native XGBoost package version that you
want to use. To find the package version migrated into the SageMaker XGBoost containers,
see Docker Registry Paths and Example Code, choose your AWS Region, and navigate to the
XGBoost (algorithm) section.
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
# initialize hyperparameters
hyperparameters = {
"max_depth":"5",
"eta":"0.2",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"objective":"reg:squarederror",
1371
Amazon SageMaker Developer Guide
Use Built-in Algorithms
"num_round":"50"}
# this line automatically looks for the XGBoost image URI and builds an XGBoost
container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")
# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)
For more information about how to set up the XGBoost as a built-in algorithm, see the following
notebook examples.
• Managed Spot Training for XGBoost
• Regression with Amazon SageMaker XGBoost (Parquet input)
Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of XGBoost supports the following data formats for training and
inference:
• text/libsvm (default)
• text/csv
• application/x-parquet
• application/x-recordio-protobuf
Note
There are a few considerations to be aware of regarding training and inference input:
• For training with columnar input, the algorithm assumes that the target variable (label) is the
first column. For inference, the algorithm assumes that the input has no label column.
• For CSV data, the input should not have a header record.
• For LIBSVM training, the algorithm assumes that subsequent columns after the label column
contain the zero-based index value pairs for features. So each row has the format: : <label>
<index0>:<value0> <index1>:<value1>.
1372
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• For information on instance types and distributed training, see EC2 Instance Recommendation
for the XGBoost Algorithm (p. 1373).
For CSV training input mode, the total memory available to the algorithm (Instance Count * the memory
available in the InstanceType) must be able to hold the training dataset. For libsvm training input
mode, it's not required, but we recommend it.
For v1.3-1 and later, SageMaker XGBoost saves the model in the XGBoost internal binary format, using
Booster.save_model. Previous versions use the Python pickle module to serialize/deserialize the
model.
Note
Be mindful of versions when using an SageMaker XGBoost model in open source XGBoost.
Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the
Python pickle module.
To use a model trained with SageMaker XGBoost v1.3-1 or later in open source XGBoost
xgb_model = xgb.Booster()
xgb_model.load_model(model_file_path)
xgb_model.predict(dtest)
To use a model trained with previous versions of SageMaker XGBoost in open source XGBoost
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
To differentiate the importance of labelled data points use Instance Weight Supports
• SageMaker XGBoost allows customers to differentiate the importance of labelled data points
by assigning each instance a weight value. For text/libsvm input, customers can assign weight
values to data instances by attaching them after the labels. For example, label:weight
idx_0:val_0 idx_1:val_1.... For text/csv input, customers need to turn on the csv_weights
flag in the parameters and attach weight values in the column after labels. For example:
label,weight,val_0,val_1,...).
SageMaker XGBoost supports CPU and GPU training and inference. Instance recommendations depend
on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the
following options for more information:
1373
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Training
CPU training
SageMaker XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to
compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice
than a compute-optimized instance (for example, C4). Further, we recommend that you have enough
total memory in selected instances to hold the training data. Although it supports the use of disk space
to handle data that does not fit into main memory (the out-of-core feature available with the libsvm
input mode), writing cache files onto disk slows the algorithm processing time.
GPU training
SageMaker XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs,
GPUs train more quickly, making them more cost effective.
SageMaker XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.
SageMaker XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that
due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.
To take advantage of GPU training, specify the instance type as one of the GPU instances (for example,
P3) and set the tree_method hyperparameter to gpu_hist in your existing XGBoost script.
Distributed training
SageMaker XGBoost supports CPU and GPU instances for distributed training.
To run CPU training on multiple instances, set the instance_count parameter for the estimator to a
value greater than one. The input data must be divided between the total number of instances.
1. Break the input data down into smaller files. The number of files should be at least equal to the
number of instances used for distributed training. Using multiple smaller files as opposed to one
large file also decreases the data download time for the training job.
2. When creating your TrainingInput, set the distribution parameter to ShardedByS3Key. This
parameter ensures that each instance gets approximately 1/n of the number of files in S3 if there
are n instances specified in the training job.
You can use distributed training with either single-GPU or multi-GPU instances.
1374
Amazon SageMaker Developer Guide
Use Built-in Algorithms
SageMaker XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means
that even if you select a multi-GPU instance, only one GPU is used per instance.
If you use XGBoost versions 1.2-2 through 1.3-1, or if you do not need to use multi-GPU instances, then
you must divide your input data between the total number of instances. For more information, see Divide
input data across instances (p. 1374).
Note
Versions 1.2-2 through 1.3-1 of SageMaker XGBoost only use one GPU per instance even if you
choose a multi-GPU instance.
Starting with version 1.5-1, SageMaker XGBoost offers distributed GPU training with Dask. With Dask you
can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-
GPU instances.
Important
Distributed training with Dask only supports CSV and Parquet input formats. If you use other
data formats such as LIBSVM or PROTOBUF, the training job fails.
For Parquet data, ensure that the column names are saved as strings. Columns that have names
of other data types will fail to load.
Important
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the
training job fails.
There are a few considerations to be aware of when training SageMaker XGBoost with Dask. Be sure to
split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for
every GPU, so the number of files should be greater than the total number of GPUs (instance count *
number of GPUs per instance). Having a very large number of files can also degrade performance. For
more information, see Dask Best Practices.
Variations in output
The specified tree_method hyperparameter determines the algorithm that is used for XGBoost training.
The tree methods approx, hist and gpu_hist are all approximate methods and use sketching for
quantile calculation. For more information, see Tree Methods in the XGBoost documentation. Sketching
is an approximate algorithm. Therefore, you can expect variations in the model depending on factors
such as the number of workers chosen for distributed training. The significance of the variation is data-
dependent.
Inference
SageMaker XGBoost supports CPU and GPU instances for inference. For information about the instance
types for inference, see Amazon SageMaker ML Instance Types.
The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker XGBoost algorithm.
1375
Amazon SageMaker Developer Guide
Use Built-in Algorithms
How to Create a Custom XGBoost container? This notebook shows you how to build a custom
XGBoost Container with Amazon SageMaker Batch
Transform.
Regression with XGBoost using Parquet This notebook shows you how to use the Abalone
dataset in Parquet to train a XGBoost model.
How to Train and Host a Multiclass Classification This notebook shows how to use the MNIST
Model? dataset to train and host a multiclass classification
model.
How to train a Model for Customer Churn This notebook shows you how to train a model to
Prediction? Predict Mobile Customer Departure in an effort to
identify unhappy customers.
An Introduction to Amazon SageMaker Managed This notebook shows you how to use Spot
Spot infrastructure for XGBoost Training Instances for training with a XGBoost Container.
How to use Amazon SageMaker Debugger to This notebook shows you how to use Amazon
debug XGBoost Training Jobs? SageMaker Debugger to monitor training jobs to
detect inconsistencies using built-in debugging
rules.
How to use Amazon SageMaker Debugger to This notebook shows you how to use the MNIST
debug XGBoost Training Jobs in Real-Time? dataset and Amazon SageMaker Debugger to
perform real-time analysis of XGBoost training
jobs while training jobs are running.
For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. The topic modeling example notebooks using the linear learning algorithm are
located in the Introduction to Amazon algorithms section. To open a notebook, choose its Use tab and
choose Create copy.
XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target
variable by combining the estimates of a set of simpler, weaker models.
When using gradient boosting for regression, the weak learners are regression trees, and each regression
tree maps an input data point to one of its leafs that contains a continuous score. XGBoost minimizes a
regularized (L1 and L2) objective function that combines a convex loss function (based on the difference
between the predicted and target outputs) and a penalty term for model complexity (in other words, the
regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals
or errors of prior trees that are then combined with previous trees to make the final prediction. It's called
gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new
models.
1376
Amazon SageMaker Developer Guide
Use Built-in Algorithms
XGBoost Hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker XGBoost algorithm. These are parameters that are set by users to facilitate
the estimation of model parameters from data. The required hyperparameters that must be set are
listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in
alphabetical order. The SageMaker XGBoost algorithm is an implementation of the open-source DMLC
XGBoost package. For details about full set of hyperparameter that can be configured for this version of
XGBoost, see XGBoost Parameters.
1377
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 0
Optional
booster Which booster to use. The gbtree and dart values use a tree-
based model, while gblinear uses a linear function.
Optional
Optional
Default value: 1
Optional
Default value: 1
Optional
Default value: 1
1378
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Valid values: 0 or 1
Default value: 0
Optional
Optional
Default value: -
Optional
Optional
1379
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 0
grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.
Optional
Optional
Optional
Default value: 1
Optional
Default value: 0
Optional
1380
Amazon SageMaker Developer Guide
Use Built-in Algorithms
max_delta_step Maximum delta step allowed for each tree's weight estimation.
When a positive integer is used, it helps make the update more
conservative. The preferred option is to use it in logistic regression.
Set it to 1-10 to help control the update.
Optional
Default value: 0
max_depth Maximum depth of a tree. Increasing this value makes the model
more complex and likely to be overfit. 0 indicates no limit. A limit is
required when grow_policy=depth-wise.
Optional
Default value: 6
Optional
Default value: 0
Optional
Default value: 1
Optional
1381
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Optional
one_drop When this flag is enabled, at least one tree is always dropped
during the dropout.
Optional
Valid values: 0 or 1
Default value: 0
Optional
rate_drop The dropout rate that specifies the fraction of previous trees to
drop during the dropout.
Optional
1382
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 1
Optional
scale_pos_weight Controls the balance of positive and negative weights. It's useful
for unbalanced classes. A typical value to consider: sum(negative
cases) / sum(positive cases).
Optional
Default value: 1
Optional
Default value: 0
single_precision_histogram When this flag is enabled, XGBoost uses single precision to build
histograms instead of double precision. Used only if tree_method
is set to hist or gpu_hist.
Optional
sketch_eps Used only for approximate greedy algorithm. This translates into
O(1 / sketch_eps) number of bins. Compared to directly select
number of bins, this comes with theoretical guarantee with sketch
accuracy.
Optional
1383
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 1
Optional
Optional
Optional
Optional
1384
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 1
You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic
model tuning searches the hyperparameters chosen to find the combination of values that result in the
model that optimizes the evaluation metric.
Note
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs,
not from the SageMaker console.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
1385
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the XGBoost evaluation metrics are: alpha, min_child_weight,
subsample, eta, and num_round.
This topic contains documentation for previous versions of Amazon SageMaker XGBoost that are
still available but deprecated. It also provides instructions on how to upgrade deprecated versions of
XGBoost, when possible, to more current versions.
1386
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Topics
• Upgrade XGBoost Version 0.90 to Version 1.5 (p. 1387)
• XGBoost Version 0.72 (p. 1388)
If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you
must have version 2.x of the SDK installed and change the XGBoost version and framework_version
parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few
hyperparameters and learning objectives.
Topics
• Upgrade SageMaker Python SDK Version 1.x to Version 2.x (p. 1387)
• Change the image tag to 1.5-1 (p. 1387)
• Change Docker Image for Boto3 (p. 1388)
• Update Hyperparameters and Learning Objectives (p. 1388)
If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the
SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see Use
Version 2.x of the SageMaker Python SDK. To install the latest version, run:
If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the
version parameter in image_uris.retrive.
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m5.2xlarge',
volume_size=5, # 5 GB
output_path=output_path)
If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized
training scripts, change the framework_version parameter in the XGBoost API.
1387
Amazon SageMaker Developer Guide
Use Built-in Algorithms
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)
For the full list of SageMaker Python SDK version 2.x changes, see Use Version 2.x of the SageMaker
Python SDK.
If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or
0.90-2) to 1.5-1.
{
"AlgorithmSpecification":: {
"TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-
xgboost:1.5-1"
}
...
}
If you using the SageMaker Python SDK to retrieve registry path, change the version parameter in
image_uris.retrieve.
The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions.
Use verbosity instead. If you were using the reg:linear learning objective, it has been deprecated as
well in favor of reg:squarederror. Use reg:squarederror instead.
hyperparameters = {
"verbosity": "2",
"objective": "reg:squarederror",
"num_round": "50",
...
}
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
...)
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
1388
Amazon SageMaker Developer Guide
Use Built-in Algorithms
import boto3
from sagemaker import image_uris
If you want to use newer versions, you have to explicitly specify the image URI tags (see
Supported versions (p. 1370)).
This previous release of the Amazon SageMaker XGBoost algorithm is based on the 0.72 release.
XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of
the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that
attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker
models. XGBoost has done remarkably well in machine learning competitions because it robustly
handles a variety of data types, relationships, and distributions, and because of the large number of
hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid
choice for problems in regression, classification (binary and multiclass), and ranking.
Customers should consider using the new release of XGBoost Algorithm (p. 1369). They can use it as a
SageMaker built-in algorithm or as a framework to run scripts in their local environments as they would
typically, for example, do with a Tensorflow deep learning framework. The new implementation has a
smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of
metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone
migrating to the new version. But this previous implementation will remain tied to the 0.72 release of
XGBoost.
Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.
The SageMaker implementation of XGBoost supports CSV and libsvm formats for training and inference:
Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input
does not have the label column.
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent
columns contain the zero-based index value pairs for features. So each row has the format:
<label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not
have labels in the libsvm format.
This differs from other SageMaker algorithms, which use the protobuf training input format to maintain
greater consistency with standard XGBoost data formats.
For CSV training input mode, the total memory available to the algorithm (Instance Count * the memory
available in the InstanceType) must be able to hold the training dataset. For libsvm training input
mode, it's not required, but we recommend it.
1389
Amazon SageMaker Developer Guide
Use Built-in Algorithms
SageMaker XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used
for saving/loading the model.
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
To differentiate the importance of labelled data points use Instance Weight Supports
• SageMaker XGBoost allows customers to differentiate the importance of labelled data points
by assigning each instance a weight value. For text/libsvm input, customers can assign weight
values to data instances by attaching them after the labels. For example, label:weight
idx_0:val_0 idx_1:val_1.... For text/csv input, customers need to turn on the csv_weights
flag in the parameters and attach weight values in the column after labels. For example:
label,weight,val_0,val_1,...).
SageMaker XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-
bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than
a compute-optimized instance (for example, C4). Further, we recommend that you have enough total
memory in selected instances to hold the training data. Although it supports the use of disk space to
handle data that does not fit into main memory (the out-of-core feature available with the libsvm input
mode), writing cache files onto disk slows the algorithm processing time.
For a sample notebook that shows how to use the latest version of SageMaker XGBoost as a built-
in algorithm to train and host a regression model, see Regression with Amazon SageMaker XGBoost
algorithm. To use the 0.72 version of XGBoost, you need to change the version in the sample code to
0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. The topic modeling example notebooks using the XGBoost algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.
The following table contains the hyperparameters for the XGBoost algorithm. These are parameters
that are set by users to facilitate the estimation of model parameters from data. The required
hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters
that can be set are listed next, also in alphabetical order. The SageMaker XGBoost algorithm is an
implementation of the open-source XGBoost package. Currently SageMaker supports version 0.72. For
more detail about hyperparameter configuration for this version of XGBoost, see XGBoost Parameters.
1390
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Required
Optional
Default value: 0
Optional
booster Which booster to use. The gbtree and dart values use a tree-
based model, while gblinear uses a linear function.
Optional
Optional
Default value: 1
Optional
Default value: 1
Optional
1391
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Default value: 0
Optional
Default value: -
Optional
Optional
Optional
Default value: 0
1392
Amazon SageMaker Developer Guide
Use Built-in Algorithms
grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.
Optional
Optional
Default value: 1
Optional
Default value: 0
Optional
max_delta_step Maximum delta step allowed for each tree's weight estimation.
When a positive integer is used, it helps make the update more
conservative. The preferred option is to use it in logistic regression.
Set it to 1-10 to help control the update.
Optional
Default value: 0
max_depth Maximum depth of a tree. Increasing this value makes the model
more complex and likely to be overfit. 0 indicates no limit. A limit is
required when grow_policy=depth-wise.
Optional
Default value: 6
1393
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 0
Optional
Default value: 1
Optional
Optional
Optional
one_drop When this flag is enabled, at least one tree is always dropped
during the dropout.
Optional
Valid values: 0 or 1
Default value: 0
1394
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
rate_drop The dropout rate that specifies the fraction of previous trees to
drop during the dropout.
Optional
Optional
Default value: 1
Optional
scale_pos_weight Controls the balance of positive and negative weights. It's useful
for unbalanced classes. A typical value to consider: sum(negative
cases) / sum(positive cases).
Optional
Default value: 1
Optional
Default value: 0
1395
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Valid values: 0 or 1
Optional
Default value: 0
sketch_eps Used only for approximate greedy algorithm. This translates into
O(1 / sketch_eps) number of bins. Compared to directly select
number of bins, this comes with theoretical guarantee with sketch
accuracy.
Optional
Optional
Optional
Default value: 1
Optional
Optional
1396
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. You
choose three types of hyperparameters:
You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic
model tuning searches the hyperparameters chosen to find the combination of values that result in the
model that optimizes the evaluation metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model
validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of
valid eval_metric values, refer to XGBoost Learning Task Parameters
1397
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the XGBoost evaluation metrics are: alpha, min_child_weight,
subsample, eta, and num_round.
• BlazingText algorithm (p. 1399)—a highly optimized implementation of the Word2vec and text
classification algorithms that scale to large datasets easily. It is useful for many downstream natural
language processing (NLP) tasks.
• Latent Dirichlet Allocation (LDA) Algorithm (p. 1409)—an algorithm suitable for determining topics in
a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with
answers during training.
• Neural Topic Model (NTM) Algorithm (p. 1415)—another unsupervised technique for determining
topics in a set of documents, using a neural network approach.
1398
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Object2Vec Algorithm (p. 1421)—a general-purpose neural embedding algorithm that can be used for
recommendation systems, document classification, and sentence embeddings.
• Sequence-to-Sequence Algorithm (p. 1437)—a supervised algorithm commonly used for neural
machine translation.
• Text Classification - TensorFlow (p. 1450)—a supervised algorithm that supports transfer learning with
available pretrained models for text classification.
Neural Topic train and File or Pipe recordIO- GPU or CPU Yes
Model (optionally) protobuf or
validation, CSV
test, or both
BlazingText algorithm
The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the
Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream
natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine
translation, etc. Text classification is an important task for applications that perform web searches,
information retrieval, ranking, and document classification.
The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector
representation of a word is called a word embedding. Words that are semantically similar correspond to
vectors that are close together. That way, word embeddings capture the semantic relationships between
words.
1399
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Many natural language processing (NLP) applications learn word embeddings by training on large
collections of documents. These pretrained vector representations provide information about semantics
and word distributions that typically improves the generalizability of other models that are later trained
on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized
for multi-core CPU architectures. This makes it difficult to scale to large datasets.
With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it
provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's
implementation of the supervised multi-class, multi-label text classification algorithm extends the
fastText text classifier to use GPU acceleration with custom CUDA kernels. You can train a model on
more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve
performance on par with the state-of-the-art deep learning text classification algorithms.
The BlazingText algorithm is not parallelizable. For more information on parameters related to training,
see Docker Registry Paths for SageMaker Built-in Algorithms.
• Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs
using highly optimized CUDA kernels. For more information, see BlazingText: Scaling and Accelerating
Word2Vec using Multiple GPUs.
• Enriched Word Vectors with Subword Information by learning vector representations for character n-
grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV)
words by representing their vectors as the sum of the character n-gram (subword) vectors.
• A batch_skipgram mode for the Word2Vec algorithm that allows faster training and distributed
computation across multiple CPU nodes. The batch_skipgram mode does mini-batching using the
Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations.
This efficiently leverages the multiply-add instructions of modern architectures. For more information,
see Parallelizing Word2Vec in Shared and Distributed Memory.
To summarize, the following modes are supported by BlazingText on different types instances:
Skip-gram
Batch Skip-gram
For more information about the mathematics behind BlazingText, see BlazingText: Scaling and
Accelerating Word2Vec using Multiple GPUs.
Topics
• Input/Output Interface for the BlazingText Algorithm (p. 1401)
• EC2 Instance Recommendation for the BlazingText Algorithm (p. 1403)
• BlazingText Sample Notebooks (p. 1404)
• BlazingText Hyperparameters (p. 1404)
1400
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Training and Validation Data Format for the Text Classification Algorithm
For supervised mode, you can train with file mode or with the augmented manifest text format.
__label__4 linux ready for prime time , intel says , despite all the linux hype , the
open-source movement has yet to make a huge splash in the desktop market . that may be
about to change , thanks to chipmaking giant intel corp .
__label__2 bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly as the indian skippers return to international cricket was short lived .
Note
The order of labels within the sentence doesn't matter.
Upload the training file under the train channel, and optionally upload the validation file under the
validation channel.
{"source":"linux ready for prime time , intel says , despite all the linux hype",
"label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly", "label":2}
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":
[1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly", "label": [2, 4, 5]}
1401
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).
If the evaluation parameter is set to True, an additional file, eval.json, is created. This file contains the
similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The
number of words from the WS-353 dataset that aren't there in the training corpus are reported.
For inference requests, the model accepts a JSON file containing a list of strings and returns a list of
vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to
True during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.
{
"instances": ["word1", "word2", "word3"]
}
{
"instances": ["the movie was excellent", "i did not like the plot ."]
}
By default, the server returns only one prediction, the one with the highest probability. For retrieving the
top k predictions, you can set k in the configuration, as follows:
{
"instances": ["the movie was excellent", "i did not like the plot ."],
"configuration": {"k": 2}
}
For BlazingText, the content-type and accept parameters must be equal. For batch transform, they
both need to be application/jsonlines. If they differ, the Accept field is ignored. The format for
input follows:
1402
Amazon SageMaker Developer Guide
Use Built-in Algorithms
content-type: application/jsonlines
{"source": "source_0"}
{"source": "source_1"}
if you need to pass the value of k for top-k, then you can do it in the following way:
accept: application/jsonlines
If you have passed the value of k to be more than 1, then response will be in this format:
For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*.bin)
produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries
produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText
using BlazingText.
Here is an example of how to use a model generated with BlazingText with fastText:
However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU
will not produce binaries.
For more details on dataset formats and model hosting, see the example notebooks Text Classification
with the BlazingText Algorithm, FastText Models, and Generating Subword Embeddings with the
Word2Vec Algorithm.
For batch_skipgram mode, BlazingText supports single or multiple CPU instances. When training on
multiple instances, set the value of the S3DataDistributionType field of the S3DataSource object
that you pass to CreateTrainingJob to FullyReplicated. BlazingText takes care of distributing
data across machines.
For the supervised text classification mode, a C5 instance is recommended if the training dataset is less
than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and
G5 instances for training and inference.
1403
Amazon SageMaker Developer Guide
Use Built-in Algorithms
BlazingText Hyperparameters
When you start a training job with a CreateTrainingJob request, you specify a training algorithm.
You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters
for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text
Classification (supervised).
Word2Vec Hyperparameters
The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided
by Amazon SageMaker.
Required
batch_size The size of each batch when mode is set to batch_skipgram. Set
to a number between 10 and 20.
Optional
Default value: 11
Optional
Optional
Default value: 5
Optional
1404
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 3
min_count Words that appear less than min_count times are discarded.
Optional
Default value: 5
Optional
Default value: 6
negative_samples The number of negative samples for the negative sample sharing
strategy.
Optional
Default value: 5
sampling_threshold The threshold for the occurrence of words. Words that appear with
higher frequency in the training data are randomly down-sampled.
Optional
1405
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
vector_dim The dimension of the word vectors that the algorithm learns.
Optional
window_size The size of the context window. The context window is the number
of words surrounding the target word used for training.
Optional
Default value: 5
The following table lists the hyperparameters for the Text Classification training algorithm provided by
Amazon SageMaker.
Note
Although some of the parameters are common between the Text Classification and Word2Vec
modes, they might have different meanings depending on the context.
Required
Optional
Optional
1406
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 5
Optional
min_count Words that appear less than min_count times are discarded.
Optional
Default value: 5
min_epochs The minimum number of epochs to train before early stopping logic
is invoked.
Optional
Default value: 5
Optional
Default value: 4
Optional
1407
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 2
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The BlazingText Text Classification algorithm (supervised mode), also reports on a single metric during
training: the validation:accuracy. When tuning the hyperparameter values for the text classification
algorithm, use these metrics as the objective.
1408
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The exact content of two documents with similar topic mixtures will not be the same. But overall, you
would expect these documents to more frequently use a shared subset of words, than when compared
with a document from a different topic mixture. This allows LDA to discover these word groups and use
them to form topics. As an extremely simple example, given a set of documents where the only words
that occur within them are: eat, sleep, play, meow, and bark, LDA might produce topics like the following:
1409
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely
to meow and sleep), and documents that fall into Topic 2 are about dogs (who prefer to play and bark).
These topics can be found even though the words dog and cat never appear in any of the texts.
Topics
• Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM) (p. 1410)
• Input/Output Interface for the LDA Algorithm (p. 1410)
• EC2 Instance Recommendation for the LDA Algorithm (p. 1411)
• LDA Sample Notebooks (p. 1411)
• How LDA Works (p. 1411)
• LDA Hyperparameters (p. 1413)
• Tune an LDA Model (p. 1414)
Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate
semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity
and maximize topic coherence.
Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the
geometric mean per-word likelihood in your test data. A lower perplexity score indicates better
generalization performance. Research has shown that the likelihood computed per word often does
not align to human judgement, and can be entirely non-correlated, thus topic coherence has been
introduced. Each inferred topic from your model consists of words, and topic coherence is computed to
the top N words for that particular topic from your model. It is often defined as the average or median of
the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A
promising model generates coherent topics or topics with high topic coherence scores.
While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence,
there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has
shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs
sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From
a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more
flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized
across multiple GPU instances, whereas LDA only supports single-instance CPU training.
Topics
• Input/Output Interface for the LDA Algorithm (p. 1410)
• EC2 Instance Recommendation for the LDA Algorithm (p. 1411)
• LDA Sample Notebooks (p. 1411)
• How LDA Works (p. 1411)
• LDA Hyperparameters (p. 1413)
• Tune an LDA Model (p. 1414)
LDA expects data to be provided on the train channel, and optionally supports a test channel, which
is scored by the final model. LDA supports both recordIO-wrapped-protobuf (dense and sparse)
1410
Amazon SageMaker Developer Guide
Use Built-in Algorithms
and CSV file formats. For CSV, the data must be dense and have dimension equal to number of records *
vocabulary size. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only
in File mode for the CSV format.
Please see the LDA Sample Notebooks (p. 1411) for more detail on training and inference formats.
LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/
inference.
For a sample notebook that shows how to train the SageMaker Latent Dirichlet Allocation algorithm
on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures
in input documents, see the An Introduction to SageMaker LDA. For instructions how to create and
access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon
SageMaker Notebook Instances (p. 204). Once you have created a notebook instance and opened it,
select the SageMaker Examples tab to see a list of all the SageMaker samples. The topic modeling
example notebooks using the NTM algorithms are located in the Introduction to Amazon algorithms
section. To open a notebook, click on its Use tab and select Create copy.
Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of
observations as a mixture of different categories. These categories are themselves a probability
distribution over the features. LDA is a generative probability model, which means it attempts to
provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to
discriminative models, which attempt to learn how inputs map to outputs.
You can use LDA for a variety of tasks, from clustering customers based on product purchases to
automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in
text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A
feature is referred to as a word. And the resulting categories are referred to as topics.
Note
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-
processing any input text data.
• α—A prior estimate on topic probability (in other words, the average frequency that each topic within
a given document occurs).
• β—a collection of k topics where each topic is given a probability distribution over the vocabulary used
in a document corpus, also called a "topic-word distribution."
LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a
generative model where each document is generated word-by-word by choosing a topic mixture θ ∼
Dirichlet(α).
1411
Amazon SageMaker Developer Guide
Use Built-in Algorithms
When training the model, the goal is to find parameters α and β, which maximize the probability that the
text corpus is generated by the model.
The most popular methods for estimating the LDA model use Gibbs sampling or Expectation
Maximization (EM) techniques. The Amazon SageMaker LDA uses tensor spectral decomposition. This
provides several advantages:
• Theoretical guarantees on results. The standard EM-method is guaranteed to converge only to local
optima, which are often of poor quality.
• Embarrassingly parallelizable. The work can be trivially divided over input documents in both training
and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily.
• Fast. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs
Sampling is also subject to slow convergence rates and also requires a large number of samples.
1. The goal is to calculate the spectral decomposition of a V x V x V tensor, which summarizes the
moments of the documents in our corpus. V is vocabulary size (in other words, the number of distinct
words in all of the documents). The spectral components of this tensor are the LDA parameters α and
β, which maximize the overall likelihood of the document corpus. However, because vocabulary size
tends to be large, this V x V x V tensor is prohibitively large to store in memory.
2. Instead, it uses a V x V moment matrix, which is the two-dimensional analog of the tensor from step
1, to find a whitening matrix of dimension V x k. This matrix can be used to convert the V x V moment
matrix into a k x k identity matrix. k is the number of topics in the model.
3. This same whitening matrix can then be used to find a smaller k x k x k tensor. When spectrally
decomposed, this tensor has components that have a simple relationship with the components of the
V x V x V tensor.
4. Alternating Least Squares is used to decompose the smaller k x k x k tensor. This provides a substantial
improvement in memory consumption and speed. The parameters α and β can be found by
“unwhitening” these outputs in the spectral decomposition.
After the LDA model’s parameters have been found, you can find the topic mixtures for each document.
You use stochastic gradient descent to maximize the likelihood function of observing a given topic
mixture corresponding to these data.
Topic quality can be improved by increasing the number of topics to look for in training and then
filtering out poor quality ones. This is in fact done automatically in SageMaker LDA: 25% more topics
are computed and only the ones with largest associated Dirichlet priors are returned. To perform further
topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as
follows:
For more information about algorithms for LDA and the SageMaker implementation, see the following:
• Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor
Decompositions for Learning Latent Variable Models, Journal of Machine Learning Research, 15:2773–
2832, 2014.
1412
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3(Jan):993–1022, 2003.
• Thomas L Griffiths and Mark Steyvers. Finding Scientific Topics. Proceedings of the National Academy
of Sciences, 101(suppl 1):5228–5235, 2004.
• Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–
500, 2009.
LDA Hyperparameters
In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters
for the LDA training algorithm provided by Amazon SageMaker. For more information, see How LDA
Works (p. 1411).
num_topics The number of topics for LDA to find within the data.
Required
Required
Required
alpha0 Initial guess for the concentration parameter: the sum of the
elements of the Dirichlet prior. Small values are more likely to
generate sparse topic mixtures and large values (greater than 1.0)
produce more uniform mixtures.
Optional
Optional
Default value: 10
1413
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
tol Target error tolerance for the ALS phase of the algorithm. Can be
used to find better quality minima at the expense of additional
computation, but typically should not be adjusted.
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations
(documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric
measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document
dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA
model.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The LDA algorithm reports on a single metric during training: test:pwll. When tuning a model, choose
this metric as the objective metric.
You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, alpha0 and
num_topics, can affect the LDA objective metric (test:pwll). If you don't already know the optimal
values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA
model, automatic model tuning can help find them.
1414
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Topic modeling provides a way to visualize the contents of a large document corpus in terms of the
learned topics. Documents relevant to each topic might be indexed or searched for based on their soft
topic labels. The latent representations of documents might also be used to find similar documents in
the topic space. You can also use the latent representations of documents that the topic model learns for
input to another supervised algorithm such as a document classifier. Because the latent representations
of documents are expected to capture the semantics of the underlying documents, algorithms based in
part on these representations are expected to perform better than those based on lexical features alone.
Although you can use both the Amazon SageMaker NTM and LDA algorithms for topic modeling, they
are distinct algorithms and can be expected to produce different results on the same input data.
For more information on the mathematics behind NTM, see Neural Variational Inference for Text
Processing.
Topics
• Input/Output Interface for the NTM Algorithm (p. 1415)
• EC2 Instance Recommendation for the NTM Algorithm (p. 1416)
• NTM Sample Notebooks (p. 1416)
• NTM Hyperparameters (p. 1416)
• Tune an NTM Model (p. 1419)
• NTM Response Formats (p. 1420)
The train, validation, and test data channels for NTM support both recordIO-wrapped-protobuf
(dense and sparse) and CSV file formats. For CSV format, each row must be represented densely with
1415
Amazon SageMaker Developer Guide
Use Built-in Algorithms
zero counts for words not present in the corresponding document, and have dimension equal to:
(number of records) * (vocabulary size). You can use either File mode or Pipe mode to train models on
data that is formatted as recordIO-wrapped-protobuf or as CSV. The auxiliary channel is used to
supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top
words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also
allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in
the log that captures similarity among the top words in each topic effectively. The ContentType for the
auxiliary channel is text/plain, with each line containing a single word, in the order corresponding to
the integer IDs provided in the data. The vocabulary file must be named vocab.txt and currently only
UTF-8 encoding is supported.
See the blog post and the companion notebook for more details on using the auxiliary channel and the
WETC scores. For more information on how to compute the WETC score, see Coherence-Aware Neural
Topic Modeling. We used the pairwise WETC described in this paper for the Amazon SageMaker Neural
Topic Model.
For more information on input and output file formats, see NTM Response Formats (p. 1420) for
inference and the NTM Sample Notebooks (p. 1416).
NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain
workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for
inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.
For a sample notebook that uses the SageMaker NTM algorithm to uncover topics in documents from a
synthetic data source where the topic distributions are known, see the Introduction to Basic Functionality
of NTM. For instructions how to create and access Jupyter notebook instances that you can use to
run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have
created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the
SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.
NTM Hyperparameters
Required
Required
1416
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
encoder_layers The number of layers in the encoder and the output size of each
layer. When set to auto, the algorithm uses two layers of sizes 3 x
num_topics and 2 x num_topics respectively.
Optional
Optional
Valid values:
Optional
Default value: 50
Optional
1417
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 3
Optional
Valid values:
Optional
sub_sample The fraction of the training data to sample for training per epoch.
Optional
1418
Amazon SageMaker Developer Guide
Use Built-in Algorithms
tolerance The maximum relative change in the loss function. Early stopping is
triggered when change in the loss function drops below this value
within the last num_patience_epochs number of epochs.
Optional
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
Amazon SageMaker NTM is an unsupervised learning algorithm that learns latent representations of
large collections of discrete data, such as a corpus of documents. Latent representations use inferred
variables that are not directly measured to model the observations in a dataset. Automatic model tuning
on NTM helps you find the model that minimizes loss over the training or validation data. Training loss
measures how well the model fits the training data. Validation loss measures how well the model can
generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the
training data. Low validation loss indicates that a model has not overfit the training data and so should
be able to model documents successfully on which is has not been trained. Usually, it's preferable to have
both losses be small. However, minimizing training loss too much might result in overfitting and increase
validation loss, which would reduce the generality of the model.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The NTM algorithm reports a single metric that is computed during training: validation:total_loss.
The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning
hyperparameter values, choose this metric as the objective.
You can tune the following hyperparameters for the NTM algorithm. Usually setting low
mini_batch_size and small learning_rate values results in lower validation losses, although it
1419
Amazon SageMaker Developer Guide
Use Built-in Algorithms
might take longer to train. Low validation losses don't necessarily produce more coherent topics as
interpreted by humans. The effect of other hyperparameters on training and validation loss can vary
from dataset to dataset. To see which values are compatible, see NTM Hyperparameters (p. 1416).
CategoricalParameterRanges
encoder_layers_activation ['sigmoid', 'tanh', 'relu']
All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker NTM algorithm.
{
"predictions": [
{"topic_weights": [0.02, 0.1, 0,...]},
{"topic_weights": [0.25, 0.067, 0,...]}
]
}
[
Record = {
features = {},
label = {
'topic_weights': {
keys: [],
values: [0.25, 0.067, 0, ...] # float32
}
}
},
Record = {
features = {},
label = {
'topic_weights': {
1420
Amazon SageMaker Developer Guide
Use Built-in Algorithms
keys: [],
values: [0.25, 0.067, 0, ...] # float32
}
}
}
]
Object2Vec Algorithm
The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm
that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional
objects. The embeddings are learned in a way that preserves the semantics of the relationship between
pairs of objects in the original space in the embedding space. You can use the learned embeddings to
efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in
low-dimensional space, for example. You can also use the embeddings as features of the corresponding
objects in downstream supervised tasks, such as classification or regression.
Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in
the SageMaker BlazingText algorithm (p. 1399). For a blog post that discusses how to apply Object2Vec
to some practical use cases, see Introduction to Amazon SageMaker Object2Vec.
Topics
• I/O Interface for the Object2Vec Algorithm (p. 1421)
• EC2 Instance Recommendation for the Object2Vec Algorithm (p. 1422)
• Object2Vec Sample Notebooks (p. 1422)
• How Object2Vec Works (p. 1422)
• Object2Vec Hyperparameters (p. 1424)
• Tune an Object2Vec Model (p. 1433)
• Data Formats for Object2Vec Training (p. 1435)
• Data Formats for Object2Vec Inference (p. 1435)
• Encoder Embeddings for Object2Vec (p. 1436)
You can use Object2Vec on many input data types, including the following examples.
Sentence-sentence pairs "A soccer game with multiple males playing." and "Some men are
playing a sport."
Labels-sequence pairs The genre tags of the movie "Titanic", such as "Romance" and
"Drama", and its short description: "James Cameron's Titanic is
an epic, action-packed romance set against the ill-fated maiden
voyage of the R.M.S. Titanic. She was the most luxurious liner of her
era, a ship of dreams, which ultimately carried over 1,500 people to
their death in the ice cold waters of the North Atlantic in the early
hours of April 15, 1912."
Item review user-item pairs A user's ID and the items she has bought, such as apple, pear, and
orange.
1421
Amazon SageMaker Developer Guide
Use Built-in Algorithms
To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec
natively supports two types of input:
• A discrete token, which is represented as a list of a single integer-id. For example, [10].
• A sequences of discrete tokens, which is represented as a list of integer-ids. For example,
[0,12,10,13].
The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token,
token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as
compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
• Average-pooled embeddings
• Hierarchical convolutional neural networks (CNNs),
• Multi-layered bidirectional long short-term memory (BiLSTMs)
The input label for each pair can be one of the following:
• A categorical label that expresses the relationship between the objects in the pair
• A score that expresses the strength of the similarity between the two objects
For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For
ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss
function. Specify these loss functions with the output_layer hyperparameter when you create the
model training job.
The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you
are training or running inference.
When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance.
For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance,
you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine.
However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU
instance families for training and inference.
For inference with a trained Object2Vec model that has a deep neural network, we recommend
using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE
environment variable can be specified to optimize on whether the the section called “GPU
optimization: Classification or Regression” (p. 1435) or the section called “GPU optimization: Encoder
Embeddings” (p. 1436) inference network is loaded into GPU.
Note
To run the notebooks on a notebook instance, see Example Notebooks (p. 220). To run the
notebooks on Studio, see Create or Open an Amazon SageMaker Studio Notebook (p. 148).
When using the Amazon SageMaker Object2Vec algorithm, you follow the standard workflow: process
the data, train the model, and produce inferences.
1422
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Topics
• Step 1: Process Data (p. 1423)
• Step 2: Train a Model (p. 1423)
• Step 3: Produce Inferences (p. 1424)
During preprocessing, convert the data to the JSON Lines text file format specified in Data Formats
for Object2Vec Training (p. 1435) . To get the highest accuracy during training, also randomly shuffle
the data before feeding it into the model. How you generate random permutations depends on the
language. For python, you could use np.random.shuffle; for Unix, shuf.
• Two input channels – The input channels take a pair of objects of the same or different types as
inputs, and pass them to independent and customizable encoders.
• Two encoders – The two encoders, enc0 and enc1, convert each object into a fixed-length embedding
vector. The encoded embeddings of the objects in the pair are then passed into a comparator.
• A comparator – The comparator compares the embeddings in different ways and outputs scores that
indicate the strength of the relationship between the paired objects. In the output score for a sentence
pair. For example, 1 indicates a strong relationship between a sentence pair, and 0 represents a weak
relationship.
During training, the algorithm accepts pairs of objects and their relationship labels or scores as inputs.
The objects in each pair can be of different types, as described earlier. If the inputs to both encoders are
composed of the same token-level units, you can use a shared token embedding layer by setting the
tied_token_embedding_weight hyperparameter to True when you create the training job. This is
possible, for example, when comparing sentences that both have word token-level units. To generate
negative samples at a specified rate, set the negative_sampling_rate hyperparameter to the desired
ratio of negative to positive samples. This hyperparameter expedites learning how to discriminate
between the positive samples observed in the training data and the negative samples that are not likely
to be observed.
Pairs of objects are passed through independent, customizable encoders that are compatible with the
input types of corresponding objects. The encoders convert each object in a pair into a fixed-length
embedding vector of equal length. The pair of vectors are passed to a comparator operator, which
assembles the vectors into a single vector using the value specified in the he comparator_list
hyperparameter. The assembled vector then passes through a multilayer perceptron (MLP) layer, which
produces an output that the loss function compares with the labels that you provided. This comparison
evaluates the strength of the relationship between the objects in the pair as predicted by the model. The
following figure shows this workflow.
1423
Amazon SageMaker Developer Guide
Use Built-in Algorithms
After the model is trained, you can use the trained encoder to preprocess input objects or to perform two
types of inference:
• To convert singleton input objects into fixed-length embeddings using the corresponding encoder
• To predict the relationship label or score between a pair of input objects
The inference server automatically figures out which of the types is requested based on the input data.
To get the embeddings as output, provide only one input. To predict the relationship label or score,
provide both inputs in the pair.
Object2Vec Hyperparameters
In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the
Object2Vec training algorithm.
Required
Required
1424
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
comparator_list A list used to customize the way in which two embeddings are
compared. The Object2Vec comparator operator layer takes the
encodings from both encoders as inputs and outputs a single
vector. This vector is a concatenation of subvectors. The string
values passed to the comparator_list and the order in which
they are passed determine how these subvectors are assembled. For
example, if comparator_list="hadamard, concat", then the
comparator operator constructs the vector by concatenating the
Hadamard product of two encodings and the concatenation of two
encodings. If, on the other hand, comparator_list="hadamard",
then the comparator operator constructs the vector as the
hadamard product of only two encodings.
Optional
Optional
1425
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 3
Optional
Optional
Optional
enc0_cnn_filter_width The filter width of the convolutional neural network (CNN) enc0
encoder.
Conditional
Default value: 3
1426
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Conditional
Conditional
Conditional
Conditional
enc0_vocab_file The vocabulary file for mapping pretrained enc0 token embedding
vectors to numerical vocabulary IDs.
Conditional
1427
Amazon SageMaker Developer Guide
Use Built-in Algorithms
enc1_network The network model for the enc1 encoder. If you want the enc1
encoder to use the same network model as enc0, including the
hyperparameter values, set the value to enc0.
Note
Even when the enc0 and enc1 encoder networks have
symmetric architectures, you can't shared parameter values
for these networks.
Optional
Conditional
Default value: 3
Conditional
Conditional
1428
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Conditional
Conditional
Conditional
enc1_vocab_file The vocabulary file for mapping pretrained enc1 token embeddings
to vocabulary IDs.
Conditional
Conditional
Optional
Default value: 30
Optional
1429
Amazon SageMaker Developer Guide
Use Built-in Algorithms
mini_batch_size The batch size that the dataset is split into for an optimizer
during training.
Optional
Default value: 32
mlp_activation The type of activation function for the multilayer perceptron (MLP)
layer.
Optional
Optional
Optional
Default value: 2
1430
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 2
Optional
1431
Amazon SageMaker Developer Guide
Use Built-in Algorithms
output_layer The type of output layer where you specify that the task is
regression or classification.
Optional
Optional
token_embedding_storage_type
The mode of gradient update used during training: when the dense
mode is used, the optimizer calculates the full gradient matrix for
the token embedding layer even if most rows of the gradient are
zero-valued. When sparse mode is used, the optimizer only stores
rows of the gradient that are actually being used in the mini-batch.
If you want the algorithm to perform lazy gradient updates, which
calculate the gradients only in the non-zero rows and which speed
up training, specify row_sparse. Setting the value to row_sparse
constrains the values available for other hyperparameters, as
follows:
Optional
1432
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
Cross-entropy
validation:cross_entropy Minimize
1433
Amazon SageMaker Developer Guide
Use Built-in Algorithms
IntegerParameterRange
early_stopping_patience MinValue: 1,
MaxValue: 5
ContinuousParameterRange
early_stopping_tolerance MinValue: 0.001,
MaxValue: 0.1
IntegerParameterRange
enc0_cnn_filter_width MinValue: 1,
MaxValue: 5
IntegerParameterRange
enc0_token_embedding_dim MinValue: 5,
MaxValue: 300
IntegerParameterRange
enc1_cnn_filter_width MinValue: 1,
MaxValue: 5
IntegerParameterRange
enc1_token_embedding_dim MinValue: 5,
MaxValue: 300
1434
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15,
69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9,
107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
The “in0” and “in1” are the inputs for encoder0 and encoder1, respectively. The same format is valid for
both classification and regression problems. For regression, the field "label" can accept real valued
inputs.
transformer = o2v.transformer(instance_count=4,
instance_type="ml.p2.xlarge",
max_concurrent_transforms=2,
max_payload=1, # 1MB
strategy='MultiRecord',
env={'INFERENCE_PREFERRED_MODE': 'classification'}, # only
useful with GPU
output_path=output_s3_path)
{
"instances" : [
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821,
4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]},
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4],
"in1": [22, 32, 13, 25, 1016, 573, 3252, 4]},
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
]
}
Content-type: application/jsonlines
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4],
"in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1":
[22, 32, 13, 25, 1016, 573, 3252, 4]}
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
For classification problems, the length of the scores vector corresponds to num_classes. For regression
problems, the length is 1.
1435
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{
"predictions": [
{
"scores": [
0.6533935070037842,
0.07582679390907288,
0.2707797586917877
]
},
{
"scores": [
0.026291321963071823,
0.6577019095420837,
0.31600672006607056
]
}
]
}
Accept: application/jsonlines
{"scores":[0.195667684078216,0.395351558923721,0.408980727195739]}
{"scores":[0.251988261938095,0.258233487606048,0.489778339862823]}
{"scores":[0.280087798833847,0.368331134319305,0.351581096649169]}
In both the classification and regression formats, the scores apply to individual labels.
Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE environment variable can be specified
to optimize on whether the the section called “Inference Formats: Scoring” (p. 1435) or the encoder
embedding inference network is loaded into GPU. If the majority of your inference is for encoder
embeddings, specify INFERENCE_PREFERRED_MODE=embedding. The following is a Batch Transform
example of using 4 instances of p3.2xlarge that optimizes for encoder embedding inference:
transformer = o2v.transformer(instance_count=4,
instance_type="ml.p2.xlarge",
max_concurrent_transforms=2,
max_payload=1, # 1MB
strategy='MultiRecord',
env={'INFERENCE_PREFERRED_MODE': 'embedding'}, # only useful
with GPU
output_path=output_s3_path)
Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum
sequence lengths for the forward and backward encoder.
{
"instances" : [
1436
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821,
4]},
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]},
{"in0": [774, 14, 21, 206]}
]
}
Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum
sequence lengths for the forward and backward encoder.
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]}
{"in0": [774, 14, 21, 206]}
In both of these formats, you specify only one input type: “in0” or “in1.” The inference service then
invokes the corresponding encoder and outputs the embeddings for each of the instances.
{
"predictions": [
{"embeddings":
[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.0036375711
{"embeddings":
[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133
]
}
Content-type: application/jsonlines
{"embeddings":
[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.0036375711
{"embeddings":
[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133
The vector length of the embeddings output by the inference service is equal to the value of one of
the following hyperparameters that you specify at training time: enc0_token_embedding_dim,
enc1_token_embedding_dim, or enc_dim.
Sequence-to-Sequence Algorithm
Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a
sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
Example applications include: machine translation (input a sentence from one language and predict what
that sentence would be in another language), text summarization (input a longer string of words and
predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output
sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural
networks that show a significant performance boost over previous methodologies. Amazon SageMaker
seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with
attention as encoder-decoder architectures.
Topics
• Input/Output Interface for the Sequence-to-Sequence Algorithm (p. 1438)
• EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm (p. 1439)
• Sequence-to-Sequence Sample Notebooks (p. 1439)
1437
Amazon SageMaker Developer Guide
Use Built-in Algorithms
SageMaker seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as
integers, not as floating points, as is usually the case.
A script to convert data from tokenized text files to the protobuf format is included in the seq2seq
example notebook. In general, it packs the data into 32-bit integer tensors and generates the necessary
vocabulary files, which are needed for metric calculation and inference.
After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three
channels:
• train: It should contain the training data (for example, the train.rec file generated by the
preprocessing script).
• validation: It should contain the validation data (for example, the val.rec file generated by the
preprocessing script).
• vocab: It should contain two vocabulary files (vocab.src.json and vocab.trg.json)
If the algorithm doesn't find data in any of these three channels, training results in an error.
Inference
For hosted endpoints, inference supports two data formats. To perform inference using space separated
text tokens, use the application/json format. Otherwise, use the recordio-protobuf format to
work with the integer encoded data. Both modes support batching of input data. application/json
format also allows you to visualize the attention matrix.
• application/json: Expects the input in JSON format and returns the output in JSON format. Both
content and accept types should be application/json. Each sequence is expected to be a string
with whitespace separated tokens. This format is recommended when the number of source sequences
in the batch is small. It also supports the following additional configuration options:
configuration: {attention_matrix: true}: Returns the attention matrix for the particular input
sequence.
• application/x-recordio-protobuf: Expects the input in recordio-protobuf format and
returns the output in recordio-protobuf format. Both content and accept types should be
applications/x-recordio-protobuf. For this format, the source sequences must be converted
into a list of integers for subsequent protobuf encoding. This format is recommended for bulk
inference.
For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON
Lines format and returns the output in JSON Lines format. Both content and accept types should be
application/jsonlines. The format for input is as follows:
content-type: application/jsonlines
{"source": "source_sequence_0"}
{"source": "source_sequence_1"}
1438
Amazon SageMaker Developer Guide
Use Built-in Algorithms
accept: application/jsonlines
{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}
For additional details on how to serialize and deserialize the inputs and outputs to specific formats for
inference, see the Sequence-to-Sequence Sample Notebooks (p. 1439) .
• An embedding layer. In this layer, the input matrix, which is input tokens encoded in a sparse way
(for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-
dimensional feature vector is more capable of encoding information regarding a particular token (word
for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this
embedding layer with a pre-trained word vector like FastText or Glove or to initialize it randomly and
learn the parameters during training.
• An encoder layer. After the input tokens are mapped into a high-dimensional feature space,
the sequence is passed through an encoder layer to compress all the information from the input
embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is
made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). (
Colah's blog explains LSTM in a great detail.)
• A decoder layer. The decoder layer takes this encoded feature vector and produces the output
sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU).
The whole model is trained jointly to maximize the probability of the target sequence given the source
sequence. This model was first introduced by Sutskever et al. in 2014.
For more in details, see the whitepaper Effective Approaches to Attention-based Neural Machine
Translation by Luong, et al. that explains and simplifies calculations for various attention mechanisms.
Additionally, the whitepaper Google's Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation by Wu, et al. describes Google's architecture for machine translation,
which uses skip connections between encoder and decoder layers.
1439
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Sequence-to-Sequence Hyperparameters
Optional
Default value: 64
beam_size Length of the beam for beam search. Used during training
for computing bleu and used during inference.
Optional
Default value: 5
Optional
Default value: 0
Optional
Default value: 10
Optional
1440
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 3
Optional
Default value: 1
Optional
Optional
Default value: 0
1441
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 5
Optional
Default value: 3
Optional
Optional
Optional
Default value: 0
Optional
Default value: 0
Optional
1442
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 10
Optional
Optional
Optional
Optional
Default value: -1
Optional
1443
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Optional
Default value: 0
Optional
Optional
Optional
1444
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 1
Optional
Default value: 1
Optional
Optional
Optional
Optional
Default value: 3
1445
Amazon SageMaker Developer Guide
Use Built-in Algorithms
rnn_attention_in_upper_layers Pass the attention to upper layers of rnn, like Google NMT
paper. Only applicable if more than one layer is used.
Optional
Optional
Optional
Optional
Optional
Optional
Default value: 2
1446
Amazon SageMaker Developer Guide
Use Built-in Algorithms
rnn_num_hidden The number of rnn hidden units for encoder and decoder.
This must be a multiple of 2 because the algorithm uses
bi-directional Long Term Short Term Memory (LSTM) by
default.
Optional
Optional
Optional
Default value: 0
Optional
Optional
Default value: 0
Optional
1447
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: in
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The sequence to sequence algorithm reports three metrics that are computed during training. Choose
one of them as an objective to optimize when tuning the hyperparameter values.
You can tune the following hyperparameters for the SageMaker Sequence to Sequence algorithm.
The hyperparameters that have the greatest impact on sequence to sequence objective
1448
Amazon SageMaker Developer Guide
Use Built-in Algorithms
ContinuousParameterRange
rnn_decoder_hidden_dropout MinValue: 0.0,
MaxValue: 0.5
1449
Amazon SageMaker Developer Guide
Use Built-in Algorithms
ContinuousParameterRange
plateau_reduce_lr_factor MinValue: 0.1,
MaxValue: 0.5
IntegerParameterRange
plateau_reduce_lr_threshold [1-5]
IntegerParameterRange
fixed_rate_lr_half_life [10-30]
Topics
• How to use the SageMaker Text Classification - TensorFlow algorithm (p. 1450)
• Input and output interface for the Text Classification - TensorFlow algorithm (p. 1451)
• Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm (p. 1452)
• Text Classification - TensorFlow sample notebooks (p. 1453)
• How Text Classification - TensorFlow Works (p. 1453)
• TensorFlow Hub Models (p. 1453)
• Text Classification - TensorFlow Hyperparameters (p. 1456)
• Tune a Text Classification - TensorFlow model (p. 1459)
You can use Text Classification - TensorFlow as an Amazon SageMaker built-in algorithm. The following
section describes how to use Text Classification - TensorFlow with the SageMaker Python SDK. For
information on how to use Text Classification - TensorFlow from the Amazon SageMaker Studio UI, see
SageMaker JumpStart (p. 47).
The Text Classification - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow models. For a list of all available pretrained models, see TensorFlow Hub
Models (p. 1453). Every pretrained model has a unique model_id. The following example uses BERT
Base Uncased (model_id: tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2) to fine-tune
on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and
stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated
model training artifacts to construct a SageMaker Estimator.
First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and
their default values with hyperparameters.retrieve_default. For more information, see Text
Classification - TensorFlow Hyperparameters (p. 1456). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For example, for larger
models, the default batch size is smaller.
1450
Amazon SageMaker Developer Guide
Use Built-in Algorithms
This example uses the SST2 dataset, which contains positive and negative movie reviews. We pre-
downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call .fit using
the Amazon S3 location of your training dataset. Any S3 bucket used in a notebook must be in the same
AWS Region as the notebook instance that accesses it.
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to JumpStart - Text Classification notebook.
Input and output interface for the Text Classification - TensorFlow algorithm
Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset made
up of text sentences with any number of classes. The pretrained model attaches a classification layer to
1451
Amazon SageMaker Developer Guide
Use Built-in Algorithms
the Text Embedding model and initializes the layer parameters to random values. The output dimension
of the classification layer is determined based on the number of classes detected in the input data.
Be mindful of how to format your training data for input to the Text Classification - TensorFlow model.
• Training data input format: A directory containing a data.csv file. Each row of the first column
should have integer class labels between 0 and the number of classes. Each row of the second column
should have the corresponding text data.
The following is an example of an input CSV file. Note that the file should not have any
header. The file should be hosted in an Amazon S3 bucket with a path similar to the following:
s3://bucket_name/input_directory/. Note that the trailing / is required.
| | |
|---|---|
|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human
nature|
|...|...|
Incremental training
You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Text Classification - TensorFlow model with another Text
Classification - TensorFlow model trained in SageMaker.
You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model.
For more information on using incremental training with the SageMaker Text Classification - TensorFlow
algorithm, see the Introduction to JumpStart - Text Classification sample notebook.
Running inference results in probability values, class labels for all classes, and the predicted label
corresponding to the class index with the highest probability encoded in JSON format. The Text
Classification - TensorFlow model processes a single string per request and outputs only one line. The
following is an example of a JSON format response:
accept: application/json;verbose
Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
The Text Classification - TensorFlow algorithm supports all CPU and GPU instances for training,
including:
1452
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge
• ml.g4dn.xlarge
• ml.g4dn.16.xlarge
• ml.g5.xlarge
• ml.g5.48xlarge
We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such
as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference. For a comprehensive list of
SageMaker training and inference instances across AWS Regions, see Amazon SageMaker Pricing.
For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to JumpStart - Text Classification notebook.
For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.
The Text Classification - TensorFlow algorithm takes text as classifies it into one of the output class
labels. Deep learning networks such as BERT are highly accurate for text classification. There are also
deep learning networks that are trained on large text datasets, such as TextNet, which has more than 11
million texts with about 11,000 categories. After a network is trained with TextNet data, you can then
fine-tune the network on a dataset with a particular focus to perform more specific text classification
tasks. The Amazon SageMaker Text Classification - TensorFlow algorithm supports transfer learning on
many pretrained models that are available in the TensorFlow Hub.
According to the number of class labels in your training data, a text classification layer is attached to
the pretrained TensorFlow model of your choice. The classification layer consists of a dropout layer,
a dense layer, and a fully connected layer with 2-norm regularization, and is initialized with random
weights. You can change the hyperparameter values for the dropout rate of the dropout layer and the L2
regularization factor for the dense layer.
You can fine-tune either the entire network (including the pretrained model) or only the top
classification layer on new training data. With this method of transfer learning, training with smaller
datasets is possible.
The following pretrained models are available to use for transfer learning with the Text Classification -
TensorFlow algorithm.
The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.
1453
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1454
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1455
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Object Detection -
TensorFlow algorithm. See Tune a Text Classification - TensorFlow model (p. 1459) for information on
hyperparameter tuning.
batch_size The batch size for training. For training on instances with multiple
GPUs, this batch size is used across the GPUs.
beta_1 The beta1 for the "adam" and "adamw" optimizers. Represents the
exponential decay rate for the first moment estimates. Ignored for
other optimizers.
1456
Amazon SageMaker Developer Guide
Use Built-in Algorithms
beta_2 The beta2 for the "adam" and "adamw" optimizers. Represents the
exponential decay rate for the second moment estimates. Ignored
for other optimizers.
dropout_rate The dropout rate for the dropout layer in the top classification
layer. Used only when reinitialize_top_layer is set to
"True".
Default value: 5.
1457
Amazon SageMaker Developer Guide
Use Built-in Algorithms
optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.
regularizers_l2 The L2 regularization factor for the dense layer in the classification
layer. Used only when reinitialize_top_layer is set to
"True".
rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.
1458
Amazon SageMaker Developer Guide
Use Built-in Algorithms
train_only_on_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.
warmup_steps_fraction The fraction of the total number of gradient update steps, where
the learning rate increases from 0 to the initial learning rate as a
warm up. Only used with the adamw optimizer.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
Refer to the following chart to find which metrics are computed by the Text Classification - TensorFlow
algorithm.
Tune a text classification model with the following hyperparameters. The hyperparameters that have
the greatest impact on text classification objective metrics are: batch_size, learning_rate, and
optimizer. Tune the optimizer-related hyperparameters, such as momentum, regularizers_l2,
beta_1, beta_2, and eps based on the selected optimizer. For example, use beta_1 and beta_2
only when adamw or adam is the optimizer.
1459
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about which hyperparameters are used for each optimizer, see Text
Classification - TensorFlow Hyperparameters (p. 1456).
CategoricalParameterRanges
train_only_on_top_layer ['True', 'False']
• DeepAR Forecasting Algorithm (p. 1460)—a supervised learning algorithm for forecasting scalar (one-
dimensional) time series using recurrent neural networks (RNN).
1460
Amazon SageMaker Developer Guide
Use Built-in Algorithms
In many applications, however, you have many similar time series across a set of cross-sectional units.
For example, you might have time series groupings for demand for different products, server loads, and
requests for webpages. For this type of application, you can benefit from training a single model jointly
over all of the time series. DeepAR takes this approach. When your dataset contains hundreds of related
time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained
model to generate forecasts for new time series that are similar to the ones it has been trained on.
The training input for the DeepAR algorithm is one or, preferably, more target time series that have
been generated by the same process or similar processes. Based on this input dataset, the algorithm
trains a model that learns an approximation of this process/processes and uses it to predict how the
target time series evolves. Each target time series can be optionally associated with a vector of static
(time-independent) categorical features provided by the cat field and a vector of dynamic (time-
dependent) time series provided by the dynamic_feat field. SageMaker trains the DeepAR model by
randomly sampling training examples from each target time series in the training dataset. Each training
example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. To
control how far in the past the network can see, use the context_length hyperparameter. To control
how far in the future predictions can be made, use the prediction_length hyperparameter. For more
information, see How the DeepAR Algorithm Works (p. 1465).
Topics
• Input/Output Interface for the DeepAR Algorithm (p. 1461)
• Best Practices for Using the DeepAR Algorithm (p. 1463)
• EC2 Instance Recommendations for the DeepAR Algorithm (p. 1464)
• DeepAR Sample Notebooks (p. 1464)
• How the DeepAR Algorithm Works (p. 1465)
• DeepAR Hyperparameters (p. 1467)
• Tune a DeepAR Model (p. 1471)
• DeepAR Inference Formats (p. 1472)
DeepAR supports two data channels. The required train channel describes the training dataset. The
optional test channel describes a dataset that the algorithm uses to evaluate model accuracy after
training. You can provide training and test datasets in JSON Lines format. Files can be in gzip or Parquet
file format.
When specifying the paths for the training and test data, you can specify a single file or a directory that
contains multiple files, which can be stored in subdirectories. If you specify a directory, DeepAR uses all
files in the directory as inputs for the corresponding channel, except those that start with a period (.) and
those named _SUCCESS. This ensures that you can directly use output folders produced by Spark jobs as
input channels for your DeepAR training jobs.
By default, the DeepAR model determines the input format from the file extension (.json, .json.gz,
or .parquet) in the specified input path. If the path does not end in one of these extensions, you must
explicitly specify the format in the SDK for Python. Use the content_type parameter of the s3_input
class.
The records in your input files should contain the following fields:
• start—A string with the format YYYY-MM-DD HH:MM:SS. The start timestamp can't contain time
zone information.
• target—An array of floating-point values or integers that represent the time series. You can encode
missing values as null literals, or as "NaN" strings in JSON, or as nan floating-point values in Parquet.
• dynamic_feat (optional)—An array of arrays of floating-point values or integers that represents the
vector of custom feature time series (dynamic features). If you set this field, all records must have the
1461
Amazon SageMaker Developer Guide
Use Built-in Algorithms
same number of inner arrays (the same number of feature time series). In addition, each inner array
must be the same length as the associated target value plus prediction_length. Missing values
are not supported in the features. For example, if target time series represents the demand of different
products, an associated dynamic_feat might be a boolean time-series which indicates whether a
promotion was applied (1) to the particular product or not (0):
• cat (optional)—An array of categorical features that can be used to encode the groups that
the record belongs to. Categorical features must be encoded as a 0-based sequence of positive
integers. For example, the categorical domain {R, G, B} can be encoded as {0, 1, 2}. All values
from each categorical domain must be represented in the training dataset. That's because the
DeepAR algorithm can forecast only for categories that have been observed during training.
And, each categorical feature is embedded in a low-dimensional space whose dimensionality is
controlled by the embedding_dimension hyperparameter. For more information, see DeepAR
Hyperparameters (p. 1467).
If you use a JSON file, it must be in JSON Lines format. For example:
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1],
"dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat":
[[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat":
[[1.3, 0.4]]}
In this example, each time series has two associated categorical features and one time series features.
For Parquet, you use the same three fields as columns. In addition, "start" can be the datetime type.
You can compress Parquet files using gzip (gzip) or the Snappy compression library (snappy).
If the algorithm is trained without cat and dynamic_feat fields, it learns a "global" model, that
is a model that is agnostic to the specific identity of the target time series at inference time and is
conditioned only on its shape.
If the model is conditioned on the cat and dynamic_feat feature data provided for each time series,
the prediction will probably be influenced by the character of time series with the corresponding cat
features. For example, if the target time series represents the demand of clothing items, you can
associate a two-dimensional cat vector that encodes the type of item (e.g. 0 = shoes, 1 = dress) in the
first component and the color of an item (e.g. 0 = red, 1 = blue) in the second component. A sample input
would look as follows:
{ "start": ..., "target": ..., "cat": [0, 0], ... } # red shoes
{ "start": ..., "target": ..., "cat": [1, 1], ... } # blue dress
At inference time, you can request predictions for targets with cat values that are combinations of the
cat values observed in the training data, for example:
{ "start": ..., "target": ..., "cat": [0, 1], ... } # blue shoes
{ "start": ..., "target": ..., "cat": [1, 0], ... } # red dress
• The start time and length of the time series can differ. For example, in marketing, products often enter
a retail catalog at different dates, so their start dates naturally differ. But all series must have the same
frequency, number of categorical features, and number of dynamic features.
1462
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Shuffle the training file with respect to the position of the time series in the file. In other words, the
time series should occur in random order in the file.
• Make sure to set the start field correctly. The algorithm uses the start timestamp to derive the
internal features.
• If you use categorical features (cat), all time series must have the same number of categorical
features. If the dataset contains the cat field, the algorithm uses it and extracts the cardinality of the
groups from the dataset. By default, cardinality is "auto". If the dataset contains the cat field,
but you don't want to use it, you can disable it by setting cardinality to "". If a model was trained
using a cat feature, you must include it for inference.
• If your dataset contains the dynamic_feat field, the algorithm uses it automatically. All time series
have to have the same number of feature time series. The time points in each of the feature time series
correspond one-to-one to the time points in the target. In addition, the entry in the dynamic_feat
field should have the same length as the target. If the dataset contains the dynamic_feat field, but
you don't want to use it, disable it by setting(num_dynamic_feat to ""). If the model was trained
with the dynamic_feat field, you must provide this field for inference. In addition, each of the
features has to have the length of the provided target plus the prediction_length. In other words,
you must provide the feature value in the future.
If you specify optional test channel data, the DeepAR algorithm evaluates the trained model with
different accuracy metrics. The algorithm calculates the root mean square error (RMSE) over the test data
as follows:
yi,t is the true value of time series i at the time t. ŷi,t is the mean prediction. The sum is over all n time
series in the test set and over the last Τ time points for each time series, where Τ corresponds to the
forecast horizon. You specify the length of the forecast horizon by setting the prediction_length
hyperparameter. For more information, see DeepAR Hyperparameters (p. 1467).
In addition, the algorithm evaluates the accuracy of the forecast distribution using weighted quantile
loss. For a quantile in the range [0, 1], the weighted quantile loss is defined as follows:
(τ)
qi,t is the τ-quantile of the distribution that the model predicts. To specify which quantiles to
calculate loss for, set the test_quantiles hyperparameter. In addition to these, the average of
the prescribed quantile losses is reported as part of the training logs. For information, see DeepAR
Hyperparameters (p. 1467).
For inference, DeepAR accepts JSON format and the following fields:
• "instances", which includes one or more time series in JSON Lines format
• A name of "configuration", which includes parameters for generating the forecast
When preparing your time series data, follow these best practices to achieve the best results:
1463
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Except for when splitting your dataset for training and testing, always provide the entire time
series for training, testing, and when calling the model for inference. Regardless of how you set
context_length, don't break up the time series or provide only a part of it. The model uses data
points further back than the value set in context_length for the lagged values feature.
• When tuning a DeepAR model, you can split the dataset to create a training dataset and a test dataset.
In a typical evaluation, you would test the model on the same time series used for training, but
on the future prediction_length time points that follow immediately after the last time point
visible during training. You can create training and test datasets that satisfy this criteria by using the
entire dataset (the full length of all time series that are available) as a test set and removing the last
prediction_length points from each time series for training. During training, the model doesn't see
the target values for time points on which it is evaluated during testing. During testing, the algorithm
withholds the last prediction_length points of each time series in the test set and generates a
prediction. Then it compares the forecast with the withheld values. You can create more complex
evaluations by repeating time series multiple times in the test set, but cutting them at different
endpoints. With this approach, accuracy metrics are averaged over multiple forecasts from different
time points. For more information, see Tune a DeepAR Model (p. 1471).
• Avoid using very large values (>400) for the prediction_length because it makes the model slow
and less accurate. If you want to forecast further into the future, consider aggregating your data at a
lower frequency. For example, use 5min instead of 1min.
• Because lags are used, a model can look further back in the time series than the value specified for
context_length. Therefore, you don't need to set this parameter to a large value. We recommend
starting with the value that you used for prediction_length.
• We recommend training a DeepAR model on as many time series as are available. Although a DeepAR
model trained on a single time series might work well, standard forecasting algorithms, such as ARIMA
or ETS, might provide more accurate results. The DeepAR algorithm starts to outperform the standard
methods when your dataset contains hundreds of related time series. Currently, DeepAR requires that
the total number of observations available across all training time series is at least 300.
You can train DeepAR on both GPU and CPU instances and in both single and multi-machine settings.
We recommend starting with a single CPU instance (for example, ml.c4.2xlarge or ml.c4.4xlarge), and
switching to GPU instances and multiple machines only when necessary. Using GPUs and multiple
machines improves throughput only for larger models (with many cells per layer and many layers) and
for large mini-batch sizes (for example, greater than 512).
For a sample notebook that shows how to prepare a time series dataset for training the SageMaker
DeepAR algorithm and how to deploy the trained model for performing inferences, see Time series
forecasting with DeepAR - Synthetic data as well as DeepAR demo on electricity dataset, which illustrates
the advanced features of DeepAR on a real world dataset. For instructions on creating and accessing
Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon SageMaker
Notebook Instances (p. 204). After creating and opening a notebook instance, choose the SageMaker
Examples tab to see a list of all of the SageMaker examples. To open a notebook, choose its Use tab, and
choose Create copy.
1464
Amazon SageMaker Developer Guide
Use Built-in Algorithms
During training, DeepAR accepts a training dataset and an optional test dataset. It uses the test dataset
to evaluate the trained model. In general, the datasets don't have to contain the same set of time series.
You can use a model trained on a given training set to generate forecasts for the future of the time series
in the training set, and for other time series. Both the training and the test datasets consist of one or,
preferably, more target time series. Each target time series can optionally be associated with a vector
of feature time series and a vector of categorical features. For more information, see Input/Output
Interface for the DeepAR Algorithm (p. 1461).
For example, the following is an element of a training set indexed by i which consists of a target time
series, Zi,t, and two associated feature time series, Xi,1,t and Xi,2,t:
The target time series might contain missing values, which are represented by line breaks in the time
series. DeepAR supports only feature time series that are known in the future. This allows you to run
"what if?" scenarios. What happens, for example, if I change the price of a product in some way?
Each target time series can also be associated with a number of categorical features. You can use these
features to encode which groupings a time series belongs to. Categorical features allow the model to
learn typical behavior for groups, which it can use to increase model accuracy. DeepAR implements this
by learning an embedding vector for each group that captures the common properties of all time series
in the group.
To facilitate learning time-dependent patterns, such as spikes during weekends, DeepAR automatically
creates feature time series based on the frequency of the target time series. It uses these derived feature
time series with the custom feature time series that you provide during training and inference. The
following figure shows two of these derived time series features: ui,1,t represents the hour of the day and
ui,2,t the day of the week.
1465
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The DeepAR algorithm automatically generates these feature time series. The following table lists the
derived features for the supported basic time frequencies.
Month month-of-year
DeepAR trains a model by randomly sampling several training examples from each of the time series
in the training dataset. Each training example consists of a pair of adjacent context and prediction
windows with fixed predefined lengths. The context_length hyperparameter controls how far in the
past the network can see, and the prediction_length hyperparameter controls how far in the future
predictions can be made. During training, the algorithm ignores training set elements containing time
series that are shorter than a specified prediction length. The following figure represents five samples
with context lengths of 12 hours and prediction lengths of 6 hours drawn from element i. For brevity,
we've omitted the feature time series xi,1,t and ui,2,t.
1466
Amazon SageMaker Developer Guide
Use Built-in Algorithms
To capture seasonality patterns, DeepAR also automatically feeds lagged values from the target time
series. In the example with hourly frequency, for each time index, t = T, the model exposes the zi,t values,
which occurred approximately one, two, and three days in the past.
For inference, the trained model takes as input target time series, which might or might not have been
used during training, and forecasts a probability distribution for the next prediction_length values.
Because DeepAR is trained on the entire dataset, the forecast takes into account patterns learned from
similar time series.
For information on the mathematics behind DeepAR, see DeepAR: Probabilistic Forecasting with
Autoregressive Recurrent Networks.
DeepAR Hyperparameters
context_length The number of time-points that the model gets to see before
making the prediction. The value for this parameter should be
about the same as the prediction_length. The model also
receives lagged inputs from the target, so context_length can be
much smaller than typical seasonalities. For example, a daily time
series can have yearly seasonality. The model automatically includes
a lag of one year, so the context length can be shorter than a year.
The lag values that the model picks depend on the frequency of the
time series. For example, lag values for daily frequency are previous
week, 2 weeks, 3 weeks, 4 weeks, and year.
Required
epochs The maximum number of passes over the training data. The
optimal value depends on your data size and learning rate. See also
early_stopping_patience. Typical values range from 10 to
1000.
Required
prediction_length The number of time-steps that the model is trained to predict, also
called the forecast horizon. The trained model always generates
forecasts with this length. It can't generate longer forecasts. The
1467
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Required
time_freq The granularity of the time series in the dataset. Use time_freq to
select appropriate date features and lags. The model supports the
following basic frequencies. It also supports multiples of these basic
frequencies. For example, 5min specifies a frequency of 5 minutes.
• M: monthly
• W: weekly
• D: daily
• H: hourly
• min: every minute
Required
Optional
dropout_rate The dropout rate to use during training. The model uses zoneout
regularization. For each iteration, a random subset of hidden
neurons are not updated. Typical values are less than 0.2.
Optional
1468
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
The DeepAR model can learn group-level time series patterns when
a categorical grouping feature is provided. To do this, the model
learns an embedding vector of size embedding_dimension for
each group, capturing the common properties of all time series in
the group. A larger embedding_dimension allows the model to
capture more complex patterns. However, because increasing the
embedding_dimension increases the number of parameters in
the model, more training data is required to accurately learn these
parameters. Typical values for this parameter are between 10-100.
Optional
Default value: 10
learning_rate The learning rate used in training. Typical values range from 1e-4 to
1e-1.
Optional
1469
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
mini_batch_size The size of mini-batches used during training. Typical values range
from 32 to 512.
Optional
num_cells The number of cells to use in each hidden layer of the RNN. Typical
values range from 30 to 100.
Optional
Default value: 40
Optional
1470
Amazon SageMaker Developer Guide
Use Built-in Algorithms
num_eval_samples The number of samples that are used per time-series when
calculating test accuracy metrics. This parameter does not have any
influence on the training or the final model. In particular, the model
can be queried with a different number of samples. This parameter
only affects the reported accuracy scores on the test channel after
training. Smaller values result in faster evaluation, but then the
evaluation scores are typically worse and more uncertain. When
evaluating with higher quantiles, for example 0.95, it may be
important to increase the number of evaluation samples.
Optional
num_layers The number of hidden layers in the RNN. Typical values range from
1 to 4.
Optional
Default value: 2
test_quantiles Quantiles for which to calculate quantile loss on the test channel.
Optional
Default value: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The DeepAR algorithm reports three metrics, which are computed during training. When tuning a model,
choose one of these as the objective. For the objective, use either the forecast accuracy on a provided
test channel (recommended) or the training loss. For recommendations for the training/test split for the
DeepAR algorithm, see Best Practices for Using the DeepAR Algorithm (p. 1463).
test:RMSE The root mean square error between the forecast Minimize
and the actual target computed on the test set.
1471
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune a DeepAR model with the following hyperparameters. The hyperparameters that have the greatest
impact, listed in order from the most to least impactful, on DeepAR objective metrics are: epochs,
context_length, mini_batch_size, learning_rate, and num_cells.
Query a trained model by using the model's endpoint. The endpoint takes the following JSON request
format.
In the request, the instances field corresponds to the time series that should be forecast by the model.
If the model was trained with categories, you must provide a cat for each instance. If the model was
trained without the cat field, it should be omitted.
If the model was trained with a custom feature time series (dynamic_feat), you have to provide the
same number of dynamic_featvalues for each instance. Each of them should have a length given by
length(target) + prediction_length, where the last prediction_length values correspond to
1472
Amazon SageMaker Developer Guide
Use Built-in Algorithms
the time points in the future that will be predicted. If the model was trained without custom feature time
series, the field should not be included in the request.
{
"instances": [
{
"start": "2009-11-01 00:00:00",
"target": [4.0, 10.0, "NaN", 100.0, 113.0],
"cat": [0, 1],
"dynamic_feat": [[1.0, 1.1, 2.1, 0.5, 3.1, 4.1, 1.2, 5.0, ...]]
},
{
"start": "2012-01-30",
"target": [1.0],
"cat": [2, 1],
"dynamic_feat": [[2.0, 3.1, 4.5, 1.5, 1.8, 3.2, 0.1, 3.0, ...]]
},
{
"start": "1999-01-30",
"target": [2.0, 1.0],
"cat": [1, 3],
"dynamic_feat": [[1.0, 0.1, -2.5, 0.3, 2.0, -1.2, -0.1, -3.0, ...]]
}
],
"configuration": {
"num_samples": 50,
"output_types": ["mean", "quantiles", "samples"],
"quantiles": ["0.5", "0.9"]
}
}
The following is the format of a response, where [...] are arrays of numbers:
{
"predictions": [
{
"quantiles": {
"0.9": [...],
"0.5": [...]
},
"samples": [...],
"mean": [...]
},
{
"quantiles": {
"0.9": [...],
"0.5": [...]
},
"samples": [...],
"mean": [...]
},
{
"quantiles": {
"0.9": [...],
1473
Amazon SageMaker Developer Guide
Use Built-in Algorithms
"0.5": [...]
},
"samples": [...],
"mean": [...]
}
]
}
DeepAR has a response timeout of 60 seconds. When passing multiple time series in a single request,
the forecasts are generated sequentially. Because the forecast for each time series typically takes about
300 to 1000 milliseconds or longer, depending on the model size, passing too many time series in a
single request can cause time outs. It's better to send fewer time series per request and send more
requests. Because the DeepAR algorithm uses multiple workers per instance, you can achieve much
higher throughput by sending multiple requests in parallel.
By default, DeepAR uses one worker per CPU for inference, if there is sufficient memory per CPU. If the
model is large and there isn't enough memory to run a model on each CPU, the number of workers is
reduced. The number of workers used for inference can be overwritten using the environment variable
MODEL_SERVER_WORKERS For example, by setting MODEL_SERVER_WORKERS=1) when calling the
SageMaker CreateModel API.
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1],
"dynamic_feat": [[1.1, 1.2, 0.5, ..]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat":
[[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat":
[[1.3, 0.4]]}
Note
When creating the transformation job with CreateTransformJob, set the BatchStrategy
value to SingleRecord and set the SplitType value in the TransformInput configuration
to Line, as the default values currently cause runtime failures.
Similar to the hosted endpoint inference request format, the cat and the dynamic_feat fields for each
instance are required if both of the following are true:
• The model is trained on a dataset that contained both the cat and the dynamic_feat fields.
• The corresponding cardinality and num_dynamic_feat values used in the training job are not set
to "".
Unlike hosted endpoint inference, the configuration field is set once for the entire batch
inference job using an environment variable named DEEPAR_INFERENCE_CONFIG. The
value of DEEPAR_INFERENCE_CONFIG can be passed when the model is created by calling
CreateTransformJob API. If DEEPAR_INFERENCE_CONFIG is missing in the container environment,
the inference container uses the following default:
{
"num_samples": 100,
"output_types": ["mean", "quantiles"],
"quantiles": ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
}
1474
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The output is also in JSON Lines format, with one line per prediction, in an order identical to the instance
order in the corresponding input file. Predictions are encoded as objects identical to the ones returned by
responses in online inference mode. For example:
For example, here is a SageMaker CreateTransformJob request for a DeepAR job with a custom
DEEPAR_INFERENCE_CONFIG:
{
"BatchStrategy": "SingleRecord",
"Environment": {
"DEEPAR_INFERENCE_CONFIG" : "{ \"num_samples\": 200, \"output_types\": [\"mean\"] }",
...
},
"TransformInput": {
"SplitType": "Line",
...
},
"TransformOutput": {
"AssembleWith": "Line",
...
},
...
}
• IP Insights (p. 1476)—learns the usage patterns for IPv4 addresses. It is designed to capture
associations between IPv4 addresses and various entities, such as user IDs or account numbers.
• K-Means Algorithm (p. 1485)—finds discrete groupings within data, where members of a group are as
similar as possible to one another and as different as possible from members of other groups.
• Principal Component Analysis (PCA) Algorithm (p. 1493)—reduces the dimensionality (number of
features) within a dataset by projecting data points onto the first few principal components. The
objective is to retain as much information or variation as possible. For mathematicians, principal
components are eigenvectors of the data's covariance matrix.
• Random Cut Forest (RCF) Algorithm (p. 1497)—detects anomalous data points within a data set that
diverge from otherwise well-structured or patterned data.
1475
Amazon SageMaker Developer Guide
Use Built-in Algorithms
IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for
IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such
as user IDs or account numbers. You can use it to identify a user attempting to log into a web service
from an anomalous IP address, for example. Or you can use it to identify an account that is attempting
to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an
endpoint for making real-time predictions or used for processing batch transforms.
SageMaker IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage
patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker IP Insights
model returns a score that infers how anomalous the pattern of the event is. For example, when a user
attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might
decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the
IP Insights score into another machine learning model. For example, you can combine the IP Insight
score with other features to rank the findings of another security system, such as those from Amazon
GuardDuty.
The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as
embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks
that use the information observed in the IP addresses. For example, you can use them in tasks such as
measuring similarities between IP addresses in clustering and visualization tasks.
Topics
• Input/Output Interface for the IP Insights Algorithm (p. 1476)
• EC2 Instance Recommendation for the IP Insights Algorithm (p. 1477)
• IP Insights Sample Notebooks (p. 1478)
• How IP Insights Works (p. 1478)
• IP Insights Hyperparameters (p. 1479)
• Tune an IP Insights Model (p. 1481)
• IP Insights Data Formats (p. 1483)
The SageMaker IP Insights algorithm supports training and validation data channels. It uses the optional
validation channel to compute an area-under-curve (AUC) score on a predefined negative sampling
strategy. The AUC metric validates how well the model discriminates between positive and negative
samples. Training and validation data content types need to be in text/csv format. The first column
of the CSV data is an opaque string that provides a unique identifier for the entity. The second column
1476
Amazon SageMaker Developer Guide
Use Built-in Algorithms
is an IPv4 address in decimal-dot notation. IP Insights currently supports only File mode. For more
information and some examples, see IP Insights Training Data Formats (p. 1483).
Inference
10,000
vector_dim 50,000 100,000 500,000 1,000,000 5,000,000 10,000,00050,000,000
\
num_entity_vectors.
32 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.2xlarge
ml.m5.4xlarge
64 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge
ml.m5.2xlarge
128 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge
ml.m5.4xlarge
256 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.4xlarge
512 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge
1024 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.4xlarge
2048 ml.m5.largeml.m5.large
ml.m5.xlarge
ml.m5.xlarge
1477
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The IP Insights algorithm uses a neural network to learn the latent vector representations for entities
and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a
simple embedding layer. Character strings such as user names or account IDs can be fed directly into
IP Insights as they appear in log files. You don't need to preprocess the data for entity identifiers. You
can provide entities as an arbitrary string value during both training and inference. The hash size should
be configured with a value that is high enough to ensure that the number of collisions, which occur
when distinct entities are mapped to the same latent vector, remain insignificant. For more information
about how to select appropriate hash sizes, see Feature Hashing for Large Scale Multitask Learning. For
representing IP addresses, on the other hand, IP Insights uses a specially designed encoder network to
uniquely represent each possible IPv4 address by exploiting the prefix structure of IP addresses.
During training, IP Insights automatically generates negative samples by randomly pairing entities and
IP addresses. These negative samples represent data that is less likely to occur in reality. The model
is trained to discriminate between positive samples that are observed in the training data and these
generated negative samples. More specifically, the model is trained to minimize the cross entropy, also
known as the log loss, defined as follows:
yn is the label that indicates whether the sample is from the real distribution governing observed data
(yn=1) or from the distribution generating negative samples (yn=0). pn is the probability that the sample
is from the real distribution, as predicted by the model.
Generating negative samples is an important process that is used to achieve an accurate model of
the observed data. If negative samples are extremely unlikely, for example, if all of the IP addresses
in negative samples are 10.0.0.0, then the model trivially learns to distinguish negative samples and
fails to accurately characterize the actual observed dataset. To keep negative samples more realistic,
IP Insights generates negative samples both by randomly generating IP addresses and randomly
picking IP addresses from training data. You can configure the type of negative sampling and the
rates at which negative samples are generated with the random_negative_sampling_rate and
shuffled_negative_sampling_rate hyperparameters.
Given an nth (entity, IP address pair), the IP Insights model outputs a score, Sn , that indicates how
compatible the entity is with the IP address. This score corresponds to the log odds ratio for a given
(entity, IP address) of the pair coming from a real distribution as compared to coming from a negative
distribution. It is defined as follows:
The score is essentially a measure of the similarity between the vector representations of the nth entity
and IP address. It can be interpreted as how much more likely it would be to observe this event in reality
1478
Amazon SageMaker Developer Guide
Use Built-in Algorithms
than in a randomly generated dataset. During training, the algorithm uses this score to calculate an
estimate of the probability of a sample coming from the real distribution, pn, to use in the cross entropy
minimization, where:
IP Insights Hyperparameters
In the CreateTransformJob request, you specify the training algorithm. You can also specify
algorithm-specific hyperparameters as string-to-string maps. The following table lists the
hyperparameters for the Amazon SageMaker IP Insights algorithm.
Required
Required
Optional
1479
Amazon SageMaker Developer Guide
Use Built-in Algorithms
epochs The number of passes over the training data. The optimal
value depends on your data size and learning rate. Typical
values range from 5 to 100.
Optional
Default value: 10
Optional
Optional
Optional
Default value: 1
1480
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 1
Optional
Default value: 1
Optional
Automatic model tuning, also called hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
1481
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The Amazon SageMaker IP Insights algorithm is an unsupervised learning algorithm that learns
associations between IP addresses and entities. The algorithm trains a discriminator model , which
learns to separate observed data points (positive samples) from randomly generated data points
(negative samples). Automatic model tuning on IP Insights helps you find the model that can most
accurately distinguish between unlabeled validation data and automatically generated negative samples.
The model accuracy on the validation dataset is measured by the area under the receiver operating
characteristic curve. This validation:discriminator_auc metric can take values between 0.0 and
1.0, where 1.0 indicates perfect accuracy.
You can tune the following hyperparameters for the SageMaker IP Insights algorithm.
IntegerParameterRanges
num_ip_encoder_layers MinValue: 1, MaxValue:
10
IntegerParameterRanges
random_negative_sampling_rate MinValue: 0, MaxValue:
10
IntegerParameterRanges
shuffled_negative_sampling_rate MinValue: 0, MaxValue:
10
1482
Amazon SageMaker Developer Guide
Use Built-in Algorithms
This section provides examples of the available input and output data formats used by the IP Insights
algorithm during training and inference.
Topics
• IP Insights Training Data Formats (p. 1483)
• IP Insights Inference Data Formats (p. 1483)
The following are the available data input formats for the IP Insights algorithm. Amazon SageMaker
built-in algorithms adhere to the common input training format described in Common Data Formats for
Training (p. 1290). However, the SageMaker IP Insights algorithm currently supports only the CSV data
input format.
INPUT: CSV
The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's
unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot
notation.
content-type: text/csv
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
The following are the available input and output formats for the IP Insights algorithm. Amazon
SageMaker built-in algorithms adhere to the common input inference format described in Common
Data Formats for Inference (p. 1293). However, the SageMaker IP Insights algorithm does not currently
support RecordIO format.
The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's
unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot
notation.
content-type: text/csv
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
JSON data can be provided in different formats. IP Insights follows the common SageMaker formats. For
more information about inference formats, see Common Data Formats for Inference (p. 1293).
content-type: application/json
1483
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{
"instances": [
{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
{"features": ["entity_id_2", "10.10.1.2"]}
]
}
The JSON Lines content type is useful for running batch transform jobs. For more information on
SageMaker inference formats, see Common Data Formats for Inference (p. 1293). For more information
on running batch transform jobs, see Use Batch Transform (p. 2421).
content-type: application/jsonlines
The default output of the SageMaker IP Insights algorithm is the dot_product between the input
entity and IP address. The dot_product signifies how compatible the model considers the entity and IP
address. The dot_product is unbounded. To make predictions about whether an event is anomalous,
you need to set a threshold based on your defined distribution. For information about how to use the
dot_product for anomaly detection, see the An Introduction to the SageMakerIP Insights Algorithm.
accept: application/json
{
"predictions": [
{"dot_product": 0.0},
{"dot_product": 2.0}
]
}
Advanced users can access the model's learned entity and IP embeddings by providing the additional
content-type parameter verbose=True to the Accept heading. You can use the entity_embedding
and ip_embedding for debugging, visualizing, and understanding the model. Additionally, you can use
these embeddings in other machine learning techniques, such as classification or clustering.
accept: application/json;verbose=True
{
"predictions": [
{
"dot_product": 0.0,
"entity_embedding": [1.0, 0.0, 0.0],
"ip_embedding": [0.0, 1.0, 0.0]
},
{
"dot_product": 2.0,
"entity_embedding": [1.0, 0.0, 1.0],
"ip_embedding": [1.0, 0.0, 1.0]
}
]
}
1484
Amazon SageMaker Developer Guide
Use Built-in Algorithms
accept: application/jsonlines
{"dot_product": 0.0}
{"dot_product": 2.0}
{"dot_product": 0.0, "entity_embedding": [1.0, 0.0, 0.0], "ip_embedding": [0.0, 1.0, 0.0]}
{"dot_product": 2.0, "entity_embedding": [1.0, 0.0, 1.0], "ip_embedding": [1.0, 0.0, 1.0]}
K-Means Algorithm
K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where
members of a group are as similar as possible to one another and as different as possible from members
of other groups. You define the attributes that you want the algorithm to use to determine similarity.
Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared
with the original version of the algorithm, the version used by Amazon SageMaker is more accurate.
Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To
do this, the version used by Amazon SageMaker streams mini-batches (small, random subsets) of the
training data. For more information about mini-batch k-means, see Web-scale k-means Clustering.
The k-means algorithm expects tabular data, where rows represent the observations that you want to
cluster, and the columns represent attributes of the observations. The n attributes in each row represent
a point in n-dimensional space. The Euclidean distance between these points represents the similarity
of the corresponding observations. The algorithm groups observations with similar attribute values (the
points corresponding to these observations are closer together). For more information about how k-
means works in Amazon SageMaker, see How K-Means Clustering Works (p. 1486).
Topics
• Input/Output Interface for the K-Means Algorithm (p. 1485)
• EC2 Instance Recommendation for the K-Means Algorithm (p. 1486)
• K-Means Sample Notebooks (p. 1486)
• How K-Means Clustering Works (p. 1486)
• K-Means Hyperparameters (p. 1489)
• Tune a K-Means Model (p. 1491)
• K-Means Response Formats (p. 1492)
For training, the k-means algorithm expects data to be provided in the train channel (recommended
S3DataDistributionType=ShardedByS3Key), with an optional test channel (recommended
S3DataDistributionType=FullyReplicated) to score the data on. Both recordIO-wrapped-
protobuf and CSV formats are supported for training. You can use either File mode or Pipe mode to
train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.
For more information on input and output file formats, see K-Means Response Formats (p. 1492) for
inference and the K-Means Sample Notebooks (p. 1486). The k-means algorithm does not support
1485
Amazon SageMaker Developer Guide
Use Built-in Algorithms
multiple instance learning, in which the training set consists of labeled “bags”, each of which is a
collection of unlabeled instances.
We recommend training k-means on CPU instances. You can train on GPU instances, but should limit
GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per
instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.
For a sample notebook that uses the SageMaker K-means algorithm to segment the population of
counties in the United States by attributes identified using principle component analysis, see Analyze
US census data for population segmentation using Amazon SageMaker. For instructions how to create
and access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon
SageMaker Notebook Instances (p. 204). Once you have created a notebook instance and opened it,
select the SageMaker Examples tab to see a list of all the SageMaker samples. To open a notebook, click
on its Use tab and select Create copy.
K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm
accomplishes this by mapping each observation in the input dataset to a point in the n-dimensional
space (where n is the number of attributes of the observation). For example, your dataset might contain
observations of temperature and humidity in a particular location, which are mapped to points (t, h) in 2-
dimensional space.
Note
Clustering algorithms are unsupervised. In unsupervised learning, labels that might be
associated with the objects in the training dataset aren't used.
In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the
distance of the point that corresponds to each observation in the dataset to the cluster centers as the
basis for clustering. You choose the number of clusters (k) to create.
For example, suppose that you want to create a model to recognize handwritten digits and you choose
the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0
through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As
part of model training, the k-means algorithm groups the input images into 10 clusters.
Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image
corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To
find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of
the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image
belongs.
Note
Amazon SageMaker uses a customized version of the algorithm where, instead of specifying that
the algorithm create k clusters, you might choose to improve model accuracy by specifying extra
cluster centers (K = k*x). However, the algorithm ultimately reduces these to k clusters.
In SageMaker, you specify the number of clusters when creating a training job. For more information, see
CreateTrainingJob. In the request body, you add the HyperParameters string map to specify the k
and extra_center_factor strings.
The following is a summary of how k-means works for model training in SageMaker:
1486
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Note
In the following topics, K clusters refer to k * x, where you specify k and x when creating a
model training job.
2. It iterates over input training data and recalculates cluster centers.
3. It reduces resulting clusters to k (if the data scientist specified the creation of k*x clusters in the
request).
The following sections also explain some of the parameters that a data scientist might specify to
configure a model training job as part of the HyperParameters string map.
Topics
• Step 1: Determine the Initial Cluster Centers (p. 1487)
• Step 2: Iterate over the Training Dataset and Calculate Cluster Centers (p. 1488)
• Step 3: Reduce the Clusters from K to k (p. 1488)
When using k-means in SageMaker, the initial cluster centers are chosen from the observations in a
small, randomly sampled batch. Choose one of the following strategies to determine how these initial
cluster centers are selected:
• The random approach—Randomly choose K observations in your input dataset as cluster centers. For
example, you might choose a cluster center that points to the 784-dimensional space that corresponds
to any 10 images in the MNIST training dataset.
• The k-means++ approach, which works as follows:
1. Start with one cluster and determine its center. You randomly select an observation from your
training dataset and use the point corresponding to the observation as the cluster center. For
example, in the MNIST dataset, randomly choose a handwritten digit image. Then choose the point
in the 784-dimensional space that corresponds to the image as your cluster center. This is cluster
center 1.
2. Determine the center for cluster 2. From the remaining observations in the training dataset, pick
an observation at random. Choose one that is different than the one you previously selected. This
observation corresponds to a point that is far away from cluster center 1. Using the MNIST dataset
as an example, you do the following:
• For each of the remaining images, find the distance of the corresponding point from cluster
center 1. Square the distance and assign a probability that is proportional to the square of the
distance. That way, an image that is different from the one that you previously selected has a
higher probability of getting selected as cluster center 2.
• Choose one of the images randomly, based on probabilities assigned in the previous step. The
point that corresponds to the image is cluster center 2.
3. Repeat Step 2 to find cluster center 3. This time, find the distances of the remaining images from
cluster center 2.
4. Repeat the process until you have the K cluster centers.
To train a model in SageMaker, you create a training job. In the request, you provide configuration
information by specifying the following HyperParameters string maps:
1487
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about the SageMaker k-means estimator, see K-means in the Amazon SageMaker
Python SDK documentation.
Step 2: Iterate over the Training Dataset and Calculate Cluster Centers
The cluster centers that you created in the preceding step are mostly random, with some consideration
for the training dataset. In this step, you use the training dataset to move these centers toward the true
cluster centers. The algorithm iterates over the training dataset, and recalculates the K cluster centers.
1. Read a mini-batch of observations (a small, randomly chosen subset of all records) from the training
dataset and do the following.
Note
When creating a model training job, you specify the batch size in the mini_batch_size
string in the HyperParameters string map.
a. Assign all of the observations in the mini-batch to one of the clusters with the closest cluster
center.
b. Calculate the number of observations assigned to each cluster. Then, calculate the proportion of
new points assigned per cluster.
Cluster c1 = 100 previously assigned points. You added 25 points from the mini-batch in this
step.
Cluster c2 = 150 previously assigned points. You added 40 points from the mini-batch in this
step.
Cluster c3 = 450 previously assigned points. You added 5 points from the mini-batch in this
step.
d. Compute the weighted average to find the updated cluster centers as follows:
2. Read the next mini-batch, and repeat Step 1 to recalculate the cluster centers.
3. For more information about mini-batch k-means, see Web-Scale k-means Clustering ).
If the algorithm created K clusters—(K = k*x) where x is greater than 1—then it reduces the K clusters to
k clusters. (For more information, see extra_center_factor in the preceding discussion.) It does this
1488
Amazon SageMaker Developer Guide
Use Built-in Algorithms
by applying Lloyd's method with kmeans++ initialization to the K cluster centers. For more information
about Lloyd's method, see k-means clustering.
K-Means Hyperparameters
In the CreateTrainingJob request, you specify the training algorithm that you want to use. You
can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists
the hyperparameters for the k-means training algorithm provided by Amazon SageMaker. For more
information about how k-means clustering works, see How K-Means Clustering Works (p. 1486).
Required
Required
Optional
Default value: 1
eval_metrics A JSON list of metric types used to report a score for the model.
Allowed values are msd for Means Square Deviation and ssd
for Sum of Square Distance. If test data is provided, the score is
reported for each of the metrics requested.
Optional
Optional
1489
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 0
init_method Method by which the algorithm chooses the initial cluster centers.
The standard k-means approach chooses them at random. An
alternative k-means++ method chooses the first cluster center at
random. Then it spreads out the position of the remaining initial
clusters by weighting the selection of centers with a probability
distribution that is proportional to the square of the distance of the
remaining data points from existing centers.
Optional
Optional
Optional
Optional
1490
Amazon SageMaker Developer Guide
Use Built-in Algorithms
local_lloyd_tol The tolerance for change in loss for early stopping of Lloyd's
expectation-maximization (EM) procedure used to build the final
model containing k centers.
Optional
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
The Amazon SageMaker k-means algorithm is an unsupervised algorithm that groups data into clusters
whose members are as similar as possible. Because it is unsupervised, it doesn't use a validation dataset
that hyperparameters can optimize against. But it does take a test dataset and emits metrics that depend
on the squared distance between the data points and the final cluster centroids at the end of each
training run. To find the model that reports the tightest clusters on the test dataset, you can use a
hyperparameter tuning job. The clusters optimize the similarity of their members.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The k-means algorithm computes the following metrics during training. When tuning a model, choose
one of these metrics as the objective metric.
Tune the Amazon SageMaker k-means model with the following hyperparameters. The
hyperparameters that have the greatest impact on k-means objective metrics are: mini_batch_size,
1491
Amazon SageMaker Developer Guide
Use Built-in Algorithms
All SageMaker built-in algorithms adhere to the common input inference format described in Common
Data Formats - Inference. This topic contains a list of the available output formats for the SageMaker k-
means algorithm.
{
"predictions": [
{
"closest_cluster": 1.0,
"distance_to_cluster": 3.0,
},
{
"closest_cluster": 2.0,
"distance_to_cluster": 5.0,
},
....
]
}
[
Record = {
features = {},
label = {
'closest_cluster': {
keys: [],
values: [1.0, 2.0] # float32
},
'distance_to_cluster': {
keys: [],
values: [3.0, 5.0] # float32
1492
Amazon SageMaker Developer Guide
Use Built-in Algorithms
},
}
}
]
1.0,3.0
2.0,5.0
• regular: For datasets with sparse data and a moderate number of observations and features.
• randomized: For datasets with both a large number of observations and features. This mode uses an
approximation algorithm.
The rows represent observations you want to embed in a lower dimensional space. The columns
represent features that you want to find a reduced approximation for. The algorithm calculates the
covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular
value decomposition on this summary to produce the principal components.
Topics
• Input/Output Interface for the PCA Algorithm (p. 1493)
• EC2 Instance Recommendation for the PCA Algorithm (p. 1494)
• PCA Sample Notebooks (p. 1494)
• How PCA Works (p. 1494)
• PCA Hyperparameters (p. 1495)
• PCA Response Formats (p. 1496)
For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to
the test dataset, which is scored by the final algorithm. Both recordIO-wrapped-protobuf and CSV
formats are supported for training. You can use either File mode or Pipe mode to train models on data
that is formatted as recordIO-wrapped-protobuf or as CSV.
1493
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information on input and output file formats, see PCA Response Formats (p. 1496) for
inference and the PCA Sample Notebooks (p. 1494).
PCA supports CPU and GPU instances for training and inference. Which instance type is most performant
depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and
G5.
For a sample notebook that shows how to use the SageMaker Principal Component Analysis algorithm to
analyze the images of handwritten digits from zero to nine in the MNIST dataset, see An Introduction to
PCA with MNIST. For instructions how to create and access Jupyter notebook instances that you can use
to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have
created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the
SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.
Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of
features) within a dataset while still retaining as much information as possible.
PCA reduces dimensionality by finding a new set of features called components, which are composites of
the original features, but are uncorrelated with one another. The first component accounts for the largest
possible variability in the data, the second component the second most variability, and so on.
Given the input of a matrix with rows each of dimension 1 * d, the data is partitioned into
mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a
summary of its data. The summaries of the different workers are then unified into a single solution at the
end of the computation.
Modes
The Amazon SageMaker PCA algorithm uses either of two modes to calculate these summaries,
depending on the situation:
• regular: for datasets with sparse data and a moderate number of observations and features.
• randomized: for datasets with both a large number of observations and features. This mode uses an
approximation algorithm.
As the algorithm's last step, it performs the singular value decomposition on the unified solution, from
which the principal components are then derived.
Mode 1: Regular
1494
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Use this algorithm when the dimension d of the vectors is small enough so that can fit in memory.
Mode 2: Randomized
When the number of features in the input dataset is large, we use a method to approximate
the covariance metric. For every mini-batch of dimension b * d, we randomly initialize a
(num_components + extra_components) * b matrix that we multiply by each mini-batch,
to create a (num_components + extra_components) * d matrix. The sum of these matrices
is computed by the workers, and the servers perform SVD on the final (num_components +
extra_components) * d matrix. The top right num_components singular vectors of it are the
approximation of the top singular vectors of the input matrix.
Denote the different inputs to the server as The server computes B, h, s, n the sums of the
respective inputs. It then computes , and finds its singular value decomposition. The top-
right singular vectors and singular values of C are used as the approximate solution to the problem.
PCA Hyperparameters
In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the
PCA training algorithm provided by Amazon SageMaker. For more information about how PCA works, see
How PCA Works (p. 1494).
Required
Required
Required
1495
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
extra_components As the value increases, the solution becomes more accurate but the
runtime and memory consumption increase linearly. The default,
-1, means the maximum of 10 and num_components. Valid for
randomized mode only.
Optional
Default value: -1
subtract_mean Indicates whether the data should be unbiased both during training
and at inference.
Optional
All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker PCA algorithm.
Accept—application/json
{
"projections": [
{
"projection": [1.0, 2.0, 3.0, 4.0, 5.0]
},
{
"projection": [6.0, 7.0, 8.0, 9.0, 0.0]
},
....
]
}
Accept—application/jsonlines
1496
Amazon SageMaker Developer Guide
Use Built-in Algorithms
[
Record = {
features = {},
label = {
'projection': {
keys: [],
values: [1.0, 2.0, 3.0, 4.0, 5.0]
}
}
},
Record = {
features = {},
label = {
'projection': {
keys: [],
values: [1.0, 2.0, 3.0, 4.0, 5.0]
}
}
}
]
With each data point, RCF associates an anomaly score. Low score values indicate that the data point
is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of
"low" and "high" depend on the application but common practice suggests that scores beyond three
standard deviations from the mean score are considered anomalous.
While there are many applications of anomaly detection algorithms to one-dimensional time series data
such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-
dimensional input. Amazon SageMaker RCF scales well with respect to number of features, data set size,
and number of instances.
Topics
• Input/Output Interface for the RCF Algorithm (p. 1497)
• Instance Recommendations for the RCF Algorithm (p. 1498)
• RCF Sample Notebooks (p. 1498)
• How RCF Works (p. 1499)
• RCF Hyperparameters (p. 1501)
• Tune an RCF Model (p. 1502)
• RCF Response Formats (p. 1503)
1497
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The train channel only supports S3DataDistributionType=ShardedByS3Key and the test channel
only supports S3DataDistributionType=FullyReplicated. The following example specifies the S3
distribution type for the train channel using the Amazon SageMaker Python SDK.
Note
The sagemaker.inputs.s3_input method was renamed to
sagemaker.inputs.TrainingInput in SageMaker Python SDK v2.
import sagemaker
To avoid common errors around execution roles, ensure that you have the execution roles required,
AmazonSageMakerFullAccess and AmazonEC2ContainerRegistryFullAccess. To avoid common
errors around your image not existing or its permissions being incorrect, ensure that your ECR image is
not larger then the allocated disk space on the training instance. To avoid this, run your training job on
an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's
Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this
will result in an error. See the ECR repository permissions for more information on setting a repository
policy statement.
See the S3DataSource for more information on customizing the S3 data source attributes. Finally, in
order to take advantage of multi-instance training the training data must be partitioned into at least as
many files as instances.
For more information on input and output file formats, see RCF Response Formats (p. 1503) for
inference and the RCF Sample Notebooks (p. 1498).
1498
Amazon SageMaker Developer Guide
Use Built-in Algorithms
instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook
Instances (p. 204). Once you have created a notebook instance and opened it, select the SageMaker
Examples tab to see a list of all the SageMaker samples. To open a notebook, click on its Use tab and
select Create copy.
The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using
a partition of a sample of the training data. For example, a random sample of the input data is first
determined. The random sample is then partitioned according to the number of trees in the forest. Each
tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score
assigned to a data point by the tree is defined as the expected change in complexity of the tree as a
result adding that point to the tree; which, in approximation, is inversely proportional to the resulting
depth of the point in the tree. The random cut forest assigns an anomaly score by computing the
average score from each constituent tree and scaling the result with respect to the sample size. The RCF
algorithm is based on the one described in reference [1].
Reservoir sampling is an algorithm for efficiently drawing random samples from a dataset
where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir
sampling works even when is not known a priori. If only one sample is requested, such as when ,
the algorithm is like this:
This algorithm selects a random sample such that for all . When the
algorithm is more complicated. Additionally, a distinction must be made between random sampling that
is with and without replacement. RCF performs an augmented reservoir sampling without replacement
on the training data based on the algorithms described in [2].
1499
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary
tree by partitioning the data domain into bounding boxes.
This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional
dataset. The corresponding tree is initialized to the root node:
A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous
data point (orange). The tree is initialized with a root node.
The RCF algorithm organizes these data in a tree by first computing a bounding box of the data,
selecting a random dimension (giving more weight to dimensions with higher "variance"), and then
randomly determining the position of a hyperplane "cut" through that dimension. The two resulting
subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the
remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will
consist of the subtree of points to the left of the initial cut and the other representing the single point on
the right.
1500
Amazon SageMaker Developer Guide
Use Built-in Algorithms
A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie
isolated in a bounding box at a smaller tree depth than other points.
Bounding boxes are then computed for the left and right halves of the data and the process is repeated
until every leaf of the tree represents a single data point from the sample. Note that if the lone point
is sufficiently far away then it is more likely that a random cut would result in point isolation. This
observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the
anomaly score.
When performing inference using a trained RCF model the final anomaly score is reported as the average
across scores reported by each tree. Note that it is often the case that the new data point does not
already reside in the tree. To determine the score associated with the new point the data point is inserted
into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent
to the training process described above. That is, the resulting tree is as if the input data point were
a member of the sample used to construct the tree in the first place. The reported score is inversely
proportional to the depth of the input point within the tree.
Choose Hyperparameters
The primary hyperparameters used to tune the RCF model are num_trees and
num_samples_per_tree. Increasing num_trees has the effect of reducing the noise observed in
anomaly scores since the final score is the average of the scores reported by each tree. While the optimal
value is application-dependent we recommend using 100 trees to begin with as a balance between score
noise and model complexity. Note that inference time is proportional to the number of trees. Although
training time is also affected it is dominated by the reservoir sampling algorithm describe above.
The parameter num_samples_per_tree is related to the expected density of anomalies in the dataset.
In particular, num_samples_per_tree should be chosen such that 1/num_samples_per_tree
approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each
tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an
optimal value for this hyperparameter is dependent on the application.
References
1. Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based
anomaly detection on streams." In International Conference on Machine Learning, pp. 2712-2721.
2016.
2. Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random
sampling with replacement from data stream." In Proceedings of the 2004 SIAM International
Conference on Data Mining, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.
RCF Hyperparameters
In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the
Amazon SageMaker RCF algorithm. For more information, including recommendations on how to choose
hyperparameters, see How RCF Works (p. 1499).
feature_dim The number of features in the data set. (If you use the Random Cut Forest
estimator, this value is calculated for you and need not be specified.)
Required
1501
Amazon SageMaker Developer Guide
Use Built-in Algorithms
eval_metrics A list of metrics used to score a labeled test data set. The following
metrics can be selected for output:
Optional
num_samples_per_tree Number of random samples given to each tree from the training data set.
Optional
Optional
Automatic model tuning, also known as hyperparameter tuning or hyperparameter optimization, finds
the best version of a model by running many jobs that test a range of hyperparameters on your dataset.
You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose
the objective metric from the metrics that the algorithm computes. Automatic model tuning searches
the hyperparameters chosen to find the combination of values that result in the model that optimizes
the objective metric.
The Amazon SageMaker RCF algorithm is an unsupervised anomaly-detection algorithm that requires
a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data
points and then labels the data points as anomalous if their scores are beyond three standard deviations
from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the
difference between calculated labels and actual labels. The hyperparameter tuning job finds the model
that maximizes that score. The success of hyperparameter optimization depends on the applicability of
the three-sigma limit heuristic to the test dataset.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The RCF algorithm computes the following metric during training. When tuning the model, choose this
metric as the objective metric.
1502
Amazon SageMaker Developer Guide
Use Built-in Algorithms
num_samples_per_treeIntegerParameterRanges MinValue: 1,
MaxValue:2048
All Amazon SageMaker built-in algorithms adhere to the common input inference format described in
Common Data Formats - Inference. Note that SageMaker Random Cut Forest supports both dense and
sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the
SageMaker RCF algorithm.
ACCEPT: application/json.
"scores": [
{"score": 0.02},
{"score": 0.25}
ACCEPT: application/jsonlines.
{"score": 0.02},
1503
Amazon SageMaker Developer Guide
Use Built-in Algorithms
{"score": 0.25}
ACCEPT: application/x-recordio-protobuf.
Record = {
features = {},
label = {
'score': {
keys: [],
},
Record = {
features = {},
label = {
'score': {
keys: [],
1504
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Image Classification - MXNet (p. 1506)—uses example data with answers (referred to as a supervised
algorithm). Use this algorithm to classify images.
• Image Classification - TensorFlow (p. 1517)—uses pretrained TensorFlow Hub models to fine-tune for
specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.
• Object Detection - MXNet (p. 1530)—detects and classifies objects in images using a single deep
neural network. It is a supervised learning algorithm that takes images as input and identifies all
instances of objects within the image scene.
• Object Detection - TensorFlow (p. 1541)—detects bounding boxes and object labels in an image. It is
a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow
models.
• Semantic Segmentation Algorithm (p. 1549)—provides a fine-grained, pixel-level approach to
developing computer vision applications.
Image training and File image files CPU or GPU Yes (only
Classification validation (.jpg, .jpeg, across
- TensorFlow or .png) multiple
GPUs on
a single
instance)
1505
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The recommended input format for the Amazon SageMaker image classification algorithms is Apache
MXNet RecordIO. However, you can also use raw images in .jpg or .png format. Refer to this discussion
for a broad overview of efficient data preparation and loading for machine learning systems.
Note
To maintain better interoperability with existing deep learning frameworks, this differs from the
protobuf data formats commonly used by other Amazon SageMaker algorithms.
• Deep residual learning for image recognition Kaiming He, et al., 2016 IEEE Conference on Computer
Vision and Pattern Recognition
• ImageNet image database
• Image classification with Gluon-CV and MXNet
Topics
• Input/Output Interface for the Image Classification Algorithm (p. 1507)
• EC2 Instance Recommendation for the Image Classification Algorithm (p. 1509)
• Image Classification Sample Notebooks (p. 1509)
• How Image Classification Works (p. 1509)
• Image Classification Hyperparameters (p. 1510)
• Tune an Image Classification Model (p. 1516)
1506
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Distributed training is supported for file mode and pipe mode. When using the RecordIO content type in
pipe mode, you must set the S3DataDistributionType of the S3DataSource to FullyReplicated.
The algorithm supports a fully replicated model where your data is copied onto each machine.
If you use the RecordIO format for training, specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. Specify one RecordIO (.rec)
file in the train channel and one RecordIO file in the validation channel. Set the content type for
both channels to application/x-recordio.
If you use the Image format for training, specify train, validation, train_lst,
and validation_lst channels as values for the InputDataConfig parameter of the
CreateTrainingJob request. Specify the individual image data (.jpg or .png files) for the train
and validation channels. Specify one .lst file in each of the train_lst and validation_lst
channels. Set the content type for all four channels to application/x-image.
Note
SageMaker reads the training and validation data separately from different channels, so you
must store the training and validation data in different folders.
A .lst file is a tab-separated file with three columns that contains a list of image files. The first column
specifies the image index, the second column specifies the class label index for the image, and the third
column specifies the relative path of the image file. The image index in the first column must be unique
across all of the images. The set of class label indices are numbered successively and the numbering
should start with 0. For example, 0 for the cat class, 1 for the dog class, and so on for additional classes.
5 1 your_image_directory/train_img_dog1.jpg
1000 0 your_image_directory/train_img_cat1.jpg
22 1 your_image_directory/train_img_dog2.jpg
The augmented manifest format enables you to do training in Pipe mode using image files without
needing to create RecordIO files. You need to specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. While using the format,
an S3 manifest file needs to be generated that contains the list of images and their corresponding
1507
Amazon SageMaker Developer Guide
Use Built-in Algorithms
annotations. The manifest file format should be in JSON Lines format in which each line represents one
sample. The images are specified using the 'source-ref' tag that points to the S3 location of the
image. The annotations are provided under the "AttributeNames" parameter value as specified in the
CreateTrainingJob request. It can also contain additional metadata under the metadata tag, but
these are ignored by the algorithm. In the following example, the "AttributeNames" are contained
in the list of image and annotation references ["source-ref", "class"]. The corresponding label
value is "0" for the first image and “1” for the second image:
{"source-ref":"s3://image/filename1.jpg", "class":"0"}
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name":
"cat", "type" : "groundtruth/image-classification"}}
The order of "AttributeNames" in the input files matters when training the ImageClassification
algorithm. It accepts piped data in a specific order, with image first, followed by label. So the
"AttributeNames" in this example are provided with "source-ref" first, followed by "class".
When using the ImageClassification algorithm with Augmented Manifest, the value of the
RecordWrapperType parameter must be "RecordIO".
Multi-label training is also supported by specifying a JSON array of values. The num_classes
hyperparameter must be set to match the total number of classes. There are two valid label formats:
multi-hot and class-id.
In the multi-hot format, each label is a multi-hot encoded vector of all classes, where each class takes
the value of 0 or 1. In the following example, there are three classes. The first image is labeled with
classes 0 and 2, while the second image is labeled with class 2 only:
In the class-id format, each label is a list of the class ids, from [0, num_classes), which apply to the data
point. The previous example would instead look like this:
The multi-hot format is the default, but can be explicitly set in the content type with the label-format
parameter: "application/x-recordio; label-format=multi-hot". The class-id format, which
is the format outputted by GroundTruth, must be set explicitly: "application/x-recordio; label-
format=class-id".
For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).
Incremental Training
You can also seed the training of a new model with the artifacts from a model that you trained
previously with SageMaker. Incremental training saves training time when you want to train a new model
with the same or similar data. SageMaker image classification models can be seeded only with another
built-in image classification model trained in SageMaker.
To use a pretrained model, in the CreateTrainingJob request, specify the ChannelName as "model" in
the InputDataConfig parameter. Set the ContentType for the model channel to application/x-
sagemaker-model. The input hyperparameters of both the new model and the pretrained model that
you upload to the model channel must have the same settings for the num_layers, image_shape and
num_classes input parameters. These parameters define the network architecture. For the pretrained
model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker. You can use
either RecordIO or image formats for input data.
1508
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For a sample notebook that shows how to use incremental training with the SageMaker image
classification algorithm, see the End-to-End Incremental Training Image Classification Example. For more
information on incremental training and for instructions on how to use it, see Incremental Training in
Amazon SageMaker (p. 2113).
The generated models can be hosted for inference and support encoded .jpg and .png image formats
as image/png, image/jpeg, and application/x-image content-type. The input image is resized
automatically. The output is the probability values for all classes encoded in JSON format, or in JSON
Lines text format for batch transform. The image classification model processes a single image per
request and so outputs only one line in the JSON or JSON Lines format. The following is an example of a
response in JSON Lines format:
accept: application/jsonlines
For more details on training and inference, see the image classification sample notebook instances
referenced in the introduction.
For image classification, we support P2, P3, G4dn, and G5 instances. We recommend using GPU instances
with more memory for training with large batch sizes. You can also run the algorithm on multi-GPU and
multi-machine settings for distributed training. Both CPU (such as C4) and GPU (P2, P3, G4dn, or G5)
instances can be used for inference.
For a sample notebook that uses the SageMaker image classification algorithm to train a model on the
caltech-256 dataset and then to deploy it to perform inferences, see the End-to-End Multiclass Image
Classification Example. For instructions how to create and access Jupyter notebook instances that you
can use to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker samples. The example image classification notebooks are located in the Introduction to
Amazon algorithms section. To open a notebook, click on its Use tab and select Create copy.
The image classification algorithm takes an image as input and classifies it into one of the output
categories. Deep learning has revolutionized the image classification domain and has achieved great
performance. Various deep learning networks such as ResNet, DenseNet, Inception, and so on, have
been developed to be highly accurate for image classification. At the same time, there have been efforts
to collect labeled image data that are essential for training these networks. ImageNet is one such
large dataset that has more than 11 million images with about 11,000 categories. Once a network is
trained with ImageNet data, it can then be used to generalize with other datasets as well, by simple re-
adjustment or fine-tuning. In this transfer learning approach, a network is initialized with weights (in
this example, trained on ImageNet), which can be later fine-tuned for an image classification task in a
different dataset.
Image classification in Amazon SageMaker can be run in two modes: full training and transfer learning.
In full training mode, the network is initialized with random weights and trained on user data from
scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top
fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new
data. In this mode, training can be achieved even with a smaller dataset. This is because the network is
already trained and therefore can be used in cases without sufficient training data.
1509
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Image Classification
algorithm. See Tune an Image Classification Model (p. 1516) for information on image classification
hyperparameter tuning.
Required
Required
• crop: Randomly crop the image and flip the image horizontally
• crop_color: In addition to ‘crop’, three random values in
the range [-36, 36], [-50, 50], and [-50, 50] are added to the
corresponding Hue-Saturation-Lightness channels respectively
• crop_color_transform: In addition to crop_color, random
transformations, including rotation, shear, and aspect ratio
variations are applied to the image. The maximum angle of
rotation is 10 degrees, the maximum shear ratio is 0.1, and the
maximum aspect changing ratio is 0.25.
Optional
beta_1 The beta1 for adam, that is the exponential decay rate for the first
moment estimates.
Optional
1510
Amazon SageMaker Developer Guide
Use Built-in Algorithms
beta_2 The beta2 for adam, that is the exponential decay rate for the
second moment estimates.
Optional
Note that all checkpoint files are saved as part of the final model
file "model.tar.gz" and uploaded to S3 to the specified model
location. This increases the size of the model file proportionally to
the number of checkpoints saved during training.
Optional
early_stopping True to use early stopping logic during training. False not to use
it.
Optional
Optional
Default value: 10
Optional
Default value: 5
1511
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Optional
Default value: 30
eps The epsilon for adam and rmsprop. It is usually set to a small value
to avoid division by 0.
Optional
gamma The gamma for rmsprop, the decay factor for the moving average
of the squared gradient.
Optional
1512
Amazon SageMaker Developer Guide
Use Built-in Algorithms
image_shape The input image dimensions, which is the same size as the input
layer of the network. The format is defined as 'num_channels,
height, width'. The image dimension can take on any value as the
network can handle varied dimensions of the input. However,
there may be memory constraints if a larger image dimension is
used. Pretrained models can use only a fixed 224 x 224 image size.
Typical image dimensions for image classification are '3,224,224'.
This is similar to the ImageNet dataset.
Optional
Optional
Optional
1513
Amazon SageMaker Developer Guide
Use Built-in Algorithms
lr_scheduler_factor The ratio to reduce learning rate used in conjunction with the
lr_scheduler_step parameter, defined as lr_new = lr_old *
lr_scheduler_factor.
Optional
Optional
Optional
Default value: 32
momentum The momentum for sgd and nag, ignored for other optimizers.
Optional
multi_label Flag to use for multi-label classification where each sample can
be assigned multiple labels. Average accuracy across all classes is
logged.
Optional
Valid values: 0 or 1
Default value: 0
1514
Amazon SageMaker Developer Guide
Use Built-in Algorithms
num_layers Number of layers for the network. For data with large image size
(for example, 224x224 - like ImageNet), we suggest selecting the
number of layers from the set [18, 34, 50, 101, 152, 200]. For data
with small image size (for example, 28x28 - like CIFAR), we suggest
selecting the number of layers from the set [20, 32, 44, 56, 110].
The number of layers in each set is based on the ResNet paper. For
transfer learning, the number of layers defines the architecture of
base network and hence can only be selected from the set [18, 34,
50, 101, 152, 200].
Optional
Valid values: positive integer in [18, 34, 50, 101, 152, 200] or [20,
32, 44, 56, 110]
optimizer The optimizer type. For more details of the parameters for the
optimizers, please refer to MXNet's API.
Optional
precision_dtype The precision of the weights used for training. The algorithm can
use either single precision (float32) or half precision (float16)
for the weights. Using half-precision for weights results in reduced
memory consumption.
Optional
resize The number of pixels in the shortest side of an image after resizing
it for training. If the parameter is not set, then the training data is
used without resizing. The parameter should be larger than both
the width and height components of image_shape to prevent
training failure.
1515
Amazon SageMaker Developer Guide
Use Built-in Algorithms
top_k Reports the top-k accuracy during training. This parameter has to
be greater than 1, since the top-1 training accuracy is the same as
the regular training accuracy that has already been reported.
Optional
use_pretrained_model Flag to use pre-trained model for training. If set to 1, then the
pretrained model with the corresponding number of layers is
loaded and used for training. Only the top FC layer are reinitialized
with random weights. Otherwise, the network is trained from
scratch.
Optional
Valid values: 0 or 1
Default value: 0
Optional
Valid values: 0 or 1
Default value: 0
weight_decay The coefficient weight decay for sgd and nag, ignored for other
optimizers.
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is
computed during training. When tuning the model, choose this metric as the objective metric.
1516
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune an image classification model with the following hyperparameters. The hyperparameters that have
the greatest impact on image classification objective metrics are: mini_batch_size, learning_rate,
and optimizer. Tune the optimizer-related hyperparameters, such as momentum, weight_decay,
beta_1, beta_2, eps, and gamma, based on the selected optimizer. For example, use beta_1 and
beta_2 only when adam is the optimizer.
For more information about which hyperparameters are used in each optimizer, see Image Classification
Hyperparameters (p. 1510).
Topics
• How to use the SageMaker Image Classification - TensorFlow algorithm (p. 1518)
1517
Amazon SageMaker Developer Guide
Use Built-in Algorithms
• Input and output interface for the Image Classification - TensorFlow algorithm (p. 1519)
• Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm (p. 1520)
• Image Classification - TensorFlow sample notebooks (p. 1521)
• How Image Classification - TensorFlow Works (p. 1521)
• TensorFlow Hub Models (p. 1521)
• Image Classification - TensorFlow Hyperparameters (p. 1526)
• Tune an Image Classification - TensorFlow model (p. 1529)
You can use Image Classification - TensorFlow as an Amazon SageMaker built-in algorithm. The following
section describes how to use Image Classification - TensorFlow with the SageMaker Python SDK. For
information on how to use Image Classification - TensorFlow from the Amazon SageMaker Studio UI, see
SageMaker JumpStart (p. 47).
The Image Classification - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow Hub models. For a list of all available pretrained models, see TensorFlow
Hub Models (p. 1521). Every pretrained model has a unique model_id. The following example
uses MobileNet V2 1.00 224 (model_id: tensorflow-ic-imagenet-mobilenet-v2-100-224-
classification-4) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded
from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network
isolation. Use these pre-generated model training artifacts to construct a SageMaker Estimator.
First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and
their default values with hyperparameters.retrieve_default. For more information, see Image
Classification - TensorFlow Hyperparameters (p. 1526). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For larger models, the default
batch size is smaller and the train_only_top_layer hyperparameter is set to "True".
This example uses the tf_flowers dataset, which contains five classes of flower images. We pre-
downloaded the dataset from TensorFlow under the Apache 2.0 license and made it available with
Amazon S3. To fine-tune your model, call .fit using the Amazon S3 location of your training dataset.
1518
Amazon SageMaker Developer Guide
Use Built-in Algorithms
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ic-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
Input and output interface for the Image Classification - TensorFlow algorithm
Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset with
any number of image classes. Be mindful of how to format your training data for input to the Image
Classification - TensorFlow model.
• Training data input format: Your training data should be a directory with as many subdirectories as
the number of classes. Each subdirectory should contain images belonging to that class in .jpg, .jpeg,
or .png format.
The following is an example of an input directory structure. This example dataset has two
classes: roses and dandelion. The image files in each class folder can have any name. The
input directory should be hosted in an Amazon S3 bucket with a path similar to the following:
s3://bucket_name/input_directory/. Note that the trailing / is required.
input_directory
|--roses
|--abc.jpg
|--def.jpg
|--dandelion
|--ghi.jpg
|--jkl.jpg
Trained models output label mapping files that map class folder names to the indices in the list of
output class probabilities. This mapping is in alphabetical order. For example, in the preceding example,
the dandelion class is index 0 and the roses class is index 1.
After training, you have a fine-tuned model that you can further train using incremental training
or deploy for inference. The Image Classification - TensorFlow algorithm automatically adds a pre-
processing and post-processing signature to the fine-tuned model so that it can take in images as input
1519
Amazon SageMaker Developer Guide
Use Built-in Algorithms
and return class probabilities. The file mapping class indices to class labels is saved along with the
models.
Incremental training
You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Image Classification - TensorFlow model with another Image
Classification - TensorFlow model trained in SageMaker.
You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model. For an example of incremental training with the
SageMaker Image Classification - TensorFlow algorithm, see the Introduction to SageMaker TensorFlow -
Image Classification sample notebook.
Running inference results in probability values, class labels for all classes, and the predicted label
corresponding to the class index with the highest probability encoded in JSON format. The Image
Classification - TensorFlow model processes a single image per request and outputs only one line. The
following is an example of a JSON format response:
accept: application/json;verbose
If accept is set to application/json, then the model only outputs probabilities. For more
information on training and inference with the Image Classification - TensorFlow algorithm, see the
Introduction to SageMaker TensorFlow - Image Classification sample notebook.
Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm
The Image Classification - TensorFlow algorithm supports all CPU and GPU instances for training,
including:
• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge
• ml.g4dn.xlarge
• ml.g4dn.16.xlarge
• ml.g5.xlarge
• ml.g5.48xlarge
We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as
M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.
1520
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about how to use the SageMaker Image Classification - TensorFlow algorithm
for transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Image
Classification notebook.
For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.
The Image Classification - TensorFlow algorithm takes an image as input and classifies it into one of
the output class labels. Various deep learning networks such as MobileNet, ResNet, Inception, and
EfficientNet are highly accurate for image classification. There are also deep learning networks that
are trained on large image datasets, such as ImageNet, which has over 11 million images and almost
11,000 classes. After a network is trained with ImageNet data, you can then fine-tune the network on
a dataset with a particular focus to perform more specific classification tasks. The Amazon SageMaker
Image Classification - TensorFlow algorithm supports transfer learning on many pretrained models that
are available in the TensorFlow Hub.
According to the number of class labels in your training data, a classification layer is attached to the
pretrained TensorFlow Hub model of your choice. The classification layer consists of a dropout layer, a
dense layer, and a fully-connected layer with 2-norm regularizer that is initialized with random weights.
The model has hyperparameters for the dropout rate of the dropout layer and the L2 regularization
factor for the dense layer. You can then fine-tune either the entire network (including the pretrained
model) or only the top classification layer on new training data. With this method of transfer learning,
training with smaller datasets is possible.
The following pretrained models are available to use for transfer learning with the Image Classification -
TensorFlow algorithm.
The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.
1521
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1522
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1523
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1524
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1525
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Image Classification -
TensorFlow algorithm. See Tune an Image Classification - TensorFlow model (p. 1529) for information on
hyperparameter tuning.
augmentation_random_flip Indicates which flip mode to use for data augmentation when
augmentation is set to "True". For more information, see
RandomFlip in the TensorFlow documentation.
augmentation_random_zoom Indicates how much vertical zoom to use for data augmentation
when augmentation is set to "True". Positive values zoom
out while negative values zoom in. 0 means no zoom. For more
information, see RandomZoom in the TensorFlow documentation.
batch_size The batch size for training. For training on instances with multiple
GPUs, this batch size is used across the GPUs.
beta_1 The beta1 for the "adam" optimizer. Represents the exponential
decay rate for the first moment estimates. Ignored for other
optimizers.
1526
Amazon SageMaker Developer Guide
Use Built-in Algorithms
beta_2 The beta2 for the "adam" optimizer. Represents the exponential
decay rate for the second moment estimates. Ignored for other
optimizers.
dropout_rate The dropout rate for the dropout layer in the top classification
layer.
Default value: 5.
Default value: 3.
1527
Amazon SageMaker Developer Guide
Use Built-in Algorithms
1528
Amazon SageMaker Developer Guide
Use Built-in Algorithms
optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.
regularizers_l2 The L2 regularization factor for the dense layer in the classification
layer.
rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.
train_only_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is
computed during training. When tuning the model, choose this metric as the objective metric.
1529
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Tune an image classification model with the following hyperparameters. The hyperparameters that
have the greatest impact on image classification objective metrics are: batch_size, learning_rate,
and optimizer. Tune the optimizer-related hyperparameters, such as momentum, regularizers_l2,
beta_1, beta_2, and eps based on the selected optimizer. For example, use beta_1 and beta_2
only when adam is the optimizer.
For more information about which hyperparameters are used for each optimizer, see Image
Classification - TensorFlow Hyperparameters (p. 1526).
Topics
1530
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The SageMaker Object Detection algorithm supports both RecordIO (application/x-recordio) and
image (image/png, image/jpeg, and application/x-image) content types for training in file mode
and supports RecordIO (application/x-recordio) for training in pipe mode. However you can also
train in pipe mode using the image files (image/png, image/jpeg, and application/x-image),
without creating RecordIO files, by using the augmented manifest format. The recommended input
format for the Amazon SageMaker object detection algorithms is Apache MXNet RecordIO. However, you
can also use raw images in .jpg or .png format. The algorithm supports only application/x-image for
inference.
Note
To maintain better interoperability with existing deep learning frameworks, this differs from the
protobuf data formats commonly used by other Amazon SageMaker algorithms.
See the Object Detection Sample Notebooks (p. 1533) for more details on data formats.
If you use the RecordIO format for training, specify both train and validation channels as values for the
InputDataConfig parameter of the CreateTrainingJob request. Specify one RecordIO (.rec) file in
the train channel and one RecordIO file in the validation channel. Set the content type for both channels
to application/x-recordio. An example of how to generate RecordIO file can be found in the object
detection sample notebook. You can also use tools from the MXNet's GluonCV to generate RecordIO files
for popular datasets like the PASCAL Visual Object Classes and Common Objects in Context (COCO).
If you use the image format for training, specify train, validation, train_annotation,
and validation_annotation channels as values for the InputDataConfig parameter of
CreateTrainingJob request. Specify the individual image data (.jpg or .png) files for the train and
validation channels. For annotation data, you can use the JSON format. Specify the corresponding .json
files in the train_annotation and validation_annotation channels. Set the content type for all
four channels to image/png or image/jpeg based on the image type. You can also use the content
type application/x-image when your dataset contains both .jpg and .png images. The following is an
example of a .json file.
{
"file": "your_image_directory/sample_image1.jpg",
"image_size": [
{
"width": 500,
"height": 400,
"depth": 3
}
],
"annotations": [
{
"class_id": 0,
"left": 111,
1531
Amazon SageMaker Developer Guide
Use Built-in Algorithms
"top": 134,
"width": 61,
"height": 128
},
{
"class_id": 0,
"left": 161,
"top": 250,
"width": 79,
"height": 143
},
{
"class_id": 1,
"left": 101,
"top": 185,
"width": 42,
"height": 130
}
],
"categories": [
{
"class_id": 0,
"name": "dog"
},
{
"class_id": 1,
"name": "cat"
}
]
}
Each image needs a .json file for annotation, and the .json file should have the same name as the
corresponding image. The name of above .json file should be "sample_image1.json". There are four
properties in the annotation .json file. The property "file" specifies the relative path of the image file.
For example, if your training images and corresponding .json files are stored in s3://your_bucket/
train/sample_image and s3://your_bucket/train_annotation, specify the path for your train and
train_annotation channels as s3://your_bucket/train and s3://your_bucket/train_annotation,
respectively.
In the .json file, the relative path for an image named sample_image1.jpg should be sample_image/
sample_image1.jpg. The "image_size" property specifies the overall image dimensions. The
SageMaker object detection algorithm currently only supports 3-channel images. The "annotations"
property specifies the categories and bounding boxes for objects within the image. Each object is
annotated by a "class_id" index and by four bounding box coordinates ("left", "top", "width",
"height"). The "left" (x-coordinate) and "top" (y-coordinate) values represent the upper-left corner
of the bounding box. The "width" (x-coordinate) and "height" (y-coordinate) values represent the
dimensions of the bounding box. The origin (0, 0) is the upper-left corner of the entire image. If you
have multiple objects within one image, all the annotations should be included in a single .json file. The
"categories" property stores the mapping between the class index and class name. The class indices
should be numbered successively and the numbering should start with 0. The "categories" property is
optional for the annotation .json file
The augmented manifest format enables you to do training in pipe mode using image files without
needing to create RecordIO files. You need to specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. While using the format,
an S3 manifest file needs to be generated that contains the list of images and their corresponding
annotations. The manifest file format should be in JSON Lines format in which each line represents one
sample. The images are specified using the 'source-ref' tag that points to the S3 location of the
image. The annotations are provided under the "AttributeNames" parameter value as specified in the
1532
Amazon SageMaker Developer Guide
Use Built-in Algorithms
CreateTrainingJob request. It can also contain additional metadata under the metadata tag, but
these are ignored by the algorithm. In the following example, the "AttributeNames are contained in
the list ["source-ref", "bounding-box"]:
The order of "AttributeNames" in the input files matters when training the Object Detection
algorithm. It accepts piped data in a specific order, with image first, followed by annotations. So the
"AttributeNames" in this example are provided with "source-ref" first, followed by "bounding-box".
When using Object Detection with Augmented Manifest, the value of parameter RecordWrapperType
must be set as "RecordIO".
For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).
Incremental Training
You can also seed the training of a new model with the artifacts from a model that you trained
previously with SageMaker. Incremental training saves training time when you want to train a new model
with the same or similar data. SageMaker object detection models can be seeded only with another built-
in object detection model trained in SageMaker.
To use a pretrained model, in the CreateTrainingJob request, specify the ChannelName as "model"
in the InputDataConfig parameter. Set the ContentType for the model channel to application/
x-sagemaker-model. The input hyperparameters of both the new model and the pretrained model
that you upload to the model channel must have the same settings for the base_network and
num_classes input parameters. These parameters define the network architecture. For the pretrained
model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker. You can use
either RecordIO or image formats for input data.
For more information on incremental training and for instructions on how to use it, see Incremental
Training in Amazon SageMaker (p. 2113).
The object detection algorithm supports P2, P3, G4dn, and G5 GPU instance families. We recommend
using GPU instances with more memory for training with large batch sizes. You can run the object
detection algorithm on multi-GPU and mult-machine settings for distributed training.
You can use both CPU (such as C5 and M5) and GPU (such as P3 and G4dn) instances for inference.
For a sample notebook that shows how to use the SageMaker Object Detection algorithm to train and
host a model on the
Caltech Birds (CUB 200 2011) dataset using the Single Shot multibox Detector algorithm, see Amazon
SageMaker Object Detection for Bird Species. For instructions how to create and access Jupyter notebook
instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook
Instances (p. 204). Once you have created a notebook instance and opened it, select the SageMaker
Examples tab to see a list of all the SageMaker samples. The object detection example notebook using
1533
Amazon SageMaker Developer Guide
Use Built-in Algorithms
the Object Detection algorithm is located in the Introduction to Amazon Algorithms section. To open a
notebook, click on its Use tab and select Create copy.
The object detection algorithm identifies and locates all instances of objects in an image from a known
collection of object categories. The algorithm takes an image as input and outputs the category that
the object belongs to, along with a confidence score that it belongs to the category. The algorithm
also predicts the object's location and scale with a rectangular bounding box. Amazon SageMaker
Object Detection uses the Single Shot multibox Detector (SSD) algorithm that takes a convolutional
neural network (CNN) pretrained for classification task as the base network. SSD uses the output of
intermediate layers as features for detection.
Various CNNs such as VGG and ResNet have achieved great performance on the image classification task.
Object detection in Amazon SageMaker supports both VGG-16 and ResNet-50 as a base network for SSD.
The algorithm can be trained in full training mode or in transfer learning mode. In full training mode, the
base network is initialized with random weights and then trained on user data. In transfer learning mode,
the base network weights are loaded from pretrained models.
The object detection algorithm uses standard data augmentation operations, such as flip, rescale, and
jitter, on the fly internally to help avoid overfitting.
In the CreateTrainingJob request, you specify the training algorithm that you want to use. You
can also specify algorithm-specific hyperparameters that are used to help estimate the parameters of
the model from a training dataset. The following table lists the hyperparameters provided by Amazon
SageMaker for training the object detection algorithm. For more information about how object training
works, see How Object Detection Works (p. 1534).
Required
Required
Optional
1534
Amazon SageMaker Developer Guide
Use Built-in Algorithms
early_stopping True to use early stopping logic during training. False not to use
it.
Optional
Optional
Default value: 10
Optional
Default value: 5
Optional
image_shape The image size for input images. We rescale the input image to a
square image with this size. We recommend using 300 and 512 for
better performance.
Optional
Default: 300
1535
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default: 30
freeze_layer_pattern The regular expression (regex) for freezing layers in the base
network. For example, if we set freeze_layer_pattern =
"^(conv1_|conv2_).*", then any layers with a name that
contains "conv1_" or "conv2_" are frozen, which means that
the weights for these layers are not updated during training. The
layer names can be found in the network symbol files vgg16-
symbol.json and resnet-50-symbol.json. Freezing a layer means that
its weights can not be modified further. This can reduce training
time significantly in exchange for modest losses in accuracy. This
technique is commonly used in transfer learning where the lower
layers in the base network do not need to be retrained.
Optional
Optional
Default: -
1536
Amazon SageMaker Developer Guide
Use Built-in Algorithms
label_width The force padding label width used to sync across training and
validation data. For example, if one image in the data contains at
most 10 objects, and each object's annotation is specified with 5
numbers, [class_id, left, top, width, height], then the label_width
should be no smaller than (10*5 + header information length). The
header information length is usually 2. We recommend using a
slightly larger label_width for the training, such as 60 for this
example.
Optional
Default: 350
Optional
Default: 0.001
lr_scheduler_factor The ratio to reduce learning rate. Used in conjunction with the
lr_scheduler_step parameter defined as lr_new = lr_old *
lr_scheduler_factor.
Optional
Default: 0.1
lr_scheduler_step The epochs at which to reduce the learning rate. The learning rate is
reduced by lr_scheduler_factor at epochs listed in a comma-
delimited string: "epoch1, epoch2, ...". For example, if the value is
set to "10, 20" and the lr_scheduler_factor is set to 1/2, then
the learning rate is halved after 10th epoch and then halved again
after 20th epoch.
Optional
1537
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default: 32
Optional
Default: 0.9
Optional
Default: 0.45
optimizer The optimizer types. For details on optimizer values, see MXNet's
API.
Optional
Default: 'sgd'
Optional
Default: 0.5
1538
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Valid values: 0 or 1
Default: 1
weight_decay The weight decay coefficient for sgd and rmsprop. Ignored for
other optimizers.
Optional
Default: 0.0005
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
1539
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Request Format
Query a trained model by using the model's endpoint. The endpoint takes .jpg and .png image formats
with image/jpeg and image/png content-types.
Response Formats
The response is the class index with a confidence score and bounding box coordinates for all objects
within the image encoded in JSON format. The following is an example of response .json file:
{"prediction":[
[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507,
0.9345266819000244],
[0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571,
0.9712159633636475],
[4.0, 0.32643985450267792, 0.3677481412887573, 0.034883320331573486, 0.6318609714508057,
0.5967587828636169],
[8.0, 0.22552496790885925, 0.6152569651603699, 0.5722782611846924, 0.882301390171051,
0.8985623121261597],
[3.0, 0.42260299175977707, 0.019305512309074402, 0.08386176824569702,
0.39093565940856934, 0.9574796557426453]
]}
Each row in this .json file contains an array that represents a detected object. Each of these object
arrays consists of a list of six numbers. The first number is the predicted class label. The second
number is the associated confidence score for the detection. The last four numbers represent the
bounding box coordinates [xmin, ymin, xmax, ymax]. These output bounding box corner indices
are normalized by the overall image size. Note that this encoding is different than that use by the
input .json format. For example, in the first entry of the detection result, 0.3088374733924866 is the
left coordinate (x-coordinate of upper-left corner) of the bounding box as a ratio of the overall image
width, 0.07030484080314636 is the top coordinate (y-coordinate of upper-left corner) of the bounding
box as a ratio of the overall image height, 0.7110607028007507 is the right coordinate (x-coordinate of
lower-right corner) of the bounding box as a ratio of the overall image width, and 0.9345266819000244
is the bottom coordinate (y-coordinate of lower-right corner) of the bounding box as a ratio of the
overall image height.
To avoid unreliable detection results, you might want to filter out the detection results with low
confidence scores. In the object detection sample notebook, we provide examples of scripts that use a
threshold to remove low confidence detections and to plot bounding boxes on the original images.
For batch transform, the response is in JSON format, where the format is identical to the JSON format
described above. The detection results of each image is represented as a JSON file. For example:
1540
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more details on training and inference, see the Object Detection Sample Notebooks (p. 1533).
{
"image_size": [
{
"width": 500,
"height": 400,
"depth": 3
}
],
"annotations": [
{
"class_id": 0,
"score": 0.943,
"left": 111,
"top": 134,
"width": 61,
"height": 128
},
{
"class_id": 0,
"score": 0.0013,
"left": 161,
"top": 250,
"width": 79,
"height": 143
},
{
"class_id": 1,
"score": 0.0133,
"left": 101,
"top": 185,
"width": 42,
"height": 130
}
]
}
Topics
• How to use the SageMaker Object Detection - TensorFlow algorithm (p. 1542)
• Input and output interface for the Object Detection - TensorFlow algorithm (p. 1543)
• Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm (p. 1544)
• Object Detection - TensorFlow sample notebooks (p. 1544)
• How Object Detection - TensorFlow Works (p. 1544)
• TensorFlow Models (p. 1545)
• Object Detection - TensorFlow Hyperparameters (p. 1546)
• Tune an Object Detection - TensorFlow model (p. 1548)
1541
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The Object Detection - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow models. For a list of all available pretrained models, see TensorFlow
Models (p. 1545). Every pretrained model has a unique model_id. The following example uses
ResNet50 (model_id: tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8) to fine-
tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and
stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated
model training artifacts to construct a SageMaker Estimator.
First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters
and their default values with hyperparameters.retrieve_default. For more information, see
Object Detection - TensorFlow Hyperparameters (p. 1546). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For example, for larger
models, the default number of epochs is smaller.
This example uses the PennFudanPed dataset, which contains images of pedestriants in the street. We
pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call .fit
using the Amazon S3 location of your training dataset.
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-od-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
1542
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about how to use the SageMaker Object Detection - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Object Detection
notebook.
Input and output interface for the Object Detection - TensorFlow algorithm
Each of the pretrained models listed in TensorFlow Models can be fine-tuned to any dataset with
any number of image classes. Be mindful of how to format your training data for input to the Object
Detection - TensorFlow model.
• Training data input format: Your training data should be a directory with an images subdirectory and
an annotations.json file.
The following is an example of an input directory structure. The input directory should be hosted in an
Amazon S3 bucket with a path similar to the following: s3://bucket_name/input_directory/.
Note that the trailing / is required.
input_directory
|--images
|--abc.png
|--def.png
|--annotations.json
The annotations.json file should contain information for bounding boxes and their class labels in
the form of a dictionary "images" and "annotations" keys. The value for the "images" key should
be a list of dictionaries. There should be one dictionary for each image with the following information:
{"file_name": image_name, "height": height, "width": width, "id": image_id}. The
value for the "annotations" key should also be a list of dictionaries. There should be one dictionary
for each bounding box with the following information: {"image_id": image_id, "bbox": [xmin,
ymin, xmax, ymax], "category_id": bbox_label}.
After training, a label mapping file and trained model are saved to your Amazon S3 bucket.
Incremental training
You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Object Detection - TensorFlow model with another Object
Detection - TensorFlow model trained in SageMaker.
1543
Amazon SageMaker Developer Guide
Use Built-in Algorithms
You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model. For more information about how to use incremental
training with the SageMaker Object Detection - TensorFlow, see the Introduction to SageMaker
TensorFlow - Object Detection notebook.
You can host the fine-tuned model that results from your TensorFlow Object Detection training for
inference. Any input image for inference must be in .jpg, .jpeg, or .png format and be content
type application/x-image. The Object Detection - TensorFlow algorithm resizes input images
automatically.
Running inference results in bounding boxes, predicted classes, and the scores of each prediction
encoded in JSON format. The Object Detection - TensorFlow model processes a single image per request
and outputs only one line. The following is an example of a JSON format response:
accept: application/json;verbose
If accept is set to application/json, then the model only outputs normalized boxes, classes, and
scores.
Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm
The Object Detection - TensorFlow algorithm supports all GPU instances for training, including:
• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge
We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such
as M5) and GPU (P2 or P3) instances can be used for inference. For a comprehensive list of SageMaker
training and inference instances across AWS Regions, see Amazon SageMaker Pricing.
For more information about how to use the SageMaker Object Detection - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Object Detection
notebook.
For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.
The Object Detection - TensorFlow algorithm takes an image as input and predicts bounding boxes and
object labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet
are highly accurate for object detection. There are also deep learning networks that are trained on large
1544
Amazon SageMaker Developer Guide
Use Built-in Algorithms
image datasets, such as Common Objects in Context (COCO), which has 328,000 images. After a network
is trained with COCO data, you can then fine-tune the network on a dataset with a particular focus to
perform more specific object detection tasks. The Amazon SageMaker Object Detection - TensorFlow
algorithm supports transfer learning on many pretrained models that are available in the TensorFlow
Model Garden.
According to the number of class labels in your training data, an object detection layer is attached to the
pretrained TensorFlow model of your choice. You can then fine-tune either the entire network (including
the pretrained model) or only the top classification layer on new training data. With this method of
transfer learning, training with smaller datasets is possible.
TensorFlow Models
The following pretrained models are available to use for transfer learning with the Object Detection -
TensorFlow algorithm.
The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.
1545
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Object Detection -
TensorFlow algorithm. See Tune an Object Detection - TensorFlow model (p. 1548) for information on
hyperparameter tuning.
Default value: 3.
beta_1 The beta1 for the "adam" optimizer. Represents the exponential
decay rate for the first moment estimates. Ignored for other
optimizers.
beta_2 The beta2 for the "adam" optimizer. Represents the exponential
decay rate for the second moment estimates. Ignored for other
optimizers.
1546
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Default value: 5.
1547
Amazon SageMaker Developer Guide
Use Built-in Algorithms
optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.
rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.
train_only_on_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
Refer to the following chart to find which metrics are computed by the Object Detection - TensorFlow
algorithm.
1548
Amazon SageMaker Developer Guide
Use Built-in Algorithms
For more information about which hyperparameters are used for each optimizer, see Object Detection
- TensorFlow Hyperparameters (p. 1546).
CategoricalParameterRanges
train_only_on_top_layer ['True', 'False']
CategoricalParameterRanges
initial_accumulator_value MinValue: 0.0,
MaxValue: 0.999
For comparison, the SageMaker Image Classification - MXNet (p. 1506) is a supervised learning algorithm
that analyzes only whole images, classifying them into one of multiple output categories. The Object
Detection - MXNet (p. 1530) is a supervised learning algorithm that detects and classifies all instances of
an object in an image. It indicates the location and scale of each object in the image with a rectangular
bounding box.
Because the semantic segmentation algorithm classifies every pixel in an image, it also provides
information about the shapes of the objects contained in the image. The segmentation output is
1549
Amazon SageMaker Developer Guide
Use Built-in Algorithms
represented as a grayscale image, called a segmentation mask. A segmentation mask is a grayscale image
with the same shape as the input image.
The SageMaker semantic segmentation algorithm is built using the MXNet Gluon framework and
the Gluon CV toolkit. It provides you with a choice of three built-in algorithms to train a deep neural
network. You can use the Fully-Convolutional Network (FCN) algorithm , Pyramid Scene Parsing (PSP)
algorithm, or DeepLabV3.
• The backbone (or encoder)—A network that produces reliable activation maps of features.
• The decoder—A network that constructs the segmentation mask from the encoded activation maps.
You also have a choice of backbones for the FCN, PSP, and DeepLabV3 algorithms: ResNet50 or
ResNet101. These backbones include pretrained artifacts that were originally trained on the ImageNet
classification task. You can fine-tune these backbones for segmentation using your own data. Or, you
can initialize and train these networks from scratch using only your own data. The decoders are never
pretrained.
To deploy the trained model for inference, use the SageMaker hosting service. During inference, you can
request the segmentation mask either as a PNG image or as a set of probabilities for each class for each
pixel. You can use these masks as part of a larger pipeline that includes additional downstream image
processing or other applications.
Topics
• Semantic Segmentation Sample Notebooks (p. 1550)
• Input/Output Interface for the Semantic Segmentation Algorithm (p. 1550)
• EC2 Instance Recommendation for the Semantic Segmentation Algorithm (p. 1553)
• Semantic Segmentation Hyperparameters (p. 1553)
• Tuning a Semantic Segmentation Model (p. 1558)
For a sample Jupyter notebook that uses the SageMaker semantic segmentation algorithm to train a
model and deploy it to perform inferences, see the Semantic Segmentation Example. For instructions on
how to create and access Jupyter notebook instances that you can use to run the example in SageMaker,
see Amazon SageMaker Notebook Instances (p. 204).
To see a list of all of the SageMaker samples, create and open a notebook instance, and choose
the SageMaker Examples tab. The example semantic segmentation notebooks are located under
Introduction to Amazon algorithms. To open a notebook, choose its Use tab, and choose Create copy.
SageMaker semantic segmentation expects the customer's training dataset to be on Amazon Simple
Storage Service (Amazon S3). Once trained, it produces the resulting model artifacts on Amazon S3. The
input interface format for the SageMaker semantic segmentation is similar to that of most standardized
semantic segmentation benchmarking datasets. The dataset in Amazon S3 is expected to be presented
in two channels, one for train and one for validation using four directories, two for images and two
for annotations. Annotations are expected to be uncompressed PNG images. The dataset might also have
a label map that describes how the annotation mappings are established. If not, the algorithm uses a
default. It also supports the augmented manifest image format (application/x-image) for training
in Pipe input mode straight from Amazon S3. For inference, an endpoint accepts images with an image/
jpeg content type.
1550
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The dataset specifying these files should look similar to the following example:
s3://bucket_name
|
|- train
|
| - 0000.jpg
| - coffee.jpg
|- validation
|
| - 00a0.jpg
| - bananna.jpg
|- train_annotation
|
| - 0000.png
| - coffee.png
|- validation_annotation
|
| - 00a0.png
| - bananna.png
|- label_map
| - train_label_map.json
| - validation_label_map.json
Every JPG image in the train and validation directories has a corresponding PNG label image with
the same name in the train_annotation and validation_annotation directories. This naming
convention helps the algorithm to associate a label with its corresponding image during training. The
train, train_annotation, validation, and validation_annotation channels are mandatory.
The annotations are single-channel PNG images. The format works as long as the metadata (modes) in
the image helps the algorithm read the annotation images into a single-channel 8-bit unsigned integer.
For more information on our support for modes, see the Python Image Library documentation. We
recommend using the 8-bit pixel, true color P mode.
The image that is encoded is a simple 8-bit integer when using modes. To get from this mapping to a
map of a label, the algorithm uses one mapping file per channel, called the label map. The label map is
used to map the values in the image with actual label indices. In the default label map, which is provided
by default if you don’t provide one, the pixel value in an annotation matrix (image) directly index the
label. These images can be grayscale PNG files or 8-bit indexed PNG files. The label map file for the
unscaled default case is the following:
{
"scale": "1"
}
To provide some contrast for viewing, some annotation software scales the label images by a constant
amount. To support this, the SageMaker semantic segmentation algorithm provides a rescaling option
to scale down the values to actual label values. When scaling down doesn’t convert the value to an
appropriate integer, the algorithm defaults to the greatest integer less than or equal to the scale value.
The following code shows how to set the scale value to rescale the label values:
{
"scale": "3"
}
1551
Amazon SageMaker Developer Guide
Use Built-in Algorithms
The following example shows how this "scale" value is used to rescale the encoded_label values of
the input annotation image when they are mapped to the mapped_label values to be used in training.
The label values in the input annotation image are 0, 3, 6, with scale 3, so they are mapped to 0, 1, 2 for
training:
encoded_label = [0, 3, 6]
mapped_label = [0, 1, 2]
In some cases, you might need to specify a particular color mapping for each class. Use the map option
in the label mapping as shown in the following example of a label_map file:
{
"map": {
"0": 5,
"1": 0,
"2": 2
}
}
encoded_label = [0, 5, 2]
mapped_label = [1, 0, 2]
With label mappings, you can use different annotation systems and annotation software to obtain data
without a lot of preprocessing. You can provide one label map per channel. The files for a label map in
the label_map channel must follow the naming conventions for the four directory structure. If you
don't provide a label map, the algorithm assumes a scale of 1 (the default).
Each JSON object in the manifest file must contain a source-ref key. The source-ref key
should contain the value of the Amazon S3 URI to the image. The labels are provided under the
AttributeNames parameter value as specified in the CreateTrainingJob request. It can also contain
additional metadata under the metadata tag, but these are ignored by the algorithm. In the example
below, the AttributeNames are contained in the list of image and annotation references ["source-
ref", "city-streets-ref"]. These names must have -ref appended to them. When using the
Semantic Segmentation algorithm with Augmented Manifest, the value of the RecordWrapperType
parameter must be "RecordIO" and value of the ContentType parameter must be application/x-
recordio.
For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).
Incremental Training
You can also seed the training of a new model with a model that you trained previously using SageMaker.
This incremental training saves training time when you want to train a new model with the same or
similar data. Currently, incremental training is supported only for models trained with the built-in
SageMaker Semantic Segmentation.
1552
Amazon SageMaker Developer Guide
Use Built-in Algorithms
To use your own pre-trained model, specify the ChannelName as "model" in the InputDataConfig for
the CreateTrainingJob request. Set the ContentType for the model channel to application/x-
sagemaker-model. The backbone, algorithm, crop_size, and num_classes input parameters that
define the network architecture must be consistently specified in the input hyperparameters of the new
model and the pre-trained model that you upload to the model channel. For the pretrained model file,
you can use the compressed (.tar.gz) artifacts from SageMaker outputs. You can only use Image formats
for input data. For more information on incremental training and for instructions on how to use it, see
Incremental Training in Amazon SageMaker (p. 2113).
Produce Inferences
To query a trained model that is deployed to an endpoint, you need to provide an image and an
AcceptType that denotes the type of output required. The endpoint takes JPEG images with an
image/jpeg content type. If you request an AcceptType of image/png, the algorithm outputs a PNG
file with a segmentation mask in the same format as the labels themselves. If you request an accept
type ofapplication/x-recordio-protobuf, the algorithm returns class probabilities encoded in
recordio-protobuf format. The latter format outputs a 3D tensor where the third dimension is the same
size as the number of classes. This component denotes the probability of each class label for each pixel.
For inference, you can use either CPU instances (such as C5 and M5) and GPU instances (such as P3 and
G4dn) or both. For information about the instance types that provide varying combinations of CPU, GPU,
memory, and networking capacity for inference, see Amazon SageMaker ML Instance Types.
Optional
Optional
Optional
Valid values:
1553
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Data Hyperparameters
Required
num_training_samples The number of samples in the training data. The algorithm uses this value
to set up the learning rate scheduler.
Required
base_size Defines how images are rescaled before cropping. Images are rescaled
such that the long size length is set to base_size multiplied by a
random number from 0.5 to 2.0, and the short size is computed to
preserve the aspect ratio.
Optional
crop_size The image size for input during training. We randomly rescale the input
image based on base_size, and then take a random square crop with
side length equal to crop_size. The crop_size will be automatically
rounded up to multiples of 8.
Optional
Training Hyperparameters
Optional
1554
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
Default value: 5
The number of epochs that meet the tolerance for lower performance
early_stopping_patience
before the algorithm enforces an early stop.
Optional
Default value: 4
If the relative improvement of the score of the training job, the mIOU,
early_stopping_tolerance
is smaller than this value, early stopping considers the epoch as not
improved. This is used only when early_stopping = True.
Optional
Optional
Default value: 10
gamma1 The decay factor for the moving average of the squared gradient for
rmsprop. Used only for rmsprop.
Optional
Optional
1555
Amazon SageMaker Developer Guide
Use Built-in Algorithms
Optional
lr_scheduler The shape of the learning rate schedule that controls its decrease over
time.
Optional
Valid values:
Optional
lr_scheduler_step A comma delimited list of the epochs after which the learning_rate
is reduced (multiplied) by an lr_scheduler_factor. For example, if
the value is set to "10, 20", then the learning-rate is reduced by
lr_scheduler_factor after the 10th epoch and again by this factor
after 20th epoch.
mini_batch_size The batch size for training. Using a large mini_batch_size usually
results in faster training, but it might cause you to run out of memory.
Memory usage is affected by the values of the mini_batch_size and
image_shape parameters, and the backbone architecture.
Optional
Default value: 16
1556
Amazon SageMaker Developer Guide
Use Built-in Algorithms
momentum The momentum for the sgd optimizer. When you use other optimizers,
the semantic segmentation algorithm ignores this parameter.
Optional
optimizer The type of optimizer. For more information about an optimizer, choose
the appropriate link:
Optional
syncbn If set to True, the batch normalization mean and variance are computed
over all the samples processed across the GPUs.
Optional
validation_mini_batch_size
The batch size for validation. A large mini_batch_size usually
results in faster training, but it might cause you to run out of memory.
Memory usage is affected by the values of the mini_batch_size and
image_shape parameters, and the backbone architecture.
Optional
Default value: 16
1557
Amazon SageMaker Developer Guide
Use Built-in Algorithms
weight_decay The weight decay coefficient for the sgd optimizer. When you use other
optimizers, the algorithm ignores this parameter.
Optional
Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.
The semantic segmentation algorithm reports two validation metrics. When tuning hyperparameter
values, choose one of these metrics as the objective.
You can tune the following hyperparameters for the semantic segmentation algorithm.
1558
Amazon SageMaker Developer Guide
Use Reinforcement Learning
The problem of RL is formalized using Markov decision processes (MDPs) that originate from dynamical
systems theory. MDPs aim to capture high-level details of a real problem that a learning agent
encounters over some period of time in attempting to achieve some ultimate goal. The learning agent
should be able to determine the current state of its environment and identify possible actions that
affect the learning agent’s current state. Furthermore, the learning agent’s goals should correlate
strongly to the state of the environment. A solution to a problem formulated in this way is known as a
reinforcement learning method.
In supervised learning, an external supervisor provides a training set of labeled examples. Each example
contains information about a situation, belongs to a category, and has a label identifying the category
to which it belongs. The goal of supervised learning is to generalize in order to predict correctly in
situations that are not present in the training data.
In contrast, RL deals with interactive problems, making it infeasible to gather all possible examples of
situations with correct labels that an agent might encounter. This type of learning is most promising
when an agent is able to accurately learn from its own experience and adjust accordingly.
In unsupervised learning, an agent learns by uncovering structure within unlabeled data. While a RL
agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to
maximize a reward signal.
Topics
• Why is Reinforcement Learning Important? (p. 1559)
• Markov Decision Process (MDP) (p. 1560)
• Key Features of Amazon SageMaker RL (p. 1560)
• Reinforcement Learning Sample Notebooks (p. 1562)
• Sample RL Workflow Using Amazon SageMaker RL (p. 1562)
• RL Environments in Amazon SageMaker (p. 1563)
• Distributed Training with Amazon SageMaker RL (p. 1565)
• Hyperparameter Tuning with Amazon SageMaker RL (p. 1565)
1559
Amazon SageMaker Developer Guide
Use Reinforcement Learning
Environment
Defines the space in which the RL model operates. This can be either a real-world environment or
a simulator. For example, if you train a physical autonomous vehicle on a physical road, that would
be a real-world environment. If you train a computer program that models an autonomous vehicle
driving on a road, that would be a simulator.
State
Specifies all information about the environment and past steps that is relevant to the future. For
example, in an RL model in which a robot can move in any direction at any time step, the position
of the robot at the current time step is the state, because if we know where the robot is, it isn't
necessary to know the steps it took to get there.
Action
What the agent does. For example, the robot takes a step forward.
Reward
A number that represents the value of the state that resulted from the last action that the agent
took. For example, if the goal is for a robot to find treasure, the reward for finding treasure might be
5, and the reward for not finding treasure might be 0. The RL model attempts to find a strategy that
optimizes the cumulative reward over the long term. This strategy is called a policy.
Observation
Information about the state of the environment that is available to the agent at each step. This
might be the entire state, or it might be just a part of the state. For example, the agent in a chess-
playing model would be able to observe the entire state of the board at any step, but a robot in a
maze might only be able to observe a small portion of the maze that it currently occupies.
Typically, training in RL consists of many episodes. An episode consists of all of the time steps in an MDP
from the initial state until the environment reaches the terminal state.
• A deep learning (DL) framework. Currently, SageMaker supports RL in TensorFlow and Apache MXNet.
• An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and
provides a wide selection of state of the art RL algorithms. SageMaker supports the Intel Coach and
Ray RLlib toolkits. For information about Intel Coach, see https://nervanasystems.github.io/coach/.
For information about Ray RLlib, see https://ray.readthedocs.io/en/latest/rllib.html.
• An RL environment. You can use custom environments, open-source environments, or commercial
environments. For information, see RL Environments in Amazon SageMaker (p. 1563).
The following diagram shows the RL components that are supported in SageMaker RL.
1560
Amazon SageMaker Developer Guide
Use Reinforcement Learning
1561
Amazon SageMaker Developer Guide
Use Reinforcement Learning
How to Train Batch RL Policies? This notebook shows how to use batch RL to train
a new policy from an offline dataset.
How to Solve the Cart-pole Balancing Problem? This notebook shows how to solve the cart-pole
balancing problem with RL.
How to Solve the Knapsack Problem? This notebook shows how to use RL to solve the
knapsack problem, and how SageMaker Managed
Spot Training can be used to run training at a
lower cost.
How to Solve the Mountain Car Problem? This notebook shows how to solve the mountain
car control problem with RL.
1. Formulate the RL problem—First, formulate the business problem into an RL problem. For example,
auto scaling enables services to dynamically increase or decrease capacity depending on conditions
that you define. Currently, this requires setting up alarms, scaling policies, thresholds, and other
manual steps. To solve this with RL, we define the components of the Markov Decision Process:
1562
Amazon SageMaker Developer Guide
Use Reinforcement Learning
When you use SageMaker for training, you can select GPU or CPU instances. Store the output from
the training job in a local directory if you train in local mode, or on Amazon S3 if you use SageMaker
training.
a. The source directory where the environment, presets, and training code are uploaded.
b. The path to the training script.
c. The RL toolkit and deep learning framework you want to use. This automatically resolves to the
Amazon ECR path for the RL container.
d. The training parameters, such as the instance count, job name, and S3 path for output.
e. Metric definitions that you want to capture in your logs. These can also be visualized in
CloudWatch and in SageMaker notebooks.
6. Visualize training metrics and output—After a training job that uses an RL model completes, you
can view the metrics you defined in the training jobs in CloudWatch,. You can also plot the metrics in
a notebook by using the Amazon SageMaker Python SDK analytics library. Visualizing metrics helps
you understand how the performance of the model as measured by the reward improves over time.
Note
If you train in local mode, you can't visualize metrics in CloudWatch.
7. Evaluate the model—Checkpointed data from the previously trained models can be passed on
for evaluation and inference in the checkpoint channel. In local mode, use the local directory. In
SageMaker training mode, you need to upload the data to S3 first.
8. Deploy RL models—Finally, deploy the trained model on an endpoint hosted on SageMaker
containers or on an edge device by using AWS IoT Greengrass.
For more information on RL with SageMaker, see Using RL with the SageMaker Python SDK.
The following diagram shows an example of the interactions with a simulator for a car racing game.
1563
Amazon SageMaker Developer Guide
Use Reinforcement Learning
The simulation environment consists of an agent and a simulator. Here, a convolutional neural network
(CNN) consumes images from the simulator and generates actions to control the game controller. With
multiple simulations, this environment generates training data of the form state_t, action, state_t
+1, and reward_t+1. Defining the reward is not trivial and impacts the RL model quality. We want to
provide a few examples of reward functions, but would like to make it user-configurable.
Topics
• Use OpenAI Gym Interface for Environments in SageMaker RL (p. 1564)
• Use Open-Source Environments (p. 1565)
• Use Commercial Environments (p. 1565)
• env.action_space—Defines the actions the agent can take, specifies whether each action is
continuous or discrete, and specifies the minimum and maximum if the action is continuous.
• env.observation_space—Defines the observations the agent receives from the environment, as
well as minimum and maximum for continuous observations.
• env.reset()—Initializes a training episode. The reset() function returns the initial state of the
environment, and the agent uses the initial state to take its first action. The action is then sent to
step() repeatedly until the episode reaches a terminal state. When step() returns done = True,
the episode ends. The RL toolkit re-initializes the environment by calling reset().
• step()—Takes the agent action as input and outputs the next state of the environment, the reward,
whether the episode has terminated, and an info dictionary to communicate debugging information.
It is the responsibility of the environment to validate the inputs.
• env.render()—Used for environments that have visualization. The RL toolkit calls this function to
capture visualizations of the environment after each call to the step() function.
1564
Amazon SageMaker Developer Guide
Run local code as a remote job
• Single training instance and multiple rollout instances of the same instance type. For an example, see
the Neural Network Compression example in the SageMaker examples repository.
• Single trainer instance and multiple rollout instances, where different instance types for training
and rollouts. For an example, see the AWS DeepRacer / AWS RoboMaker example in the SageMaker
examples repository.
• Single trainer instance that uses multiple cores for rollout. For an example, see the Roboschool
example in the SageMaker examples repository. This is useful if the simulation environment is light-
weight and can run on a single thread.
• Multiple instances for training and rollouts. For an example, see the Roboschool example in the
SageMaker examples repository.
@remote(**settings)
def divide(x, y):
return x / y
The SageMaker Python SDK will automatically translate your existing workspace environment and any
associated data processing code and datasets into a SageMaker training job that runs on the SageMaker
training platform. You can also activate a persistent cache feature, which will further reduce job start
latency by caching previously downloaded dependency packages. This reduction in job latency is greater
1565
Amazon SageMaker Developer Guide
Set up your environment
than the reduction in latency from using SageMaker managed warm pools alone. For more information,
see Using persistent cache (p. 2121).
Note
Distributed training jobs are not supported by remote functions.
The following sections show how to annotate your local ML code with an @remote decorator and tailor
your experience for your use case. This includes customizing your environment and integrating with
SageMaker Experiments.
Topics
• Set up your environment (p. 1566)
• Invoking a function (p. 1572)
• Configuration file (p. 1576)
• Customize your runtime environment (p. 1577)
• Container image compatibility (p. 1578)
• Logging parameters and metrics with Amazon SageMaker Experiments (p. 1581)
• Using modular code with the @remote decorator (p. 1584)
• Private repository for runtime dependencies (p. 1585)
• Example notebooks (p. 1586)
The @remote decorator automatically detects the image attached to the SageMaker Studio
notebook and uses it to run the SageMaker training job. If image_uri is specified either as an
argument in the decorator or in the configuration file, then the value specified in image_uri will
be used instead of the detected image.
For more information about how to create a notebook in SageMaker Studio, see the Create a
Notebook from the File Menu section in Create or Open an Amazon SageMaker Studio Notebook.
1566
Amazon SageMaker Developer Guide
Set up your environment
To annotate your code with the @remote function inside a SageMaker Studio Notebook, you must
have the SageMaker Python SDK installed. Install the SageMaker Python SDK, as shown in the
following code example.
To run your local ML code, first create a dependencies file to instruct SageMaker where to locate your
local code. To do so, follow these steps:
a. From the SageMaker Studio Launcher main working area, in Utilities and files, choose Text file.
This opens a new tab with a text file called untitled.txt.
For more information about the SageMaker Studio user interface (UI), see Amazon SageMaker
Studio UI Overview.
b. Rename untitled.txt to requirements.txt.
c. Add all the dependencies required for the code along with the SageMaker library to
requirements.txt.
A minimal code example for requirements.txt for the example divide function is provided in
the following section, as follows.
sagemaker
d. Run your code with the remote decorator by passing the dependencies file, as follows.
@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
return x / y
divide(2, 3.0)
If you’re already running a SageMaker Studio notebook, and you install the Python SDK as
instructed in 2. Install the SageMaker Python SDK, you must restart your kernel. For more
information, see Use the SageMaker Studio Notebook Toolbar in the Amazon SageMaker Developer
Guide.
You can annotate your local ML code with an @remote decorator to use inside of a SageMaker training
job. First you must create and customize a SageMaker notebook instance to use a kernel with Python
version 3.7 or higher, up to 3.10.x. To do so, follow these steps:
a. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.
b. In the left navigation panel, choose Notebook to expand its options.
c. Choose Notebook Instances from the expanded options.
1567
Amazon SageMaker Developer Guide
Set up your environment
d. Choose the Create Notebook Instance button. This opens a new page.
e. For Notebook instance name, enter a name with a maximum of 63 characters and no spaces. Valid
characters: A-Z, a-z, 0-9, and .:+=@ _%- (hyphen).
f. In the Notebook instance settings dialog box, expand the right arrow next to Additional
Configuration.
g. Under Lifecycle configuration - optional, expand the down arrow and select Create a new
lifecycle configuration. This opens a new dialog box.
h. Under Name, enter a name for your configuration setting.
i. In the Scripts dialog box, in the Start notebook tab, replace the existing contents of the text box
with the following script.
#!/bin/bash
set -e
j. In the Scripts dialog box, in the Create notebook tab, replace the existing contents of the text box
with the following script.
#!/bin/bash
set -e
1568
Amazon SageMaker Developer Guide
Set up your environment
k. Choose the Create configuration button on the bottom right of the window.
l. Choose the Create notebook instance button on the bottom right of the window.
m.Wait for the notebook instance Status to change from Pending to InService.
2. Create a Jupyter notebook in the notebook instance.
The following instructions show how to create a Jupyter notebook using Python 3.10 in your newly
created SageMaker instance.
a. After the notebook instance Status from the previous step is InService, do the following:
i. Select Open Jupyter under Actions in the row containing your newly created notebook instance
Name. This opens a new Jupyter server.
b. In the Jupyter server, select New from the top right menu.
c. From the down arrow, select conda_custom_python310. This creates a new Jupyter notebook that
uses a Python 3.10 kernel. This new Jupyter notebook can now be used similarly to a local Jupyter
notebook.
3. Install the SageMaker Python SDK.
After your virtual environment is running, install the SageMaker Python SDK by using the following
code example.
When you annotate your local ML code with an @remote decorator inside the SageMaker notebook,
SageMaker training will automatically interpret the function of your code and run it as a SageMaker
training job. Set up your notebook by doing the following:
a. Select the kernel name in the notebook menu from the SageMaker notebook instance that you
created in step 1, Create a SageMaker Notebook instance with a custom kernel.
@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
return x / y
divide(2, 3.0)
1569
Amazon SageMaker Developer Guide
Set up your environment
1. Install prerequisites by setting up the AWS Command Line Interface (AWS CLI) and creating a role, as
follows:
• Onboard to a SageMaker domain following the instructions in the AWS CLI Prerequisites section of
Set Up Amazon SageMaker Prerequisites.
• Create an IAM role following the Create execution role section of SageMaker Roles.
2. Create a virtual environment by using either PyCharm or conda and using Python version 3.7 or
higher, up to 3.10.x.
• Set up a virtual environment using PyCharm as follows:
a. Select File from the main menu.
b. Choose New Project.
c. Choose Conda from the down arrow under New environment using.
d. In the field for Python version use the down arrow to select a version of Python that is 3.7 or
above. You can go up to 3.10.x from the list.
• If you have Anaconda installed, you can set up a virtual environment using conda, as follows:
• Open an Anaconda prompt terminal interface.
• Create and activate a new conda environment using a Python version of 3.7 or higher, up to
3.10x. The following code example shows how to create a conda environment using Python
version 3.10.
1570
Amazon SageMaker Developer Guide
Set up your environment
To package your code from your preferred IDE, you must have a virtual environment set up using
Python 3.7 or higher, up to 3.10x. You also need a compatible container image. Install the SageMaker
Python SDK using the following code example.
4. Wrap your code inside the @remote decorator. The SageMaker Python SDK will automatically
interpret the function of your code and run it as a SageMaker training job. The following code
examples show how to import the necessary libraries, set up a SageMaker session, and annotate a
function with the @remote decorator.
You can run your code by either providing the dependencies needed directly, or by using dependencies
from the active conda environment.
• To provide the dependencies directly, do the following:
• Create a requirements.txt file in the working directory that the code resides in.
• Add all of the dependencies required for the code along with the SageMaker library. The
following section provides a minimal code example for requirements.txt for the example
divide function.
sagemaker
• Run your code with the @remote decorator by passing the dependencies file. In the following
code example, replace The IAM role name with an AWS Identity and Access Management (IAM)
role ARN that you would like SageMaker to use to run your job.
import boto3
import sagemaker
from sagemaker.remote_function import remote
sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-
west-2"))
settings = dict(
sagemaker_session=sm_session,
role=<The IAM role name>,
instance_type="ml.m5.xlarge",
dependencies='./requirements.txt'
)
@remote(**settings)
def divide(x, y):
return x / y
if __name__ == "__main__":
print(divide(2, 3.0))
• To use dependencies from the active conda environment, use the value auto_capture for the
dependencies parameter, as shown in the following.
import boto3
import sagemaker
from sagemaker.remote_function import remote
sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-
west-2"))
settings = dict(
1571
Amazon SageMaker Developer Guide
Invoking a function
sagemaker_session=sm_session,
role=<The IAM role name>,
instance_type="ml.m5.xlarge",
dependencies="auto_capture"
)
@remote(**settings)
def divide(x, y):
return x / y
if __name__ == "__main__":
print(divide(2, 3.0))
Note
You can also implement the previous code inside a Jupyter notebook. PyCharm
Professional Edition supports Jupyter natively. For more guidance, see Jupyter notebook
support in PyCharm's documentation.
Invoking a function
To invoke a function inside the @remote decorator, use either of the following methods:
If you use the @remote decorator method to invoke a function, the training job will wait for the function
to complete before starting a new task. However, if you use the RemoteExecutor API, you can run more
than one job in parallel. The following sections show both ways of invoking a function.
@remote(instance_type="ml.m5.large")
def matrix_multiply(a, b):
return np.matmul(a, b)
a = np.array([[1, 0],
[0, 1]])
b = np.array([1, 2])
def remote(
*,
**kwarg):
1572
Amazon SageMaker Developer Guide
Invoking a function
...
When you invoke a decorated function, SageMaker Python SDK loads any exceptions raised by an
error into local memory. In the following code example, the first call to the divide function completes
successfully and the result is loaded into local memory. In the second call to the divide function, the code
returns an error and this error is loaded into local memory.
@remote()
def divide(a, b):
return a/b
Note
The decorated function is run as a remote job. If the thread is interrupted, the underlying job
will not be stopped.
In the following code example, a list and a dict are appended inside the decorator function. This does not
change when the decorator function is invoked.
a = []
@remote
def func():
a.append(1)
# a stays as []
a = {}
@remote
def func(a):
# append new values to the input dictionary
a["key-2"] = "value-2"
a = {"key": "value"}
func(a)
To change the value of a local variable declared inside of a decorator function, return the variable from
the function. The following code example shows that the value of a local variable is changed when it is
returned from the function.
1573
Amazon SageMaker Developer Guide
Invoking a function
a = {"key-1": "value-1"}
@remote
def func(a):
a["key-2"] = "value-2"
return a
a = func(a)
@remote()
def train(dtrain, params):
return xgb.train(data, params)
df = pandas.read_csv("./data.csv")
train_data, test_data = train_test_split(df, test_size=0.3)
dtrain = DMatrix(train_data)
As a recommended practice, DMatrix objects should be loaded during training time instead of using
them as an input data object to the remote function.
To rectify the previous code example, pass the pandas dataframe or numpy arrays directly to the train
function by using this code example.
@remote
def train(df, params):
dtrain = DMatrix(df)
return xgb.train(dtrain, params)
1574
Amazon SageMaker Developer Guide
Invoking a function
For data sets that are too large to fit into memory, use the specialized data loader provided by your
framework in the function. The following code shows an example of the tensorflow data loader.
@remote()
def train(data_path: str, params):
import tensorflow as tf
import tensorflow_io as tfio
dataset = tf.data.TextLineDataset(tf.data.Dataset.list_files(f"{data_path}/*.txt"))
...
train("s3://my_bucket/data", {})
The following code example shows how to import the required libraries, define a function, start a
SageMaker instance, and use the API to submit a request to run 2 jobs in parallel.
a = np.array([[1, 0],
[0, 1]])
b = np.array([1, 2])
The following code example shows how to define a function and call it using the RemoteExecutorAPI.
In this example, the RemoteExecutor will submit 4 jobs in total, but only 2 in parallel. The last two jobs
will reuse the clusters with minimal overhead.
The max_parallel_job parameter only serves as a rate limiting mechanism without optimizing
compute resource allocation. In the previous code example, RemoteExecutor doesn’t reserve
1575
Amazon SageMaker Developer Guide
Configuration file
compute resources for the two parallel jobs before any jobs are submitted. For more information about
max_parallel_job or other parameters for the @remote decorator, see Remote function classes and
methods specification.
Configuration file
The Amazon SageMaker Python SDK supports setting of default values for AWS infrastructure primitive
types. After administrators configure these defaults, they are automatically passed when SageMaker
Python SDK calls supported APIs. The arguments for the decorator function can be put inside of
configuration files. This is so that you can separate settings that are related to the infrastructure from
the code base. For more information about parameters and arguments for the remote function and
methods, see Remote function classes and methods specification.
You can set infrastructure settings for the network configuration, IAM roles, Amazon S3 folder for input,
output data, and tags inside the configuration file. The configuration file can be used when invoking a
function using either the @remote decorator or the RemoteExecutor API.
An example configuration file that defines the dependencies, resources, and other arguments follows.
This example configuration file is used to invoke a function that is initiated either using the @remote
decorator or the RemoteExecutor API.
SchemaVersion: '1.0'
SageMaker:
PythonSDK:
Modules:
RemoteFunction:
Dependencies: 'path/to/requirements.txt'
EnableInterContainerTrafficEncryption: true
EnvironmentVariables: {'EnvVarKey': 'EnvVarValue'}
ImageUri: '366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest'
IncludeLocalWorkDir: true
InstanceType: 'ml.m5.large'
JobCondaEnvironment: 'your_conda_env'
PreExecutionCommands:
- 'command_1'
- 'command_2'
PreExecutionScript: 'path/to/script.sh'
RoleArn: 'arn:aws:iam::366666666666:role/MyRole'
S3KmsKeyId: 'yourkmskeyid'
S3RootUri: 's3://my-bucket/my-project'
VpcConfig:
SecurityGroupIds:
- 'sg123'
Subnets:
- 'subnet-1234'
Tags: [{'Key': 'yourTagKey', 'Value':'yourTagValue'}]
VolumeKmsKeyId: 'yourkmskeyid'
The @remote decorator and RemoteExecutor will look for Dependencies in the following
configuration files:
1576
Amazon SageMaker Developer Guide
Customize your runtime environment
The default locations for these configuration files depend on, and are relative to, your environment. The
following code example returns the default location of your admin and user configuration files. These
commands must be run in the same environment where you're using the SageMaker Python SDK.
import os
from platformdirs import site_config_dir, user_config_dir
You can override the default locations of these files by setting the
SAGEMAKER_ADMIN_CONFIG_OVERRIDE and SAGEMAKER_USER_CONFIG_OVERRIDE environment
variables for the admin-defined and user-defined configuration file paths, respectively.
If a key exists in both the admin-defined and user-defined configuration files, the value in the user-
defined file will be used.
Both the remote decorator and the RemoteExecutor methods to invoke a function allow users to
define and customize their runtime environment. You can use either a requirements.txt file or a
conda environment YAML file.
To customize a runtime environment using both a conda environment YAML file and a
requirements.txt file, refer to the following code example.
Alternatively, you can set dependencies to auto_capture to let the SageMaker Python SDK
capture the installed dependencies in the active conda environment. The following are required for
auto_capture to work reliably:
• You must have an active conda environment. We recommend not using the base conda environment
for remote jobs so that you can reduce potential dependency conflicts. Not using the base conda
environment also allows for faster environment setup in the remote job.
• You must not have any dependencies installed using pip with a value for the parameter --extra-
index-url.
• You must not have any dependency conflicts between packages installed with conda and packages
installed with pip in the local development environment.
1577
Amazon SageMaker Developer Guide
Container image compatibility
• Your local development environment must not contain operating system-specific dependencies that
are not compatible with Linux.
In case auto_capture does not work, we recommend that you pass in your dependencies as a
requirement.txt or conda environment.yaml file, as described in the first coding example in this section.
Data Science 2.0 3.8(py38) For SageMaker Studio For SageMaker Studio
Notebooks only. Python Notebooks only. Python
SDK automatically SDK automatically
selects the image selects the image
URI when used as URI when used as
SageMaker Studio SageMaker Studio
Notebook kernel image. Notebook kernel image.
Data Science 3.0 3.10(py310) For SageMaker Studio For SageMaker Studio
Notebooks only. Python Notebooks only. Python
SDK automatically SDK automatically
selects the image selects the image
URI when used as URI when used as
SageMaker Studio SageMaker Studio
Notebook kernel image. Notebook kernel image.
Base Python 2.0 3.8(py38) Python SDK selects this For SageMaker Studio
image when it detects Notebooks only. Python
that development SDK automatically
environment is using selects the image
Python 3.8 runtime. URI when used as
Otherwise Python SDK SageMaker Studio
automatically selects Notebook kernel image.
this image when used
as SageMaker Studio
Notebook kernel image
Base Python 3.0 3.10(py310) Python SDK selects this For SageMaker Studio
image when it detects Notebooks only. Python
that development SDK automatically
environment is using selects the image URI
Python 3.8 runtime. when used as Studio
Otherwise Python SDK Notebook kernel image.
automatically selects
this image when used
1578
Amazon SageMaker Developer Guide
Container image compatibility
1579
Amazon SageMaker Developer Guide
Container image compatibility
Note
To run jobs locally using AWS Deep Learning Containers (DLC) images, use the image URIs
found in the DLC documentation. The DLC images do not support the auto_capture value for
dependencies.
You can also run remote functions with your custom images. For compatibility with remote functions,
custom images should be built with Python version 3.7.x-3.10.x. The following is a minimal Dockerfile
example showing you how to use a Docker image with Python 3.10.
FROM python:3.10
To create conda environments in your image and use it to run jobs, set the environment
variable SAGEMAKER_JOB_CONDA_ENV to the conda environment name. If your image has the
SAGEMAKER_JOB_CONDA_ENV value set, the remote function cannot create a new conda environment
during the training job runtime. Refer to the following Dockerfile example that uses a conda
environment with Python version 3.10.
FROM continuumio/miniconda3:4.12.0
ENV SHELL=/bin/bash \
CONDA_DIR=/opt/conda \
SAGEMAKER_JOB_CONDA_ENV=sagemaker-job-env
For SageMaker to use mamba to manage your Python virtual environment in the container image, install
the mamba toolkit from miniforge. To use mamba, add the following code example to your Dockerfile.
Then, SageMaker will detect the mamba availability at runtime and use it instead of conda.
#Mamba Installation
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/
Mambaforge-Linux-x86_64.sh" \
&& bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \
&& /opt/conda/bin/conda init bash
Using a custom conda channel on an Amazon S3 bucket is not compatible with mamba when using a
remote function. If you choose to use mamba, make sure you are not using a custom conda channel on
Amazon S3. For more information, see the Prerequisites section under Custom conda repository using
Amazon S3.
1580
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments
The following is a complete Dockerfile example showing how to create a compatible Docker image.
FROM python:3.10
#Install Mamba
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/
Mambaforge-Linux-x86_64.sh" \
&& bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \
&& /opt/conda/bin/conda init bash
#cleanup
RUN apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf ${HOME}/.cache/pip \
&& rm Mambaforge-Linux-x86_64.sh
ENV SHELL=/bin/bash \
PATH=$PATH:/opt/conda/bin
The resulting image from running the previous Dockerfile example can also be used as a SageMaker
Studio kernel image.
You can log parameters and metrics from a remote function using either the @remote decorator or the
RemoteExecutor API.
To log parameters and metrics from a remote function, choose one of the following methods:
• Instantiate a SageMaker experiment run inside a remote function using Run from the SageMaker
Experiments library. For more information, see Create an Amazon SageMaker Experiment.
• Use the load_run function inside a remote function from the SageMaker Experiments library. This will
load a Run instance that is declared outside of the remote function.
The following sections show how to create and track lineage with SageMaker experiment runs by using
the previous listed methods. The sections also describe cases that are not supported by SageMaker
training.
1581
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments
The following code example imports the name of your experiment, the name of the run, and the
parameters to log during each run. The parameters param_1 and param_2 are logged over time inside
a training loop. Common parameters may include batch size or epochs. In this example, the metrics
metric_a and metric_b are logged for a run over time inside a training loop. Other common metrics
may include accuracy or loss.
1582
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments
run.log_metric("metric_b", value_2)
def square(x):
with load_run() as run:
result = x * x
run.log_metric("result", result)
return result
with RemoteExecutor(
max_parallel_job=2,
instance_type="ml.m5.large"
) as e:
with Run(
experiment_name="my-exp-name",
run_name="my-run-name",
):
future_1 = e.submit(square, 2)
The following code example attempts to pass a Run type object to an @remote decorator, and it
generates an error.
@remote
def func(run: Run):
run.log_metrics("metric_a", 1.0)
The following code example attempts to use a global run object instantiated outside of the remote
function. In the code example, the train() function is defined inside the with Run context,
referencing a global run object from within. When train() is called, it generates an error.
1583
Amazon SageMaker Developer Guide
Using modular code with the @remote decorator
@remote
def train(metric_1, value_1, metric_2, value_2):
run.log_parameter(metric_1, value_1)
run.log_parameter(metric_2, value_2)
@remote(
include_local_workdir=True,
)
Note
The @remote decorator and parameter must appear in the main file, rather than in any of the
dependent files.
When include_local_workdir is set to True, SageMaker will package all of the Python scripts while
maintaining the directory structure in the process's current directory. It will also make the dependencies
available in the job's working directory.
As an example, consider the following scenario where a Python script to process the MNIST dataset is
divided into a main.py script and a dependent pytorch_mnist.py script. Then, the dependent scripts
is called by main.py. In this scenario, the main.py script contains code to import the dependency, as
follows.
The main.py file must also contain the @remote decorator, and it must set the
include_local_workdir parameter to True.
• Put the @remote decorator in a file that resides at the root level directory of the workspace.
• Structure the local modules at the root level.
The following example image shows a directory structure that will result in inconsistent behavior when it
is used to annotate your code with an @remote decorator.
In this example structure, the main.py script that contains the @remote decorator is not located at the
root level directory. While the remote job may be successful if you run python entrypoint/main.py,
the remote job will not run successfully. Therefore, the following structure is NOT recommended.
.
### config.yaml
### entrypoint
# ### data
# ### main.py <----------------- @remote used here
### mnist_impl
1584
Amazon SageMaker Developer Guide
Private repository for runtime dependencies
# ### __pycache__
# # ### pytorch_mnist.cpython-310.pyc
# ### pytorch_mnist.py <-------- dependency of main.py
### requirements.txt
The following sections show you how to access a private Python Package Index (PyPI) repository
managed with AWS CodeArtifact. The sections also show how to access a custom conda channel hosted
on Amazon Simple Storage Service (Amazon S3).
• Your private PyPI repository should already have been created. You can utilize AWS CodeArtifact
to create and manage your private package repositories. To learn more about CodeArtifact, see the
CodeArtifact User Guide.
• Your VPC should have access to your CodeArtifact repository. To allow a connection from your VPC to
your CodeArtifact repository, you must do the following:
• Create VPC endpoints for CodeArtifact.
• Create an Amazon S3 gateway endpoint for your VPC, which allows CodeArtifact to store package
assets.
The following pre-execution command example shows how to configure pip in the SageMaker training
job to point to your CodeArtifact repository. For more information, see Configure and use pip with
CodeArtifact.
• Your private conda channel must already be set up in your Amazon S3 bucket, and all dependent
packages must be indexed and uploaded to your Amazon S3 bucket. For instructions on how to index
your conda packages, see Creating custom channels.
1585
Amazon SageMaker Developer Guide
Example notebooks
• Your VPC should have access to the Amazon S3 bucket. For more information, see Endpoints for
Amazon S3.
• The base conda environment in your job image should have boto3 installed. To check your
environment, enter the following in your Anaconda prompt to check that boto3 appears in the
resulting generated list.
• You job image should be installed with conda, not mamba. To check your environment, ensure that the
previous code prompt does not return mamba.
The following pre-execution commands example shows how to configure conda in the SageMaker
training job to point to your private channel on Amazon S3 The pre-execution commands removes the
defaults channel and adds custom channels to a .condarc conda configuration file.
Example notebooks
You can transform a training code in an existing workspace environment and any associated data
processing code and datasets into a SageMaker training job. The following notebooks show you how
to customize your environment, job settings, and more for an image classification problem, using the
XGBoost algorithm and Hugging Face.
The following notebooks provide additional code examples for different ML problems types and
implementations.
• To see code examples to use the @remote decorator for an image classification problem, open the
pytorch_mnist.ipynb notebook. This classification problem recognizes handwritten digits using the
Modified National Institute of Standards and Technology (MNIST) sample dataset.
• To see code examples for using the @remote decorator for the previous image classification problem
with a script, see the Pytorch MNIST sample script, train.py.
• To see how the XGBoost algorithm implemented with an @remote decorator: Open the
xgboost_abalone.ipynb notebook.
• To see how Hugging Face is integrated with an @remote decorator: Open the huggingface.ipynb
notebook.
1586
Amazon SageMaker Developer Guide
Experiments
Machine learning is an iterative process. You need to experiment with multiple combinations of data,
algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.
Over time, this iterative experimentation can result in thousands of model training runs and model
versions. This makes it hard to track the best performing models and their input configurations. It’s
also difficult to compare active experiments with past experiments to identify opportunities for further
incremental improvements. Use SageMaker Experiments to organize, view, analyze, and compare
iterative ML experimentation to gain comparative insights and track your best performing models.
SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of
your iterations as runs. You can assign, group, and organize these runs into experiments. SageMaker
Experiments is integrated with Amazon SageMaker Studio, providing a visual interface to browse
your active and past experiments, compare runs on key performance metrics, and identify the best
performing models. SageMaker Experiments tracks all of the steps and artifacts that went into creating
a model, and you can quickly revisit the origins of a model when you are troubleshooting issues in
production, or auditing your models for compliance verifications.
Use SageMaker Experiments to view, manage, analyze, and compare both custom experiments that you
programmatically create and experiments automatically created from SageMaker jobs.
Topics
• Create an Amazon SageMaker Experiment (p. 1587)
• View, search, and compare experiment runs (p. 1592)
• SageMaker integrations (p. 1596)
• Example notebooks for Amazon SageMaker Experiments (p. 1598)
• Monitor experiment training metrics with AWS CloudTrail (p. 1599)
• Clean Up Amazon SageMaker Experiment Resources (p. 1600)
• Additional supported SDK (p. 1602)
• Experiments FAQs (p. 1605)
• Search Using the Amazon SageMaker Console and API (p. 1607)
1587
Amazon SageMaker Developer Guide
Create an experiment
create visualizations for analysis, and find the best performing model. You can also integrate SageMaker
Experiments into your SageMaker training script using the SageMaker Python SDK.
Overview
The following components make up the building blocks of an experiment in Amazon SageMaker.
• experiment: An experiment is a collection of runs. When you initialize a run in your training loop, you
include the name of the experiment that the run belongs to. Experiment names must be unique within
your AWS account.
• Run: A run consists of all the inputs, parameters, configurations, and results for one interaction of
model training. Initialize an experiment run for tracking a training job with Run.init().
Note
We recommend that you initialize a Run object in a Jupyter Notebook, and create the
SageMaker job for your experiment within the context of this Run object initialization. To refer
to this Run object in script mode, use the load_run() method. For examples, see Example
notebooks for Amazon SageMaker Experiments (p. 1598).
Note
The SageMaker Python SDK automatically turns experiment names and run names to
lowercase.
• load_run: To run your experiments in script mode, refer to an initialized Run object with
load_run(). If an experiment for a run exists, load_run returns the experiment context. Generally,
you use load_run with no arguments to track metrics, parameters, and artifacts within a SageMaker
training or processing job script.
# Load run from a local script passing experiment and run names
with load_run(experiment_name=experiment_name, run_name=run_name) as run:
run.log_parameter("param1", "value1")
• log_parameter: Log parameters for a run, such as batch size or epochs, over time in a training loop
with run.log_parameter(). log_parameter records a single name-value pair in a run. You can
use run.log_parameters() to log multiple parameters. If called multiple times within a run for a
parameter of the same name, log_parameter overwrites any previous value. The name must be a
string and the value must be either a string, integer, or float.
• log_metric: Log metrics for a run, such as accuracy or loss, over time in a training loop with
run.log_metric(). log_metric records a name-value pair where the name is a string and the
value is an integer or float. To declare the frequency of logging over the course of the run, define a
step value. You can then visualize these metrics in the Studio Experiments UI. For more information,
see View, search, and compare experiment runs (p. 1592).
1588
Amazon SageMaker Developer Guide
Create an experiment
run.log_metric(name="Final_loss", value=finalloss)
• log_artifact: Log any input or output artifacts related to a run with run.log_artifact(). Log
artifacts such as S3 URIs, datasets, models, and more for your experiment to help you keep track of
artifacts across multiple runs. is_output is True by default. To record the artifact as an input artifact
instead of an output artifact, set is_output to False.
• log_file: Log any input or output files related to a run, such as training or test data, and store them
in Amazon S3 with run.log_file(). is_output is True by default. To record the file as an input
artifact instead of an output artifact, set is_output to False.
For more information on initializing a Run object, see Experiments in the SageMaker Python SDK
documentation. For information on visualizing logged experiment data and automatic logging, see View,
search, and compare experiment runs (p. 1592).
class ExperimentCallback(keras.callbacks.Callback):
""" """
Next, train the Keras model in a notebook environment and track it as an experiment.
Note
This example carries out jobs sequentially. To run SageMaker jobs asynchronously, you may need
to increase your resource limit.
1589
Amazon SageMaker Developer Guide
Create an experiment
# Train locally
model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=epochs,
validation_split=0.1,
callbacks = [ExperimentCallback(run, model, x_test, y_test)]
)
For more code samples and example notebooks, see Example notebooks for Amazon SageMaker
Experiments (p. 1598).
# Make sure that you have the latest version of the SageMaker Python SDK
import os
os.system("pip install -U sagemaker")
1590
Amazon SageMaker Developer Guide
Create an experiment
For more code samples and example notebooks on using Amazon SageMaker Experiments in SageMaker
script mode, see Track experiments for SageMaker training jobs using script mode (p. 1599).
For more information on script mode, see Use script mode in a supported framework. You can also
define custom metrics in script mode by specifying a name and regular expression for each metric that a
tuning job monitors. See Use a custom algorithm for training for more information.
Select the name of the experiment to view all associated runs. It might take a moment for the list to
refresh and display a new experiment or experiment run. You can click Refresh to update the page. Your
experiment list should look similar to the following:
To view the runs that make up your experiment, select the experiment name. For more information, see
View, search, and compare experiment runs (p. 1592).
1591
Amazon SageMaker Developer Guide
View, search, and compare experiment runs
experiment, the resulting runs are unassigned and can be viewed in the Unassigned runs section of the
Studio Experiments UI.
To clean up the resources you created, see Clean Up Amazon SageMaker Experiment Resources (p. 1600).
You use the experiments browser to display a list of these entities. You can filter the list by entity
name, type, and tags. For an overview of the Studio user interface, see Amazon SageMaker Studio UI
Overview (p. 129).
Topics
• View experiments and runs (p. 1592)
• Compare and analyze runs (p. 1594)
Select the name of the experiment to view all associated runs. You can search experiments by typing
directly into the Search bar or filtering for experiment type. You can also choose which columns to
display in your experiment or run list.
It might take a moment for the list to refresh and display a new experiment or experiment run. You
can click Refresh to update the page. Your experiment list should look similar to the following:
1592
Amazon SageMaker Developer Guide
View, search, and compare experiment runs
2. In the experiments list, double-click an experiment to display a list of the runs in the experiment.
Note
Experiment runs that are automatically created by SageMaker jobs and containers are
visible in the Experiments Studio UI by default. To hide runs created by SageMaker jobs for
a given experiment, choose the settings icon ( ) and toggle Show jobs.
In the Overview pane, choose any of the following headings to see available information about each
run:
1593
Amazon SageMaker Developer Guide
View, search, and compare experiment runs
1. After navigating to the experiment of your choice, select all the runs that you want to compare. You
must choose more than 1 and less than 20 runs to analyze.
2. Choose Analyze in the upper right-hand corner.
3. Visualize the comparative metrics of multiple experiment runs in a histogram, line chart, scatter
plot, or bar chart. To add a chart, choose Add Chart, select values for your chart axes, and choose
Create.
1594
Amazon SageMaker Developer Guide
View, search, and compare experiment runs
Log charts
Logging charts and visualizations is available for classification models. You can log a confusion matrix,
receiver operating characteristics, or precision and recall graphs.
Log and visualize metrics with the following Python SDK methods:
• log_confusion_matrix: Records a confusion matrix artifact that you can view in the Charts section
of the Run Overview in Studio.
• log_roc_curve: Records a receiver operating characteristic artifact that you can view in the Charts
section of the Run Overview in Studio.
• log_precision_recall: Records a precision recall graph that you can view in the Charts section of
the Run Overview in Studio.
An automatically logged precision recall record creates a chart similar to the following:
1595
Amazon SageMaker Developer Guide
SageMaker integrations
SageMaker integrations
Amazon SageMaker Experiments is integrated with a number of SageMaker features. Certain SageMaker
jobs automatically create experiments. You can view and manage SageMaker Clarify bias reports or
SageMaker Debugger output tensors for specific experiment runs directly in the Studio Experiments UI.
Autopilot
Amazon SageMaker Experiments is integrated with Amazon SageMaker Autopilot. When you perform an
Autopilot job, SageMaker Experiments creates an experiment for that job as well as runs for each of the
different combinations of the available run components, parameters, and artifacts. You can find these
runs in the SageMaker Experiments UI by filtering for the run type Autopilot. For more information, see
Automate model development with Amazon SageMaker Autopilot.
HPO
Amazon SageMaker Experiments is integrated with HPO jobs. An HPO job automatically creates Amazon
SageMaker experiments, runs, and components for each training job that it completes. You can find
these runs in the SageMaker Experiments UI by filtering for the run type HPO. For more information, see
Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model.
1596
Amazon SageMaker Developer Guide
SageMaker integrations
Pipelines
Amazon SageMaker Model Building Pipelines is closely integrated with Amazon SageMaker Experiments.
By default, when SageMaker Pipelines creates and executes a pipeline, experiments, runs, and
components are created if they do not already exist. You can find these runs in the SageMaker
Experiments UI by filtering for the run type Pipelines. For more information, see Amazon SageMaker
Experiments Integration.
Choose Explanations to see any Clarify explainability reports associated with the experiment run.
You can generate pre-training or post-training bias reports that analyze bias in datasets or model
predictions using labels and bias metrics with SageMaker Clarify. You can also use SageMaker Clarify to
generate explainability reports that document model behavior for global or local data samples. For more
information, see Amazon SageMaker Clarify Bias Detection and Model Explainability.
1597
Amazon SageMaker Developer Guide
Tutorials
Debugging
You can debug model training progress with Amazon SageMaker Debugger and view debug output
tensors in the Studio Experiments UI. Choose the name of the run associated with the Debugger report
and choose Debugger.
Then, choose the training job name to view the associated Amazon SageMaker Debugger dashboard.
For more information, see Debug Training Jobs Using Amazon SageMaker Debugger.
1598
Amazon SageMaker Developer Guide
CloudTrail metrics
• Run a SageMaker Experiment with Pytorch Distributed Data Parallel - MNIST Handwritten Digits
Classification
• Track an experiment while training a Pytorch model with a SageMaker Training Job
• Train a TensorFlow model with a SageMaker training job and track it using SageMaker Experiments
When you create an experiment run, you can also configure the continuous delivery of CloudTrail
events to an Amazon S3 bucket. Use CloudTrail to monitor all ingested training metrics for an
experiment run, including information such as the metric name, the training step of the recorded
metric, the timestamp, and the metric value. CloudTrail events also include the experiment
run ARN, the ID of the account that created the run, and the resource type, which should be
AWS::SageMaker::ExperimentTrialComponent.
To monitor BatchPutMetrics API calls as CloudTrail events, you must first set up the logging of data
plane API activity in CloudTrail. See Logging data events for trails for more information. For granular
control over which API calls you want to selectively log and pay for, you can filter CloudTrail events
by resource type. Specify AWS::SageMaker::ExperimentTrialComponent as a resource type
1599
Amazon SageMaker Developer Guide
Clean up experiment resources
to monitor calls to the BatchPutMetrics API. For more information, see DataResource in the AWS
CloudTrail API reference. To learn more about CloudTrail, see the AWS CloudTrail User Guide.
For an in-depth explanation of how Amazon SageMaker works with AWS CloudTrail, see Log Amazon
SageMaker API Calls with AWS CloudTrail (p. 3285).
The following is an example CloudTrail event for a training metric in an experiment run:
{
...
"eventTime": "2022-12-14T21:53:41Z",
"eventSource": "metrics-sagemaker.amazonaws.com",
"eventName": "BatchPutMetrics",
"awsRegion": "us-east-1",
"sourceIPAddress": "192.0.2.0",
"userAgent": "aws-cli/2.7.25 Python/3.9.11 Linux/5.4.214-134.408.amzn2int.x86_64 exe/
x86_64.amzn.2 prompt/off command/sm-metrics.batch-put-metrics",
"requestParameters": {
"trialComponentName": "trial-component-name",
"metricData": [
{
"metricName": "foo",
"timestamp": 1670366870000,
"step": 101,
"value": 0.9
}
]
},
...
"resources": [
{
"accountId": "abcdef01234567890",
"type": "AWS::SageMaker::ExperimentTrialComponent",
"ARN": "arn:aws:sagemaker:us-east-1:1234567890abcdef0:experiment-trial-component/
trial-component-name"
}
],
...
}
Topics
• Clean Up Using the SageMaker Python SDK (Recommended) (p. 1600)
• Clean Up Using the Python SDK (Boto3) (p. 1601)
• Clean Up Using the Experiments SDK (p. 1601)
1600
Amazon SageMaker Developer Guide
Clean up experiment resources
import boto3
sm = boto3.Session().client('sagemaker')
Define cleanup_boto3
def cleanup_boto3(experiment_name):
trials = sm.list_trials(ExperimentName=experiment_name)['TrialSummaries']
print('TrialNames:')
for trial in trials:
trial_name = trial['TrialName']
print(f"\n{trial_name}")
components_in_trial = sm.list_trial_components(TrialName=trial_name)
print('\tTrialComponentNames:')
for component in components_in_trial['TrialComponentSummaries']:
component_name = component['TrialComponentName']
print(f"\t{component_name}")
sm.disassociate_trial_component(TrialComponentName=component_name,
TrialName=trial_name)
try:
# comment out to keep trial components
sm.delete_trial_component(TrialComponentName=component_name)
except:
# component is associated with another trial
continue
# to prevent throttling
time.sleep(.5)
sm.delete_trial(TrialName=trial_name)
sm.delete_experiment(ExperimentName=experiment_name)
print(f"\nExperiment {experiment_name} deleted")
Call cleanup_boto3
import sys
!{sys.executable} -m pip install sagemaker-experiments
import time
1601
Amazon SageMaker Developer Guide
Additional supported SDK
Define cleanup_sme_sdk
def cleanup_sme_sdk(experiment):
for trial_summary in experiment.list_trials():
trial = Trial.load(trial_name=trial_summary.trial_name)
for trial_component_summary in trial.list_trial_components():
tc = TrialComponent.load(
trial_component_name=trial_component_summary.trial_component_name)
trial.remove_trial_component(tc)
try:
# comment out to keep trial components
tc.delete()
except:
# tc is associated with another trial
continue
# to prevent throttling
time.sleep(.5)
trial.delete()
experiment_name = experiment.experiment_name
experiment.delete()
print(f"\nExperiment {experiment_name} deleted")
Call cleanup_sme_sdk
experiment_to_cleanup = Experiment.load(
# Use experiment name not display name
experiment_name="experiment-name")
cleanup_sme_sdk(experiment_to_cleanup)
The following section describes how to create a SageMaker Experiment with the SageMaker Experiments
SDK.
The following procedure shows you how to create a SageMaker experiment for a SageMaker training,
processing, or transform job. Steps labeled as (Studio) describe how to view the experiment in Amazon
SageMaker Studio. You don't have to run the experiment in Studio to view the experiment in Studio.
import sys
1602
Amazon SageMaker Developer Guide
Additional supported SDK
2. (Optional) The Amazon SageMaker Python SDK, comes preinstalled in SageMaker Studio. If you plan
to run your code outside Studio, install the SageMaker Python SDK.
4. Import modules.
import time
from time import strftime
import sagemaker
role = sagemaker.get_execution_role()
sm_sess = sagemaker.session.Session()
6. Create a SageMaker experiment. The experiment name must be unique in your account.
Note
The tags parameter is optional. You can search for the tag using Studio, the SageMaker
console, and the SDK. Tags can also be applied to trials and trial components.
create_date = strftime("%Y-%m-%d-%H-%M-%S")
demo_experiment = Experiment.create(experiment_name = "DEMO-{}".format(create_date),
description = "Demo experiment",
tags = [{'Key': 'demo-experiments', 'Value':
'demo1'}])
7. (Studio) To view the experiment in SageMaker Studio, in the left sidebar, choose the Experiments.
After the code runs, the experiment list contains the new experiment. It might take a moment for
the list to refresh and display the experiment. The filter on the experiment tag is also displayed.
Only experiments that have a matching tag are displayed. Your list should look similar to the
following:
1603
Amazon SageMaker Developer Guide
Additional supported SDK
8. Create a trial for the experiment. The trial name must be unique in your account.
9. Create a trial component as part of the trial. The trial component is the SageMaker job.
Add the ExperimentConfig parameter to the appropriate method. The SageMaker jobs listed in the
following table are supported.
The following examples are for a training job. The Tags parameter adds a tag to the trial
component. ExperimentName isn't specified because the trial was associated with the experiment
when the trial was created in an earlier step.
sagemaker.estimator.Estimator(
...,
sagemaker_session = sm_sess,
tags = [{'Key': 'demo-jobs', 'Value': 'demo2'}])
estimator.fit(
...,
experiment_config = {
# "ExperimentName"
"TrialName" : demo_trial.trial_name,
1604
Amazon SageMaker Developer Guide
Experiments FAQs
"TrialComponentDisplayName" : "TrainingJob",
})
Using Boto3
create_training_job(
...,
"ExperimentConfig": {
# "ExperimentName"
"TrialName" : demo_trial.trial_name,
"TrialComponentDisplayName" : "TrainingJob",
},
"Tags": [{'Key': 'demo-jobs', 'Value': 'demo2'}])
10. (Studio) In the experiment list, double-click the experiment to display a list of the trials in the
experiment. In the Studio UI, trials are referred to as run groups and trial components are referred to
as runs. Your list should look similar to the following:
11. (Studio) To view information about the experiment, trial, and job (trial component), see View, search,
and compare experiment runs (p. 1592).
To clean up the resources you created, see Clean Up Amazon SageMaker Experiment Resources (p. 1600).
Experiments FAQs
Refer to the following FAQ items for answers to commonly asked questions about SageMaker
Experiments.
1605
Amazon SageMaker Developer Guide
Experiments FAQs
Q. Why do I see experiments and runs in the Experiments Studio UI that I did not
create using the SageMaker Python SDK?
Experiment runs that are automatically created by SageMaker jobs and containers are visible in the
Experiments Studio UI by default. To hide runs created by SageMaker jobs for a given experiment, choose
...
if rank == 0:
test_loss, correct, target, pred = test(model, test_loader, device, tracker)
logger.info(
"Test Average loss: {:.4f}, Test Accuracy: {:.0f}%;\n".format(
test_loss, test_accuracy)
)
)
run.log_metric(name = "train_loss", value = loss.item(), step = epoch)
run.log_metric(name = "test_loss", value = test_loss, step = epoch)
run.log_metric(name = "test_accuracy", value = test_accuracy, step = epoch)
...
For more information, see the Run a SageMaker Experiment with Pytorch Distributed Data Parallel -
MNIST Handwritten Digits Classification example notebook.
1606
Amazon SageMaker Developer Guide
Search using the console and API
runs. If these jobs are launched without being explicitly associated with an experiment or run, they are
created as unassigned runs.
Q. Do I need to pass the experiment run context to the training script when
running a SageMaker training job?
A: Yes. You need to load the run context into the training script, along with the SageMaker session
information.
session = Session(boto3.session.Session(region_name=args.region))
• Organize, find, and evaluate training jobs using properties, hyperparameters, performance metrics, or
any metadata.
• Find the best performing model by reviewing training job and model metrics, such as training loss or
validation accuracy.
• Trace a model's lineage to the training job and its related resources, such as the training datasets.
This topic covers searching from the SageMaker console and the SageMaker API.
Topics
• Organize, Find, and Evaluate Training Jobs (Console) (p. 1607)
• Find and Evaluate Training Jobs (API) (p. 1609)
• Verify the Datasets Used by Your Training Jobs (p. 1611)
• Trace Model Lineage (p. 1611)
1607
Amazon SageMaker Developer Guide
Search using the console and API
To find a specific training job, model, or resource, use model tracking to search on keywords assigned to
any searchable items. Searchable items include training jobs, models, hyperparameters, metadata, tags,
and URLs. To refine your tracking results, you can search using multiple criteria.
To choose the best model for deployment, evaluate how all models performed against one or more
metrics. You can use model tracking results to list, sort, and evaluate the performance of the models in
your experiments.
Topics
• Use Tags to Track Training Jobs (Console) (p. 1608)
• Find Training Jobs (Console) (p. 1608)
• Evaluate Models (Console) (p. 1609)
4. To add another tag, choose Add tag, and add another key-value pair.
a. In the search box, enter a parameter and choose a parameter type, for example
TrainingJobName.
b. Choose a conditional operation. For numeric values, use operators such as is equals to, lesser
than, or or greater than. For text-based values, use operators such as equals to or contains.
c. Enter a value for the parameter.
1608
Amazon SageMaker Developer Guide
Search using the console and API
4. (Optional) To refine your search, add additional search criteria. Choose Add row and enter the
parameter values.
5. Choose Search.
3. Open the preferences window by choosing the settings icon in the search results table.
4. To show or hide a hyperparameter or metric, turn it on or off by choosing Hyperparameter or Metric
.
5. Make necessary changes, then choose Update view.
6. After viewing metrics and important hyperparameters, you can compare and contrast the result.
Then, you can choose the best model to host or investigate the models that are performing poorly.
Topics
• Find Training Jobs (API) (p. 1610)
• Evaluate Models (API) (p. 1609)
• Get Suggestions for a Search (API) (p. 1610)
1609
Amazon SageMaker Developer Guide
Search using the console and API
The following example shows how to use the Search API to find training jobs.
import boto3
search_params={
"MaxResults": 10,
"Resource": "TrainingJob",
"SearchExpression": {
"Filters": [{
"Name": "Tags.Project",
"Operator": "Equals",
"Value": "Project_Binary_Classifier"
}]},
"SortBy": "Metrics.train:binary_classification_accuracy",
"SortOrder": "Descending"
}
smclient = boto3.client(service_name='sagemaker')
results = smclient.search(**search_params)
The following example shows how to evaluate models and to display the results in a table.
import pandas
headers=["Training Job Name", "Training Job Status", "Batch Size", "Binary Classification
Accuracy"]
rows=[]
for result in results['Results']:
trainingJob = result['TrainingJob']
metrics = trainingJob['FinalMetricDataList']
rows.append([trainingJob['TrainingJobName'],
trainingJob['TrainingJobStatus'],
trainingJob['HyperParameters']['mini_batch_size'],
metrics[[x['MetricName'] for x in
metrics].index('train:binary_classification_accuracy')]['Value']
])
df = pandas.DataFrame(data=rows,columns=headers)
The following example for AWS SDK for Python (Boto3) is a get_search_suggestions request for
items containing linear.
search_suggestion_params={
"Resource": "TrainingJob",
1610
Amazon SageMaker Developer Guide
Search using the console and API
"SuggestionQuery": {
"PropertyNameQuery": {
"PropertyNameHint": "linear"
}
}
}
{
'PropertyNameSuggestions': [{'PropertyName': 'hyperparameters.linear_init_method'},
{'PropertyName': 'hyperparameters.linear_init_value'},
{'PropertyName': 'hyperparameters.linear_init_sigma'},
{'PropertyName': 'hyperparameters.linear_lr'},
{'PropertyName': 'hyperparameters.linear_wd'}]
}
After getting search suggestions, you can use one of the property names in a search.
To check whether a specific dataset was used in a training job, you search for the URL to its location in
Amazon Simple Storage Service (Amazon S3). Model tracking capability returns the training jobs that
used the dataset that you specify. If your search doesn't return the dataset (the result is empty), the
dataset wasn't used in a training job. An empty result confirms, for example, that a holdout dataset
wasn't used.
Topics
• Trace Model Lineage (Console) (p. 1611)
• Trace Model Lineage (API) (p. 1611)
1611
Amazon SageMaker Developer Guide
Automatic Model Tuning
The following example shows how to trace a model's lineage using the API.
After finding the training job, you can review the resources used to train the model.
For example, suppose that you want to solve a binary classification problem on a marketing dataset.
Your goal is to maximize the area under the curve (AUC) metric of the algorithm by training an XGBoost
Algorithm (p. 1369) model. You want to find which values for the eta, alpha, min_child_weight,
and max_depth hyperparameters that will train the best model. Specify a range of values for these
hyperparameters. Then, SageMaker hyperparameter tuning searches within these ranges to find a
combination of values that creates a training job that creates a model with the highest AUC. To conserve
resources or meet a specific model quality expectation, you can also set up completion criteria to stop
tuning after the criteria have been met.
You can use SageMaker AMT with built-in algorithms, custom algorithms, or SageMaker pre-built
containers for machine learning frameworks.
SageMaker AMT can use an Amazon EC2 Spot instance to optimize costs when running training jobs. For
more information, see Managed Spot Training in Amazon SageMaker (p. 2117).
Before you start using hyperparameter tuning, you should have a well-defined machine learning
problem, including the following:
• A dataset
• An understanding of the type of algorithm that you need to train
• A clear understanding of how you measure success
Prepare your dataset and algorithm so that they work in SageMaker and successfully run a training job
at least once. For information about setting up and running a training job, see Get Started with Amazon
SageMaker (p. 35).
1612
Amazon SageMaker Developer Guide
How Hyperparameter Tuning Works
Topics
• How Hyperparameter Tuning Works (p. 1613)
• Define metrics and environment variables (p. 1615)
• Define Hyperparameter Ranges (p. 1617)
• Track and set completion criteria for your tuning job (p. 1620)
• Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model (p. 1623)
• Example: Hyperparameter Tuning Job (p. 1628)
• Stop Training Jobs Early (p. 1640)
• Run a Warm Start Hyperparameter Tuning Job (p. 1641)
• Resource Limits for Automatic Model Tuning (p. 1645)
• Best Practices for Hyperparameter Tuning (p. 1647)
Use the API reference guide to understand how to interact with hyperparameter tuning. The examples on
this page can be found in the HyperParameterTuningJobConfig and HyperbandStrategyConfig APIs.
Note
Because the algorithm itself is stochastic, it’s possible that the hyperparameter tuning model
will fail to converge on the best answer. This can occur even if the best possible combination of
values is within the ranges that you choose.
Grid Search
When using grid search, hyperparameter tuning chooses combinations of values from the range of
categorical values that you specify when you create the job. Only categorical parameters are supported
when using the grid search strategy. You do not need to specify the MaxNumberOfTrainingJobs. The
number of training jobs created by the tuning job will be automatically calculated to be the total number
of distinct categorical combinations possible. If specified, the value of MaxNumberOfTrainingJobs
should equal the total number of distinct categorical combinations possible.
Random Search
When using random search, hyperparameter tuning chooses a random combination of values from
within the ranges that you specify for hyperparameters for each training job it launches. Because the
choice of hyperparameter values doesn't depend on the results of previous training jobs, you can run the
maximum number of concurrent training jobs without affecting the performance of the tuning.
For an example notebook that uses random search, see the Random search and hyperparameter scaling
with SageMaker XGBoost and Automatic Model Tuning notebook.
Bayesian Optimization
Bayesian optimization treats hyperparameter tuning like a regression problem. Given a set of input
features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that
you choose. To solve a regression problem, hyperparameter tuning makes guesses about which
hyperparameter combinations are likely to get the best results, and runs training jobs to test these
1613
Amazon SageMaker Developer Guide
How Hyperparameter Tuning Works
values. After testing a set of hyperparameter values, hyperparameter tuning uses regression to choose
the next set of hyperparameter values to test.
When choosing the best hyperparameters for the next training job, hyperparameter tuning
considers everything that it knows about this problem so far. Sometimes it chooses a combination
of hyperparameter values close to the combination that resulted in the best previous training job to
incrementally improve performance. This allows hyperparameter tuning to exploit the best known
results. Other times, it chooses a set of hyperparameter values far removed from those it has tried. This
allows it to explore the range of hyperparameter values to try to find new areas that are not yet well
understood. The explore/exploit trade-off is common in many machine learning problems.
• A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User
Modeling and Hierarchical Reinforcement Learning
• Practical Bayesian Optimization of Machine Learning Algorithms
• Taking the Human Out of the Loop: A Review of Bayesian Optimization
Hyperband
Hyperband is a multi-fidelity based tuning strategy that dynamically reallocates resources. Hyperband
uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized
hyperparameter configurations and automatically stops those that underperform. It also seamlessly
scales to using many parallel training jobs. These features can significantly speed up hyperparameter
tuning over random search and Bayesian optimization strategies.
Hyperband should only be used to tune iterative algorithms that publish results at different resource
levels. For example, Hyperband can be used to tune a neural network for image classification which
publishes accuracy metrics after every epoch.
1614
Amazon SageMaker Developer Guide
Define metrics and environment variables
Define metrics
Amazon SageMaker hyperparameter tuning parses your machine learning algorithm's stdout and
stderr streams to find metrics, such as loss or validation-accuracy. The metrics show how well the
model is performing on the dataset.
The following sections describe how to use two types of algorithms for training: built-in and custom.
For the objective metric for the tuning job, choose one of the metrics that the built-in algorithm emits.
For a list of available metrics, see the model tuning section for the appropriate algorithm in Use Amazon
SageMaker Built-in Algorithms or Pre-trained Models.
You can choose up to 40 metrics to monitor in your tuning job. Select one of those metrics to be the
objective metric. The hyperparameter tuning job returns the training job that performed the best
against the objective metric.
Note
Hyperparameter tuning automatically sends an additional hyperparameter
_tuning_objective_metric to pass your objective metric to the tuning job for use during
training.
You can define custom metrics by specifying a name and regular expression for each metric that your
tuning job monitors. Then, pass these metric definitions to the CreateHyperParameterTuningJob
1615
Amazon SageMaker Developer Guide
Define metrics and environment variables
The following shows sample output from a log written to stderr or stdout by a training algorithm.
The following code example shows how to use regular expressions in Python (regex). This is used to
search the sample log output and capture the numeric values of four different metrics.
[
{
"Name": "ganloss",
"Regex": "GAN_loss=(.*?);",
},
{
"Name": "disc-combined",
"Regex": "disc-combined=(.*?);",
},
{
"Name": "discloss",
"Regex": "disc_train_loss=(.*?);",
},
{
"Name": "loss",
"Regex": "Loss = (.*?);",
},
]
In regular expressions, parenthesis () are used to group parts of the regular expression together.
• For the loss metric that is defined in the code example, the expression (.*?); captures any
character between the exact text "Loss=" and the first semicolon (;) character.
• The character . instructs the regular expression to match any character.
• The character * means to match zero or more characters.
• The character ? means capture only until the first instance of the ; character.
The loss metric defined in the code sample will capture Loss = 16.020744 from the sample output.
Choose one of the metrics that you define as the objective metric for the tuning job. If you are using
the SageMaker API, specify the value of the name key in the HyperParameterTuningJobObjective
field of the HyperParameterTuningJobConfig parameter that you send to the
CreateHyperParameterTuningJob operation.
If you want to use an environment variable from your tuning job or specify a new environment variable,
input a string value for Environment within the SageMaker HyperParameterTrainingJobDefinition API.
Pass this training job definition to the CreateHyperParameterTuningJob API.
For example, the environment variable SM_LOG_LEVEL can be set to the following values to tailor the
output from a Python container.
1616
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges
NOTSET=0
DEBUG=10
INFO=20
WARN=30
ERROR=40
CRITICAL=50
As an example, to set the log level to 10 to debug your container logs, set the environment variable
inside the HyperParameterTrainingJobDefinition, as follows.
{
"HyperParameterTuningJobConfig": {
...,
}
"TrainingJobDefinition": {
...,
"Environment" : [
{
"SM_LOG_LEVEL": 10
}
],
...,
},
...,
}
Choosing hyperparameters and ranges significantly affects the performance of your tuning job.
Hyperparameter tuning finds the best hyperparameter values for your model by searching over a
range of values that you specify for each tunable hyperparameter. You can also specify up to 100
static hyperparameters that do not change over the course of the tuning job. You can use up to 100
hyperparameters in total (static + tunable). For guidance on choosing hyperparameters and ranges, see
Best Practices for Hyperparameter Tuning (p. 1647). You can also use autotune to find optimal tuning
job settings. For more information, see the following Autotune section.
Note
SageMaker Automatic Model Tuning (AMT) may add additional hyperparameters(s) that
contribute to the limit of 100 total hyperparameters. Currently, to pass your objective metric
to the tuning job for use during training, SageMaker adds _tuning_objective_metric
automatically.
Static hyperparameters
Use static hyperparameters for the following cases: For example, you can use AMT to tune your model
using param1 (a tunable parameter) and param2 (a static parameter). If you do, then use a search space
for param1 that lies between two values, and pass param2 as a static hyperparameter, as follows.
param1: ["range_min","range_max"]
param2: "static_value"
"StaticHyperParameters": {
1617
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges
"objective" : "reg:squarederror",
"dropout_rate": "0.3"
}
You can use the Amazon SageMaker API to specify key value pairs in the StaticHyperParameters
field of the HyperParameterTrainingJobDefinition parameter that you pass to the
CreateHyperParameterTuningJob operation.
Dynamic hyperparameters
You can use the SageMaker API to define hyperparameter ranges. Specify the names of hyperparameters
and ranges of values in the ParameterRanges field of the HyperParameterTuningJobConfig
parameter that you pass to the CreateHyperParameterTuningJob operation.
The ParameterRanges field has three subfields: categorical, integer, and continuous. You can define up
to 30 total (categorical + integer + continuous) tunable hyperparameters to search over.
Note
Each categorical hyperparameter can have at most 30 different values.
"ParameterRanges": {
"CategoricalParameterRanges": [
{
"Name": "tree_method",
"Values": ["auto", "exact", "approx", "hist"]
}
],
"ContinuousParameterRanges": [
{
"Name": "eta",
"MaxValue" : "0.5",
"MinValue": "0",
"ScalingType": "Auto"
}
],
"IntegerParameterRanges": [
{
"Name": "max_depth",
"MaxValue": "10",
"MinValue": "1",
"ScalingType": "Auto"
}
]
}
If you create a tuning job with a Grid strategy, you can only specify categorical values. You don't
need to provide the MaxNumberofTrainingJobs. This value is inferred from the total number
of configurations that can be produced from your categorical parameters. If specified, the value of
MaxNumberOfTrainingJobs should be equal to the total number of distinct categorical combinations
possible.
Autotune
To save time and resources searching for hyperparameter ranges, resources or objective metrics,
autotune can automatically guess optimal values for some hyperparameter fields. Use autotune to find
optimal values for the following fields:
• ParameterRanges – The names and ranges of hyperparameters that a tuning job can optimize.
1618
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges
• ResourceLimits – The maximum resources to be used in a tuning job. These resources can include the
maximum number of training jobs, maximum runtime of a tuning job, and the maximum number of
training jobs that can be run at the same time.
• TrainingJobEarlyStoppingType – A flag that stops a training job if a job is not significantly improving
against an objective metric. Defaults to enabled. For more information, see Stop Training Jobs
Early (p. 1640).
• RetryStrategy – The number of times to retry a training job. Non-zero values for RetryStrategy can
increase the likelihood that your job will complete successfully.
• Strategy – Specifies how hyperparameter tuning chooses the combinations of hyperparameter values
to use for the training job that it launches.
• ConvergenceDetected – A flag to indicate that Automatic Model Tuning (AMT) has detected model
convergence.
1. Specify the hyperparameter and an example value in the AutoParameters field of the
ParameterRanges API.
2. Enable autotune.
AMT will determine if your hyperparameters and example values are eligible for autotune.
Hyperparameters that can be used in autotune are automatically assigned to the appropriate
parameter range type. Then, AMT uses ValueHint to select an optimal range for you. You can use the
DescribeHyperParameterTrainingJob API to view these ranges.
The following example shows you how to configure a tuning job that uses autotune. In the configuration
example, the hyperparameter max_depth has ValueHint containing an example value of 4.
config = {
'Autotune': {'Mode': 'Enabled'},
'HyperParameterTuningJobName':'my-autotune-job',
'HyperParameterTuningJobConfig': {
'HyperParameterTuningJobObjective': {'Type': 'Minimize', 'MetricName':
'validation:rmse'},
'ResourceLimits': {'MaxNumberOfTrainingJobs': 5, 'MaxParallelTrainingJobs': 1},
'ParameterRanges': {
'AutoParameters': [
{'Name': 'max_depth', 'ValueHint': '4'}
]
}
},
'TrainingJobDefinition': {
.... }
Continuing the previous example, a tuning job is created after the previous configuration is included
in a call to the CreateHyperParameterTuningJob API. Then, autotune converts the max_depth
hyperparameter in AutoParameters to the hyperparameter IntegerParameterRanges. The
following response from a DescribeHyperParameterTrainingJob API shows that the optimal
IntegerParameterRanges for max_depth are between 2 and 8.
{
'HyperParameterTuningJobName':'my_job',
'HyperParameterTuningJobConfig': {
'ParameterRanges': {
'IntegerParameterRanges': [
{'Name': 'max_depth', 'MinValue': '2', 'MaxValue': '8'},
],
}
1619
Amazon SageMaker Developer Guide
Track and set completion criteria
},
'TrainingJobDefinition': {
...
},
'Autotune': {'Mode': 'Enabled'}
Auto
SageMaker hyperparameter tuning chooses the best scale for the hyperparameter.
Linear
Hyperparameter tuning searches the values in the hyperparameter range by using a linear scale.
Typically, you choose this if the range of all values from the lowest to the highest is relatively small
(within one order of magnitude). Uniformly searching values from the range provides a reasonable
exploration of the entire range.
Logarithmic
Hyperparameter tuning searches the values in the hyperparameter range by using a logarithmic
scale.
Logarithmic scaling works only for ranges that have values greater than 0.
Choose logarithmic scaling when you're searching a range that spans several orders of magnitude.
For example, if you're tuning a Tune a linear learner model (p. 1345) model, and you specify a range
of values between .0001 and 1.0 for the learning_rate hyperparameter, consider the following:
Searching uniformly on a logarithmic scale gives you a better sample of the entire range than
searching on a linear scale would. This is because searching on a linear scale would, on average,
devote 90 percent of your training budget to only the values between .1 and 1.0. As a result, that
leaves only 10 percent of your training budget for the values between .0001 and .1.
ReverseLogarithmic
Hyperparameter tuning searches the values in the hyperparameter range by using a reverse
logarithmic scale. Reverse logarithmic scaling is supported only for continuous hyperparameter
ranges. It is not supported for integer hyperparameter ranges.
Choose reverse logarithmic scaling when you are searching a range that is highly sensitive to small
changes that are very close to 1.
Reverse logarithmic scaling works only for ranges that are entirely within the range 0<=x<1.0.
For an example notebook that uses hyperparameter scaling, see these Amazon SageMaker
hyperparameter examples on GitHub.
1620
Amazon SageMaker Developer Guide
Track and set completion criteria
maximum number of training jobs that don’t improve when evaluated against the objective metric. You
can also track the progress of your tuning job and decide to let it continue or to stop it manually. This
guide shows you how to set completion criteria, check the progress of and stop your tuning job manually.
• Check your training jobs for completion and update statistics accordingly
• Decide what combination of hyperparameters to evaluate next.
AMT will continuously check the training jobs that were launched from your tuning job to update
statistics. These statistics include tuning job runtime and best training job. Then, AMT determines
whether it should stop the job according to your completion criteria. You can also check these statistics
and stop your job manually. For more information about stopping a job manually, see the Stopping your
tuning job manually (p. 1623) section.
As an example, if your tuning job meets your objective, you can stop tuning early to conserve resources
or ensure model quality. AMT checks your job performance against your completion criteria and stops
the tuning job if any have been met.
• Use MaxNumberOfTrainingJobs in the ResourceLimits API to set an upper limit for the number of
training jobs that can be run before your tuning job is stopped. Start with a large number and adjust it
based on model performance against your tuning job objective. Most users input values of around 50
or more training jobs to find an optimal hyperparameter configuration. Users looking for higher levels
of model performance will use 200 or more training jobs.
• Use MaxNumberOfTrainingJobsNotImproving in the BestObjectiveNotImproving API field to stop
training if model performance fails to improve after a specified number of jobs. Model performance
is evaluated against an objective function. After the MaxNumberOfTrainingJobsNotImproving
is met, AMT will stop the tuning job. Tuning jobs tend to make the most progress in the
beginning of the job. Improving model performance against an objective function will
1621
Amazon SageMaker Developer Guide
Track and set completion criteria
require a larger number of training jobs towards the end of tuning. Select a value for
MaxNumberOfTrainingJobsNotImproving by checking the performance of similar training jobs
against your objective metric.
• Use MaxRuntimeInSeconds in the ResourceLimits API to set an upper limit for the amount of wall
clock time that the tuning job may take. Use this field to meet a deadline by which the tuning job must
complete or to limit compute resources.
To get an estimated total compute time in seconds for a tuning job, use the following formula:
• BestTrainingJob – An object that describes the best training job obtained so far, evaluated against your
objective metric. Use this field to check your current model performance and the value of the objective
metric of this best training job.
• ObjectiveStatusCounters – An object that specifies the total number of training jobs completed in a
tuning job. To estimate average duration of a tuning job, use ObjectiveStatusCounters and the
total runtime of a tuning job. You can use the average duration to estimate how much longer your
tuning job will run.
• ConsumedResources – The total resources, such as RunTimeInSeconds, consumed by your tuning
job. Compare ConsumedResources, found in the DescribeHyperParameterTuningJob API, against
BestTrainingJob in the same API. You can also compare ConsumedResources against the
1622
Amazon SageMaker Developer Guide
Tune Multiple Algorithms
Use the tuning job completion criteria to assess how likely your tuning job is to improve your model
performance. Model performance is evaluated against the best objective metric if it ran to completion.
To stop the tuning job manually, use the StopHyperParameterTuningJob API and provide the name of
the tuning job to be stopped.
• The job settings to configure include warm starting, early stopping, and the tuning strategy. Warm
starting and early stopping are available only when tuning a single algorithm.
• The training job definition to specify the name, algorithm source, objective metric, and the range
of values, when required, to configure the set of hyperparameter values for each training job. It
configures the channels for data inputs, data output locations, and any checkpoint storage locations
for each training job. The definition also configures the resources to deploy for each training job,
including instance types and counts, managed spot training, and stopping conditions.
• The tuning job resources: to deploy, including the maximum number of concurrent training jobs that
a hyperparameter tuning job can run concurrently and the maximum number of training jobs that the
hyperparameter tuning job can run.
Get Started
You can create a new hyperparameter tuning job, clone a job, add or edit tags to a job from the console.
You can also use the search feature to find jobs by their name, creation time, or status. Alternatively, you
can also hyperparameter tuning jobs with the SageMaker API.
• In the console: To create a new job, open the Amazon SageMaker console at https://
console.aws.amazon.com/sagemaker/, choose Hyperparameter tuning jobs from the Training, menu,
and then choose Create hyperparameter tuning job. Then following the configuration steps to create
a training job for each algorithm that you want to use. These steps are documented in the Create a
Hyperparameter Optimization Tuning Job for One or More Algorithms (Console) (p. 1624) topic.
1623
Amazon SageMaker Developer Guide
Tune Multiple Algorithms
Note
When you start the configuration steps, note that the warm start and early stopping features
are not available to use with multi-algorithm HPO. If you want to use these features, you can
only tune a single algorithm at a time.
• With the API: For instructions on using the SageMaker API to create a hyperparameter tuning job,
see Example: Hyperparameter Tuning Job. When you call CreateHyperParameterTuningJob
to tune multiple algorithms, you must provide a list of training definitions
using TrainingJobDefinitions instead of specifying a single TrainingJobDefinition. You must
provide job settings that apply to all of the algorithms to be tested and a training definition for each
of these algorithms. You must also specify the resources you want to use for the tuning job. You must
choose just one of these definition types depending on the number of algorithms being tuned.
Topics
• Create a Hyperparameter Optimization Tuning Job for One or More Algorithms (Console) (p. 1624)
• Manage Hyperparameter Tuning and Training Jobs (p. 1627)
Topics
• Define job settings (p. 1624)
• Create Training Job Definitions (p. 1625)
• Configure Tuning Job Resources (p. 1627)
• Review and Create HPO Tuning Job (p. 1627)
Warm Start
If you cloned this job, you can choose to use the results from a previous tuning job to improve the
performance of this new tuning job. This is the warm start feature and it is only available when tuning
a single algorithm. When you choose this option, you can choose up to five previous hyperparameter
tuning jobs to use. Alternatively, you can use transfer learning to add additional data to the parent
tuning job. When you select this option, you choose one previous tuning job as the parent.
Note
Warm start is compatible only with tuning jobs created after October 1, 2018. For more
information, see Run a warm start job.
Early Stopping
To reduce compute time and avoid overfitting your model, training jobs can be stopped early when they
are unlikely to improve the current best objective metric of the hyperparameter tuning job.. Like warm
start, this feature is only available when tuning a single algorithm. This is an automatic feature without
1624
Amazon SageMaker Developer Guide
Tune Multiple Algorithms
configuration options, and it’s disabled by default. For more information on how early stopping works,
the algorithms that support it, and how to use it with your own algorithms, see Stop Training Jobs Early.
Tuning Strategy
Tuning strategy can be either random, Bayesian or Hyperband. These selections specify how automatic
tuning algorithms search over specified hyperparameter ranges (selected in a later step). Random
search chooses random combinations of values from the specified ranges and can be run sequentially
or in parallel. Bayesian optimization chooses values based on what is likely to get the best result
given what is known about the history of previous selections. Hyperband uses a multi-fidelity strategy
that dynamically allocates resources towards well-utilized jobs and automatically stops those that
underperform. The new configuration that starts after stopping other configurations is chosen randomly.
Hyperband can only be used with iterative algorithms. For more information search strategies, see How
Hyperparameter Tuning Works.
Note
Hyperband uses an advanced internal mechanism to apply early stopping. Thus, the parameter
TrainingJobEarlyStoppingType in the HyperParameterTuningJobConfig API must be
set to OFF when using Hyperband's internal early stopping feature.
Tags
You enter tags as key-value pairs to assign metadata to tuning jobs to help you manage them. Values
are not required. You can use just the key. To see the keys associated with a job, choose the Tags tab
on the details page for tuning job. For more information about using tags for tuning jobs, see Manage
Hyperparameter Tuning and Training Jobs (p. 1627)
Topics
• Configure algorithm and parameters (p. 1625)
• Define Data Input and Output (p. 1626)
• Configure Training Job Resources (p. 1627)
• Add or Clone a Training Job (p. 1627)
Each training job definition for a tuning job requires a name, permission to access services, and the
specification of algorithm options, an objective metric, and the range of values, when required, to
configure the set of hyperparameter values for each training job.
Name
Permissions
Amazon SageMaker requires permissions to call other services on your behalf. Choose an IAM role or let
AWS create a role that has the AmazonSageMakerFullAccess IAM policy attached.
1625
Amazon SageMaker Developer Guide
Tune Multiple Algorithms
The network isolation setting prevents the container from making any outbound network calls. This is
required for AWS Marketplace machine learning offerings.
Algorithm Options
You can choose one of the built-in algorithms, your own algorithm, your own container with an
algorithm, or you can subscribe to an algorithm from AWS Marketplace.
• If you choose a built-in algorithm, it has the ECR image information pre-populated.
• If you choose your own container, you must specify the ECR image information. You can select the
input mode for the algorithm as file or pipe.
• If you plan to supply your data using a .CSV file from Amazon S3, you should select the file.
Metrics
When you choose a built-in algorithm, metrics are provided for you. If you choose your own algorithm,
you need to define your metrics. You can define up to 20 metrics for your tuning job to monitor, one
of which must be chosen as the objective metric. For more information on how to define a metric for a
tuning job, see Define metrics (p. 1615).
Objective Metric
To find the best training job, set an objective metric and whether to maximize or minimize it. After the
training job is complete, you can view the tuning job detail page for a summary of the best training job
found using this objective metric.
Hyperparameter Configuration
When you choose a built-in algorithm, the default values for its hyperparameters are set for you, using
ranges that are optimized for the algorithm being tuned. You can change these values as you see fit. For
example, instead of a range, you can set a fixed value for a hyperparameter by setting the parameter’s
type to static. Each algorithm has different required and optional parameters. For more information,
see Best Practices for Hyperparameter Tuning and Define Hyperparameter Ranges.
Each training job definition for a tuning job must configures the channels for data inputs, data output
locations, and optionally any checkpoint storage locations for each training job.
Input data is defined by channels, each with their own source location (Amazon S3 or Amazon Elastic
File System), compression, and format options. You can define up to 20 channels of input sources. If the
algorithm you chose supports multiple input channels, you can specify those too. For example, when
using the XGBoost churn prediction notebook, you could add two channels: train and validation.
Checkpoint Configuration
Checkpoints are periodically generated during training. You must choose an Amazon S3 location for
the checkpoints to be saved. Checkpoints are used in metrics reporting, and are also used to resume
managed spot training jobs. For more information, see Use Checkpoints in Amazon SageMaker (p. 2142).
1626
Amazon SageMaker Developer Guide
Tune Multiple Algorithms
You must define an Amazon S3 location for the artifacts of the training job to be stored. You have the
option of adding encryption to the output using an AWS Key Management Service (AWS KMS) key.
Each training job definition for a tuning job must configures the resources to deploy, including instance
types and counts, managed spot training, and stopping conditions.
Resource Configuration
Each training definition can have a different resource configuration. You choose the instance type and
number of nodes.
You can save computer costs for jobs if you have flexibility in start and end times by allowing SageMaker
to use spare capacity to run jobs. For more information, see Managed Spot Training in Amazon
SageMaker (p. 2117).
Stopping condition
The stopping condition specifies the maximum duration allowed per training job.
Once you have created a training job definition for a tuning job, you are returned to the Training
Job Definition(s) panel where you can create additional training job definitions to train additional
algorithms. You can select the Add training job definition and work through the steps to definite a
training job again or choose Clone from the Action menu to replicate an existing training job definition
and the edit it for the new algorithm. The clone option can save time as it copies all of the job’s settings,
including the data channels, S3 storage locations. For more information on cloning, see Manage
Hyperparameter Tuning and Training Jobs (p. 1627)
You can specify the maximum number of concurrent training jobs that a hyperparameter tuning job
can run concurrently (10 at most) and the maximum number of training jobs that the hyperparameter
tuning job can run (500 at most).The number of parallel jobs should not exceed the number of nodes you
have requested across all of your training definitions. The total number of jobs can’t exceed the number
of jobs that your definitions are expected to run.
1627
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
To see the training jobs run a part of a tuning job, select one of the hyperparameter tuning jobs from
the list. The tabs on the tuning job page allow you to inspect the training jobs, their definitions, the tags
and configuration used for the tuning job, and the best training job found during tuning. You can select
the best training job or any of the other training jobs that belong to the tuning job to see all of their
settings. From here you can create a model that uses the hyperparameter values found by a training job
by selecting Create Model or you can clone the training job by selecting Clone.
Cloning
You can save time by cloning a training job that belongs to a hyperparameter tuning job. Cloning copies
all of the job’s settings, including data channels, S3 storage locations for output artifacts. You can do
this for training jobs you have already run from the tuning job page, as just described, or when you are
creating additional training job definitions while creating a hyperparameter tuning job, as described in
Add or Clone a Training Job (p. 1627) step of that procedure.
Tagging
Automatic Model Tuning launches multiple training jobs within a single parent tuning job to discover the
ideal weighting of model hyperparameters. Tags can be added to the parent tuning job as described in
the Define job settings (p. 1624) section and these tags are then propagated to the individual training
jobs underneath. Customers can use these tags for purposes such as cost allocation or access control. To
add tags using the SageMaker SDK, use AddTags API. For more information about using tagging for AWS
resources, see Tagging AWS resources.
You use the low-level SDK for Python (Boto3) to configure and launch the hyperparameter tuning job,
and the AWS Management Console to monitor the status of hyperparameter tuning jobs. You can also
use the Amazon SageMaker high-level Amazon SageMaker Python SDK to configure, run, monitor, and
analyze hyperparameter tuning jobs. For more information, see https://github.com/aws/sagemaker-
python-sdk.
Prerequisites
To run the code in this example, you need
Topics
• Create a Notebook Instance (p. 1629)
• Get the Amazon SageMaker Boto 3 Client (p. 1629)
• Get the SageMaker Execution Role (p. 1629)
• Specify a S3 Bucket to Upload Training Datasets and Store Output Data (p. 1630)
• Download, Prepare, and Upload Training Data (p. 1630)
• Configure and Launch a Hyperparameter Tuning Job (p. 1631)
• Clean up (p. 1639)
1628
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
Next Step
Get the Amazon SageMaker Boto 3 Client (p. 1629)
import sagemaker
import boto3
region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')
The preceding code cell defines region and smclient objects that you will use to call the built-in
XGBoost algorithm and set the SageMaker hyperparameter tuning job.
Next Step
Get the SageMaker Execution Role (p. 1629)
1629
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
role = get_execution_role()
print(role)
Next Step
Specify a S3 Bucket to Upload Training Datasets and Store Output Data (p. 1630)
Use the following code to specify the default S3 bucket allocated for your SageMaker session. prefix is
the path within the bucket where SageMaker stores the data for the current training job.
sess = sagemaker.Session()
bucket = sess.default_bucket() # Set a default S3 bucket
prefix = 'DEMO-automatic-model-tuning-xgboost-dm'
If you want to use a specific S3 bucket, use the following code and replace the strings to the exact name
of the S3 bucket. The name of the bucket must contain sagemaker, and be globally unique. The bucket
must be in the same AWS Region as the notebook instance that you use for this example.
bucket = "sagemaker-your-preferred-s3-bucket"
sess = sagemaker.Session(
default_bucket = bucket
)
Note
The name of the bucket doesn't need to contain sagemaker if the IAM role that you use to run
the hyperparameter tuning job has a policy that gives the S3FullAccess permission.
Next Step
Download, Prepare, and Upload Training Data (p. 1630)
For more information about the dataset and the data transformation that the example performs, see the
hpo_xgboost_direct_marketing_sagemaker_APIs notebook in the Hyperparameter Tuning section of the
SageMaker Examples tab in your notebook instance.
1630
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-
additional.zip
!unzip -o bank-additional.zip
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5) # Keep the output on one page
data
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/
train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/
validation.csv')).upload_file('validation.csv')
Next Step
Configure and Launch a Hyperparameter Tuning Job (p. 1631)
Topics
• Settings for the hyperparameter tuning job (p. 1632)
• Configure the training jobs (p. 1633)
• Name and launch the hyperparameter tuning job (p. 1635)
• Monitor the Progress of a Hyperparameter Tuning Job (p. 1636)
• View the Status of the Training Jobs (p. 1638)
1631
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
Note
If you use your own algorithm for hyperparameter tuning, rather than a SageMaker built-
in algorithm, you must define metrics for your algorithm. For more information, see Define
metrics (p. 1615).
The following code example shows how to configure a hyperparameter tuning job using the
built-in XGBoost algorithm. The code example shows how to define ranges for the eta, alpha,
min_child_weight, and max_depth hyperparameters. For more information about these and other
hyperparameters see XGBoost Parameters.
In this code example, the objective metric for the hyperparameter tuning job finds the hyperparameter
configuration that maximizes validation:auc. SageMaker built-in algorithms automatically write the
objective metric to CloudWatch Logs. The following code example also shows how to set a RandomSeed.
tuning_job_config = {
"ParameterRanges": {
"CategoricalParameterRanges": [],
"ContinuousParameterRanges": [
{
"MaxValue": "1",
"MinValue": "0",
"Name": "eta"
},
{
"MaxValue": "2",
"MinValue": "0",
"Name": "alpha"
},
{
"MaxValue": "10",
"MinValue": "1",
"Name": "min_child_weight"
}
],
"IntegerParameterRanges": [
1632
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
{
"MaxValue": "10",
"MinValue": "1",
"Name": "max_depth"
}
]
},
"ResourceLimits": {
"MaxNumberOfTrainingJobs": 20,
"MaxParallelTrainingJobs": 3
},
"Strategy": "Bayesian",
"HyperParameterTuningJobObjective": {
"MetricName": "validation:auc",
"Type": "Maximize"
},
"RandomSeed" : 123
}
To configure the training jobs, define a JSON object and pass it as the value of the
TrainingJobDefinition parameter inside CreateHyperParameterTuningJob.
• AlgorithmSpecification – The registry path of the Docker image containing the training
algorithm and related metadata. To specify an algorithm, you can use your own custom built
algorithm inside a Docker container or a SageMaker built-in algorithm (required).
• InputDataConfig – The input configuration, including the ChannelName, ContentType, and data
source for your training and test data (required).
• InputDataConfig – The input configuration, including the ChannelName, ContentType, and data
source for your training and test data (required).
• The storage location for the algorithm's output. Specify the S3 bucket where you want to store the
output of the training jobs.
• RoleArn – The Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role
that SageMaker uses to perform tasks. Tasks include reading input data, downloading a Docker image,
writing model artifacts to an S3 bucket, writing logs to Amazon CloudWatch Logs, and writing metrics
to Amazon CloudWatch (required).
• StoppingCondition – The maximum runtime in seconds that a training job can run before being
stopped. This value should be greater than the time needed to train your model (required).
• MetricDefinitions – The name and regular expression that defines any metrics that the training
jobs emit. Define metrics only when you use a custom training algorithm. The example in the following
code uses a built-in algorithm, which already has metrics defined. For information about defining
metrics (optional), see Define metrics (p. 1615).
• TrainingImage – The Dockercontainer image that specifies the training algorithm (optional).
• StaticHyperParameters – The name and values of hyperparameters that are not tuned in the
tuning job (optional).
The following code example sets static values for the eval_metric, num_round, objective,
rate_drop, and tweedie_variance_power parameters of the XGBoost Algorithm (p. 1369) built-in
algorithm.
1633
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
training_job_definition = {
"AlgorithmSpecification": {
"TrainingImage": training_image,
"TrainingInputMode": "File"
},
"InputDataConfig": [
{
"ChannelName": "train",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_train
}
}
},
{
"ChannelName": "validation",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_validation
}
}
}
],
"OutputDataConfig": {
"S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.2xlarge",
"VolumeSizeInGB": 10
},
"RoleArn": role,
"StaticHyperParameters": {
"eval_metric": "auc",
"num_round": "100",
"objective": "binary:logistic",
"rate_drop": "0.3",
"tweedie_variance_power": "1.4"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 43200
}
}
1634
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
training_job_definition = {
"AlgorithmSpecification": {
"TrainingImage": training_image,
"TrainingInputMode": "File"
},
"InputDataConfig": [
{
"ChannelName": "train",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_train
}
}
},
{
"ChannelName": "validation",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_validation
}
}
}
],
"OutputDataConfig": {
"S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.2xlarge",
"VolumeSizeInGB": 10
},
"RoleArn": role,
"StaticHyperParameters": {
"eval_metric": "auc",
"num_round": "100",
"objective": "binary:logistic",
"rate_drop": "0.3",
"tweedie_variance_power": "1.4"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 43200
}
}
tuning_job_name = "MyTuningJob"
1635
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
HyperParameterTuningJobConfig =
tuning_job_config,
TrainingJobDefinition = training_job_definition)
Topics
• View the Status of the Hyperparameter Tuning Job (p. 1636)
1636
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
1637
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
3. In the list of hyperparameter tuning jobs, check the status of the hyperparameter tuning job you
launched. A tuning job can be:
To view the status of the training jobs that the hyperparameter tuning job launched
1. In the list of hyperparameter tuning jobs, choose the job that you launched.
2. Choose Training jobs.
3. View the status of each training job. To see more details about a job, choose it in the list of training
jobs. To view a summary of the status of all of the training jobs that the hyperparameter tuning job
launched, see Training job status counter.
1638
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job
Note
Hyperparameter tuning jobs can be stopped and the underlying resources deleted, but the
jobs themselves cannot be deleted.
To deploy the best training job as a model that you can host at a SageMaker endpoint, choose Create
model.
Next Step
Clean up
To avoid incurring unnecessary charges, when you are done with the example, use the AWS Management
Console to delete the resources that you created for it.
Note
If you plan to explore other examples, you might want to keep some of these resources, such as
your notebook instance, S3 bucket, and IAM role.
1639
Amazon SageMaker Developer Guide
Stop Training Jobs Early
3. Open the IAM console at https://console.aws.amazon.com/iam/ and delete the IAM role. If you
created permission policies, you can delete them, too.
4. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/ and delete
all of the log groups that have names starting with /aws/sagemaker/.
• If you are using the AWS SDK for Python (Boto3), set the TrainingJobEarlyStoppingType field of
the HyperParameterTuningJobConfig object that you use to configure the tuning job to AUTO.
• If you are using the Amazon SageMaker Python SDK, set the early_stopping_type parameter of
the HyperParameterTuner object to Auto.
• In the Amazon SageMaker console, in the Create hyperparameter tuning job workflow, under Early
stopping, choose Auto.
For a sample notebook that demonstrates how to use early stopping, see https://
github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/
image_classification_early_stopping/hpo_image_classification_early_stopping.ipynb or open the
hpo_image_classification_early_stopping.ipynb notebook in the Hyperparameter Tuning
section of the SageMaker Examples in a notebook instance. For information about using sample
notebooks in a notebook instance, see Example Notebooks (p. 220).
• After each epoch of training, get the value of the objective metric.
• Compute the running average of the objective metric for all previous training jobs up to the same
epoch, and then compute the median of all of the running averages.
• If the value of the objective metric for the current training job is worse (higher when minimizing
or lower when maximizing the objective metric) than the median value of running averages of the
objective metric for previous training jobs up to the same epoch, SageMaker stops the current training
job.
1640
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job
Note
This list of built-in algorithms that support early stopping is current as of December 13, 2018.
Other built-in algorithms might support early stopping in the future. If an algorithm emits a
metric that can be used as an objective metric for a hyperparameter tuning job (preferably a
validation metric), then it supports early stopping.
To use early stopping with your own algorithm, you must write your algorithms such that it emits the
value of the objective metric after each epoch. The following list shows how you can do that in different
frameworks:
TensorFlow
Extend chainer by using the extensions.Evaluator class. For information, see the
chainer.training.extensions.Evaluator API.
PyTorch and Spark
There is no high-level support. You must explicitly write your training code so that it computes
objective metrics and writes them to logs after each epoch.
• To gradually increase the number of training jobs over several tuning jobs based on results after each
iteration.
• To tune a model using new data that you received.
1641
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job
• To change hyperparameter ranges that you used in a previous tuning job, change static
hyperparameters to tunable, or change tunable hyperparameters to static values.
• You stopped a previous hyperparameter job early or it stopped unexpectedly.
Topics
• Types of Warm Start Tuning Jobs (p. 1642)
• Warm Start Tuning Restrictions (p. 1642)
• Warm Start Tuning Sample Notebook (p. 1643)
• Create a Warm Start Tuning Job (p. 1643)
IDENTICAL_DATA_AND_ALGORITHM
The new hyperparameter tuning job uses the same input data and training image as the parent
tuning jobs. You can change the hyperparameter ranges to search and the maximum number of
training jobs that the hyperparameter tuning job launches. You can also change hyperparameters
from tunable to static, and from static to tunable, but the total number of static plus tunable
hyperparameters must remain the same as it is in all parent jobs. You cannot use a new version of
the training algorithm, unless the changes in the new version do not affect the algorithm itself. For
example, changes that improve logging or adding support for a different data format are allowed.
Use identical data and algorithm when you use the same training data as you used in a previous
hyperparameter tuning job, but you want to increase the total number of training jobs or change
ranges or values of hyperparameters.
When you run an warm start tuning job of type IDENTICAL_DATA_AND_ALGORITHM, there
is an additional field in the response to DescribeHyperParameterTuningJob named
OverallBestTrainingJob. The value of this field is the TrainingJobSummary for the training job
with the best objective metric value of all training jobs launched by this tuning job and all parent
jobs specified for the warm start tuning job.
TRANSFER_LEARNING
The new hyperparameter tuning job can include input data, hyperparameter ranges, maximum
number of concurrent training jobs, and maximum number of training jobs that are different than
those of its parent hyperparameter tuning jobs. You can also change hyperparameters from tunable
to static, and from static to tunable, but the total number of static plus tunable hyperparameters
must remain the same as it is in all parent jobs. The training algorithm image can also be a different
version from the version used in the parent hyperparameter tuning job. When you use transfer
learning, changes in the dataset or the algorithm that significantly affect the value of the objective
metric might reduce the usefulness of using warm start tuning.
• A tuning job can have a maximum of 5 parent jobs, and all parent jobs must be in a terminal state
(Completed, Stopped, or Failed) before you start the new tuning job.
• The objective metric used in the new tuning job must be the same as the objective metric used in the
parent jobs.
1642
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job
• The total number of static plus tunable hyperparameters must remain the same between parent
jobs and the new tuning job. Because of this, if you think you might want to use a hyperparameter
as tunable in a future warm start tuning job, you should add it as a static hyperparameter when you
create a tuning job.
• The type of each hyperparameter (continuous, integer, categorical) must not change between parent
jobs and the new tuning job.
• The number of total changes from tunable hyperparameters in the parent jobs to static
hyperparameters in the new tuning job, plus the number of changes in the values of static
hyperparameters cannot be more than 10. For example, if the parent job has a tunable categorical
hyperparameter with the possible values red and blue, you change that hyperparameter to
static in the new tuning job, that counts as 2 changes against the allowed total of 10. If the same
hyperparameter had a static value of red in the parent job, and you change the static value to blue in
the new tuning job, it also counts as 2 changes.
• Warm start tuning is not recursive. For example, if you create MyTuningJob3 as a warm start tuning
job with MyTuningJob2 as a parent job, and MyTuningJob2 is itself an warm start tuning job with
a parent job MyTuningJob1, the information that was learned when running MyTuningJob1 is not
used for MyTuningJob3. If you want to use the information from MyTuningJob1, you must explicitly
add it as a parent for MyTuningJob3.
• The training jobs launched by every parent job in a warm start tuning job count against the 500
maximum training jobs for a tuning job.
• Hyperparameter tuning jobs created before October 1, 2018 cannot be used as parent jobs for warm
start tuning jobs.
Topics
• Create a Warm Start Tuning Job ( Low-level SageMaker API for Python (Boto 3)) (p. 1643)
• Create a Warm Start Tuning Job (SageMaker Python SDK) (p. 1644)
Create a Warm Start Tuning Job ( Low-level SageMaker API for Python (Boto 3))
To use warm start tuning, you specify the values of a HyperParameterTuningJobWarmStartConfig
object, and pass that as the WarmStartConfig field in a call to CreateHyperParameterTuningJob.
1643
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job
warm_start_config = {
"ParentHyperParameterTuningJobs" : [
{"HyperParameterTuningJobName" : 'MyParentTuningJob'}
],
"WarmStartType" : "IdenticalDataAndAlgorithm"
}
smclient = boto3.Session().client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName =
'MyWarmStartTuningJob',
HyperParameterTuningJobConfig = tuning_job_config, # See notebook for tuning
configuration
TrainingJobDefinition = training_job_definition, # See notebook for job definition
WarmStartConfig = warm_start_config)
• Specify the parent jobs and the warm start type by using a WarmStartConfig object.
• Pass the WarmStartConfig object as the value of the warm_start_config argument of a
HyperparameterTuner object.
• Call the fit method of the HyperparameterTuner object.
For more information about using the Amazon SageMaker Python SDK for hyperparameter tuning, see
https://github.com/aws/sagemaker-python-sdk#sagemaker-automatic-model-tuning.
This example uses an estimator that uses the Image Classification - MXNet (p. 1506) algorithm for
training. The following code sets the hyperparameter ranges that the warm start tuning job searches
within to find the best combination of values. For information about setting hyperparameter ranges, see
Define Hyperparameter Ranges (p. 1617).
The following code configures the warm start tuning job by creating a WarmStartConfig object.
parent_tuning_job_name = "MyParentTuningJob"
warm_start_config =
WarmStartConfig(warm_start_type=WarmStartTypes.IDENTICAL_DATA_AND_ALGORITHM,
parents={parent_tuning_job_name})
Now set the values for static hyperparameters, which are hyperparameters that keep the same
value for every training job that the warm start tuning job launches. In the following code,
imageclassification is an estimator that was created previously.
imageclassification.set_hyperparameters(num_layers=18,
image_shape='3,224,224',
num_classes=257,
num_training_samples=15420,
mini_batch_size=128,
1644
Amazon SageMaker Developer Guide
Resource Limits for Automatic Model Tuning
epochs=30,
optimizer='sgd',
top_k='2',
precision_dtype='float32',
augmentation_type='crop')
Now create the HyperparameterTuner object and pass the WarmStartConfig object that you
previously created as the warm_start_config argument.
tuner_warm_start = HyperparameterTuner(imageclassification,
'validation:accuracy',
hyperparameter_ranges,
objective_type='Maximize',
max_jobs=10,
max_parallel_jobs=2,
base_tuning_job_name='warmstart',
warm_start_config=warm_start_config)
Finally, call the fit method of the HyperparameterTuner object to launch the warm start tuning job.
tuner_warm_start.fit(
{'train': s3_input_train, 'validation': s3_input_validation},
include_cls_metadata=False)
1645
Amazon SageMaker Developer Guide
Resource Limits for Automatic Model Tuning
• Number of concurrent hyperparameter tuning jobs: You don't need to increase the limit, because 10
tuning jobs is below the limit of 100.
• Number of training jobs per hyperparameter tuning job: You don't need to increase the limit, because
100 training jobs is below the limit of 750.
• Number of concurrent training jobs per hyperparameter tuning job: You need to request a limit
increase to 20, because the default limit is 10.
• SageMaker training ml.m4.xlarge instances: You need to request a limit increase to 200, because you
have 10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default
limit is 20 instances.
• SageMaker training total instance count: You need to request a limit increase to 200, because you have
10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default limit
is 20 instances.
1. Open the AWS Support Center page, sign in if necessary, and then choose Create case.
2. On the Create case page, choose Service limit increase.
3. On the Case details panel, select SageMaker Automatic Model Tuning [Hyperparameter
Optimization] for the Limit type
4. On the Requests panel for Request 1, select the Region, the resource Limit to increase and the
New Limit value you are requesting. Select Add another request if you have additional requests for
quota increases.
1646
Amazon SageMaker Developer Guide
Best Practices for Hyperparameter Tuning
Topics
• Choosing a tuning strategy (p. 1647)
• Choosing the number of hyperparameters (p. 1648)
• Choosing hyperparameter ranges (p. 1648)
• Using the correct scales for hyperparameters (p. 1648)
• Choosing the best number of parallel training jobs (p. 1648)
• Running training jobs on multiple instances (p. 1649)
• Using a random seed to reproduce hyperparameter configurations (p. 1649)
1647
Amazon SageMaker Developer Guide
Best Practices for Hyperparameter Tuning
towards well-utilized hyperparameter configurations and run parallel jobs. For smaller training jobs using
less runtime, use either random search or Bayesian optimization.
Use Bayesian optimization to make increasingly informed decisions about improving hyperparameter
configurations in the next run. Bayesian optimization uses information gathered from prior runs to
improve subsequent runs. Because of its sequential nature, Bayesian optimization cannot massively scale.
Use random search to run a large number of parallel jobs. In random search, subsequent jobs do not
depend on the results from prior jobs and can be run independently. Compared to other strategies,
random search is able to run the largest number of parallel jobs.
Use grid search to reproduce results of a tuning job, or if simplicity and transparency of the optimization
algorithm are important. You can also use grid search to explore the entire hyperparameter search space
evenly. Grid search methodically searches through every hyperparameter combination to find optimal
hyperparameter values. Unlike grid search, Bayesian optimization, random search and Hyperband all
draw hyperparameters randomly from the search space. Because grid search analyzes every combination
of hyperparameters, optimal hyperparameter values will be identical between tuning jobs that use the
same hyperparameters.
Although you can simultaneously specify up to 30 hyperparameters, limiting your search to a smaller
number can reduce computation time. Reducing computation time allows SageMaker to converge more
quickly to an optimal hyperparameter configuration.
1648
Amazon SageMaker Developer Guide
Debug and Profile
SageMaker Debugger profiles and debugs training jobs to help resolve such problems and improve your
ML model's compute resource utilization and performance. Debugger offers tools to send alerts when
training anomalies are found, take actions against the problems, and identify the root cause of them by
visualizing collected metrics and tensors.
SageMaker Debugger supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks. For
more information about available frameworks and versions supported by SageMaker Debugger, see
Supported Frameworks and Algorithms (p. 1650).
1649
Amazon SageMaker Developer Guide
Supported Frameworks and Algorithms
1. Modify your training script with the sagemaker-debugger Python SDK if needed.
2. Configure a SageMaker training job with SageMaker Debugger.
• Configure using the SageMaker Estimator API (for Python SDK).
• Configure using the SageMaker CreateTrainingJob request (for Boto3 or CLI).
• Configure custom training containers (p. 1795) with SageMaker Debugger.
3. Start a training job and monitor training issues in real time.
• SageMaker Studio Debugger dashboards in Studio Experiments and trials (p. 1721).
• List of Debugger Built-in Rules (p. 1748).
4. Get alerts and take prompt actions against the training issues.
• Receive texts and emails and stop training jobs when training issues are found using Debugger
Built-in Actions for Rules (p. 1698).
• Set up your own actions using Amazon CloudWatch Events and AWS Lambda (p. 1702).
5. Receive training reports, suggestions to fix the issues, and insights into your training jobs.
• Studio Debugger Insights dashboard for deep learning frameworks
• Deep learning framework profiling report
• SageMaker XGBoost training report
6. Explore deep analysis of the training issues and bottlenecks.
• For profiling training jobs, see Analyze Data Using the SMDebug Client Library (p. 1740).
• For debugging model output tensors, see Visualize Debugger Output Tensors in
TensorBoard (p. ).
7. Fix the issues, considering the suggestions provided by Debugger, and repeat steps 1–5 until you
optimize your model and achieve target accuracy.
The SageMaker Debugger developer guide walks you through the following topics.
Topics
• Supported Frameworks and Algorithms (p. 1650)
• Amazon SageMaker Debugger Architecture (p. 1653)
• Get Started with Debugger Tutorials (p. 1654)
• Debug Training Jobs Using Amazon SageMaker Debugger (p. 1664)
• Profile Training Jobs Using Amazon SageMaker Debugger (p. 1709)
• List of Debugger Built-in Rules (p. 1748)
• Create Debugger Custom Rules for Training Job Analysis (p. 1793)
• Use Debugger with Custom Training Containers (p. 1795)
• Configure Debugger Using Amazon SageMaker API (p. 1799)
• Best Practices for Amazon SageMaker Debugger (p. 1809)
• Amazon SageMaker Debugger Advanced Topics and Reference Documentation (p. 1812)
• Amazon SageMaker Debugger Release Notes (p. 1820)
TensorFlow All AWS Deep learning AWS TensorFlow deep AWS TensorFlow deep
containers learning containers >= learning containers
v2.3.1, < v2.11 1.15.4 or later
• Monitoring system bottlenecks – Monitor the system utilization rate for resources such as CPU,
GPU, memories, network, and data I/O metrics. This is a framework and model agnostic feature and
available for any training jobs in SageMaker.
• Profiling deep learning framework operations – Profile the deep learning operations of the
TensorFlow and PyTorch frameworks, such as step durations, data loaders, forward and backward
operations, Python profiling metrics, and framework-specific metrics.
Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow
2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks
and SDKs as follows.
• SageMaker Python SDK <= v2.130.0
• PyTorch >= v1.6.0, < v2.0
• TensorFlow >= v2.3.1, < v2.11
See also Amazon SageMaker Debugger Release Notes: March 16, 2023 (p. 1820).
• Debugging output tensors – Track and debug model parameters, such as weights, gradients, biases,
and scalar values of your training job. Available deep learning frameworks are Apache MXNet,
TensorFlow, PyTorch, and XGBoost.
Important
For the TensorFlow framework with Keras, SageMaker Debugger deprecates the zero code
change support for debugging models built using the tf.keras modules of TensorFlow
2.6 and later. This is due to breaking changes announced in the TensorFlow 2.6.0 release
note. For instructions on how to update your training script, see the section called
“TensorFlow” (p. 1667).
1651
Amazon SageMaker Developer Guide
Supported Frameworks and Algorithms
Important
Since PyTorch v1.12.0 and later, SageMaker Debugger deprecates the zero code change
support for debugging models.
This is due to breaking changes that cause SageMaker Debugger to interfere with the
torch.jit functionality. For instructions on how to update your training script, see the
section called “PyTorch” (p. 1665).
If the framework or algorithm that you want to train and debug is not listed in the table, go to the AWS
Discussion Forum and leave feedback on SageMaker Debugger.
AWS Regions
Amazon SageMaker Debugger is available in all regions where Amazon SageMaker is in service except the
following region.
To find if Amazon SageMaker is in service in your AWS Region, see AWS Regional Services.
For more information about how to build your training container with the sagemaker-debugger client
library, push it to the Amazon Elastic Container Registry (Amazon ECR), and monitor and debug, see Use
Debugger with Custom Training Containers (p. 1795).
For direct resources about the Debugger and sagemaker-debugger API operations, see the following
links:
If you use the SDK for Java to conduct SageMaker training jobs and want to configure Debugger APIs, see
the following references:
1652
Amazon SageMaker Developer Guide
Debugger Architecture
Debugger supports profiling functionality for performance optimization to identify computation issues,
such as system bottlenecks and underutilization, and to help optimize hardware resource utilization at
scale.
Debugger's debugging functionality for model optimization is about analyzing non-converging training
issues that can arise while minimizing the loss functions using optimization algorithms, such as gradient
descent and its variations.
The following diagram shows the architecture of SageMaker Debugger. The blocks with bold boundary
lines are what Debugger manages to analyze your training job.
Debugger stores the following data from your training jobs in your secured Amazon S3 bucket:
1653
Amazon SageMaker Developer Guide
Tutorials
• System metrics – Hardware resource utilization data, such as CPU, GPU, CPU and GPU memory,
network, and data input and output (I/O) metrics.
• Framework metrics – Metrics to track each framework operation per call or sampling, such as
convolutional layer operations in the forward pass, batch normalization operations in the backward
pass, data loader processes between steps, and gradient descent algorithm operations to calculate and
update the loss function.
• Output tensors – Collections of scalars and model parameters that are continuously updated during
the forward and backward passes while training ML models. The output tensors include scalar values
(accuracy and loss) and matrices (weights, gradients, input layers, and output layers).
Note
By default, Debugger monitors and debugs SageMaker training jobs without any Debugger-
specific parameters configured in SageMaker estimators. Debugger collects system metrics
every 500 milliseconds and basic output tensors (scalar outputs such as loss and accuracy)
every 500 steps. It also runs the ProfilerReport rule to analyze the system metrics and
aggregate the Studio Debugger insights dashboard and a profiling report. Debugger saves the
output data in your secured Amazon S3 bucket.
The Debugger built-in rules run on processing containers, which are designed to evaluate machine
learning models by processing the training data collected in your S3 bucket (see Process Data and
Evaluate Models). The built-in rules are fully managed by Debugger. You can also create your own rules
customized to your model to watch for any issues you want to monitor.
Topics
• Debugger Tutorial Videos (p. 1654)
• Debugger Example Notebooks (p. 1655)
• Debugger Advanced Demos and Visualization (p. 1657)
Topics
• Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon SageMaker
Debugger (p. 1654)
• Debug Models with Amazon SageMaker Debugger in Studio (p. 1654)
• Deep Dive on Amazon SageMaker Debugger and SageMaker Model Monitor (p. 1655)
Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon
SageMaker Debugger
Emily Webber, AWS Machine Learning Specialist | Length: 13 minutes 54 seconds
This tutorial video gives you a tour of Amazon SageMaker Debugger to capture, debug, and visualize
model output data from a training model with MXNet. Learn how Amazon SageMaker Debugger makes
1654
Amazon SageMaker Developer Guide
Tutorials
the training process transparent by automatically capturing metrics, analyzing training runs, and
detecting problems.
Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon SageMaker Debugger
You can find the example notebook in this video at Visualizing Debugging Tensors of MXNet training in
the Amazon SageMaker Examples GitHub repository.
This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect
debugging information from a training model. The example training model used in this video is a simple
convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker in a
TensorFlow framework and Debugger enable you to build an estimator directly using the training script
and debug the training job.
You can find the example notebook in the video in this Studio Demo repository provided by the author.
You need to clone the debugger.ipynb notebook file and the mnist_keras_tf.py training script
to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the
path keras_script_path to the mnist_keras_tf.py file inside the debugger.ipynb notebook.
For example, if you cloned the two files in the same directory, set it as keras_script_path =
"mnist_keras_tf.py".
This video session explores advanced features of Debugger and SageMaker Model Monitor that help
boost productivity and the quality of your models. First, this video shows how to detect and fix training
issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to
monitor models in production and identify prediction issues such as missing features or data drift using
SageMaker Model Monitor. Finally, it offers cost optimization tips to help you make the most of your
machine learning budget.
You can find the example notebook in the video in this AWS Dev Days 2020 repository offered by the
author.
We recommend that you run the example notebooks on SageMaker Studio or a SageMaker Notebook
instance because most of the examples are designed for training jobs in the SageMaker ecosystem,
including Amazon EC2, Amazon S3, and Amazon SageMaker Python SDK.
To clone the example repository to SageMaker Studio, follow the instructions at Amazon SageMaker
Studio Tour.
To find the examples in a SageMaker Notebook instance, follow the instructions at SageMaker Notebook
Instance Example Notebooks.
1655
Amazon SageMaker Developer Guide
Tutorials
Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the
SMDebug client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment,
run the following code to install the latest versions of the libraries and restart the kernel.
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
1656
Amazon SageMaker Developer Guide
Tutorials
Visualizing MXNet Gluon Fashion MNIST Run a training job and configure
Debugging Convolutional SageMaker Debugger to store
Tensors of Neural all tensors from this job, then
MXNet training Network visualize those tensors ina
notebook.
Enable Spot MXNet Gluon Fashion MNIST Learn how Debugger collects
Training with Convolutional tensor data from a training job
Amazon Neural on a spot instance, and how to
SageMaker Network use the Debugger built-in rules
Debugger with managed spot training.
Explain an XGBoost XGBoost Adult Census Learn how to use the Debugger
XGBoost model Regression dataset hook and built-in rules for
that predicts collecting and visualizing tensor
an individual’s data from an XGBoost regression
income with model, such as loss values,
Amazon features, and SHAP values.
SageMaker
Debugger
To find advanced visualizations of model parameters and use cases, see the next topic at Debugger
Advanced Demos and Visualization (p. 1657).
Topics
• Train and Tune Your Models with Amazon SageMaker Experiments and Debugger (p. 1658)
• Using SageMaker Debugger to Monitor a Convolutional Autoencoder Model Training (p. 1661)
• Using SageMaker Debugger to Monitor Attentions in BERT Model Training (p. 1661)
• Using SageMaker Debugger to Visualize Class Activation Maps in Convolutional Neural Networks
(CNNs) (p. 1664)
1657
Amazon SageMaker Developer Guide
Tutorials
Train and Tune Your Models with Amazon SageMaker Experiments and
Debugger
Dr. Nathalie Rauschmayr, AWS Applied Scientist | Length: 49 minutes 26 seconds
Find out how Amazon SageMaker Experiments and Debugger can simplify the management of your
training jobs. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves
training metrics into your Amazon S3 bucket. SageMaker Experiments enables you to call the training
information as trials through SageMaker Studio and supports visualization of the training job. This helps
you keep model quality high while reducing less important parameters based on importance rank.
This video demonstrates a model pruning technique that makes pre-trained ResNet50 and AlexNet
models lighter and affordable while keeping high standards for model accuracy.
SageMaker Estimator trains those algorithms supplied from the PyTorch model zoo in an AWS Deep
Learning Containers with PyTorch framework, and Debugger extracts training metrics from the training
process.
The video also demonstrates how to set up a Debugger custom rule to watch the accuracy of a pruned
model, to trigger an Amazon CloudWatch event and an AWS Lambda function when the accuracy hits a
threshold, and to automatically stop the pruning process to avoid redundant iterations.
• Learn how to use SageMaker to accelerate ML model training and improve model quality.
• Understand how to manage training iterations with SageMaker Experiments by automatically
capturing input parameters, configurations, and results.
• Discover how Debugger makes the training process transparent by automatically capturing real-time
tensor data from metrics such as weights, gradients, and activation outputs of convolutional neural
networks.
• Use CloudWatch to trigger Lambda when Debugger catches issues.
• Master the SageMaker training process using SageMaker Experiments and Debugger.
You can find the notebooks and training scripts used in this video from SageMaker Debugger PyTorch
Iterative Model Pruning.
The following image shows how the iterative model pruning process reduces the size of AlexNet by
cutting out the 100 least significant filters based on importance rank evaluated by activation outputs
and gradients.
The pruning process reduced the initial 50 million parameters to 18 million. It also reduced the estimated
model size from 201 MB to 73 MB.
1658
Amazon SageMaker Developer Guide
Tutorials
You also need to track model accuracy, and the following image shows how you can plot the model
pruning process to visualize changes in model accuracy based on the number of parameters in
SageMaker Studio.
1659
Amazon SageMaker Developer Guide
Tutorials
In SageMaker Studio, choose the Experiments tab, select a list of tensors saved by Debugger from the
pruning process, and then compose a Trial Component List panel. Select all ten iterations and then
choose Add chart to create a Trial Component Chart. After you decide on a model to deploy, choose the
trial component and choose a menu to perform an action or choose Deploy model.
Note
To deploy a model through SageMaker Studio using the following notebook example, add a line
at the end of the train function in the train.py script.
# In the train.py script, look for the train function in line 58.
def train(epochs, batch_size, learning_rate):
...
print('acc:{:.4f}'.format(correct/total))
hook.save_scalar("accuracy", correct/total, sm_metric=True)
# Add the following code to line 128 of the train.py script to save the pruned
models
# under the current SageMaker Studio model directory
torch.save(model.state_dict(), os.environ['SM_MODEL_DIR'] + '/model.pt')
1660
Amazon SageMaker Developer Guide
Tutorials
The training model in this notebook is a convolutional autoencoder with the MXNet framework. The
convolutional autoencoder has a bottleneck-shaped convolutional neural network that consists of an
encoder part and a decoder part.
The encoder in this example has two convolution layers to produce compressed representation (latent
variables) of the input images. In this case, the encoder produces a latent variable of size (1, 20) from an
original input image of size (28, 28) and significantly reduces the size of data for training by 40 times.
The decoder has two deconvolutional layers and ensures that the latent variables preserve key
information by reconstructing output images.
The convolutional encoder powers clustering algorithms with smaller input data size and the
performance of clustering algorithms such as k-means, k-NN, and t-Distributed Stochastic Neighbor
Embedding (t-SNE).
This notebook example demonstrates how to visualize the latent variables using Debugger, as shown
in the following animation. It also demonstrates how the t-SNE algorithm classifies the latent variables
into ten clusters and projects them into a two-dimensional space. The scatter plot color scheme on the
right side of the image reflects the true values to show how well the BERT model and t-SNE algorithm
organize the latent variables into the clusters.
The BERT model is pre-trained on unsupervised tasks such as predicting missing words in a sentence or
predicting the next sentence that naturally follows a previous sentence. The training data contains 3.3
billion words (tokens) of English text, from sources such as Wikipedia and electronic books. For a simple
example, the BERT model can give a high attention to appropriate verb tokens or pronoun tokens from a
subject token.
1661
Amazon SageMaker Developer Guide
Tutorials
The pre-trained BERT model can be fine-tuned with an additional output layer to achieve state-of-the-
art model training in NLP tasks, such as automated responses to questions, text classification, and many
others.
Debugger collects tensors from the fine-tuning process. In the context of NLP, the weight of neurons is
called attention.
This notebook demonstrates how to use the pre-trained BERT model from the GluonNLP model zoo on
the Stanford Question and Answering dataset and how to set up SageMaker Debugger to monitor the
training job.
Plotting attention scores and individual neurons in the query and key vectors can help to identify causes
of incorrect model predictions. With SageMaker Debugger, you can retrieve the tensors and plot the
attention-head view in real time as training progresses and understand what the model is learning.
The following animation shows the attention scores of the first 20 input tokens for ten iterations in the
training job provided in the notebook example.
1662
Amazon SageMaker Developer Guide
Tutorials
1663
Amazon SageMaker Developer Guide
Debug Training Jobs
In this notebook, the PyTorch ResNet model is trained on the German Traffic Sign Dataset, which
contains more than 40 classes of traffic-related objects and more than 50,000 images in total.
During the training process, SageMaker Debugger collects tensors to plot the class activation maps
in real time. As shown in the animated image, the class activation map (also called as a saliency map)
highlights regions with high activation in red color.
Using tensors captured by Debugger, you can visualize how the activation map evolves during the model
training. The model starts by detecting the edge on the lower-left corner at the beginning of the training
job. As the training progresses, the focus shifts to the center and detects the speed limit sign, and the
model successfully predicts the input image as Class 3, which is a class of speed limit 60km/h signs, with
a 97% confidence level.
Topics
• Step 1: Adapt Your Training Script to Register a Hook (p. 1665)
• Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK (p. 1669)
• SageMaker Debugger Interactive Report for XGBoost (p. 1684)
• Action on Amazon SageMaker Debugger Rules (p. 1698)
• Visualize Amazon SageMaker Debugger Output Tensors in TensorBoard (p. 1707)
1664
Amazon SageMaker Developer Guide
Debug Training Jobs
The sagemaker-debugger Python SDK provides wrapper functions that help register a hook to extract
model tensors, without altering your training script. To get started with collecting model output tensors
and debug them to find training issues, make the following modifications in your training script.
Tip
While you're following this page, use the sagemaker-debugger open source SDK
documentation for API references.
Topics
• Adapt Your PyTorch Training Script (p. 1665)
• Adapt Your TensorFlow Training Script (p. 1667)
If you bring a PyTorch training script, you can run the training job and extract model output tensors with
a few additional code lines in your training script. You need to use the hook APIs in the sagemaker-
debugger client library. Walk through the following instructions that break down the steps with code
examples.
1. Create a hook.
When you launch a training job in the section called “Step 2: Launch and Debug Training Jobs Using
SageMaker Python SDK” (p. 1669) with any of the DebuggerHookConfig, TensorBoardConfig, or Rules
in your estimator, SageMaker adds a JSON configuration file to your training instance that is picked
up by the get_hook function. Note that if you do not include any of the configuration APIs in your
estimator, there will be no configuration file for the hook to find, and the function returns None.
If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2
instances, or your own local devices, use smd.Hook class to create a hook. However, this approach
can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s
built-in Rules don’t work with the local mode because the Rules require SageMaker ML training
instances and S3 to store outputs from the remote instances in real time. The smd.get_hook API
returns None in this case.
If you want to create a manual hook to save tensors in local mode, use the following code snippet
with the logic to check if the smd.get_hook API returns None and create a manual hook using the
smd.Hook class. Note that you can specify any output directory in your local machine.
1665
Amazon SageMaker Developer Guide
Debug Training Jobs
if hook is None:
hook=smd.Hook(
out_dir='/path/to/your/local/output/',
export_tensorboard=True
)
The hook.register_module() method takes your model and iterates through each layer, looking
for any tensors that match with regular expressions that you’ll provide through the configuration in
the section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).
The collectable tensors through this hook method are weights, biases, activations, gradients, inputs,
and outputs.
hook.register_module(model)
Tip
If you collect the entire output tensors from a large deep learning model, the total size of
those collections can exponentially grow and might cause bottlenecks. If you want to save
specific tensors, you can also use the hook.save_tensor() method. This method helps you
pick the variable for the specific tensor and save to a custom collection named as you want.
For more information, see step 7 (p. 1667) of this instruction.
3. Warp the loss function with the hook’s class methods.
The hook.register_loss method is to wrap the loss function. It extracts loss values every
save_interval that you’ll set during configuration in the section called “Step 2: Launch and Debug
Training Jobs Using SageMaker Python SDK” (p. 1669), and saves them to the "losses" collection.
hook.register_loss(loss_function)
4. Add hook.set_mode(ModeKeys.TRAIN) in the train block. This indicates the tensor collection is
extracted during the training phase.
def train():
...
hook.set_mode(ModeKeys.TRAIN)
5. Add hook.set_mode(ModeKeys.EVAL) in the validation block. This indicates the tensor collection is
extracted during the validation phase.
def validation():
...
hook.set_mode(ModeKeys.EVAL)
6. Use hook.save_scalar() to save custom scalars. You can save scalar values that aren’t in your
model. For example, if you want to record the accuracy values computed during evaluation, add the
following line of code below the line where you calculate accuracy.
hook.save_scalar("accuracy", accuracy)
Note that you need to provide a string as the first argument to name the custom scalar collection. This
is the name that'll be used for visualizing the scalar values in TensorBoard, and can be any string you
want.
1666
Amazon SageMaker Developer Guide
Debug Training Jobs
7. Use hook.save_tensor() to save custom tensors. Similarly to hook.save_scalar(), you can save
additional tensors, defining your own tensor collection. For example, you can extract input image data
that are passed into the model and save as a custom tensor by adding the following code line, where
"images" is an example name of the custom tensor, image_inputs is an example variable for the
input image data.
hook.save_tensor("images", image_inputs)
Note that you must provide a string to the first argument to name the custom tensor.
hook.save_tensor() has the third argument collections_to_write to specify the tensor
collection to save the custom tensor. The default is collections_to_write="default". If you
don't explicitely specify the third argument, the custom tensor is saved to the "default" tensor
collection.
After you have completed adapting your training script, proceed to the section called “Step 2: Launch
and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).
hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
This creates a hook when you start a SageMaker training job. When you launch a training job in the
section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669) with
any of the DebuggerHookConfig, TensorBoardConfig, or Rules in your estimator, SageMaker adds
a JSON configuration file to your training instance that is picked up by the smd.get_hook method. Note
that if you do not include any of the configuration APIs in your estimator, there will be no configuration
file for the hook to find, and the function returns None.
If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances,
or your own local devices, use smd.Hook class to create a hook. However, this approach can only store
the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules
don’t work with the local mode. The smd.get_hook method also returns None in this case.
If you want to create a manual hook, use the following code snippet with the logic to check if the hook
returns None and create a manual hook using the smd.Hook class.
hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
if hook is None:
hook=smd.KerasHook(
out_dir='/path/to/your/local/output/',
export_tensorboard=True
)
After adding the hook creation code, proceed to the following topic for TensorFlow Keras.
Note
SageMaker Debugger currently supports TensorFlow Keras only.
1667
Amazon SageMaker Developer Guide
Debug Training Jobs
The following precedure walks you through how to use the hook and its methods to collect output
scalars and tensors from your model and optimizer.
1. Wrap your Keras model and optimizer with the hook’s class methods.
The hook.register_model() method takes your model and iterates through each layer, looking for
any tensors that match with regular expressions that you’ll provide through the configuration in the
section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669). The
collectable tensors through this hook method are weights, biases, and activations.
model=tf.keras.Model(...)
hook.register_model(model)
optimizer=tf.keras.optimizers.Adam(...)
optimizer=hook.wrap_optimizer(optimizer)
To collect tensors from the model, such as the input and output tensors of each layer, you must run
the training in eager mode. Otherwise, SageMaker Debugger will not be able to collect the tensors.
However, other tensors, such as model weights, biases, and the loss, can be collected without explicitly
running in eager mode.
model.compile(
loss="categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"],
# Required for collecting tensors of each layer
run_eagerly=True
)
To collect the tensors from the hooks that you registered, add callbacks=[hook] to the Keras
model.fit() class method. This will pass the sagemaker-debugger hook as a Keras callback.
model.fit(
X_train, Y_train,
batch_size=batch_size,
epochs=epoch,
validation_data=(X_valid, Y_valid),
shuffle=True,
callbacks=[hook]
)
5. TensorFlow 2.x provides only symbolic gradient variables that do not provide access to their values. To
collect gradients, wrap tf.GradientTape by the hook.wrap_tape() method, which requires you
to write your own training step as follows.
1668
Amazon SageMaker Developer Guide
Debug Training Jobs
By wrapping the tape, the sagemaker-debugger hook can identify output tensors such as gradients,
parameters, and losses. Wrapping the tape ensures that the hook.wrap_tape() method around
functions of the tape object, such as push_tape(), pop_tape(), gradient(), will set up the
writers of SageMaker Debugger and save tensors that are provided as input to gradient() (trainable
variables and loss) and output of gradient() (gradients).
Note
To collect with a custom training loop, make sure that you use eager mode. Otherwise,
SageMaker Debugger is not able to collect any tensors.
For a full list of actions that the sagemaker-debugger hook APIs offer to construct hooks and save
tensors, see Hook Methods in the sagemaker-debugger Python SDK documentation.
After you have completed adapting your training script, proceed to the section called “Step 2: Launch
and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).
PyTorch
session=boto3.session.Session()
region=session.region_name
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=PyTorch(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
1669
Amazon SageMaker Developer Guide
Debug Training Jobs
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.12.0",
py_version="py37",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)
estimator.fit(wait=False)
TensorFlow
session=boto3.session.Session()
region=session.region_name
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule()),
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=TensorFlow(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)
estimator.fit(wait=False)
MXNet
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=MXNet(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
1670
Amazon SageMaker Developer Guide
Debug Training Jobs
framework_version="1.7.0",
py_version="py37",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)
estimator.fit(wait=False)
XGBoost
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]
estimator=XGBoost(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.5-1",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)
estimator.fit(wait=False)
Generic estimator
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]
region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",
# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
1671
Amazon SageMaker Developer Guide
Debug Training Jobs
rules=rules
)
estimator.fit(wait=False)
Note
SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For
example, the format of the default S3 bucket URI in your account is s3://sagemaker-
<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/.
There are two subfolders created by SageMaker Debugger: debug-output, and rule-output.
If you add the tensorboard_output_config parameter, you'll also find tensorboard-
output folder.
See the following topics to find more examples of how to configure the Debugger-specific parameters in
detail.
Topics
• Configure SageMaker Debugger to Save Tensors (p. 1672)
• Configure Debugger Built-in Rules (p. 1678)
• Turn Off Debugger (p. 1683)
• Useful SageMaker Estimator Classmethods for Debugger (p. 1684)
1672
Amazon SageMaker Developer Guide
Debug Training Jobs
collection_configs=[
CollectionConfig(name="weights"),
CollectionConfig(name="gradients")
]
The preceding collections set up the Debugger hook to save the tensors every 500 steps based on the
default "save_interval" value.
For a full list of available Debugger built-in collections, see Debugger Built-in Collections.
If you want to customize the built-in collections, such as changing the save intervals and tensor regex,
use the following CollectionConfig template to adjust parameters.
collection_configs=[
CollectionConfig(
name="tensor_collection",
parameters={
"key_1": "value_1",
"key_2": "value_2",
...
"key_n": "value_n"
}
)
]
For more information about available parameter keys, see CollectionConfig in the Amazon SageMaker
Python SDK. For example, the following code example shows how you can adjust the save intervals of
the "losses" tensor collection at different phases of training: save loss every 100 steps in training phase
and validation loss every 10 steps in validation phase.
collection_configs=[
CollectionConfig(
name="losses",
parameters={
1673
Amazon SageMaker Developer Guide
Debug Training Jobs
"train.save_interval": "100",
"eval.save_interval": "10"
}
)
]
Tip
This tensor collection configuration object can be used for both DebuggerHookConfig and Rule
API operations.
debugger_hook_config=DebuggerHookConfig(
collection_configs=collection_configs
)
Debugger saves the model training output tensors into the default S3 bucket. The format of the default
S3 bucket URI is s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/
debug-output/.
If you want to specify an exact S3 bucket URI, use the following code example:
debugger_hook_config=DebuggerHookConfig(
s3_output_path="specify-your-s3-bucket-uri"
collection_configs=collection_configs
)
For more information, see DebuggerHookConfig in the Amazon SageMaker Python SDK.
Topics
• Tensor Visualization Example Notebooks (p. 1674)
• Save Tensors Using Debugger Built-in Collections (p. 1676)
• Save Tensors Using Debugger Modified Built-in Collections (p. 1677)
• Save Tensors Using Debugger Custom Collections (p. 1677)
This notebook example shows how to visualize saved tensors using Amazon SageMaker Debugger.
By visualizing the tensors, you can see how the tensor values change while training deep learning
algorithms. This notebook includes a training job with a poorly configured neural network and uses
Amazon SageMaker Debugger to aggregate and analyze tensors, including gradients, activation
1674
Amazon SageMaker Developer Guide
Debug Training Jobs
outputs, and weights. For example, the following plot shows the distribution of gradients of a
convolutional layer that is suffering from a vanishing gradient problem.
This notebook also illustrates how a good initial hyperparameter setting improves the training process
by generating the same tensor distribution plots.
• Visualizing and Debugging Tensors from MXNet Model Training
This notebook example shows how to save and visualize tensors from an MXNet Gluon model training
job using Amazon SageMaker Debugger. It illustrates that Debugger is set to save all tensors to an
Amazon S3 bucket and retrieves ReLu activation outputs for the visualization. The following figure
shows a three-dimensional visualization of the ReLu activation outputs. The color scheme is set to blue
to indicate values close to 0 and yellow to indicate values close to 1.
1675
Amazon SageMaker Developer Guide
Debug Training Jobs
tensor_plot.py script provided with the notebook retrieves tensors using Debugger and visualizes
the CNN. You can run this notebook on SageMaker Studio to reproduce the tensor visualization and
implement your own convolutional neural network model.
• Real-time Tensor Analysis in a SageMaker Notebook with MXNet
This example guides you through installing required components for emitting tensors in an Amazon
SageMaker training job and using the Debugger API operations to access those tensors while training
is running. A gluon CNN model is trained on the Fashion MNIST dataset. While the job is running, you
will see how Debugger retrieves activation outputs of the first convolutional layer from each of 100
batches and visualizes them. Also, this will show you how to visualize weights after the job is done.
You can use built-in collections of tensors using the CollectionConfig API and save them using the
DebuggerHookConfig API. The following example shows how to use the default settings of Debugger
hook configurations to construct a SageMaker TensorFlow estimator. You can also utilize this for MXNet,
PyTorch, and XGBoost estimators.
Note
In the following example code, the s3_output_path parameter for DebuggerHookConfig
is optional. If you do not specify it, Debugger saves the tensors at s3://<output_path>/
debug-output/, where the <output_path> is the default output path of SageMaker training
jobs. For example:
"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-training-YYYY-MM-DD-HH-
MM-SS-123/debug-output"
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)
1676
Amazon SageMaker Developer Guide
Debug Training Jobs
py_version="py39",
sagemaker_estimator.fit()
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)
sagemaker_estimator.fit()
1677
Amazon SageMaker Developer Guide
Debug Training Jobs
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)
sagemaker_estimator.fit()
In the following topics, you'll learn how to use the SageMaker Debugger built-in rules.
Topics
• Use Debugger Built-in Rules with the Default Parameter Settings (p. 1679)
• Use Debugger Built-in Rules with Custom Parameter Values (p. 1679)
• Example Notebooks and Code Samples to Configure Debugger Rules (p. 1680)
1678
Amazon SageMaker Developer Guide
Debug Training Jobs
To specify Debugger built-in rules in an estimator, you need to configure a list object. The following
example code shows the basic structure of listing the Debugger built-in rules:
rules=[
Rule.sagemaker(rule_configs.built_in_rule_name_1()),
Rule.sagemaker(rule_configs.built_in_rule_name_2()),
...
Rule.sagemaker(rule_configs.built_in_rule_name_n()),
... # You can also append more profiler rules in the
ProfilerRule.sagemaker(rule_configs.*()) format.
]
For more information about default parameter values and descriptions of the built-in rule, see List of
Debugger Built-in Rules (p. 1748).
For example, to inspect the overall training performance and progress of your model, construct a
SageMaker estimator with the following built-in rule configuration.
rules=[
Rule.sagemaker(rule_configs.loss_not_decreasing()),
Rule.sagemaker(rule_configs.overfit()),
Rule.sagemaker(rule_configs.overtraining()),
Rule.sagemaker(rule_configs.stalled_training_rule())
]
When you start the training job, Debugger collects system resource utilization data every
500 milliseconds and the loss and accuracy values every 500 steps by default. Debugger
analyzes the resource utilization to identify if your model is having bottleneck problems. The
loss_not_decreasing, overfit, overtraining, and stalled_training_rule monitors if your
model is optimizing the loss function without those training issues. If the rules detect training anomalies,
the rule evaluation status changes to IssueFound. You can set up automated actions, such as notifying
training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more
information, see Action on Amazon SageMaker Debugger Rules (p. 1698).
If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure
the base_config and rule_parameters parameters for the ProfilerRule.sagemaker and
Rule.sagemaker classmethods. In case of the Rule.sagemaker class methods, you can also customize
tensor collections through the collections_to_save parameter. The instruction of how to use the
CollectionConfig class is provided at Configure Tensor Collections Using the CollectionConfig
API (p. 1673).
Use the following configuration template for built-in rules to customize parameter values. By changing
the rule parameters as you want, you can adjust the sensitivity of the rules to be triggered.
• The base_config argument is where you call the built-in rule methods.
• The rule_parameters argument is to adjust the default key values of the built-in rules listed in List
of Debugger Built-in Rules (p. 1748).
1679
Amazon SageMaker Developer Guide
Debug Training Jobs
For more information about the Debugger rule class, methods, and parameters, see SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.
rules=[
Rule.sagemaker(
base_config=rule_configs.built_in_rule_name(),
rule_parameters={
"key": "value"
},
collections_to_save=[
CollectionConfig(
name="tensor_collection_name",
parameters={
"key": "value"
}
)
]
)
]
The parameter descriptions and value customization examples are provided for each rule at List of
Debugger Built-in Rules (p. 1748).
In the following sections, notebooks and code samples of how to use Debugger rules to monitor
SageMaker training jobs are provided.
Topics
• Debugger Built-in Rules Example Notebooks (p. 1680)
• Debugger Built-in Rules Example Code (p. 1681)
• Use Debugger Built-in Rules with Parameter Modifications (p. 1682)
The following example notebooks show how to use Debugger built-in rules when running training jobs
with Amazon SageMaker:
While running the example notebooks in SageMaker Studio, you can find the training job trial created
on the Studio Experiment List tab. For example, as shown in the following screenshot, you can find and
open a Describe Trial Component window of your current training job. On the Debugger tab, you can
1680
Amazon SageMaker Developer Guide
Debug Training Jobs
There are two ways of using the Debugger built-in rules in the SageMaker environment: deploy the built-
in rules as it is prepared or adjust their parameters as you want. The following topics show you how to
use the built-in rules with example codes.
The following code sample shows how to set the Debugger built-in rules using the Rule.sagemaker
method. To specify built-in rules that you want to run, use the rules_configs API operation to call
the built-in rules. To find a full list of Debugger built-in rules and default parameter values, see List of
Debugger Built-in Rules (p. 1748).
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs
1681
Amazon SageMaker Developer Guide
Debug Training Jobs
sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),
base_job_name='debugger-built-in-rules-demo',
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",
Note
The Debugger built-in rules run in parallel with your training job. The maximum number of
built-in rule containers for a training job is 20.
For more information about the Debugger rule class, methods, and parameters, see the SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.
To find an example of how to adjust the Debugger rule parameters, see the following Use Debugger
Built-in Rules with Parameter Modifications (p. 1682) section.
The following code example shows the structure of built-in rules to adjust parameters. In this example,
the stalled_training_rule collects the losses tensor collection from a training job at every 50
steps and an evaluation stage at every 10 steps. If the training process starts stalling and not collecting
tensor outputs for 120 seconds, the stalled_training_rule stops the training job.
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs
built_in_rules_modified=[
Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
'threshold': '120',
'training_job_name_prefix': base_job_name_prefix,
'stop_training_on_fire' : 'True'
}
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"train.save_interval": "50"
"eval.save_interval": "10"
}
)
]
)
]
1682
Amazon SageMaker Developer Guide
Debug Training Jobs
base_job_name=base_job_name_prefix,
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",
For an advanced configuration of the Debugger built-in rules using the CreateTrainingJob API, see
Configure Debugger Using Amazon SageMaker API (p. 1799).
Turn Off Debugger
If you want to completely turn off Debugger, do one of the following:
To stop both monitoring and profiling, include the disable_profiler parameter to your estimator
and set it to True.
Warning
If you disable it, you won't be able to view the comprehensive Studio Debugger insights
dashboard and the autogenerated profiling report.
estimator=Estimator(
...
disable_profiler=True
debugger_hook_config=False
)
For more information about the Debugger-specific parameters, see SageMaker Estimator in the
Amazon SageMaker Python SDK.
• While a training job is running, do the following:
To disable both monitoring and profiling while your training job is running, use the following
estimator classmethod:
estimator.disable_profiling()
To disable framework profiling only and keep system monitoring, use the update_profiler method:
estimator.update_profiler(disable_framework_metrics=true)
For more information about the estimator extension methods, see the estimator.disable_profiling and
estimator.update_profiler classmethods in the Amazon SageMaker Python SDK documentation.
1683
Amazon SageMaker Developer Guide
Debug Training Jobs
estimator.output_path
estimator.latest_training_job.job_name
estimator.latest_training_job.describe()
• To check a full list of the Debugger rules while a SageMaker training job is running:
estimator.latest_training_job.rule_job_summary()
• To check the S3 bucket URI where the model parameter data (output tensors) are saved:
estimator.latest_job_debugger_artifacts_path()
• To check the S3 bucket URI at where the model performance data (system and framework metrics) are
saved:
estimator.latest_job_profiler_artifacts_path()
estimator.debugger_rule_configs
• To check the list of the Debugger rules for debugging while a SageMaker training job is running:
estimator.debugger_rules
• To check the Debugger rule configuration for monitoring and profiling system and framework metrics:
estimator.profiler_rule_configs
• To check the list of the Debugger rules for monitoring and profiling while a SageMaker training job is
running:
estimator.profiler_rules
For more information about the SageMaker estimator class and its methods, see Estimator API in the
Amazon SageMaker Python SDK.
1684
Amazon SageMaker Developer Guide
Debug Training Jobs
Note
You can download a Debugger reports while your training job is running or after the job has
finished. During training, Debugger concurrently updates the report reflecting the current rules'
evaluation status. You can download a complete Debugger report only after the training job has
completed.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.
Topics
• Construct a SageMaker XGBoost Estimator with the Debugger XGBoost Report Rule (p. 1685)
• Download the Debugger XGBoost Training Report (p. 1686)
• Debugger XGBoost Training Report Walkthrough (p. 1689)
Construct a SageMaker XGBoost Estimator with the Debugger XGBoost Report Rule
The CreateXgboostReport (p. 1761) rule collects the following output tensors from your training job:
The output tensors are saved at a default S3 bucket. For example, s3://
sagemaker-<region>-<12digit_account_id>/<base-job-name>/debug-output/.
When you construct a SageMaker estimator for an XGBoost training job, specify the rule as shown in the
following example code.
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import Rule, rule_configs
rules=[
Rule.sagemaker(rule_configs.create_xgboost_report())
]
1685
Amazon SageMaker Developer Guide
Debug Training Jobs
region = boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-xgboost-report-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",
estimator.fit(wait=False)
Download the Debugger XGBoost training report while your training job is running or after the job has
finished using the Amazon SageMaker Python SDK and AWS Command Line Interface (CLI).
estimator.output_path
estimator.latest_training_job.job_name
4. To check if the report is generated, list directories and files recursively under the
rule_output_path using aws s3 ls with the --recursive option.
This should return a complete list of files under autogenerated folders that are named
CreateXgboostReport and ProfilerReport-1234567890. The XGBoost training
report is stored in the CreateXgboostReport, and the profiling report is stored in the
ProfilerReport-1234567890 folder. To learn more about the profiling report generated by
default with the XGBoost training job, see SageMaker Debugger Profiling Report (p. 1729).
1686
Amazon SageMaker Developer Guide
Debug Training Jobs
Tip
If you are using a Jupyter notebook server, run !pwd to verify the current working
directory.
6. Under the /CreateXgboostReport directory, open xgboost_report.html. If you are using
JupyterLab, choose Trust HTML to see the autogenerated Debugger training report.
7. Open the xgboost_report.ipynb file to explore how the report is generated. You can
customize and extend the training report using the Jupyter notebook file.
1. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
2. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base
S3 bucket name should be in the following format: sagemaker-<region>-111122223333.
Look up the base S3 bucket through the Find bucket by name field.
3. In the base S3 bucket, look up the training job name by entering your job name prefix in Find
objects by prefix and then choosing the training job name.
1687
Amazon SageMaker Developer Guide
Debug Training Jobs
4. In the training job's S3 bucket, choose rule-output/ subfolder. There must be three subfolders
for training data collected by Debugger: debug-output/, profiler-output/, and rule-output/.
5. In the rule-output/ folder, choose the CreateXgboostReport/ folder. The folder contains
xbgoost_report.html (the autogenerated report in html) and xbgoost_report.ipynb (a Jupyter
notebook with scripts that are used for generating the report).
6. Choose the xbgoost_report.html file, choose Download actions, and then choose Download.
1688
Amazon SageMaker Developer Guide
Debug Training Jobs
This section walks you through the Debugger XGBoost training report. The report is automatically
aggregated depending on the output tensor regex, recognizing what type of your training job is among
binary classification, multiclass classification, and regression.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.
Topics
• Distribution of True Labels of the Dataset (p. 1690)
• Loss versus Step Graph (p. 1690)
• Feature Importance (p. 1691)
• Confusion Matrix (p. 1692)
• Evaluation of the Confusion Matrix (p. 1693)
• Accuracy Rate of Each Diagonal Element Over Iteration (p. 1694)
• Receiver Operating Characteristic Curve (p. 1695)
• Distribution of Residuals at the Last Saved Step (p. 1696)
• Absolute Validation Error per Label Bin Over Iteration (p. 1697)
1689
Amazon SageMaker Developer Guide
Debug Training Jobs
This histogram shows the distribution of labeled classes (for classification) or values (for regression) in
your original dataset. Skewness in your dataset could contribute to inaccuracies. This visualization is
available for the following model types: binary classification, multiclassification, and regression.
This is a line chart that shows the progression of loss on training data and validation data throughout
training steps. The loss is what you defined in your objective function, such as mean squared error. You
can gauge whether the model is overfit or underfit from this plot. This section also provides insights that
you can use to determine how to resolve the overfit and underfit problems. This visualization is available
for the following model types: binary classification, multiclassification, and regression.
1690
Amazon SageMaker Developer Guide
Debug Training Jobs
Feature Importance
There are three different types of feature importance visualizations provided: Weight, Gain and
Coverage. We provide detailed definitions for each of the three in the report. Feature importance
visualizations help you learn what features in your training dataset contributed to the predictions.
Feature importance visualizations are available for the following model types: binary classification,
multiclassification, and regression.
1691
Amazon SageMaker Developer Guide
Debug Training Jobs
Confusion Matrix
This visualization is only applicable to binary and multiclass classification models. Accuracy alone might
not be sufficient for evaluating the model performance. For some use cases, such as healthcare and fraud
detection, it’s also important to know the false positive rate and false negative rate. A confusion matrix
gives you the additional dimensions for evaluating your model performance.
1692
Amazon SageMaker Developer Guide
Debug Training Jobs
This section provides you with more insights on the micro, macro, and weighted metrics on precision,
recall, and F1-score for your model.
1693
Amazon SageMaker Developer Guide
Debug Training Jobs
This visualization is only applicable to binary classification and multiclass classification models. This is a
line chart that plots the diagonal values in the confusion matrix throughout the training steps for each
class. This plot shows you how the accuracy of each class progresses throughout the training steps. You
can identify the under-performing classes from this plot.
1694
Amazon SageMaker Developer Guide
Debug Training Jobs
This visualization is only applicable to binary classification models. The Receiver Operating Characteristic
curve is commonly used to evaluate binary classification model performance. The y-axis of the curve
is True Positive Rate (TPF) and x-axis is false positive rate (FPR). The plot also displays the value for
the area under the curve (AUC). The higher the AUC value, the more predictive your classifier. You can
also use the ROC curve to understand the trade-off between TPR and FPR and identify the optimum
classification threshold for your use case. The classification threshold can be adjusted to tune the
behavior of the model to reduce more of one or another type of error (FP/FN).
1695
Amazon SageMaker Developer Guide
Debug Training Jobs
This visualization is a column chart that shows the residual distributions in the last step Debugger
captures. In this visualization, you can check whether the residual distribution is close to normal
distribution that’s centered at zero. If the residuals are skewed, your features may not be sufficient for
predicting the labels.
1696
Amazon SageMaker Developer Guide
Debug Training Jobs
This visualization is only applicable to regression models. The actual target values are split into 10
intervals. This visualization shows how validation errors progress for each interval throughout the
training steps in line plots. Absolute validation error is the absolute value of difference between
prediction and actual during validation. You can identify the underperforming intervals from this
visualization.
1697
Amazon SageMaker Developer Guide
Debug Training Jobs
Topics
• Debugger Built-in Actions for Rules (p. 1698)
• Create Actions on Rules Using Amazon CloudWatch and AWS Lambda (p. 1702)
Step 1: Set Up Amazon SNS, Create an SMDebugRules Topic, and Subscribe to the Topic
This section walks you through how to set up an Amazon SNS SMDebugRules topic, subscribe to it, and
confirm the subscription to receive notifications from the Debugger rules.
Note
For more information about billing for Amazon SNS, see Amazon SNS pricing and Amazon SNS
FAQs.
1. Sign in to the AWS Management Console and open the Amazon SNS console at https://
console.aws.amazon.com/sns/v3/home.
2. In the left navigation pane, choose Topics.
3. On the Topics page, choose Create topic.
4. On the Create topic page, in the Details section, do the following:
a. For Topic ARN, choose the SMDebugRules topic ARN. The ARN should be in format of
arn:aws:sns:<region-id>:111122223333:SMDebugRules.
b. For Protocol, choose Email or SMS.
c. For Endpoint, enter the endpoint value, such as an email address or a phone number that you
want to receive notifications.
1698
Amazon SageMaker Developer Guide
Debug Training Jobs
Note
Make sure you type the correct email address and phone number. Phone numbers must
include +, a country code, and phone number, with no special characters or spaces. For
example, the phone number +1 (222) 333-4444 is formatted as +12223334444.
5. Skip all other optional settings and choose Create subscription. If you want to learn more about the
optional settings, see Subscribing to an Amazon SNS topic.
After you subscribe to the SMDebugRules topic, you receive the following confirmation message in email
or by phone:
For more information about Amazon SNS, see Mobile text messaging (SMS) and Email notifications in the
Amazon SNS Developer Guide.
In this step, you add the required policies to your IAM role.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Policies, and choose Create policy.
3. On the Create policy page, do the following to create a new sns-access policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"sns:Publish",
"sns:CreateTopic",
"sns:Subscribe"
],
"Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
}
]
}
1699
Amazon SageMaker Developer Guide
Debug Training Jobs
For more examples of setting up IAM policies for Amazon SNS, see Example cases for Amazon SNS access
control.
After successfully finishing the required settings in the preceding steps, you can configure the Debugger
built-in actions for debugging rules as shown in the following example script. You can choose which
built-in actions to use while building the actions list object. The rule_configs is a helper module
that provides high-level tools to configure Debugger built-in rules and actions. The following built-in
actions are available for Debugger:
• rule_configs.StopTraining() – Stops a training job when the Debugger rule finds an issue.
• rule_configs.Email("[email protected]") – Sends a notification via email when the Debugger rule
finds an issue. Use the email address that you used when you set up your SNS topic subscription.
• rule_configs.SMS("+1234567890") – Sends a notification via text message when the Debugger
rule finds an issue. Use the phone number that you used when you set up your SNS topic subscription.
Note
Make sure you type the correct email address and phone number. Phone numbers must
include +, a country code, and a phone number, with no special characters or spaces. For
example, the phone number +1 (222) 333-4444 is formatted as +12223334444.
You can use all of the built-in actions or a subset of actions by wrapping up using the
rule_configs.ActionList() method, which takes the built-in actions and configures a list of
actions.
If you want to assign all of the three built-in actions to a single rule, configure a Debugger built-in
action list while constructing an estimator. Use the following template to construct the estimator, and
Debugger will stop training jobs and send notifications through email and text for any rules that you use
to monitor your training job progress.
1700
Amazon SageMaker Developer Guide
Debug Training Jobs
actions=actions
)
]
estimator = Estimator(
...
rules = rules
)
estimator.fit(wait=False)
To create multiple built-in action objects to assign different actions to a single rule
If you want to assign the built-in actions to be triggered at different threshold values of a single rule,
you can create multiple built-in action objects as shown in the following script. To avoid a conflict error
by running the same rule, you must submit different rule job names (specify different strings for the
rules' name attribute) as shown in the following example script template. This example shows how to set
up StalledTrainingRule (p. 1781) to take two different actions: send an email to [email protected] when a
training job stalls for 60 seconds, and stop the training job if stalling for 120 seconds.
# Configure a rule with the Email built-in action to trigger if a training job stalls for
60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "60",
"training_job_name_prefix": base_job_name_prefix
},
actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"
# Configure a rule with the StopTraining built-in action to trigger if a training job
stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "120",
"training_job_name_prefix": base_job_name_prefix
},
actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"
estimator = Estimator(
...
rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)
1701
Amazon SageMaker Developer Guide
Debug Training Jobs
estimator.fit(wait=False)
While the training job is running, the Debugger built-in action sends notification emails and text
messages whenever the rule finds issues with your training job. The following screenshot shows an
example of email notification for a training job that has a stalled training job issue.
The following screenshot shows an example text notification that Debugger sends when the rule finds a
StalledTraining issue.
• To use the Debugger built-in actions, an internet connection is required. This feature is not supported
in the network isolation mode provided by Amazon SageMaker or Amazon VPC.
• The built-in actions cannot be used for Debugger ProfilerRule (p. 1749).
• The built-in actions cannot be used on training jobs with spot training interruptions.
• In email or text notifications, None appears at the end of messages. This does not have any meaning,
so you can disregard the text None.
1702
Amazon SageMaker Developer Guide
Debug Training Jobs
You can use the training and Debugger rule job status in the CloudWatch logs to take further actions
when there are training issues.
For more information about monitoring training jobs using CloudWatch, see Monitor Amazon
SageMaker.
Set Up Debugger for Automated Training Job Termination Using CloudWatch and Lambda
The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger
rule training job evaluation status.
import sagemaker
sagemaker.get_execution_role()
The following figure shows an example of the Create function page with the input fields and selections
completed.
1703
Amazon SageMaker Developer Guide
Debug Training Jobs
1. In the Function code section of the configuration page, paste the following Python script in the
Lambda code editor pane. The lambda_handler function monitors the Debugger rule evaluation
status collected by CloudWatch and triggers the StopTrainingJob API operation. The AWS SDK
for Python (Boto3) client for SageMaker provides a high-level method, stop_training_job,
which triggers the StopTrainingJob API operation.
import json
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
1704
Amazon SageMaker Developer Guide
Debug Training Jobs
client = boto3.client('sagemaker')
For more information about the Lambda code editor interface, see Creating functions using the AWS
Lambda console editor.
2. Skip all other settings and choose Save at the top of the configuration page.
Step 3: Create a CloudWatch Events Rule and Link to the Lambda Function for Debugger
To create a CloudWatch Events rule and link to the Lambda function for Debugger
1705
Amazon SageMaker Developer Guide
Debug Training Jobs
{
"source": [
"aws.sagemaker"
],
"detail-type": [
"SageMaker Training Job State Change"
]
}
5. In the Targets section, choose Add target*, and choose the debugger-rule-stop-training-job
Lambda function that you created. This step links the CloudWatch Events rule with the Lambda
function.
6. Choose Configure details and go to the Step 2: Configure rule details page.
7. Specify the CloudWatch rule definition name. For example, debugger-cw-event-rule.
8. Choose Create rule to finish.
9. Go back to the Lambda function configuration page and refresh the page. Confirm that it's
configured correctly in the Designer panel. The CloudWatch Events rule should be registered as a
trigger for the Lambda function. The configuration design should look like the following example:
You can run the following example notebooks, which are prepared for experimenting with stopping a
training job using Debugger's built-in rules.
This example notebook runs a training job that has a vanishing gradient issue. The Debugger
VanishingGradient (p. 1769) built-in rule is used while constructing the SageMaker TensorFlow
estimator. When the Debugger rule detects the issue, the training job is terminated.
1706
Amazon SageMaker Developer Guide
Debug Training Jobs
• Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule
This example notebook runs a training script with a code line that forces it to sleep for 10 minutes. The
Debugger StalledTrainingRule (p. 1781) built-in rule invokes issues and stops the training job.
Disable the CloudWatch Events Rule to Stop Using the Automated Training Job Termination
If you want to disable the automated training job termination, you need to disable the CloudWatch
Events rule. In the Lambda Designer panel, choose the EventBridge (CloudWatch Events) block linked
to the Lambda function. This shows an EventBridge panel below the Designer panel (for example, see
the previous screen shot). Select the check box next to EventBridge (CloudWatch Events): debugger-
cw-event-rule, and then choose Disable. If you want to use the automated termination functionality
later, you can enable the CloudWatch Events rule again.
Use SageMaker Debugger to create output tensor files that are compatible with TensorBoard. Load the
files to visualize in TensorBoard and analyze your SageMaker training jobs. Debugger automatically
generates output tensor files that are compatible with TensorBoard. For any hook configuration
you customize for saving output tensors, Debugger has the flexibility to create scalar summaries,
distributions, and histograms that you can import to TensorBoard.
The following procedure explains how to save scalars, weights, and biases as full tensors, histograms, and
distributions that can be visualized with TensorBoard. Debugger saves them to the training container's
local path (the default path is /opt/ml/output/tensors) and syncs to the Amazon S3 locations
passed through the Debugger output configuration objects.
1707
Amazon SageMaker Developer Guide
Debug Training Jobs
import sagemaker
from sagemaker.debugger import TensorBoardOutputConfig
bucket = sagemaker.Session().default_bucket()
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path='s3://{}'.format(bucket)
)
For additional information, see the Debugger TensorBoardOutputConfig API in the Amazon
SageMaker Python SDK.
2. Configure the Debugger hook and customize the hook parameter values. For example, the
following code configures a Debugger hook to save all scalar outputs every 100 steps in training
phases and 10 steps in validation phases, the weights parameters every 500 steps (the default
save_interval value for saving tensor collections is 500), and the bias parameters every 10
global steps until the global step reaches 500.
hook_config = DebuggerHookConfig(
hook_parameters={
"train.save_interval": "100",
"eval.save_interval": "10"
},
collection_configs=[
CollectionConfig("weights"),
CollectionConfig(
name="biases",
parameters={
"save_interval": "10",
"end_step": "500",
"save_histogram": "True"
}
),
]
)
For more information about the Debugger configuration APIs, see the Debugger
CollectionConfig and DebuggerHookConfig APIs in the Amazon SageMaker Python SDK.
3. Construct a SageMaker estimator with the Debugger parameters passing the configuration objects.
The following example template shows how to create a generic SageMaker estimator. You can
replace estimator and Estimator with other SageMaker frameworks' estimator parent classes
and estimator classes. Available SageMaker framework estimators for this functionality are
TensorFlow, PyTorch, and MXNet.
estimator = Estimator(
...
# Debugger parameters
debugger_hook_config=hook_config,
tensorboard_output_config=tensorboard_output_config
1708
Amazon SageMaker Developer Guide
Profile Training Jobs
)
estimator.fit()
The estimator.fit() method starts a training job, and Debugger writes the output tensor files
in real time to the Debugger S3 output path and to the TensorBoard S3 output path. To retrieve the
output paths, use the following estimator methods:
tensorboard_output_path=estimator.latest_job_tensorboard_artifacts_path()
print(tensorboard_output_path)
!aws s3 ls {tensorboard_output_path}/
6. Download the TensorBoard output data to your notebook instance. For example, the following AWS
CLI command downloads the TensorBoard files to /logs/fit under the current working directory
of your notebook instance.
7. Compress the file directory to a TAR file to download to your local machine.
8. Download and extract the Tensorboard TAR file to a directory on your device, launch a Jupyter
notebook server, open a new notebook, and run the TensorBoard app.
For any training job you run in SageMaker using the SageMaker Python SDK, Debugger starts profiling
basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization,
network, and I/O wait time. It collects these resource utilization metrics every 500 milliseconds. To see
the graphs of the resource utilization metrics of your training job, simply use the SageMaker Debugger UI
in SageMaker Studio Experiments.
Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon
CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity
1709
Amazon SageMaker Developer Guide
Profile Training Jobs
into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep
into the metrics at the level of an operation or a step.
If you want to change the metric collection time interval, you need to add parameters for profiling
to your training job launcher. If you're using SageMaker Python SDK, you need to pass the
profiler_config parameter when you create an estimator. To learn how to adjust the resource
utilization metric collection interval, see the section called “Construct a SageMaker Estimator with
SageMaker Debugger” (p. 1711) and then the section called “Configure Debugger for Monitoring
Resource Utilization” (p. 1714).
Additionally, you can add profiling analysis tools called built-in profiling rules provided by SageMaker
Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect
computational performance issues. For more information, see the section called “Configure Built-in
Profiler Rules” (p. 1719). You can receive rule analysis results through the SageMaker Debugger UI in
SageMaker Studio Experiments or the SageMaker Debugger Profiling Report. You can also create custom
profiling rules using the SageMaker Python SDK.
Use the following topics to learn more about profiling functionalities provided by SageMaker Debugger.
Topics
• Configure Debugger Using Amazon SageMaker Python SDK (p. 1710)
• Configure Built-in Profiling Rules Managed by Amazon SageMaker Debugger (p. 1719)
• Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments (p. 1721)
• SageMaker Debugger Interactive Report (p. 1729)
• Analyze Data Using the SMDebug Client Library (p. 1740)
If you want to change settings for profiling , you can specify Debugger-specific parameters while
creating a SageMaker training job launcher using SageMaker Python SDK, AWS SDK for Python (Boto3),
or AWS Command Line Interface (CLI). In this guide, we focus on how to change profiling options using
the Amazon SageMaker Python SDK. There are two parameters in the SageMaker estimator classes:
profiler_config for changing the profiler settings, and rules for activating additional analysis tools.
Important
To use the latest SageMaker Debugger features, you need to upgrade the SageMaker Python
SDK and the SMDebug client library. In your iPython kernel, Jupyter Notebook, or JupyterLab
environment, run the following code to install the latest versions of the libraries and restart the
kernel.
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
1710
Amazon SageMaker Developer Guide
Profile Training Jobs
PyTorch
session=boto3.session.Session()
region=session.region_name
profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=PyTorch(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.12.0",
py_version="py37",
estimator.fit(wait=False)
TensorFlow
session=boto3.session.Session()
region=session.region_name
profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=TensorFlow(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
1711
Amazon SageMaker Developer Guide
Profile Training Jobs
instance_type="ml.p3.2xlarge",
framework_version="2.8.0",
py_version="py37",
estimator.fit(wait=False)
MXNet
profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=MXNet(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.7.0",
py_version="py37",
estimator.fit(wait=False)
Note
For MXNet, when configuring the profiler_config parameter, you can only configure for
system monitoring. Profiling framework metrics is not supported for MXNet.
XGBoost
profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=XGBoost(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.5-1",
# Debugger-specific parameters
1712
Amazon SageMaker Developer Guide
Profile Training Jobs
profiler_config=profiler_config,
rules=rules
)
estimator.fit(wait=False)
Note
For XGBoost, when configuring the profiler_config parameter, you can only configure
for system monitoring. Profiling framework metrics is not supported for XGBoost.
Generic estimator
profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",
# Debugger-specific parameters
profiler_config=profiler_config,
rules=rules
)
estimator.fit(wait=False)
• profiler_config – Configure Debugger to collect system metrics and framework metrics from your
training job and save into your secured S3 bucket URI or local machine. You can set how frequently
or loosely collect the system metrics. To learn how to configure the profiler_config parameter,
see Configure Debugger for Monitoring Resource Utilization (p. 1714) and Configure Debugger for
Framework Profiling (p. 1714).
• rules – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run
in parallel. Make sure that your training job has access to this S3 bucket. The rules runs on processing
containers and automatically analyze your training job to find computational and operational
performance issues. The ProfilerReport rule is the most integrated rule that runs all built-in profiling
rules and saves the profiling results as a report into your secured S3 bucket. To learn how to configure
the rules parameter, see Configure Debugger Built-in Rules (p. 1678).
Note
Debugger securely saves output data in subfolders of your default S3 bucket. For
example, the format of the default S3 bucket URI is s3://sagemaker-<region>-
1713
Amazon SageMaker Developer Guide
Profile Training Jobs
See the following topics to find out how to configure the Debugger-specific parameters in detail.
Topics
• Configure Debugger for Monitoring Resource Utilization (p. 1714)
• Configure Debugger for Framework Profiling (p. 1714)
• Updating Debugger System Monitoring and Framework Profiling Configuration while a Training Job
is Running (p. 1718)
• Turn Off Debugger (p. 1718)
The following code example shows how to set up the profiler_config parameter with a system
monitoring time interval of 1000 milliseconds.
profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000
)
To see the progress of system monitoring, see Open the Amazon SageMaker Debugger Insights
Dashboard (p. 1721).
1714
Amazon SageMaker Developer Guide
Profile Training Jobs
See also Amazon SageMaker Debugger Release Notes: March 16, 2023 (p. 1820).
Note
Before getting started with Debugger framework profiling, verify that the framework used to
build your model is supported by Debugger for framework profiling. For more information, see
Supported Frameworks and Algorithms (p. 1650).
Debugger saves the framework metrics in a default S3 bucket. The format of the default S3
bucket URI is s3://sagemaker-<region>-<12digit_account_id>/<training-job-
name>/profiler-output/.
Start a Training Job with the Default System Monitoring and Framework Profiling
The following example code is the simplest profiler_config parameter setting to start the default
system monitoring and the default framework profiling. The FrameworkProfile class in the following
example code initiates the default framework profiling when a training job starts. Debugger framework
profiling includes the following options: detailed profiling, data loader profiling, and Python profiling.
profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile()
)
With this profiler_config parameter configuration, Debugger calls the default settings of monitoring
and profiling. Debugger monitors system metrics every 500 milliseconds; profiles the fifth step with the
detailed profiling option; the seventh step with the data loader profiling option; and the ninth, tenth,
and eleventh steps with the Python profiling option.
To find available profiling configuration options, the default parameter settings, and examples of how to
configure them, see Start a Training Job with the Default System Monitoring and Customized Framework
Profiling with Different Profiling Options (p. 1717) and SageMaker Debugger APIs – FrameworkProfile in
the Amazon SageMaker Python SDK.
If you want to change the system monitoring interval and enable the default framework profiling,
you can specify the system_monitor_interval_millis parameter explicitly with the
framework_profile_params parameter. For example, to monitor every 1000 milliseconds and enable
the default framework profiling, use the following example code.
profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile()
)
For more information about the FrameworkProfile class, see SageMaker Debugger APIs –
FrameworkProfile in the Amazon SageMaker Python SDK.
Start a Training Job with the Default System Monitoring and Customized Framework Profiling
for Target Steps or a Target Time Range
If you want to specify target steps or target time intervals to profile your training job, you need to
specify parameters for the FrameworkProfile class. The following code examples show how to specify
the target ranges for profiling along with system monitoring.
1715
Amazon SageMaker Developer Guide
Profile Training Jobs
With the following example configuration, Debugger monitors the entire training job every 500
milliseconds (the default monitoring) and profiles a target step range from step 5 to step 15 (for 10
steps).
profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
)
With the following example configuration, Debugger monitors the entire training job every 1000
milliseconds and profiles a target step range from step 5 to step 15 (for 10 steps).
profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
)
With the following example configuration, Debugger monitors the entire training job every 500
milliseconds (the default monitoring) and profiles a target time range from the current Unix time for
600 seconds.
import time
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()),
duration=600)
)
With the following example configuration, Debugger monitors the entire training job every 1000
milliseconds and profiles a target time range from the current Unix time for 600 seconds.
import time
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()),
duration=600)
)
The framework profiling is performed for all of the profiling options at the target step or time range.
To find more information about available profiling options, see SageMaker Debugger APIs –
FrameworkProfile in the Amazon SageMaker Python SDK.
The next section shows you how to script the available profiling options.
1716
Amazon SageMaker Developer Guide
Profile Training Jobs
Start a Training Job with the Default System Monitoring and Customized Framework Profiling
with Different Profiling Options
You can use the following profiling configuration classes to manage the framework profiling options:
• DetailedProfilingConfig – Specify a target step or time range to profile framework operations using
the native framework profilers (TensorFlow profiler and PyTorch profiler). For example, if using
TensorFlow, the Debugger hooks enable the TensorFlow profiler to collect TensorFlow-specific
framework metrics. Detailed profiling enables you to profile all framework operators at a pre-step
(before the first step), within steps, and between steps of a training job.
Note
Detailed profiling might significantly increase GPU memory consumption. We do not
recommend enabling detailed profiling for more than a couple of steps.
• DataloaderProfilingConfig – Specify a target step or time range to profile deep learning framework
data loader processes. Debugger collects every data loader event of the frameworks.
Note
Data loader profiling might lower the training performance while collecting information from
data loaders. We don't recommend enabling data loader profiling for more than a couple of
steps.
Debugger is preconfigured to annotate data loader processes only for the AWS deep learning
containers. Debugger cannot profile data loader processes from any other custom or external
training containers.
• PythonProfilingConfig – Specify a target step or time range to profile Python functions. You can also
choose between two Python profilers: cProfile and Pyinstrument.
• cProfile – The standard Python profiler. cProfile collects information for every Python operator
called during training. With cProfile, Debugger saves cumulative time and annotation for each
function call, providing complete detail about Python functions. In deep learning, for example, the
most frequently called functions might be the convolutional filters and backward pass operators,
and cProfile profiles every single of them. For the cProfile option, you can further select a timer
option: total time, CPU time, and off-CPU time. While you can profile every function call executing
on processors (both CPU and GPU) in CPU time, you can also identify I/O or network bottlenecks
with the off-CPU time option. The default is total time, and Debugger profiles both CPU and off-CPU
time. With cProfile, you are able to drill down to every single functions when analyzing the profile
data.
• Pyinstrument – Pyinstrument is a low-overhead Python profiler that works based on sampling.
With the Pyinstrument option, Debugger samples profiling events every millisecond. Because
Pyinstrument measures elapsed wall-clock time instead of CPU time, the Pyinstrument option
can be a better choice over the cProfile option for reducing profiling noise (filtering out irrelevant
function calls that are cumulatively fast) and capturing operators that are actually compute
intensive (cumulatively slow) for training your model. With Pyinstrument, you are able to see a tree
of function calls and better understand the structure and root cause of the slowness.
Note
Enabling Python profiling might slow down the overall training time. cProfile profiles the
most frequently called Python operators at every call, so the processing time on profiling
increases with respect to the number of calls. For Pyinstrument, the cumulative profiling time
increases with respect to time because of its sampling mechanism.
The following example configuration shows the full structure when you use the different profiling
options with specified values.
import time
from sagemaker.debugger import (ProfilerConfig,
FrameworkProfile,
DetailedProfilingConfig,
DataloaderProfilingConfig,
1717
Amazon SageMaker Developer Guide
Profile Training Jobs
PythonProfilingConfig,
PythonProfiler, cProfileTimer)
profiler_config=ProfilerConfig(
system_monitor_interval_millis=500,
framework_profile_params=FrameworkProfile(
detailed_profiling_config=DetailedProfilingConfig(
start_step=5,
num_steps=1
),
dataloader_profiling_config=DataloaderProfilingConfig(
start_step=7,
num_steps=1
),
python_profiling_config=PythonProfilingConfig(
start_step=9,
num_steps=1,
python_profiler=PythonProfiler.CPROFILE,
cprofile_timer=cProfileTimer.TOTAL_TIME
)
)
)
• To activate Debugger system monitoring for a running training job and receive a Debugger profiling
report, use the following:
estimator.enable_default_profiling()
When you use the enable_default_profiling method, Debugger initiates the default system
monitoring and the ProfileReport built-in rule, which generates a comprehensive profiling report
at the end of the training job. This method can be called only if the current training job is running
without both Debugger monitoring and profiling.
For more information, see estimator.enable_default_profiling in the Amazon SageMaker Python SDK.
• To update system monitoring configuration, use the following:
estimator.update_profiler(
system_monitor_interval_millis=500
)
For more information, see estimator.update_profiler in the Amazon SageMaker Python SDK.
Turn Off Debugger
If you want to completely turn off Debugger, do one of the following:
To turn off profiling, include the disable_profiler parameter to your estimator and set it to True.
1718
Amazon SageMaker Developer Guide
Profile Training Jobs
Warning
If you disable it, you won't be able to view the comprehensive Studio Debugger insights
dashboard and the autogenerated profiling report.
estimator=Estimator(
...
disable_profiler=True
debugger_hook_config=False
)
For more information about the Debugger-specific parameters, see SageMaker Estimator in the
Amazon SageMaker Python SDK.
• While a training job is running, do the following:
To disable both monitoring and profiling while your training job is running, use the following
estimator classmethod:
estimator.disable_profiling()
To disable framework profiling only and keep system monitoring, use the update_profiler method:
estimator.update_profiler(disable_framework_metrics=true)
For more information about the estimator extension methods, see the estimator.disable_profiling and
estimator.update_profiler classmethods in the Amazon SageMaker Python SDK documentation.
In the following topics, learn how to use the Debugger built-in rules.
Topics
• Use SageMaker Debugger Built-in Profiler Rules with the Default Parameter Settings (p. 1720)
• Use Debugger Built-in Profiler Rules with Custom Parameter Values (p. 1720)
1719
Amazon SageMaker Developer Guide
Profile Training Jobs
Use SageMaker Debugger Built-in Profiler Rules with the Default Parameter
Settings
To add SageMaker Debugger built-in rules in your estimator, you need to configure a rules list object.
The following example code shows the basic structure of listing the SageMaker Debugger built-in rules.
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_1()),
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2()),
...
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n()),
... # You can also append more debugging rules in the Rule.sagemaker(rule_configs.*())
format.
]
estimator=Estimator(
...
rules=rules
)
For a complete list of available built-in rules, see List of Debugger Built-in Rules (p. 1748).
To use the profiling rules and inspect the computational performance and progress of your training job,
add the ProfilerReport rule of SageMaker Debugger. This rule activates all built-in rules under the
Debugger ProfilerRule ProfilerRule family. Furthermore, this rule generates an aggregated profiling
report. For more information, see Profiling Report Generated Using SageMaker Debugger. You can use
the following code to add the profiling report rule to your training estimator.
rules=[
ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]
When you start the training job with the ProfilerReport rule, Debugger collects resource utilization
data every 500 milliseconds. Debugger analyzes the resource utilization to identify if your model is
having bottleneck problems. If the rules detect training anomalies, the rule evaluation status changes to
IssueFound. You can set up automated actions, such as notifying training issues and stopping training
jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see Action on Amazon
SageMaker Debugger Rules (p. 1698).
Use the following configuration template for built-in rules to customize parameter values. By changing
the rule parameters as you want, you can adjust the sensitivity of the rules to be initiated.
• The base_config argument is where you call the built-in rule methods.
• The rule_parameters argument is to adjust the default key values of the built-in rules listed in List
of Debugger Built-in Rules (p. 1748).
1720
Amazon SageMaker Developer Guide
Profile Training Jobs
For more information about the Debugger rule class, methods, and parameters, see SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.
rules=[
ProfilerRule.sagemaker(
base_config=rule_configs.BuiltInProfilerRuleName(),
rule_parameters={
"key": "value"
}
)
]
The parameter descriptions and value customization examples are provided for each rule at List of
Debugger Built-in Rules (p. 1748).
For a low-level JSON configuration of the Debugger built-in rules using the CreateTrainingJob API,
see Configure Debugger Using Amazon SageMaker API (p. 1799).
Topics
• Open the Amazon SageMaker Debugger Insights Dashboard (p. 1721)
• Amazon SageMaker Debugger Insights Dashboard Controller (p. 1722)
• Amazon SageMaker Debugger Insights Dashboard (p. 1724)
• Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728)
1721
Amazon SageMaker Developer Guide
Profile Training Jobs
charges for the ml.m5.4xlarge instance usage. For information about pricing, see the Amazon
SageMaker Pricing page.
Important
When you are done using the SageMaker Debugger Insights dashboard, you must shut down the
ml.m5.4xlarge instance to avoid accruing charges. For instructions on how to shut down the
instance, see Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728).
1. On the Studio Home page, choose Experiments in the left navigation pane.
2. Search your training job in the Experiments page. If your training job is set up with an Experiments
run, the job should appear in the Experiments tab; if you didn't set up an Experiments run, the job
should appear in the Unassigned runs tab.
3. Choose (click) the link of the training job name to see the job details.
4. Under the OVERVIEW menu, choose Debuggger. This should show the following two sections.
• In the Debugger rules section, you can browse the status of the Debugger built-in rules associated
with the training job.
• In the Debugger insights section, you can find links to open SageMaker Debugger Insights on the
dashboard.
5. In the SageMaker Debugger Insights section, choose the link of the training job name to open the
SageMaker Debugger Insights dashboard. This opens a Debug [your-training-job-name] window. In
this window, Debugger provides an overview of the computational performance of your training job
on Amazon EC2 instances and helps you identify issues in compute resource utilization.
You can also download an aggregated profiling report by adding the built-in ProfilerReport rule of
SageMaker Debugger. For more information, see Configure Built-in Profiler Rules and Profiling Report
Generated Using SageMaker Debugger.
Using the Debugger controller located at the upper-left corner of the Insights dashboard, you can refresh
the dashboard, configure or update Debugger settings for monitoring system metrics, stop a training job,
and download a Debugger profiling report.
1722
Amazon SageMaker Developer Guide
Profile Training Jobs
• If you want to manually refresh the dashboard, choose the refresh button (the round arrow at the
upper-left corner) as shown in the preceding screenshot.
• The Monitoring toggle button is on by default for any SageMaker training job initiated using the
SageMaker Python SDK. If not activated, you can use the toggle button to start monitoring. During
monitoring, Debugger only collects resource utilization metrics to detect computational problems
such as CPU bottlenecks and GPU underutilization. For a complete list of resource utilization problems
that Debugger monitors, see Debugger built-in rules for profiling hardware system resource utilization
(system metrics) (p. 1749).
• The Configure monitoring button opens a pop-up window that you can use to set or update the data
collection frequency and the S3 path to save the data.
1723
Amazon SageMaker Developer Guide
Profile Training Jobs
Note
If you choose one of the lower time intervals, you increase the granularity of resource
utilization metrics, so you can capture spikes and anomalies with a higher time resolution.
However, higher the resolution, larger the size of system metrics to process. This might
introduce additional overhead and impact the overall training and processing time.
• Using the Stop training button, you can stop the training job when you find anomalies in resource
utilization.
• Using the Download report button, you can download an aggregated profiling report by using the
built-in ProfilerReport rule of SageMaker Debugger. The button is activated when you add the built-
in ProfilerReport rule to the estimator. For more information, see Configure Built-in Profiler Rules and
Profiling Report Generated Using SageMaker Debugger.
Topics
• System Metrics (p. 1724)
• Rules (p. 1727)
System Metrics
In the System Metrics tab, you can use the summary table and timeseries plots to understand resource
utilization.
This summary table shows the statistics of compute resource utilization metrics of all nodes (denoted
as algo-n). The resource utilization metrics include the total CPU utilization, the total GPU utilization,
the total CPU memory utilization, the total GPU memory utilization, the total I/O wait time, and the
total network in bytes. The table shows the minimum and the maximum values, and p99, p90, and p50
percentiles.
1724
Amazon SageMaker Developer Guide
Profile Training Jobs
Use the time series graphs to see more details of resource utilization and identify at what time interval
each instance shows any undesired utilization rate, such as low GPU utilization and CPU bottlenecks that
can cause a waste of the expensive instance.
The following screenshot shows the UI controller for adjusting the time series graphs.
• algo-1: Use this dropdown menu to choose the node that you want to look into.
• Zoom In: Use this button to zoom in the time series graphs and view shorter time intervals.
• Zoom Out: Use this button to zoom out the time series graphs and view wider time intervals.
• Pan Left: Move the time series graphs to an earlier time interval.
• Pan Right: Move the time series graphs to a later time interval.
• Fix Timeframe: Use this check box to fix or bring back the time series graphs to show the whole view
from the first data point to the last data point.
The first two graphs show CPU utilization and I/O wait time over time. By default, the graphs show the
average of CPU utilization rate and I/O wait time spent on the CPU cores. You can select one or more
CPU cores by selecting the labels to graph them on single chart and compare utilization across cores. You
can drag and zoom in and out to have a closer look at specific time intervals.
1725
Amazon SageMaker Developer Guide
Profile Training Jobs
The following graphs show GPU utilization and GPU memory utilization over time. By default, the graphs
show the mean utilization rate over time. You can select the GPU core labels to see the utilization rate
of each core. Taking the mean of utilization rate over the total number of GPU cores shows the mean
utilization of the entire hardware system resource. By looking at the mean utilization rate, you can check
the overall system resource usage of an Amazon EC2 instance. The following figure shows an example
training job on an ml.p3.16xlarge instance with 8 GPU cores. You can monitor if the training job is
well distributed, fully utilizing all GPUs.
1726
Amazon SageMaker Developer Guide
Profile Training Jobs
The following heatmap shows an example of the entire system utilization of an ml.p3.16xlarge
instance over time, projected onto the two-dimensional plot. Every CPU and GPU core is listed in the
vertical axis, and the utilization is recorded over time with a color scheme, where the bright colors
represent low utilization and the darker colors represent high utilization. See the labeled color bar on the
right side of the plot to find out which color level corresponds to which utilization rate.
Rules
Use the Rules tab to find a summary of the profiling rule analysis on your training job. If the profiling
rule is activated with the training job, the text appears highlighted with the solid white text. Inactive
rules are dimmed in gray text. To activate these rules, follow instructions at the section called “Configure
Built-in Profiler Rules” (p. 1719).
1727
Amazon SageMaker Developer Guide
Profile Training Jobs
1728
Amazon SageMaker Developer Guide
Profile Training Jobs
1.
In Studio, select the Running Instances and Kernels icon ( ).
2. Under the RUNNING APPS list, look for the sagemaker-debugger-1.0 app. Select the shutdown icon
( ) next to the app. The SageMaker Debugger Insights dashboards run on an ml.m5.4xlarge
instance. This instance also disappears from the RUNNING INSTANCES when you shut down the
sagemaker-debugger-1.0 app.
1729
Amazon SageMaker Developer Guide
Profile Training Jobs
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.
estimator.output_path
estimator.latest_training_job.job_name
4. To check if the report is generated, list directories and files recursively under the
rule_output_path using aws s3 ls with the --recursive option.
This should return a complete list of files under an autogenerated folder that's named
as ProfilerReport-1234567890. The folder name is a combination of strings:
ProfilerReport and a unique 10-digit tag based on the Unix timestamp when the
ProfilerReport rule is initiated.
1730
Amazon SageMaker Developer Guide
Profile Training Jobs
Tip
If using a Jupyter notebook server, run !pwd to double check the current working
directory.
6. Under the /ProfilerReport-1234567890/profiler-output directory, open profiler-
report.html. If using JupyterLab, choose Trust HTML to see the autogenerated Debugger
profiling report.
7. Open the profiler-report.ipynb file to explore how the report is generated. You can also
customize and extend the profiling report using the Jupyter notebook file.
1. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
2. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base
S3 bucket name should be in the following format: sagemaker-<region>-111122223333.
Look up the base S3 bucket through the Find bucket by name field.
3. In the base S3 bucket, look up the training job name by specifying your job name prefix into the
Find objects by prefix input field. Choose the training job name.
1731
Amazon SageMaker Developer Guide
Profile Training Jobs
4. In the training job's S3 bucket, there must be three subfolders for training data collected by
Debugger: debug-output/, profiler-output/, and rule-output/. Choose rule-output/.
Note
If you started your training job without configuring the Debugger-specific parameters, Debugger
generates the report based only on the system monitoring rules because the Debugger
parameters are not configured to save framework metrics. To enable framework metrics
1732
Amazon SageMaker Developer Guide
Profile Training Jobs
profiling and receive an extended Debugger profiling report, configure the profiler_config
parameter when constructing or updating SageMaker estimators.
To learn how to configure the profiler_config parameter before starting a training job, see
Configure Debugger for Framework Profiling (p. 1714).
To update the current training job and enable framework metrics profiling, see Update
Debugger Framework Profiling Configuration (p. 1718).
This section walks you through the Debugger profiling report section by section. The profiling report is
generated based on the built-in rules for monitoring and profiling. The report shows result plots only for
the rules that found issues.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.
Topics
• Training Job Summary (p. 1733)
• System Usage Statistics (p. 1734)
• Framework metrics summary (p. 1735)
• Rules Summary (p. 1736)
• Analyzing the Training Loop – Step Durations (p. 1737)
• GPU Utilization Analysis (p. 1737)
• Batch Size (p. 1737)
• CPU Bottlenecks (p. 1738)
• I/O Bottlenecks (p. 1739)
• LoadBalancing in Multi-GPU Training (p. 1739)
• GPU Memory Analysis (p. 1739)
At the beginning of the report, Debugger provides a summary of your training job. In this section, you
can overview the time durations and timestamps at different training phases.
1733
Amazon SageMaker Developer Guide
Profile Training Jobs
• node – Lists the name of nodes. If using distributed training on multi nodes (multiple EC2 instances),
the node names are in format of algo-n.
1734
Amazon SageMaker Developer Guide
Profile Training Jobs
• metric – The system metrics collected by Debugger: CPU, GPU, CPU memory, GPU memory, I/O, and
Network metrics.
• unit – The unit of the system metrics.
• max – The maximum value of each system metric.
• p99 – The 99th percentile of each system utilization.
• p95 – The 95th percentile of each system utilization.
• p50 – The 50th percentile (median) of each system utilization.
• min – The minimum value of each system metric.
In this section, the following pie charts show the breakdown of framework operations on CPUs and
GPUs.
Each of the pie charts analyzes the collected framework metrics in various aspects as follows:
• Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on
different training phases.
• Ratio between forward and backward pass – Shows the ratio between time durations spent on
forward and backward pass in the training loop.
• Ratio between CPU/GPU operators – Shows the ratio between time spent on operators running on
CPU or GPU, such as convolutional operators.
• General metrics recorded in framework – Shows the ratio between time spent on major framework
metrics, such as data loading, forward and backward pass.
1735
Amazon SageMaker Developer Guide
Profile Training Jobs
This section provides information of the CPU operators in detail. The table shows the percentage of the
time and the absolute cumulative time spent on the most frequently called CPU operators.
This section provides information of the GPU operators in detail. The table shows the percentage of the
time and the absolute cumulative time spent on the most frequently called GPU operators.
Rules Summary
In this section, Debugger aggregates all of the rule evaluation results, analysis, rule descriptions, and
suggestions.
1736
Amazon SageMaker Developer Guide
Profile Training Jobs
In this section, you can find a detailed statistics of step durations on each GPU core of each node.
Debugger evaluates mean, maximum, p99, p95, p50, and minimum values of step durations, and
evaluate step outliers. The following histogram shows the step durations captured on different worker
nodes and GPUs. You can enable or disable the histogram of each worker by choosing the legends on the
right side. You can check if there is a particular GPU that's causing step duration outliers.
This section shows the detailed statistics about GPU core utilization based on LowGPUUtilization rule.
It also summarizes the GPU utilization statistics, mean, p95, and p5 to determine if the training job is
underutilizing GPUs.
Batch Size
This section shows the detailed statistics of total CPU utilization, individual GPU utilizations, and GPU
memory footprints. The BatchSize rule determines if you need to change the batch size to better utilize
1737
Amazon SageMaker Developer Guide
Profile Training Jobs
the GPUs. You can check whether the batch size is too small resulting in underutilization or too large
causing overutilization and out of memory issues. In the plot, the boxes show the p25 and p75 percentile
ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars
show the 5th percentile for the lower bound and 95th percentile for the upper bound.
CPU Bottlenecks
In this section, you can drill down into the CPU bottlenecks that the CPUBottleneck rule detected from
your training job. The rule checks if the CPU utilization is above cpu_threshold (90% by default) and
also if the GPU utilization is below gpu_threshold (10% by default).
• Low GPU usage caused by CPU bottlenecks – Shows the ratio of data points between the ones with
GPU utilization above and below the threshold and the ones that matches the CPU bottleneck criteria.
1738
Amazon SageMaker Developer Guide
Profile Training Jobs
• Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on
different training phases.
• Ratio between forward and backward pass – Shows the ratio between time durations spent on
forward and backward pass in the training loop.
• Ratio between CPU/GPU operators – Shows the ratio between time durations spent on GPUs and
CPUs by Python operators, such as data loader processes and forward and backward pass operators.
• General metrics recorded in framework – Shows major framework metrics and the ratio between
time durations spent on the metrics.
I/O Bottlenecks
In this section, you can find a summary of I/O bottlenecks. The rule evaluates the I/O wait time and GPU
utilization rates and monitors if the time spent on the I/O requests exceeds a threshold percent of the
total training time. It might indicate I/O bottlenecks where GPUs are waiting for data to arrive from
storage.
In this section, you can identify workload balancing issue across GPUs.
In this section, you can analyze the GPU memory utilization collected by the GPUMemoryIncrease rule.
In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow
respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and
95th percentile for the upper bound.
1739
Amazon SageMaker Developer Guide
Profile Training Jobs
To install the library and use the SMDebug analysis tools (in a JupyterLab notebook or an iPython
kernel)
The following topics walk you through how to use the SMDebug tools to visualize and analyze the
training data collected by Debugger.
To set up a TrainingJob object and retrieve profiling event files of a training job
Tip
You need to specify the training_job_name and region parameters to log to a training job.
There are two ways to specify the training job information:
• Use the SageMaker Python SDK while the estimator is still attached to the training job.
1740
Amazon SageMaker Developer Guide
Profile Training Jobs
import sagemaker
training_job_name=estimator.latest_training_job.job_name
region=sagemaker.Session().boto_region_name
training_job_name="your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
region="us-west-2"
Note
By default, SageMaker Debugger collects system metrics to monitor hardware resource
utilization and system bottlenecks. Running the following functions, you might receive error
messages regarding unavailability of framework metrics. To retrieve framework profiling data
and gain insights into framework operations, you must enable framework profiling.
• If you use SageMaker Python SDK to manipulate your training job request, pass the
framework_profile_params to the profiler_config argument of your estimator. To
learn more, see Configure SageMaker Debugger Framework Profiling.
• If you use Studio, turn on profiling using the Profiling toggle button in the Debugger insights
dashboard. To learn more, see SageMaker Debugger Insights Dashboard Controller.
To retrieve a description of the training job description and the S3 bucket URI where the metric data
are saved
tj.describe_training_job()
tj.get_config_and_profiler_s3_output_path()
To check if the system and framework metrics are available from the S3 URI
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
To create system and framework reader objects after the metric data become available
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
The reader objects have an extended method, refresh_event_file_list(), to retrieve the latest
training event files.
system_metrics_reader.refresh_event_file_list()
framework_metrics_reader.refresh_event_file_list()
1741
Amazon SageMaker Developer Guide
Profile Training Jobs
you specify select_dimensions=["GPU"], the plot methods filter the metrics that include
the "GPU" keyword. If you specify select_events=["total"], the plot methods filter the
metrics that include the "total" event tags at the end of the metric names. If you enable these
parameters and give the keyword strings, the visualization classes return the charts with filtered
metrics.
metrics_histogram = MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(
starttime=0,
endtime=system_metrics_reader.get_timestamp_of_latest_available_file(),
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"] # optional
)
view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)
step_histogram = StepHistogram(framework_metrics_reader)
step_histogram.plot(
starttime=step_histogram.last_timestamp - 5 * 1000 * 1000,
endtime=step_histogram.last_timestamp,
show_workers=True
)
view_timeline_charts = TimelineCharts(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"] # optional
)
view_timeline_charts.plot_detailed_profiler_data([700,710])
view_heatmap = Heatmap(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"], # optional
plot_height=450
)
1742
Amazon SageMaker Developer Guide
Profile Training Jobs
Access the Profiling Data Using the Pandas Data Parsing Tool
The following PandasFrame class provides tools to convert the collected profiling data to Pandas data
frame.
The PandasFrame class takes the tj object's S3 bucket output path, and its methods
get_all_system_metrics() get_all_framework_metrics() return system metrics and
framework metrics in the Pandas data format.
pf = PandasFrame(tj.profiler_s3_output_path)
system_metrics_df = pf.get_all_system_metrics()
framework_metrics_df = pf.get_all_framework_metrics(
selected_framework_metrics=[
'Step:ModeKeys.TRAIN',
'Step:ModeKeys.GLOBAL'
]
)
To profile specific intervals during training to partition statistics for each of these intervals, Debugger
provides tools to set modes and phases.
• PythonProfileModes.TRAIN – Use if you want to profile the target steps in the training phase. This
mode option available only for TensorFlow.
• PythonProfileModes.EVAL – Use if you want to profile the target steps in the evaluation phase.
This mode option available only for TensorFlow.
• PythonProfileModes.PREDICT – Use if you want to profile the target steps in the prediction phase.
This mode option available only for TensorFlow.
• PythonProfileModes.GLOBAL – Use if you want to profile the target steps in the global phase,
which includes the previous three phases. This mode option available only for PyTorch.
• PythonProfileModes.PRE_STEP_ZERO – Use if you want to profile the target steps in the
initialization stage before the first training step of the first epoch starts. This phase includes the initial
job submission, uploading the training scripts to EC2 instances, preparing the EC2 instances, and
downloading input data. This mode option available for both TensorFlow and PyTorch.
• PythonProfileModes.POST_HOOK_CLOSE – Use if you want to profile the target steps in the
finalization stage after the training job has done and the Debugger hook is closed. This phase includes
profiling data while the training jobs are finalized and completed. This mode option available for both
TensorFlow and PyTorch.
1743
Amazon SageMaker Developer Guide
Profile Training Jobs
• cProfile – The standard python profiler. cProfile collects framework metrics on CPU time for every
function called when profiling was enabled.
• Pyinstrument – This is a low overhead Python profiler sampling profiling events every milliseconds.
To learn more about the Python profiling options and what's collected, see Start a Training Job
with the Default System Monitoring and Customized Framework Profiling with Different Profiling
Options (p. 1717).
To set Python profiling objects for analysis, use the cProfileAnalysis or PyinstrumentAnalysis classes as
shown in the following example code. It shows how to set a cProfileAnalysis object, and if you want
to use PyinstrumentAnalysis, replace the class name.
python_analysis = cProfileAnalysis(
local_profile_dir=tf_python_stats_dir,
s3_path=tj.profiler_s3_output_path
)
The following methods are available for the cProfileAnalysis and PyinstrumentAnalysis classes
to fetch the Python profiling stats data:
• python_analysis.fetch_python_profile_stats_by_time(start_time_since_epoch_in_secs,
end_time_since_epoch_in_secs) – Takes in a start time and end time, and returns the function
stats of step stats whose start or end times overlap with the provided interval.
• python_analysis.fetch_python_profile_stats_by_step(start_step, end_step,
mode, start_phase, end_phase) – Takes in a start step and end step and returns the function
stats of all step stats whose profiled step satisfies start_step <= step < end_step.
• start_step and end_step (str) – Specify the start step and end step to fetch the Python profiling
stats data.
1744
Amazon SageMaker Developer Guide
Profile Training Jobs
• mode (str) – Specify the mode of training job using the PythonProfileModes enumerator class.
The default is PythonProfileModes.TRAIN. Available options are provided in the Training Modes
and Phases for Python Profiling (p. 1743) section.
• start_phase (str) – Specify the start phase in the target step(s) using the StepPhase enumerator
class. This parameter enables profiling between different phases of training. The default is
StepPhase.STEP_START. Available options are provided in the Training Modes and Phases for
Python Profiling (p. 1743) section.
• end_phase (str) – Specify the end phase in the target step(s) using the StepPhase enumerator
class. This parameter sets up the end phase of training. Available options are as same as the ones for
the start_phase parameter. The default is StepPhase.STEP_END. Available options are provided
in the Training Modes and Phases for Python Profiling (p. 1743) section.
• python_analysis.fetch_profile_stats_between_modes(start_mode, end_mode) –
Fetches stats from the Python profiling between the start and end modes.
• python_analysis.fetch_pre_step_zero_profile_stats() – Fetches the stats from the
Python profiling until step 0.
• python_analysis.fetch_post_hook_close_profile_stats() – Fetches stats from the Python
profiling after the hook is closed.
• python_analysis.list_profile_stats() – Returns a DataFrame of the Python profiling stats.
Each row holds the metadata for each instance of profiling and the corresponding stats file (one per
step).
• python_analysis.list_available_node_ids() – Returns a list the available node IDs for the
Python profiling stats.
The MergedTimeline class provides tools to integrate and correlate different profiling information
in a single timeline. After Debugger captures profiling data and annotations from different phases of a
training job, JSON files of trace events are saved in a default tracefolder directory.
• For annotations in the Python layers, the trace files are saved in *pythontimeline.json.
• For annotations in the TensorFlow C++ layers, the trace files are saved in *model_timeline.json.
• Tensorflow profiler saves events in a *trace.json.gz file.
1745
Amazon SageMaker Developer Guide
Profile Training Jobs
Tip
If you want to list all of the JSON trace files, use the following AWS CLI command:
As shown in the following animated screenshot, putting and aligning the trace events captured from
the different profiling sources in a single plot can provide an overview of the entire events occurring in
different phases of the training job.
Tip
To interact with the merged timeline on the traicing app using a keyboard, use the W key for
zooming in, the A key for shifting to the left, the S key for zooming out, and the D key for
shifiting to the right.
The multiple event trace JSON files can be merged into one trace event JSON file
using the following MergedTimeline API operation and class method from the
smdebug.profiler.analysis.utils.merge_timelines module.
• path (str) – Specify a root folder (/profiler-output) that contains system and framework profiling
trace files. You can locate the profiler-output using the SageMaker estimator classmethod or
the TrainingJob object. For example, estimator.latest_job_profiler_artifacts_path() or
tj.profiler_s3_output_path.
• file_suffix_filter (list) – Specify a list of file suffix filters to merge timelines. Available suffiex
filters are ["model_timeline.json", "pythontimeline.json", "trace.json.gz"]. If this
parameter is not manually specified, all of the trace files are merged by default.
1746
Amazon SageMaker Developer Guide
Profile Training Jobs
• output_directory (str) – Specify a path to save the merged timeline JSON file. The default is to the
directory specified for the path parameter.
The merge_timeline() classmethod passes the following parameters to execute the merging process:
• start (int) – Specify start time (in microseconds and in Unix time format) or start step to merge
timelines.
• end (int) – Specify end time (in microseconds and in Unix time format) or end step to merge timelines.
• unit (str) – Choose between "time" and "step". The default is "time".
Using the following example codes, execute the merge_timeline() method and download the merged
JSON file.
• Merge timeline with the "time" unit option. The following example code merges all available trace
files between the Unix start time (the absolute zero Unix time) and the current Unix time, which means
that you can merge the timelines for the entire training duration.
import time
from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
from smdebug.profiler.profiler_constants import CONVERT_TO_MICROSECS
• Merge timeline with the "step" unit option. The following example code merges all available
timelines between step 3 and step 9.
Open the Chrome tracing app at chrome://tracing on a Chrome browser, and open the JSON file. You can
explore the output to plot the merged timeline.
To use the PyTorch data loader profiling analysis tool, import the following PT_dataloader_analysis
class:
Pass the profiling data retrieved as a Pandas frame data object in the Access the Profiling Data Using the
Pandas Data Parsing Tool (p. 1743) section:
pt_analysis = PT_dataloader_analysis(pf)
1747
Amazon SageMaker Developer Guide
List of Built-in Rules
The SMDebug S3SystemMetricsReader class reads the system metrics from the S3 bucket specified to
the s3_trial_path parameter.
• pt_analysis.analyze_dataloaderIter_initialization()
The analysis outputs the median and maximum duration for these initializations. If there are outliers,
(i.e duration is greater than 2 * median), the function prints the start and end times for those
durations. These can be used to inspect system metrics during those time intervals.
The following list shows what analysis is available from this class method:
• Which type of data loader iterators were initialized.
• The number of workers per iterator.
• Inspect whether the iterator was initialized with or without pin_memory.
• Number of times the iterators were initialized during training.
• pt_analysis.analyze_dataloaderWorkers()
The following list shows what analysis is available from this class method:
• The number of worker processes that were spun off during the entire training.
• Median and maximum duration for the worker processes.
• Start and end time for the worker processes that are outliers.
• pt_analysis.analyze_dataloader_getnext()
The following list shows what analysis is available from this class method:
• Number of GetNext calls made during the training.
• Median and maximum duration in microseconds for GetNext calls.
• Start time, End time, duration and worker id for the outlier GetNext call duration.
• pt_analysis.analyze_batchtime(start_timestamp, end_timestamp,
select_events=[".*"], select_dimensions=[".*"])
Debugger collects the start and end times of all the GetNext calls. You can find the amount of time
spent by the training script on one batch of data. Within the specified time window, you can identify
the calls that are not directly contributing to the training. These calls can be from the following
operations: computing the accuracy, adding the losses for debugging or logging purposes, and printing
the debugging information. Operations like these can be compute intensive or time consuming.
We can identify such operations by correlating the Python profiler, system metrics, and framework
metrics.
The following list shows what analysis is available from this class method:
• Profile time spent on each data batch, BatchTime_in_seconds, by finding the difference between
start times of current and subsequent GetNext calls.
• Find the outliers in BatchTime_in_seconds and start and end time for those outliers.
• Obtain the system and framework metrics during those BatchTime_in_seconds timestamps. This
indicates where the time was spent.
• pt_analysis.plot_the_window()
Plots a timeline charts between a start timestamp and the end timestamp.
SageMaker Python SDK or the low-level SageMaker API operations. There's no additional cost for using
the built-in rules. For more information about billing, see the Amazon SageMaker Pricing page.
Note
The maximum numbers of built-in rules that you can attach to a training job are 20 for
ProfilerRule and 20 for Rule. SageMaker Debugger fully manages the built-in rules and
analyzes your training job synchronously.
Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the
SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment,
run the following code to install the latest versions of the libraries and restart the kernel.
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
Debugger ProfilerRule
The following rules are the Debugger built-in rules that are callable using the
ProfilerRule.sagemaker classmethod.
Profiling Report for any SageMaker training job • ProfilerReport (p. 1751)
Debugger built-in rules for profiling hardware system resource utilization (system metrics)
Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11
and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and
SDKs as follows.
1749
Amazon SageMaker Developer Guide
List of Built-in Rules
With the deprecation, SageMaker Debugger also discontinues support for the three
ProfilerRules for framework profiling. See also Amazon SageMaker Debugger Release Notes:
March 16, 2023 (p. 1820).
Debugger Rule
The following rules are the Debugger built-in rules that are callable using the Rule.sagemaker
classmethod.
Debugger built-in rules for debugging model training data (output tensors)
To use the built-in rules with default parameter values – use the following configuration format:
1750
Amazon SageMaker Developer Guide
List of Built-in Rules
rules = [
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_1()),
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2()),
...
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n()),
Rule.sagemaker(rule_configs.built_in_rule_name_1()),
Rule.sagemaker(rule_configs.built_in_rule_name_2()),
...
Rule.sagemaker(rule_configs.built_in_rule_name_n())
]
To use the built-in rules with customizing the parameter values – use the following configuration
format:
rules = [
ProfilerRule.sagemaker(
base_config=rule_configs.BuiltInRuleName(),
rule_parameters={
"key": "value"
}
)
Rule.sagemaker(
base_config=rule_configs.built_in_rule_name(),
rule_parameters={
"key": "value"
}
collections_to_save=[
CollectionConfig(
name="tensor_collection_name",
parameters={
"key": "value"
}
)
]
)
]
To find available keys for the rule_parameters parameter, see the parameter description tables.
Sample rule configuration codes are provided for each built-in rule below the parameter description
tables.
• For a full instruction and examples of using the Debugger built-in rules, see Debugger Built-in Rules
Example Code (p. 1681).
• For a full instruction on using the built-in rules with the low-level SageMaker API operations, see
Configure Debugger Using Amazon SageMaker API (p. 1799).
ProfilerReport
The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling
report and updates when the individual rules are triggered. You can download a comprehensive profiling
report while a training job is running or after the training job is complete. You can adjust the rule
parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following
example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport
rule.
1751
Amazon SageMaker Developer Guide
List of Built-in Rules
rules=[
ProfilerRule.sagemaker(
rule_configs.ProfilerReport(
<BuiltInRuleName>_<parameter_name> = value
)
)
]
If you trigger this ProfilerReport rule without any customized parameter as shown in the following
example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling
with their default parameter values.
rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]
The following example code shows how to specify and adjust the CPUBottleneck rule's cpu_threshold
parameter and the IOBottleneck rule's threshold parameter.
rules=[
ProfilerRule.sagemaker(
rule_configs.ProfilerReport(
CPUBottleneck_cpu_threshold = 90,
IOBottleneck_threshold = 90
)
)
]
To explore what's in the profiler report, see SageMaker Debugger Profiling Report. Also, because this
rule activates all of the profiling rules, you can also check the rule analysis status using the SageMaker
Debugger UI in SageMaker Studio Experiments.
Required
Optional
BatchSize
The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this
rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on
CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on
a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks
that heavily overallocate memory. However, increasing the batch size can lead to processing or data
loading bottlenecks because more data preprocessing time is required in each iteration.
1752
Amazon SageMaker Developer Guide
List of Built-in Rules
Required
Optional
Optional
Optional
Optional
Optional
1753
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
CPUBottleneck
The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if
number of CPU bottlenecks exceeds a predefined threshold.
Required
Optional
Optional
Optional
1754
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
GPUMemoryIncrease
The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.
Required
Optional
Optional
1755
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
IOBottleneck
This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number
of IO bottlenecks exceeds a predefined threshold.
Required
Optional
Optional
Optional
1756
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
LoadBalancing
The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.
Required
Optional
Optional
1757
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
LowGPUUtilization
The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is
checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold_p95 which
indicates underutilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is
below threshold_p5 which indicates fluctuations.
Required
Optional
Optional
Optional
1758
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
OverallSystemUsage
The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only
aggregates values per node and computes their percentiles.
Required
Optional
MaxInitializationTime
The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule
waits until the first step is available.
1759
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
OverallFrameworkMetrics
The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward
and backward pass, and data loading.
Required
Optional
StepOutlier
The StepOutlier rule helps detect outliers in step durations. This rule returns True if there are outliers
with step durations larger than stddev sigmas of the entire step durations in a time range.
1760
Amazon SageMaker Developer Guide
List of Built-in Rules
Required
Optional
Optional
Optional
Default value: 10
Optional
CreateXgboostReport
The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a
comprehensive training report. You can download a comprehensive profiling report while a training job
is running or after the training job is complete, and check progress of training or the final result of the
training job. The CreateXgboostReport rule collects the following output tensors by default:
1761
Amazon SageMaker Developer Guide
List of Built-in Rules
Required
rules=[
Rule.sagemaker(
rule_configs.create_xgboost_report()
)
]
DeadRelu
This rule detects when the percentage of rectified linear unit (ReLU) activation functions in a trial are
considered dead because their activation activity has dropped below a threshold. If the percent of
inactive ReLUs in a layer is greater than the threshold_layer value of inactive ReLUs, the rule returns
True.
Required
Optional
1762
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.dead_relu(),
rule_parameters={
"tensor_regex": ".*relu_output|.*ReLU_output",
"threshold_inactivity": "1.0",
"threshold_layer": "50.0"
},
collections_to_save=[
CollectionConfig(
name="custom_relu_collection",
parameters={
"include_regex: ".*relu_output|.*ReLU_output",
"save_interval": "500"
}
)
]
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
ExplodingTensor
This rule detects whether the tensors emitted during training have non-finite values, either infinite or
NaN (not a number). If a non-finite value is detected, the rule returns True.
1763
Amazon SageMaker Developer Guide
List of Built-in Rules
Required
Optional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.exploding_tensor(),
rule_parameters={
"tensor_regex": ".*gradient",
"only_nan": "False"
},
collections_to_save=[
CollectionConfig(
name="gradients",
parameters={
"save_interval": "500"
}
)
]
1764
Amazon SageMaker Developer Guide
List of Built-in Rules
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
PoorWeightInitialization
This rule detects if your model parameters have been poorly initialized.
Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains
commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively.
Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for
training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an
initialization can lead to exploding gradients. This rule checks the variance of activation inputs across
layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural
network has been poorly initialized.
Required
Optional
Optional
1765
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Default value: 5
Optional
Default value: 10
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.poor_weight_initialization(),
rule_parameters={
"activation_inputs_regex": ".*relu_input|.*ReLU_input",
"threshold": "10.0",
"distribution_range": "0.001",
"patience": "5",
"steps": "10"
},
collections_to_save=[
CollectionConfig(
name="custom_relu_collection",
parameters={
"include_regex": ".*relu_input|.*ReLU_input",
"save_interval": "500"
}
)
]
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
SaturatedActivation
This rule detects if the tanh and sigmoid activation layers are becoming saturated. An activation
layer is saturated when the input of the layer is close to the maximum or minimum of the activation
function. The minimum and maximum of the tanh and sigmoid activation functions are defined by their
1766
Amazon SageMaker Developer Guide
List of Built-in Rules
respective min_threshold and max_thresholds values. If the activity of a node drops below the
threshold_inactivity percentage, it is considered saturated. If more than a threshold_layer
percent of the nodes are saturated, the rule returns True.
Required
Optional
Optional
Default value:
".*tanh_input|.*sigmoid_input".
Optional
1767
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Optional
Optional
1768
Amazon SageMaker Developer Guide
List of Built-in Rules
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.saturated_activation(),
rule_parameters={
"tensor_regex": ".*tanh_input|.*sigmoid_input",
"threshold_tanh_min": "-9.4999",
"threshold_tanh_max": "9.4999",
"threshold_sigmoid_min": "-23",
"threshold_sigmoid_max": "16.99999",
"threshold_inactivity": "1.0",
"threshold_layer": "50.0"
},
collections_to_save=[
CollectionConfig(
name="custom_activations_collection",
parameters={
"include_regex": ".*tanh_input|.*sigmoid_input"
"save_interval": "500"
}
)
]
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
VanishingGradient
This rule detects if the gradients in a trial become extremely small or drop to a zero magnitude. If the
mean of the absolute values of the gradients drops below a specified threshold, the rule returns True.
Required
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.vanishing_gradient(),
1769
Amazon SageMaker Developer Guide
List of Built-in Rules
rule_parameters={
"threshold": "0.0000001"
},
collections_to_save=[
CollectionConfig(
name="gradients",
parameters={
"save_interval": "500"
}
)
]
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
WeightUpdateRatio
This rule keeps track of the ratio of updates to weights during training and detects if that ratio gets too
large or too small. If the ratio of updates to weights is larger than the large_threshold value or if
this ratio is smaller than small_threshold, the rule returns True.
Conditions for training are best when the updates are commensurate to gradients. Excessively
large updates can push the weights away from optimal values, and very small updates result
in very slow convergence. This rule requires weights to be available for two training steps, and
train.save_interval needs to be set equal to num_steps.
Required
Optional
1770
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.weight_update_ratio(),
rule_parameters={
"num_steps": "100",
"large_threshold": "10.0",
"small_threshold": "0.00000001",
"epsilon": "0.000000001"
},
collections_to_save=[
CollectionConfig(
name="weights",
parameters={
"train.save_interval": "100"
}
)
]
)
]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.
AllZero
This rule detects if all or a specified percentage of the tensor values are zero.
1771
Amazon SageMaker Developer Guide
List of Built-in Rules
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.all_zero(),
rule_parameters={
"tensor_regex": ".*",
1772
Amazon SageMaker Developer Guide
List of Built-in Rules
"threshold": "100"
},
collections_to_save=[
CollectionConfig(
name="all",
parameters={
"save_interval": "500"
}
)
]
)
]
ClassImbalance
This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a
threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.
Classification models require well-balanced classes in the training dataset or a proper weighting/
sampling of classes during training. The rule performs the following checks:
• It counts the occurrences per class. If the ratio of number of samples between smallest and largest
class is larger than the threshold_imbalance, an error is thrown.
• It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied,
then the model can reach high accuracy for the class with many training samples, but low accuracy
for the classes with few training samples. If a fraction of mispredictions for a certain class is above
threshold_misprediction, an error is thrown.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
Default value: 10
1773
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Conditional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.class_imbalance(),
rule_parameters={
"threshold_imbalance": "10",
"threshold_misprediction": "0.7",
"samples": "500",
1774
Amazon SageMaker Developer Guide
List of Built-in Rules
"argmax": "False",
"labels_regex": ".*labels",
"predictions_regex": ".*predictions"
},
collections_to_save=[
CollectionConfig(
name="custom_output_collection",
parameters={
"include_regex": ".*labels|.*predictions",
"save_interval": "500"
}
)
]
)
]
LossNotDecreasing
This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be
scalars.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
Optional
1775
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Default value: 10
Optional
Optional
1776
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.loss_not_decreasing(),
rule_parameters={
"tensor_regex": ".*",
"use_losses_collection": "True",
"num_steps": "10",
"diff_percent": "0.1",
"increase_threshold_percent": "5",
"mode": "GLOBAL"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]
Overfit
This rule detects if your model is being overfit to the training data by comparing the validation and
training losses.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
A standard way to prevent overfitting is to regularize your model.
Required
1777
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Default value: 0
Optional
Default value: 1
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.overfit(),
rule_parameters={
"tensor_regex": ".*",
"start_step": "0",
"patience": "1",
"ratio_threshold": "0.1"
},
collections_to_save=[
1778
Amazon SageMaker Developer Guide
List of Built-in Rules
CollectionConfig(
name="losses",
parameters={
"train.save_interval": "100",
"eval.save_interval": "10"
}
)
]
)
]
Overtraining
This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved
model (both training and validation loss decrease), the model approaches to a minimum of the loss
function and does not improve anymore. If the model continues training it can happen that validation
loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to
determine if the model is not improving, and prevents overfitting problems due to overtraining.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
Overtraining can be avoided by early stopping. For information on early stopping, see Stop
Training Jobs Early (p. 1640). For an example that shows how to use spot training with
Debugger, see Enable Spot Training with Amazon SageMaker Debugger.
Required
Optional
Default value: 5
Optional
Default value: 10
1779
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.overtraining(),
rule_parameters={
"patience_train": "5",
"patience_validation": "10",
"delta": "0.01"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]
SimilarAcrossRuns
This rule compares tensors gathered from a base trial with tensors from another trial.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Required
1780
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.similar_across_runs(),
rule_parameters={
"other_trials": "<specify-another-job-name>",
"collection_names": "losses",
"tensor_regex": ".*"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]
StalledTrainingRule
StalledTrainingRule detects if there is no progress made on training job, and stops the training job if the
rule fires. This rule requires tensors to be periodically saved in a time interval defined by its threshold
parameter. This rule keeps on monitoring for new tensors, and if no new tensor has been emitted for
threshold interval rule gets fired.
1781
Amazon SageMaker Developer Guide
List of Built-in Rules
Required
Optional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "1800",
"stop_training_on_fire": "True",
"training_job_name_prefix": "<specify-training-base-job-name>"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
1782
Amazon SageMaker Developer Guide
List of Built-in Rules
]
)
]
TensorVariance
This rule detects if you have tensors with very high or low variances. Very high or low variances in a
tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very
high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues
early.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
Optional
Optional
1783
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.tensor_variance(),
rule_parameters={
"collection_names": "weights",
"max_threshold": "10",
"min_threshold": "0.00001",
},
collections_to_save=[
CollectionConfig(
name="weights",
parameters={
"save_interval": "500"
}
)
]
)
]
UnchangedTensor
This rule detects whether a tensor is no longer changing across steps.
This rule runs the numpy.allclose method to check if the tensor isn't changing.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
1784
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Optional
Default value: 3
Optional
Optional
1785
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.unchanged_tensor(),
rule_parameters={
"collection_names": "losses",
"tensor_regex": "",
"num_steps": "3",
"rtol": "1e-05",
"atol": "1e-08",
"equal_nan": "False"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]
CheckInputImages
This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the
sample data differs by more than a threshold value from zero. Many computer vision models require that
input data has a zero mean and unit variance.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
1786
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Default value:
".*hybridsequential0_input_0" (the name
of the input tensor for Apache MXNet models
using HybridSequential)
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.check_input_images(),
rule_parameters={
"threshold_mean": "0.2",
"threshold_samples": "500",
"regex": ".*hybridsequential0_input_0",
"channel": "1"
},
collections_to_save=[
CollectionConfig(
name="custom_inputs_collection",
parameters={
"include_regex": ".*hybridsequential0_input_0",
"save_interval": "500"
}
)
]
)
1787
Amazon SageMaker Developer Guide
List of Built-in Rules
NLPSequenceRatio
This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for
optimizing performance. For example, you can calculate the percentage of padding end-of-sentence
(EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing
strategy should be performed. You also can calculate the percentage of unknown tokens in your input
sequence. If the number of unknown words is too high, an alternate vocabulary could be used.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
Optional
Default value: 0
Optional
1788
Amazon SageMaker Developer Guide
List of Built-in Rules
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.nlp_sequence_ratio(),
rule_parameters={
"tensor_regex": ".*embedding0_input_0",
"token_values": "0",
"token_thresholds_percent": "50"
},
collections_to_save=[
CollectionConfig(
name="custom_inputs_collection",
parameters={
"include_regex": ".*embedding0_input_0"
}
)
]
)
]
Confusion
This rule evaluates the goodness of a confusion matrix for a classification problem.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
1789
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Optional
Optional
Optional
Optional
Optional
built_in_rules = [
Rule.sagemaker(
1790
Amazon SageMaker Developer Guide
List of Built-in Rules
base_config=rule_configs.confusion(),
rule_parameters={
"category_no": "10",
"labels": "labels",
"predictions": "predictions",
"labels_collection": "labels",
"predictions_collection": "predictions",
"min_diag": "0.9",
"max_off_diag": "0.1"
},
collections_to_save=[
CollectionConfig(
name="labels",
parameters={
"save_interval": "500"
}
),
CollectionConfig(
name="predictions",
parameters={
"include_regex": "500"
}
)
]
)
]
Note
This rule infers default values for the optional parameters if their values aren't specified.
FeatureImportanceOverweight
This rule accumulates the weights of the n largest feature importance values per step and ensures that
they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not
hold more than 80 percent of the total weights of the model.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Required
Optional
1791
Amazon SageMaker Developer Guide
List of Built-in Rules
Optional
Default value: 3
Optional
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.feature_importance_overweight(),
rule_parameters={
"threshold": "0.8",
"nfeatures": "3",
"tensor_regex": ".*feature_importance/weight"
},
collections_to_save=[
CollectionConfig(
name="feature_importance",
parameters={
"save_interval": "500"
}
)
]
)
]
TreeDepth
This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if they do not improve
loss. This regularizes the training. As a result, the tree might not grow as deep as defined by the depth
parameter.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
1792
Amazon SageMaker Developer Guide
Create Custom Rules
Optional
Default value: 4
built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.tree_depth(),
rule_parameters={
"depth": "4"
},
collections_to_save=[
CollectionConfig(
name="tree",
parameters={
"save_interval": "500"
}
)
]
)
]
Topics
• Prerequisites for Creating Debugger Custom Rules (p. 1793)
• Use the Debugger Client Library smdebug to Create a Custom Rule Python Script (p. 1794)
• Use the Debugger APIs to Run Your Own Custom Rules (p. 1794)
1793
Amazon SageMaker Developer Guide
Create Custom Rules
class CustomGradientRule(Rule):
def __init__(self, base_trial, threshold=10.0):
super().__init__(base_trial)
self.threshold = float(threshold)
You can add multiple custom rule classes as many as you want in the same python script and deploy
them to any training job trials by constructing custom rule objects in the following section.
custom_rule = Rule.custom(
name='MyCustomRule',
image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-
evaluator:latest',
instance_type='ml.t3.medium',
source='path/to/my_custom_rule.py',
rule_to_invoke='CustomGradientRule',
collections_to_save=[CollectionConfig("gradients")],
rule_parameters={"threshold": "20.0"}
)
1794
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers
• collections_to_save (str): This specifies which tensor collections you will save for the rule to run.
• rule_parameters (dictionary): This accepts parameter inputs in a dictionary format. You can adjust
the parameters that you configured in the custom rule script.
After you set up the custom_rule object, you can use it for building a SageMaker estimator for any
training jobs. Specify the entry_point to your training script. You do not need to make any change of
your training script.
estimator = TensorFlow(
role=sagemaker.get_execution_role(),
base_job_name='smdebug-custom-rule-demo-tf-keras',
entry_point='path/to/your_training_script.py'
train_instance_type='ml.p2.xlarge'
...
estimator.fit()
For more variations and advanced examples of using Debugger custom rules, see the following example
notebooks.
• Monitor your training job with Amazon SageMaker Debugger custom rules
• PyTorch iterative model pruning of ResNet and AlexNet
• Trigger Amazon CloudWatch Events using Debugger Rules to Take an Action Based on Training Status
with TensorFlow
You need the following resources to build a customized container with Debugger.
For an end-to-end example of using Debugger with a custom training container, see the following
example notebook.
• Build a Custom Training Container and Debug Training Jobs with Debugger
1795
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers
Tip
This custom container with Debugger guide is an extension of the Adapting your own training
container (p. 2686) guide which walks you thorough how to build and push your custom
training container to Amazon ECR.
The following example code shows the structure of a training script using the Keras ResNet50 model and
how to pass the Debugger hook as a Keras callback for debugging. To find a complete training script, see
TensorFlow training script with SageMaker Debugger hook.
...
model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epoch,
validation_data=(X_valid, Y_valid),
shuffle=True,
def main():
parser=argparse.ArgumentParser(description="Train resnet50 cifar10")
# hyperparameter settings
parser.add_argument(...)
args = parser.parse_args()
# Add the following line to register the Debugger hook for Keras.
hook=smd.KerasHook.create_from_json_file()
1796
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers
if __name__ == "__main__":
main()
For more information about registering the Debugger hook for the supported frameworks and
algorithm, see the following links in the SMDebug client library:
In the following example notebooks' training scripts, you can find more examples about how to add the
Debugger hooks to training scripts and collect output tensors in detail:
To see the difference between using Debugger in a Deep Learning Container and in script mode, open
this notebook and put it and the previous Debugger in a Deep Learning Container TensorFlow v2.1
notebook example side by side.
In script mode, the hook configuration part is removed from the script in which you set the estimator.
Instead, the Debugger hook feature is merged into the training script, TensorFlow Keras ResNet
training script in script mode. The training script imports the smdebug library in the required
TensorFlow Keras environment to communicate with the TensorFlow ResNet50 algorithm. It also
manually implements the smdebug hook functionality by adding the callbacks=[hook] argument
inside the train function (in line 49), and by adding the manual hook configuration (in line 89)
provided through SageMaker Python SDK.
This script mode example runs the training job in the TF 2.1 framework for direct comparison with
the zero script change in the TF 2.1 example. The benefit of setting up Debugger in script mode is the
flexibility to choose framework versions not covered by AWS Deep Learning Containers.
• Using Amazon SageMaker Debugger in a PyTorch Container in Script Mode
This notebook enables Debugger in script mode in PyTorch v1.3.1 framework. PyTorch v1.3.1 is
supported by SageMaker containers, and this example shows details of how to modify a training script.
The SageMaker PyTorch estimator is already in script mode by default. In the notebook, the line to
activate script_mode is not included in the estimator configuration.
This notebook shows detailed steps to change an original PyTorch training script to a modified
version with Debugger enabled. Additionally, this example shows how you can use Debugger built-in
rules to detect training issues such as the vanishing gradients problem, and the Debugger trial features
to call and analyze the saved tensors.
1797
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers
# Install required packages to enable the SageMaker Python SDK and the smdebug library
RUN pip install sagemaker-training
RUN pip install smdebug
CMD ["bin/bash"]
If you want to use a pre-built AWS Deep Learning Container image, see Available AWS Deep Learning
Containers Images.
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-debugger-mnist-byoc-tf2'
tag = ':latest'
region = boto3.session.Session().region_name
uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
uri_suffix = 'amazonaws.com.cn'
byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix,
ecr_repository + tag)
Tip
If you use one of the AWS Deep Learning Container base images, run the following code to log
in to Amazon ECR and access to the Deep Learning Container image repository.
import sagemaker
1798
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
profiler_config=ProfilerConfig(...)
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule()),
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]
estimator=Estimator(
image_uri=byoc_image_uri,
entry_point="./debugger_custom_container_test_folder/your-training-script.py"
role=sagemaker.get_execution_role(),
base_job_name='debugger-custom-container-test',
instance_count=1,
instance_type='ml.p3.2xlarge',
# Debugger-specific parameters
profiler_config=profiler_config,
debugger_hook_config=debugger_hook_config,
rules=rules
)
# start training
estimator.fit()
Topics
• JSON (AWS CLI) (p. 1799)
• AWS Boto3 (p. 1804)
The following code shows a complete JSON template to run a training job with required settings and
Debugger configurations. Save the template as a JSON file in your working directory and run the training
job using AWS CLI. For example, save the following code as debugger-training-job-cli.json.
Note
Ensure that you use the correct Docker container images. To find AWS Deep Learning Container
images, see Available Deep Learning Containers Images. To find a complete list of available
Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813).
1799
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
"TrainingJobName": "debugger-aws-cli-test",
"RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-
YYYYMMDDT123456",
"AlgorithmSpecification": {
// Specify a training Docker container image URI (Deep Learning Container or your own
training container) to TrainingImage.
"TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-
training:2.4.1-gpu-py37-cu110-ubuntu18.04",
"TrainingInputMode": "File",
"EnableSageMakerMetricsTimeSeries": false
},
"HyperParameters": {
"sagemaker_program": "entry_point/tf-hvd-train.py",
"sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-
profiling-test/source.tar.gz"
},
"OutputDataConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
},
"DebugHookConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-
output",
"CollectionConfigurations": [
{
"CollectionName": "losses",
"CollectionParameters" : {
"train.save_interval": "50"
}
}
]
},
"DebugRuleConfigurations": [
{
"RuleConfigurationName": "LossNotDecreasing",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
}
],
"ProfilerConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/
profiler-output",
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex
\": \".*\", }",
"DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
"PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\":
\"cprofile\", \"cProfileTimer\": \"total_time\"}",
"LocalPath": "/opt/ml/output/profiler/"
}
},
"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "ProfilerReport",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {"rule_to_invoke": "ProfilerReport"}
}
],
"ResourceConfig": {
"InstanceType": "ml.p3.8xlarge",
"InstanceCount": 1,
"VolumeSizeInGB": 30
},
1800
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
}
}
After saving the JSON file, run the following command in your terminal. (Use ! at the beginning of the
line if you use a Jupyter notebook.)
"DebugHookConfig": {
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
"CollectionConfigurations": [
{
"CollectionName": "gradients",
"CollectionParameters" : {
"save_interval": "500"
}
}
]
}
This will make the training job save the tensor collection, gradients, every save_interval of 500
steps. To find available CollectionName values, see Debugger Built-in Collections in the SMDebug
client library documentation. To find available CollectionParameters parameter keys and values, see
the sagemaker.debugger.CollectionConfig class in the SageMaker Python SDK documentation.
The following DebugRuleConfigurations API example shows how to run the built-in
VanishingGradient rule on the saved gradients collection.
"DebugRuleConfigurations": [
{
"RuleConfigurationName": "VanishingGradient",
"RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "VanishingGradient",
"threshold": "20.0"
}
}
]
With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training
job using the VanishingGradient rule on the collection of gradients tensor. To find a complete list
of available Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813). To find the key-value pairs for RuleParameters, see List of Debugger Built-in
Rules (p. 1748).
1801
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
Target Step
"ProfilerConfig": {
// Optional. Path to an S3 bucket to save profiling outputs
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output",
// Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3,
\"MetricsRegex\": \".*\" }",
"DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
// For PythonProfilingConfig,
// available ProfilerName options: cProfile, Pyinstrument
// available cProfileTimer options only when using cProfile: cpu, off_cpu,
total_time
"PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName
\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
// Optional. Local path for profiling outputs
"LocalPath": "/opt/ml/output/profiler/"
}
}
"ProfilerConfig": {
// Optional. Path to an S3 bucket to save profiling outputs
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output",
// Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
"DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10 }",
// For PythonProfilingConfig,
// available ProfilerName options: cProfile, Pyinstrument
// available cProfileTimer options only when using cProfile: cpu, off_cpu,
total_time
"PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\":
\"total_time\" }",
// Optional. Local path for profiling outputs
"LocalPath": "/opt/ml/output/profiler/"
}
}
The following example code shows how to configure the ProfilerReport rule.
"ProfilerRuleConfigurations": [
1802
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
{
"RuleConfigurationName": "ProfilerReport",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "ProfilerReport",
"CPUBottleneck_cpu_threshold": "90",
"IOBottleneck_threshold": "90"
}
}
]
To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).
{
"ProfilerConfig": {
"DisableProfiler": boolean,
"ProfilingIntervalInMilliseconds": number,
"ProfilingParameters": {
"string" : "string"
}
},
"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "string",
"RuleEvaluatorImage": "string",
"RuleParameters": {
"string" : "string"
}
}
],
"TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}
"DebugHookConfig": {
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
"CollectionConfigurations": [
{
"CollectionName": "relu_activations",
1803
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
"CollectionParameters": {
"include_regex": "relu",
"save_interval": "500",
"end_step": "5000"
}
}
]
},
"DebugRulesConfigurations": [
{
"RuleConfigurationName": "improper_activation_job",
"RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-
debugger-rule-evaluator:latest",
"InstanceType": "ml.c4.xlarge",
"VolumeSizeInGB": 400,
"RuleParameters": {
"source_s3_uri": "s3://bucket/custom_rules.py",
"rule_to_invoke": "ImproperActivation",
"collection_names": "relu_activations"
}
}
]
To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).
AWS Boto3
Amazon SageMaker Debugger built-in rules can be configured for a training job using the
create_training_job() function of the AWS Boto3 SageMaker client. You need to specify the right
image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how
to set up the request body for the create_training_job() function.
The following code shows a complete example of how to configure Debugger for the
create_training_job() request body and start a training job in us-west-2, assuming that a
training script entry_point/train.py is prepared using TensorFlow. To find an end-to-end example
notebook, see Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker
Debugger (Boto3).
Note
Ensure that you use the correct Docker container images. To find available AWS Deep Learning
Container images, see Available Deep Learning Containers Images. To find a complete list of
available Docker images for using the Debugger rules, see Use Debugger Docker Images for
Built-in or Custom Rules (p. 1813).
# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = 'debugger-boto3-test'
1804
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)
1805
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
},
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
}
]
)
DebugHookConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
'CollectionConfigurations': [
{
'CollectionName': 'gradients',
'CollectionParameters' : {
'train.save_interval': '500',
'eval.save_interval': '50'
}
}
]
}
This will make the training job save a tensor collection, gradients, every save_interval of 500 steps.
To find available CollectionName values, see Debugger Built-in Collections in the SMDebug client
library documentation. To find available CollectionParameters parameter keys and values, see the
sagemaker.debugger.CollectionConfig class in the SageMaker Python SDK documentation.
The following DebugRuleConfigurations API example shows how to run the built-in
VanishingGradient rule on the saved gradients collection.
DebugRuleConfigurations=[
{
'RuleConfigurationName': 'VanishingGradient',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'VanishingGradient',
'threshold': '20.0'
}
}
]
With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training
job using the VanishingGradient rule on the collection of gradients tensor. To find a complete list
of available Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813). To find the key-value pairs for RuleParameters, see List of Debugger Built-in
Rules (p. 1748).
1806
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
Target Step
ProfilerConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', #
Optional. Path to an S3 bucket to save profiling outputs
# Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
'ProfilingIntervalInMilliseconds': 500,
'ProfilingParameters': {
'DataloaderProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3,
"MetricsRegex": ".*"
}',
'DetailedProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3
}',
'PythonProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3,
"ProfilerName": "cprofile", # Available options: cprofile, pyinstrument
"cProfileTimer": "total_time" # Include only when using cprofile.
Available options: cpu, off_cpu, total_time
}',
'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling
outputs
}
}
ProfilerConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', #
Optional. Path to an S3 bucket to save profiling outputs
# Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
'ProfilingIntervalInMilliseconds': 500,
'ProfilingParameters': {
'DataloaderProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10,
"MetricsRegex": ".*"
}',
'DetailedProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10
}',
'PythonProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10,
"ProfilerName": "cprofile", # Available options: cprofile, pyinstrument
"cProfileTimer": "total_time" # Include only when using cprofile.
Available options: cpu, off_cpu, total_time
1807
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API
}',
'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling
outputs
}
}
The following example code shows how to configure the ProfilerReport rule.
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'ProfilerReport',
'CPUBottleneck_cpu_threshold': '90',
'IOBottleneck_threshold': '90'
}
}
]
To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).
ProfilerConfig={
'DisableProfiler': boolean,
'ProfilingIntervalInMilliseconds': number,
'ProfilingParameters': {
'string' : 'string'
}
},
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'string',
'RuleEvaluatorImage': 'string',
'RuleParameters': {
'string' : 'string'
}
}
],
TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'
1808
Amazon SageMaker Developer Guide
Best Practices for Debugger
function. The following code sample shows how to configure a custom ImproperActivation rule
written with the smdebug library using this SageMaker API operation. This example assumes that you’ve
written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket. The example
provides pre-built Docker images that you can use to run your custom rules. These are listed at Amazon
SageMaker Debugger Registry URLs for Custom Rule Evaluators (p. 1814). You specify the URL registry
address for the pre-built Docker image in the RuleEvaluatorImage parameter.
DebugHookConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
'CollectionConfigurations': [
{
'CollectionName': 'relu_activations',
'CollectionParameters': {
'include_regex': 'relu',
'save_interval': '500',
'end_step': '5000'
}
}
]
},
DebugRulesConfigurations=[
{
'RuleConfigurationName': 'improper_activation_job',
'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-
debugger-rule-evaluator:latest',
'InstanceType': 'ml.c4.xlarge',
'VolumeSizeInGB': 400,
'RuleParameters': {
'source_s3_uri': 's3://bucket/custom_rules.py',
'rule_to_invoke': 'ImproperActivation',
'collection_names': 'relu_activations'
}
}
]
To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).
Topics
• Choose a Machine Learning Framework (p. 1810)
• Use Studio Debugger Insights Dashboard (p. 1810)
• Download Debugger Reports and Gain More Insights (p. 1810)
• Capture Data from Your Training Job and Save Data to Amazon S3 (p. 1810)
• Analyze the Data with a Fleet of Debugger Built-in Rules (p. 1810)
• Take Actions Based on the Built-in Rule Status (p. 1810)
• Dive Deep into the Data Using the SMDebug Client Library (p. 1811)
• Monitor and Analyze Training Job Metrics (p. 1811)
• Monitoring System Utilization and Detect Bottlenecks (p. 1811)
• Profiling Framework Operations (p. 1811)
• Debugging Model Output Tensors (p. 1812)
1809
Amazon SageMaker Developer Guide
Best Practices for Debugger
Capture Data from Your Training Job and Save Data to Amazon
S3
You can use a Debugger hook to save output tensors. After you choose a container and a framework that
fit your training script, use a Debugger hook to configure which tensors to save and to which directory
to save them, such as a Amazon S3 bucket. A Debugger hook helps you to build the configuration and to
keep it in your account to use in subsequent analyses, where it is secured for use with the most privacy-
sensitive applications. To learn more, see Configure SageMaker Debugger to Save Tensors (p. 1672).
1810
Amazon SageMaker Developer Guide
Best Practices for Debugger
Dive Deep into the Data Using the SMDebug Client Library
You can use the SMDebug tools to access and analyze training data collected by Debugger. The
TrainingJob and create_trial classes load the metrics and tensors saved by Debugger. These
classes provide extended class methods to analyze the data in real time or after the training has
finished. The SMDebug library also provides visualization tools: merge timelines of framework metrics to
aggregate different profiling, line charts and heatmap to track the system utilization, and histograms to
find step duration outliers. To learn more about the SMDebug library tools, see Analyze Data Using the
SMDebug Client Library (p. 1740).
If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second)
granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any
time, consider using Amazon SageMaker Debugger. SageMaker Debugger provides built-in rules to
automatically detect common training issues; it detects hardware resource utilization issues (such as
CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients,
and exploding tensors).
SageMaker Debugger also provides visualizations through Studio and its profiling report. Unlike
CloudWatch metrics, which accumulates resource utilization rates of CPU and GPU cores and averages
those out across multiple instances, Debugger tracks the utilization rate of each core. This enables you to
identify unbalanced usage of hardware resources as you scale up to larger compute clusters. To explore
the Debugger visualizations, see SageMaker Debugger Insights Dashboard Walkthrough, Debugger
Profiling Report Walkthrough, and Analyze Data Using the SMDebug Client Library.
To learn how to enable Debugger system monitoring, see Configure Debugger Using Amazon SageMaker
Python SDK (p. 1710) and then Configure Debugger for Monitoring Resource Utilization (p. 1714).
For a full list of available built-in rules for monitoring, see Debugger built-in rules for profiling hardware
system resource utilization (system metrics) (p. 1749).
1811
Amazon SageMaker Developer Guide
Advanced Topics and Reference
To learn how to configure Debugger for framework profiling, see Configure Debugger Using Amazon
SageMaker Python SDK (p. 1710) and then Configure Debugger for Framework Profiling (p. 1714).
For a complete list of available built-in rules for profiling, see Debugger built-in rules for profiling
framework metrics (p. 1749).
To learn how to configure Debugger for debugging output tensors, see Step 2: Launch and Debug
Training Jobs Using SageMaker Python SDK (p. 1669) and then Configure SageMaker Debugger to Save
Tensors (p. 1672).
For a full list of available built-in rules for debugging, see Debugger built-in rules for debugging model
training data (output tensors) (p. 1750).
Topics
• Amazon SageMaker Debugger API Operations (p. 1812)
• Use Debugger Docker Images for Built-in or Custom Rules (p. 1813)
• Amazon SageMaker Debugger Exceptions (p. 1815)
• Considerations for Amazon SageMaker Debugger (p. 1816)
• Amazon SageMaker Debugger Usage Statistics (p. 1818)
Amazon SageMaker Debugger also provides the open source sagemaker-debugger Python SDK that
is used to configure built-in rules, define custom rules, and register hooks to collect output tensor data
from training jobs.
The Amazon SageMaker Python SDK is a high-level SDK focused on machine learning experimentation.
The SDK can be used to deploy built-in or custom rules defined with the SMDebug Python library to
monitor and analyze these tensors using SageMaker estimators.
Debugger has added operations and types to the Amazon SageMaker API that enable the platform to
use Debugger when training a model and to manage the configuration of inputs and outputs.
1812
Amazon SageMaker Developer Guide
Advanced Topics and Reference
• CreateTrainingJob and UpdateTrainingJob use the following Debugger APIs to configure tensor
collections, rules, rule images, and profiling options:
• CollectionConfiguration
• DebugHookConfig
• DebugRuleConfiguration
• TensorBoardOutputConfig
• ProfilerConfig
• ProfilerRuleConfiguration
• DescribeTrainingJob provides a full description of a training job, including the following Debugger
configurations and rule evaluation statuses:
• DebugHookConfig
• DebugRuleConfiguration
• DebugRuleEvaluationStatus
• ProfilerConfig
• ProfilerRuleConfiguration
• ProfilerRuleEvaluationStatus
The rule configuration API operations use the SageMaker Processing functionality when analyzing a
model training. For more information about SageMaker Processing, see Process Data (p. 1196).
If you use the Amazon SageMaker Python SDK, you can simply use SageMaker high-level Debugger API
operations with SageMaker Estimator API operations, without having to manually retrieve the Debugger
Docker images and configure the ConfigureTrainingJobAPI.
If you are not using the SageMaker Python SDK, you have to retrieve a relevant pre-built container base
image for the Debugger rules. Amazon SageMaker Debugger provides pre-built Docker images for built-
in and custom rules, and the images are stored in Amazon Elastic Container Registry (Amazon ECR). To
pull an image from an Amazon ECR repository (or to push an image to one), use the full name registry
URL of the image using the CreateTrainingJob API. SageMaker uses the following URL patterns for
the Debugger rule container image registry address.
For the account ID in each AWS Region, Amazon ECR repository name, and tag value, see the following
topics.
Topics
• Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators (p. 1813)
• Amazon SageMaker Debugger Registry URLs for Custom Rule Evaluators (p. 1814)
1813
Amazon SageMaker Developer Guide
Advanced Topics and Reference
Tag: latest
904829902805.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rules:latest
Region account_id
af-south-1 314341159256
ap-east-1 199566480951
ap-northeast-1 430734990657
ap-northeast-2 578805364391
ap-south-1 904829902805
ap-southeast-1 972752614525
ap-southeast-2 184798709955
ca-central-1 519511493484
cn-north-1 618459771430
cn-northwest-1 658757709296
eu-central-1 482524230118
eu-north-1 314864569078
eu-south-1 563282790590
eu-west-1 929884845733
eu-west-2 250201462417
eu-west-3 447278800020
me-south-1 986000313247
sa-east-1 818342061345
us-east-1 503895931360
us-east-2 915447279597
us-west-1 685455198987
us-west-2 895741380848
us-gov-west-1 515509971035
1814
Amazon SageMaker Developer Guide
Advanced Topics and Reference
Tag: latest
552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-
evaluator:latest
Region account_id
af-south-1 515950693465
ap-east-1 645844755771
ap-northeast-1 670969264625
ap-northeast-2 326368420253
ap-south-1 552407032007
ap-southeast-1 631532610101
ap-southeast-2 445670767460
ca-central-1 105842248657
cn-north-1 617202126805
cn-northwest-1 658559488188
eu-central-1 691764027602
eu-north-1 091235270104
eu-south-1 335033873580
eu-west-1 606966180310
eu-west-2 074613877050
eu-west-3 224335253976
me-south-1 050406412588
sa-east-1 466516958431
us-east-1 864354269164
us-east-2 840043622174
us-west-1 952348334681
us-west-2 759209512951
us-gov-west-1 515361955729
1815
Amazon SageMaker Developer Guide
Advanced Topics and Reference
happens when a tensor is missing. These exceptions are available in the smdebug.exceptions module.
You can import them as follows:
• TensorUnavailableForStep – The tensor requested is not available for the step. This might mean
that this step might not be saved at all by the hook, or that this step might have saved some tensors
but the requested tensor is not part of them. Note that when you see this exception, it means that this
tensor can never become available for this step in the future. If the tensor has reductions saved for the
step, it notifies you they can be queried.
• TensorUnavailable – This tensor is not being saved or has not been saved by the smdebug API. This
means that this tensor is never seen for any step in smdebug.
• StepUnavailable – The step was not saved and Debugger has no data from the step.
• StepNotYetAvailable – The step has not yet been seen by smdebug. It might be available in the
future if the training is still going on. Debugger automatically loads new data as it becomes available.
• NoMoreData – Raised when the training ends. Once you see this, you know that there are no more
steps and no more tensors to be saved.
• IndexReaderException – The index reader is not valid.
• InvalidWorker – A worker was invoked that was not valid.
• RuleEvaluationConditionMet – Evaluation of the rule at the step resulted in the condition being
met.
• InsufficientInformationForRuleInvocation – Insufficient information was provided to invoke
the rule.
• Horovod
1816
Amazon SageMaker Developer Guide
Advanced Topics and Reference
Scope of validity of using Debugger for training jobs with SageMaker distributed data
parallel
** SageMaker distributed data parallel does not support TensorFlow 2.x with Keras implementation.
• SageMaker distributed model parallel – Debugger does not support SageMaker distributed model
parallel training.
• Distributed training with SageMaker checkpoints – Debugger is not available for training jobs when
both the distributed training option and SageMaker checkpoints are enabled. You might see an error
that looks like the following:
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
To use Debugger for training jobs with distributed training options, you need to disable SageMaker
checkpointing and add manual checkpointing functions to your training script. For more information
about using Debugger with distributed training options and checkpoints, see Using SageMaker
Distributed Data Parallel with Amazon SageMaker Debugger and Checkpoints (p. 1862) and Saving
Checkpoints (p. 1939).
• Parameter Server – Debugger does not support parameter server-based distributed training.
• Profiling distributed training framework operations, such as the AllReduced operation of SageMaker
distributed data parallel and Horovod operations, is not available.
FrameworkProfile(local_path="/opt/ml/output/profiler/")
• For AWS TensorFlow, the data loader profiling configuration cannot be updated while a training job is
running.
• For AWS TensorFlow, a NoneType error might occur when you use analysis tools and notebook
examples with TensorFlow 2.3 training jobs and the detailed profiling option.
• Python profiling and detailed profiling are only supported for Keras API.
• To access the deep profiling feature for TensorFlow and PyTorch, currently you must specify the latest
AWS deep learning container images with CUDA 11. For example, you must specify the specific image
URI in the TensorFlow and PyTorch estimator as follows:
• For TensorFlow
1817
Amazon SageMaker Developer Guide
Advanced Topics and Reference
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-
gpu-py37-cu110-ubuntu18.04"
• For PyTorch
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.6.0-gpu-
py36-cu110-ubuntu18.04"
import mxnet as mx
from mxnet.gluon import HybridBlock, nn
class Model(HybridBlock):
def __init__(self, **kwargs):
super(Model, self).__init__(**kwargs)
# use name_scope to give child Blocks appropriate names.
with self.name_scope():
self.dense0 = nn.Dense(20)
self.dense1 = nn.Dense(20)
model = Model()
model.initialize(ctx=mx.cpu(0))
model.hybridize()
model(mx.nd.zeros((10, 10), ctx=mx.cpu(0)))
Debugger collects profiling report usage statistics by including code in the Jupyter notebook that
collects the unique ProfilerReport rule's processing job ARN if the user opens the final profiler-
report.html file.
Debugger only collects information about whether a user opens the final HTML report. It DOES NOT
collect any information from training jobs, training data, training scripts, processing jobs, logs, or the
content of the profiling report itself.
1818
Amazon SageMaker Developer Guide
Advanced Topics and Reference
You can opt out of the collection of usage statistics using either of the following options.
To opt out, you need to add the following Debugger ProfilerReport rule configuration to your
training job request.
estimator=sagemaker.estimator.Estimator(
...
rules=ProfilerRule.sagemaker(
base_config=rule_configs.ProfilerReport()
rule_parameters={"opt_out_telemetry": "True"}
)
)
AWS CLI
"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "ProfilerReport-1234567890",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "ProfilerReport",
"opt_out_telemetry": "True"
}
}
]
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport-1234567890',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'ProfilerReport',
'opt_out_telemetry': 'True'
}
}
]
To opt out after training has completed, you need to modify the profiler-report.ipynb file.
Note
HTML reports autogenerated without Option 1 already added to your training job request still
report the usage statistics even after you opt out using Option 2.
1. Follow the instructions on downloading the Debugger profiling report files in the Download the
SageMaker Debugger Profiling Report (p. 1730) page.
2. In the /ProfilerReport-1234567890/profiler-output directory, open profiler-
report.ipynb.
1819
Amazon SageMaker Developer Guide
SageMaker Debugger Release Notes
3. Add opt_out=True to the setup_profiler_report() function in the fifth code cell as shown in
the following example code:
setup_profiler_report(processing_job_arn, opt_out=True)
SageMaker Debugger launches TensorBoard on SageMaker, a capability that brings the TensorBoard app
to SageMaker with access control.
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and
PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.
With the deprecation, SageMaker Debugger also discontinues support for the following three
ProfilerRules for framework profiling.
• MaxInitializationTime
• OverallFrameworkMetrics
• StepOutlier
• The XGBoost report tab has been removed from the SageMaker Debugger's profiler dashboard. You
can still access the XGBoost report by downloading it as a Jupyter notebook or a HTML file. For more
information, see SageMaker Debugger XGBoost Training Report.
• Starting from this release, the built-in profiler rules are not activated by default. To use the SageMaker
Debugger profiler rules to detect certain computational problems, you need to add the rules when you
configure a SageMaker training job launcher.
1820
Amazon SageMaker Developer Guide
Distributed Training
The SageMaker distributed training libraries are optimized for the SageMaker training environment,
help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.
The libraries offer both data parallel and model parallel training strategies. They combine software and
hardware technologies to improve inter-GPU and inter-node communications, and extend SageMaker’s
training capabilities with built-in options that require minimal code changes to your training scripts.
• To use SageMaker's data parallelism library, configure the distribution parameter of the
SageMaker framework estimators. Supported framework estimators are PyTorch and TensorFlow. The
following code example shows how to set a framework estimator for distributed training with the data
parallelism library on two ml.p4d.24xlarge instances.
estimator = Framework(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"smdistributed" : {"dataparallel" : {"enabled" : True}}}
)
To learn how to prepare your training script and launch a distributed training job, see SageMaker's
data parallelism library (p. 1831) (see also Distributed Training APIs in the SageMaker Python SDK
documentation).
• To use SageMaker's model parallelism library, configure the distribution parameter of the
SageMaker framework estimators. Supported framework estimators are PyTorch and TensorFlow. The
following code example shows how to construct a framework estimator for distributed training with
the model parallelism library on two ml.p4d.24xlarge instances.
distribution={
1821
Amazon SageMaker Developer Guide
Get Started with Distributed Training
"smdistributed": {
"modelparallel": {
"enabled":True,
"parameters": {
... # enter parameter key-value pairs here
}
},
},
"mpi": {
"enabled" : True,
... # enter parameter key-value pairs here
}
}
estimator = Framework(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution=distribution
)
To learn how to prepare your training script, configure distribution parameters, and launch a
distributed training job, see SageMaker's model parallelism library (p. 1864) (see also Distributed
Training APIs in the SageMaker Python SDK documentation).
SageMaker also supports the following options to operate mpirun and torchrun in the backend.
• To use PyTorch DistributedDataParallel (DDP) in SageMaker with the mpirun backend, add
distribution={"pytorchddp": {"enabled": True}} to your PyTorch estimator. For more
information, see also PyTorch Distributed Training and SageMaker PyTorch Estimator's distribution
argument in the SageMaker Python SDK documentation.
Note
This option is available for PyTorch 1.12.0 and later.
estimator = PyTorch(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"pytorchddp": {"enabled": True}} # runs mpirun in the backend
)
• SageMaker supports the PyTorch torchrun launcher for distributed training on GPU-based Amazon
EC2 instances, such as P3 and P4, as well as Trn1 powered by the AWS Trainium device.
To use PyTorch DistributedDataParallel (DDP) in SageMaker with the torchrun backend, add
distribution={"torch_distributed": {"enabled": True}} to the PyTorch estimator.
Note
This option is available for PyTorch 1.13.0 and later.
The following code snippet shows an example of constructing a SageMaker PyTorch estimator to run
distributed training on two ml.p4d.24xlarge instances with the torch_distributed distribution
option.
estimator = PyTorch(
...,
1822
Amazon SageMaker Developer Guide
Get Started with Distributed Training
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"torch_distributed": {"enabled": True}} # runs torchrun in the
backend
)
For more information, see Distributed PyTorch Training and SageMaker PyTorch Estimator's
distribution argument in the SageMaker Python SDK documentation.
A Trn1 instance consists of up to 16 Trainium devices, and each Trainium device consists of two
NeuronCores. For specs of the AWS Trainium devices, see Trainium Architecture in the AWS Neuron
Documentation.
To train on the Trainium-powered instances, you only need to specify the Trn1 instance code,
ml.trn1.*, in string to the instance_type argument of the SageMaker PyTorch estimator class. To
find available Trn1 instance types, see AWS Trn1 Architecture in the AWS Neuron documentation.
Note
SageMaker Training on Amazon EC2 Trn1 instances is currently available only for the PyTorch
framework in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0. To find
a complete list of supported versions of PyTorch Neuron, see Neuron Containers in the AWS
Deep Learning Containers GitHub repository.
When you launch a training job on Trn1 instances using the SageMaker Python SDK, SageMaker
automatically picks up and runs the right container from Neuron Containers provided by AWS Deep
Learning Containers. The Neuron Containers are prepackaged with training environment settings
and dependencies for easier adaptation of your training job to the SageMaker Training platform and
Amazon EC2 Trn1 instances.
Note
To run your PyTorch training job on Trn1 instances with SageMaker, you should modify your
training script to initialize process groups with the xla backend and use PyTorch/XLA. To
support the XLA adoption process, the AWS Neuron SDK provides PyTorch Neuron that uses
XLA to make conversion of PyTorch operations to Trainium instructions. To learn how to
modify your training script, see Developer Guide for Training with PyTorch Neuron (torch-
neuronx) in the AWS Neuron Documentation.
For more information, see Distributed Training with PyTorch Neuron on Trn1 instances and SageMaker
PyTorch Estimator's distribution argument in the SageMaker Python SDK documentation.
• To use MPI in SageMaker, add distribution={"mpi": {"enabled": True}} to your estimator.
The MPI distribution option is available for the following frameworks: MXNet, PyTorch, and
TensorFlow.
• To use a parameter server in SageMaker, add distribution={"parameter_server":
{"enabled": True}} to your estimator. The parameter server option is available for the following
frameworks: MXNet, PyTorch, and TensorFlow.
Tip
For more information about using the MPI and parameter server options per framework, use
the following links to the SageMaker Python SDK documentation.
• MXNet Distributed Training and SageMaker MXNet Estimator's distribution argument
• PyTorch Distributed Training and SageMaker PyTorch Estimator's distribution argument
• TensorFlow Distributed Training and SageMaker TensorFlow Estimator's distribution
argument.
1823
Amazon SageMaker Developer Guide
Basic Distributed Training Concepts
• Training Dataset: All of the data you use to train the model.
• Global batch size: The number of records selected from the training dataset in each iteration to send
to the GPUs in the cluster. This is the number of records over which the gradient is computed at each
iteration. If data parallelism is used, it is equal to the total number of model replicas multiplied by the
per-replica batch size: global batch size = (the number of model replicas) * (per-
replica batch size). A single batch of global batch size is often referred to as the mini-batch in
machine learning literature.
• Per-replica batch size: When data parallelism is used, this is the number of records sent to each model
replica. Each model replica performs a forward and backward pass with this batch to calculate weight
updates. The resulting weight updates are synchronized (averaged) across all replicas before the next
set of per-replica batches are processed.
• Micro-batch: A subset of the mini-batch or, if hybrid model and data parallelism is used , it is a subset
of the per-replica sized batch . When you use SageMaker’s distributed model parallelism library, each
micro-batch is fed into the training pipeline one-by-one and follows an execution schedule defined by
the library's runtime.
Training
• Epoch: One training cycle through the entire dataset. It is common to have multiple iterations per an
epoch. The number of epochs you use in training is unique on your model and use case.
• Iteration: A single forward and backward pass performed using a global batch sized batch (a mini-
batch) of training data. The number of iterations performed during training is determined by the
global batch size and the number of epochs used for training. For example, if a dataset includes 5,000
samples, and you use a global batch size of 500, it will take 10 iterations to complete a single epoch.
• Learning rate: A variable that influences the amount that weights are changed in response to the
calculated error of the model. The learning rate plays an important role in the model’s ability to
converge as well as the speed and optimality of convergence.
• Instances: An AWS machine learning compute instance. These are also referred to as nodes.
• Cluster size: When using SageMaker's distributed training library, this is the number of instances
multiplied by the number of GPUs in each instance. For example, if you use two ml.p3.8xlarge
instances in a training job, which have 4 GPUs each, the cluster size is 8. While increasing cluster size
can lead to faster training times, communication between instances must be optimized; Otherwise,
communication between the nodes can add overhead and lead to slower training times. The
SageMaker distributed training library is designed to optimize communication between Amazon EC2
ML compute instances, leading to higher device utilization and faster training times.
• Data parallelism: A strategy in distributed training where a training dataset is split up across multiple
GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. Each GPU contains
a replica of the model, receives different batches of training data, performs a forward and backward
pass, and shares weight updates with the other nodes for synchronization before moving on to the
next batch and ultimately another epoch.
1824
Amazon SageMaker Developer Guide
Advanced Concepts
• Model parallelism: A strategy in distributed training where the model partitioned across multiple
GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. The model might
be complex and have a large number of hidden layers and weights, making it unable to fit in the
memory of a single instance. Each GPU carries a subset of the model, through which the data flows
and the transformations are shared and compiled. The efficiency of model parallelism, in terms of GPU
utilization and training time, is heavily dependent on how the model is partitioned and the execution
schedule used to perform forward and backward passes.
• Pipeline Execution Schedule (Pipelining): The pipeline execution schedule determines the order in
which computations (micro-batches) are made and data is processed across devices during model
training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome
the performance loss due to sequential computation by having the GPUs compute simultaneously on
different data samples. To learn more, see Pipeline Execution Schedule.
Advanced Concepts
Machine Learning (ML) practitioners commonly face two scaling challenges when training models:
scaling model size and scaling training data. While model size and complexity can result in better
accuracy, there is a limit to the model size you can fit into a single CPU or GPU. Furthermore, scaling
model size may result in more computations and longer training times.
Not all models handle training data scaling equally well because they need to ingest all the training data
in memory for training. They only scale vertically, and to bigger and bigger instance types. In most cases,
scaling training data results in longer training times.
Deep Learning (DL) is a specific family of ML algorithms consisting of several layers of artificial neural
networks. The most common training method is with mini-batch Stochastic Gradient Descent (SGD).
In mini-batch SGD, the model is trained by conducting small iterative changes of its coefficients in
the direction that reduces its error. Those iterations are conducted on equally sized subsamples of the
training dataset called mini-batches. For each mini-batch, the model is run in each record of the mini-
batch, its error measured and the gradient of the error estimated. Then the average gradient is measured
across all the records of the mini-batch and provides an update direction for each model coefficient. One
full pass over the training dataset is called an epoch. Model trainings commonly consist of dozens to
hundreds of epochs. Mini-batch SGD has several benefits: First, its iterative design makes training time
theoretically linear of dataset size. Second, in a given mini-batch each record is processed individually
by the model without need for inter-record communication other than the final gradient average. The
processing of a mini-batch is consequently particularly suitable for parallelization and distribution.
Parallelizing SGD training by distributing the records of a mini-batch over different computing devices is
called data parallel distributed training, and is the most commonly used DL distribution paradigm. Data
parallel training is a relevant distribution strategy to scale the mini-batch size and process each mini-
batch faster. However, data parallel training comes with the extra complexity of having to compute the
mini-batch gradient average with gradients coming from all the workers and communicating it to all the
workers, a step called allreduce that can represent a growing overhead, as the training cluster is scaled,
and that can also drastically penalize training time if improperly implemented or implemented over
improper hardware subtracts.
Data parallel SGD still requires developers to be able to fit at least the model and a single record
in a computing device, such as a single CPU or GPU. When training very large models such as large
transformers in Natural Language Processing (NLP), or segmentation models over high-resolution
images, there may be situations in which this is not feasible. An alternative way to break up the workload
is to partition the model over multiple computing devices, an approach called model-parallel distributed
training.
1825
Amazon SageMaker Developer Guide
Strategies
Strategies
Distributed training is usually split by two approaches: data parallel and model parallel. Data parallel is
the most common approach to distributed training: You have a lot of data, batch it up, and send blocks
of data to multiple CPUs or GPUs (nodes) to be processed by the neural network or ML algorithm, then
combine the results. The neural network is the same on each node. A model parallel approach is used
with large models that won’t fit in a node’s memory in one piece; it breaks up the model and places
different parts on different nodes. In this situation, you need to send your batches of data out to each
node so that the data is processed on all parts of the model.
The terms network and model are often used interchangeably: A large model is really a large network
with many layers and parameters. Training with a large network produces a large model, and loading
the model back onto the network with all your pre-trained parameters and their weights loads a large
model into memory. When you break apart a model to split it across nodes, you’re also breaking apart
the underlying network. A network consists of layers, and to split up the network, you put layers on
different compute devices.
A common pitfall of naively splitting layers across devices is severe GPU under-utilization. Training is
inherently sequential in both forward and backward passes, and at a given time, only one GPU can
actively compute, while the others wait on the activations to be sent. Modern model parallel libraries
solve this problem by using pipeline execution schedules to improve device utilization. However, only
the Amazon SageMaker's distributed model parallel library includes automatic model splitting. The two
core features of the library, automatic model splitting and pipeline execution scheduling, simplifies the
process of implementing model parallelism by making automated decisions that lead to efficient device
utilization.
• On NVIDIA TensorCore-equipped hardware, using mixed precision training creates both speed-up and
memory consumption reduction.
• SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the
box. No extra action is needed to enable AMP other than the framework-level modifications to your
training script. If gradients are in FP16, the SageMaker data parallelism library runs its AllReduce
operation in FP16. For more information about implementing AMP APIs to your training script, see the
following resources:
• Frameworks - PyTorch in the NVIDIA Deep Learning Performance documentation
• Frameworks - TensorFlow in the NVIDIA Deep Learning Performance documentation
• Automatic Mixed Precision for Deep Learning in the NVIDIA Developer Docs
• Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs in the
PyTorch Blog
• TensorFlow mixed precision APIs in the TensorFlow documentation
1826
Amazon SageMaker Developer Guide
Optimize Distributed Training
• Reduce the NLP sequence length if you increase the sequence link, need to adjust the batch size down,
or adjust the GPUs up to spread the batch.
• Reduce image resolution.
Check if you use batch normalization, since this can impact convergence. When you use distributed
training, your batch is split across GPUs and the effect of a much lower batch size can be a higher error
rate thereby disrupting the model from converging. For example, if you prototyped your network on a
single GPU with a batch size of 64, then scaled up to using four p3dn.24xlarge, you now have 32 GPUs
and your per-GPU batch size drops from 64 to 2. This will likely break the convergence you saw with a
single node.
To learn more about the SageMaker distributed libraries, see the following:
Batch Size
SageMaker distributed toolkits generally allow you to train on bigger batches. For example, if a model
fits within a single device but can only be trained with a small batch size, using either model-parallel
training or data parallel training enables you to experiment with larger batch sizes.
Be aware that batch size directly influences model accuracy by controlling the amount of noise in the
model update at each iteration. Increasing batch size reduces the amount of noise in the gradient
estimation, which can be beneficial when increasing from very small batches sizes, but can result in
degraded model accuracy as the batch size increases to large values.
Tip
Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as
you increase its batch size.
A number of techniques have been developed to maintain good model convergence when batch is
increased.
1827
Amazon SageMaker Developer Guide
Scenarios
Mini-Batch Size
In SGD, the mini-batch size quantifies the amount of noise present in the gradient estimation. A small
mini-batch results in a very noisy mini-batch gradient, which is not representative of the true gradient
over the dataset. A large mini-batch results in a mini-batch gradient close to the true gradient over the
dataset and potentially not noisy enough—likely to stay locked in irrelevant minima.
Scenarios
The following sections cover scenarios in which you may want to scale up training, and how you can do
so using AWS resources.
p3.2xlarge 1
p3.8xlarge 4
p3.16xlarge 8
p3dn.24xlarge 8
Note
The ml instance types used by SageMaker training have the same number of GPUs as the
corresponding p3 instance types. For example, ml.p3.8xlarge has the same number of GPUs
as p3.8xlarge - 4.
1828
Amazon SageMaker Developer Guide
Scenarios
If you have made the jump from a single GPU on a p3.2xlarge to four GPUs on a p3.8xlarge, but
decide that you require more processing power, you may see better performance and incur lower costs
if you choose a p3.16xlarge before trying to increase instance count. Depending on the libraries you
use, when you keep your training on a single instance, performance is better and costs are lower than a
scenario where you use multiple instances.
When you are ready to scale the number of instances, you can do this with SageMaker Python
SDK estimator function by setting your instance_count. For example, you can set instance_type
= p3.16xlarge and instance_count = 2. Instead of the eight GPUs on a single p3.16xlarge,
you have 16 GPUs across two identical instances. The following chart shows scaling and throughput
starting with eight GPUs on a single instance and increasing to 64 instances for a total of 256 GPUs.
1829
Amazon SageMaker Developer Guide
Scenarios
1830
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
First, your instances need to be in the same Region and same Availability Zone. For example, instances in
us-west-2 must all be in us-west-2a. When you use the SageMaker Python SDK, this is handled for
you. If you use Amazon EC2 and orchestrate your own training clusters, you need to be aware of this, or
your training speeds suffer.
Your training data should also be in the same Availability Zone. When you use a SageMaker estimator,
you pass in the Region and the S3 bucket, and if the data is not in the Region you set, you get an error.
SageMaker also supports Horovod and implementations of distributed training native to each major
deep learning framework. If you choose to use examples from these frameworks, you can follow
SageMaker’s container guide for Deep Learning Containers, and various example notebooks that
demonstrate implementations.
When training a model on a large amount of data, machine learning practitioners often turn to
distributed training to reduce the time to train. In some cases, where time is of the essence, the business
requirement is to finish training as quickly as possible or at least within a constrained time period. Then,
distributed training is scaled to use a cluster of multiple nodes—not just multiple GPUs in a computing
instance, but multiple instances with multiple GPUs. As the cluster size increases, so does the significant
drop in performance. This drop in performance is primarily caused by the communications overhead
between nodes in a cluster.
To resolve such overhead problems, SageMaker offers two distributed training options: SageMaker
model parallelism and SageMaker data parallelism. This guide focuses on how to train models using the
SageMaker data parallelism library.
• The library optimizes your training job for AWS network infrastructure and Amazon EC2 instance
topology.
• The library takes advantage of gradient updates to communicate between nodes with a custom
AllReduce algorithm.
To track the latest updates of the library, see the SageMaker Distributed Data Parallel Release Notes in
the SageMaker Python SDK documentation.
1831
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
For more information about training with a model-parallel strategy, see SageMaker's Model Parallelism
Library (p. 1864).
Topics
• Introduction to SageMaker's Distributed Data Parallel Library (p. 1832)
• Supported Frameworks, AWS Regions, and Instances Types (p. 1834)
• Run a SageMaker Distributed Training Job with Data Parallelism (p. 1839)
• SageMaker Distributed Data Parallel Configuration Tips and Pitfalls (p. 1858)
• Amazon SageMaker Data Parallel Library FAQ (p. 1860)
• Data Parallel Troubleshooting (p. 1862)
1. The library performs AllReduce, a key operation during distributed training that is responsible for a
large portion of communication overhead.
2. The library performs optimized node-to-node communication by fully utilizing AWS’s network
infrastructure and Amazon EC2 instance topology.
Use this data parallel library to increase speed by up to 25% in training models such as BERT. While
implementations like Horovod offer sub-linear performance at scale, this library offers near-linear
performance at scale. This means that you get a faster training time and a lower cost to train a model.
Note
The SageMaker distributed training libraries are available only through the AWS deep learning
containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker
training platform. To use the libraries, you must use the SageMaker Python SDK or the
SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout
the documentation, instructions and examples focus on how to use the distributed training
libraries with the SageMaker Python SDK.
Training Benchmarks
PyTorch with SageMaker's data parallel library
1832
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
• BERT: When used with PyTorch, the SageMaker library is 41%, 52%, and 13% faster than PyTorch-
DDP.
• MaskRCNN: When used with PyTorch, the SageMaker library is 4%, 19%, and 15% faster than PyTorch-
DDP.
These benchmarks were run on PyTorch v1.6 using ml.p3dn.24xlarge instances. You can find the
training code on the SageMaker examples website. The examples website also has benchmark training
code for these models using TensorFlow 2.3.
One key disadvantage of traditional parameter servers is their suboptimal use of available network
bandwidth. Parameter servers treat variables as atomic units and place each variable on one server.
Since gradients become available sequentially during the backward pass, at any given instant, there
is imbalance in the volume of data being sent and received from different servers. Some servers are
receiving and sending more data, some less, and some none. This problem becomes worse as the number
of parameter servers increases.
The library addresses these problems by introducing balanced fusion buffers. A balanced fusion buffer is a
buffer in the GPU that holds the gradients until the size of the buffer exceeds a threshold. In a setup with
N parameter servers, when the buffer exceeds the threshold, the balanced fusion buffer is copied to CPU
memory, sharded into N parts, and the ith part is sent to the ith parameter server. Each server receives
exactly the same number of bytes from a balanced fusion buffer. The ith server receives the ith partition
of the balanced fusion buffer from all workers, sums them up, and sends the results back to all workers.
Since all the servers participate equally in averaging each balanced fusion buffer, server bandwidth is
efficiently utilized.
• Leverages CPUs: The library uses CPUs to AllReduce gradients, offloading this task from the GPUs.
• Improved GPU usage: The cluster’s GPUs focus on computing gradients, improving their utilization
throughout training.
1833
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
• The size of the global batch is (number of nodes in a cluster) * (number of GPUs per
node) * (per batch shard).
• A batch shard (small batch) is a subset of dataset assigned to each GPU (worker) per iteration.
3. The library launches a training script on each worker.
4. The library manages copies of model weights and gradients from the workers at the end of every
iteration.
5. The library synchronizes model weights and gradients across the workers to aggregate a single trained
model.
The following architecture diagram shows an example of how the library sets up data parallelism for a
cluster of 3 nodes.
To start using the SageMaker distributed data parallel library, see Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847) to set up a SageMaker estimator
through Amazon SageMaker Python SDK, and Run a SageMaker Distributed Training Job with Data
Parallelism (p. 1839) to adapt your training script using the SageMaker distributed data parallel library.
Supported Frameworks
The following tables show the deep learning frameworks and their versions that SageMaker and the
SageMaker data parallelism library support. The SageMaker data parallelism library is available in AWS
Deep Learning Containers (DLC) or downloadable as a binary file.
1834
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
Note
To check the latest updates and release notes of the library, see also the SageMaker Data
Parallel Release Notes in the SageMaker Python SDK documentation.
Topics
• PyTorch (p. 1835)
• PyTorch Lightning (p. 1837)
• TensorFlow (p. 1837)
• Hugging Face Transformers (p. 1838)
PyTorch
1835
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
Note
The SageMaker data parallelism library v1.4.0 and later works as a backend of PyTorch
distributed. In accordance with the change, the following smdistributed APIs for the PyTorch
distributed package are deprecated.
1836
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
If you need to use the previous versions of the library (v1.3.0 or before), see the archived
SageMaker data parallelism library documentation in the SageMaker Python SDK
documentation.
** The URLs of the binary files are for installing the SageMaker data parallelism library in custom
containers. For more information, see Create Your Own Docker Container with the SageMaker Distributed
Data Parallel Library (p. 1850).
PyTorch Lightning
Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the
PyTorch DLCs. When you construct a SageMaker PyTorch estimator and submit a training job
request in Step 2, you need to provide requirements.txt to install pytorch-lightning
and lightning-bolts in the SageMaker PyTorch training container.
# requirements.txt
pytorch-lightning
lightning-bolts
For more information about specifying the source directory to place the requirements.txt
file along with your training script and a job submission, see Using third-party libraries in the
Amazon SageMaker Python SDK documentation.
TensorFlow
1837
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
AWS Regions
The SageMaker data parallelism library is available in all of the AWS Regions where the AWS Deep
Learning Containers for SageMaker are in service. For more information, see Available Deep Learning
Containers Images.
Instance type
ml.p3.16xlarge
ml.p3dn.24xlarge
ml.p4d.24xlarge
ml.p4de.24xlarge
For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.
If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.
1838
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
• SageMaker Python SDK with the library API – In most cases, all you have to change in your training
script is the data parallel library import statements. Swap these out with the SageMaker data parallel
library equivalents.
• Focus on your model training without infrastructure management – When training a deep learning
model with the library on SageMaker, you can focus on writing your training script and model training.
You can run a training job using estimator classes provided by the SageMaker Python SDK. The
estimator classes help prepare ML instances, load datasets from specified data resources, submit the
training job using your training script, and shut down the instances after the training job is completed.
To begin, you need to adapt TensorFlow or PyTorch training scripts to use the library. The following
topics provide instructions on how to modify your training script.
Topics
• Step 1: Modify Your Own Training Script (p. 1839)
• Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK (p. 1847)
The training script examples provided in these sections are simplified and designed to highlight the
required changes you must make to use the library. For end-to-end, runnable notebook examples that
demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker distributed data
parallel library, see Amazon SageMaker Distributed Training Notebook Examples (p. 1942).
Topics
• Modify a TensorFlow Training Script (p. 1839)
• Modify a PyTorch Training Script (p. 1842)
• Modify a PyTorch Lightning Script (p. 1845)
The library APIs are designed to be similar to Horovod APIs. For additional details on each API
that the library offers for TensorFlow, see the SageMaker distributed data parallel TensorFlow API
documentation.
Note
SageMaker distributed data parallel is adaptable to TensorFlow training scripts composed of tf
core modules except tf.keras modules. SageMaker distributed data parallel does not support
TensorFlow with Keras implementation.
1839
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
Note
SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP)
out of the box. No extra action is needed to enable AMP other than the framework-level
modifications to your training script. If gradients are in FP16, the SageMaker data parallelism
library runs its AllReduce operation in FP16. For more information about implementing AMP
APIs to your training script, see the following resources:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
3. Scale the learning rate by the number of workers. The sdp.tensorflow.size() API provides you
the number of workers in the cluster. This is invoked in the following code block as sdp.size().
5. Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks
1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use
the sdp.tensorflow.broadcast_variables API after the model and optimizer variables are
initialized. This is invoked in the following code block as sdp.broadcast_variables().
sdp.broadcast_variables(model.variables, root_rank=0)
sdp.broadcast_variables(opt.variables(), root_rank=0)
6. Finally, modify your script to save checkpoints only on the leader node. The leader node has a
synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly
corrupting the checkpoints.
1840
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
if sdp.rank() == 0:
checkpoint.save(checkpoint_dir)
The following is an example TensorFlow training script for distributed training with the library.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
# SageMaker data parallel: Pin GPUs to a single library process
tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
# Prepare Dataset
dataset = tf.data.Dataset.from_tensor_slices(...)
# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()
@tf.function
def training_step(images, labels, first_batch):
with tf.GradientTape() as tape:
probs = mnist_model(images, training=True)
loss_value = loss(labels, probs)
if first_batch:
# SageMaker data parallel: Broadcast model and optimizer variables
sdp.broadcast_variables(mnist_model.variables, root_rank=0)
sdp.broadcast_variables(opt.variables(), root_rank=0)
return loss_value
...
After you have completed adapting your training script, move on to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).
1841
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
If you need to use the previous versions of the library (v1.3.0 or before), see the archived
SageMaker distributed data parallel library documentation in the SageMaker Python SDK
documentation.
Use the SageMaker Distributed Data Parallel Library as the Backend of torch.distributed
To use the SageMaker distributed data parallel library, the only thing you need
to do is to import the SageMaker distributed data parallel library’s PyTorch client
(smdistributed.dataparallel.torch.torch_smddp). The client registers smddp as
a backend for PyTorch. When you initialize the PyTorch distributed process group using the
torch.distributed.init_process_group API, make sure you specify 'smddp' to the backend
argument.
import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist
dist.init_process_group(backend='smddp')
Note
The smddp backend currently does not support creating subprocess groups with the
torch.distributed.new_group() API. You cannot use the smddp backend concurrently
with other process group backends such as NCCL and Gloo.
If you already have a working PyTorch script and only need to add the backend specification, you can
proceed to Using the SageMaker Framework Estimators For PyTorch and TensorFlow (p. 1847) in the
Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK (p. 1847) topic.
If you still need to modify your training script to properly use the PyTorch distributed package, follow
the rest of the procedures on this page.
1842
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
2. After parsing arguments and defining a batch size parameter (for example,
batch_size=args.batch_size), add two lines of code to resize the batch size per worker (GPU).
PyTorch's DataLoader operation does not automatically handle the batch resizing for distributed
training.
3. Pin each GPU to a single SageMaker data parallel library process with local_rank—this refers to
the relative rank of the process within a given node.
You can retrieve the rank of the process from the LOCAL_RANK environment variable.
import os
local_rank = os.environ["LOCAL_RANK"]
torch.cuda.set_device(local_rank)
model = ...
train_sampler = DistributedSampler(
train_dataset,
num_replicas = dist.get_world_size(),
rank = dist.get_rank()
)
6. Modify your script to save checkpoints only on the leader process (rank 0). The leader process has
a synchronized model. This also avoids other processes overwriting the checkpoints and possibly
corrupting the checkpoints.
if dist.get_rank() == 0:
torch.save(...)
The following example code shows the structure of a PyTorch training script with smddp as the backend.
import os
import torch
1843
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
class Net(nn.Module):
...
# Define model
def train(...):
...
# Model training
def test(...):
...
# Model evaluation
def main():
# Prepare dataset
train_dataset = torchvision.datasets.MNIST(...)
train_loader = torch.utils.data.DataLoader(..)
# SageMaker data parallel: Wrap the PyTorch model with the library's DDP
model = DDP(Net().to(device))
# Train
optimizer = optim.Adadelta(...)
scheduler = StepLR(...)
for epoch in range(1, args.epochs + 1):
train(...)
if rank == 0:
test(...)
scheduler.step()
if __name__ == '__main__':
main()
After you have completed adapting your training script, proceed to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).
1844
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
If you want to bring your PyTorch Lightning training script and run a distributed data parallel training job
in SageMaker, you can run the training job with minimal changes in your training script. The necessary
changes include the following: import the smdistributed.dataparallel library’s PyTorch modules,
set up the environment variables for PyTorch Lightning to accept the SageMaker environment variables
that are preset by the SageMaker training toolkit, and activate the SageMaker data parallel library by
setting the process group backend to "smddp". To learn more, walk through the following instructions
that break down the steps with code examples.
Note
The PyTorch Lightning support is available in the SageMaker data parallel library v1.5.0 and
later.
import pytorch_lightning as pl
import smdistributed.dataparallel.torch.torch_smddp
2. Set the world size and the rank for the LightningEnvironment class object. When launching a
training job in SageMaker, the SageMaker training toolkit sets up the environment variables "RANK",
"LOCAL_RANK", and "WORLD_SIZE". These environment variables represent the processes' global
ranks, their local ranks, and the world size, respectively. Use these SageMaker environment variables
to configure the LightningEnvironment.
import os
from pytorch_lightning.plugins.environments.lightning_environment \
import LightningEnvironment
env = LightningEnvironment()
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])
3. Set distributed training strategy using the PyTorch Lightning DDPStrategy module, create a PyTorch
Lightning Trainer object, and adapt them to use the SageMaker data parallel library.
Create an object (ddp in the following code example) of the DDPStrategy class, and specify "smddp"
to the process_group_backend parameter. When configuring a PyTorch Lightning Trainer object,
use the SageMaker environment variables to specify the scale of the GPU cluster and the ddp object
to set up the distributed training strategy.
Note
We recommend that you check the versions of PyTorch Lightning tested for compatibility
with the SageMaker data parallel library in the the section called “Supported Frameworks and
AWS Regions” (p. 1872) page.
ddp = DDPStrategy(
cluster_environment=env,
process_group_backend="smddp",
accelerator="gpu"
)
world_size = int(os.environ["WORLD_SIZE"])
num_gpus = int(os.environ["SM_NUM_GPUS"])
num_nodes = int(world_size/num_gpus)
trainer = pl.Trainer(
1845
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
devices=num_gpus,
num_nodes=num_nodes,
max_epochs=10,
strategy=ddp
)
If you are using DDPPlugin, which is a deprecated functionality, set the distributed strategy as shown
in the following code example.
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "smddp"
ddp = DDPPlugin(
parallel_devices=[torch.device("cuda", d) for d in range(num_gpus)],
cluster_environment=env
)
world_size = int(os.environ["WORLD_SIZE"])
num_gpus = int(os.environ["SM_NUM_GPUS"])
num_nodes = int(world_size/num_gpus)
trainer = pl.Trainer(
gpus=num_gpus,
num_nodes=num_nodes,
max_epochs=10,
strategy=ddp
)
4. Run trainer.fit to start the training job of a PyTorch model. The following code example shows
a PyTorch model object wrapped by the PyTorch Lightning Trainer’s fit method with the PyTorch
Lightning MNIST data module.
trainer.fit(model, datamodule=MNISTDataModule(batch_size=32))
After you have completed adapting your training script, proceed to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).
Note
When you construct a SageMaker PyTorch estimator and submit a training job request in Step
2, you need to provide requirements.txt to install pytorch-lightning and lightning-
bolts in the SageMaker PyTorch training container.
# requirements.txt
pytorch-lightning
lightning-bolts
For more information about specifying the source directory to place the requirements.txt
file along with your training script and a job submission, see Using third-party libraries in the
Amazon SageMaker Python SDK documentation.
1846
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
• If you want to achieve a quick adoption of your distributed training job in SageMaker, configure a
SageMaker PyTorch or TensorFlow framework estimator class. The framework estimator picks up your
training script and automatically matches the right image URI of the pre-built PyTorch or TensorFlow
Deep Learning Containers (DLC), given the value specified to the framework_version parameter.
• If you want to extend one of the pre-built containers or build a custom container to create your own
ML environment with SageMaker, use the SageMaker generic Estimator class and specify the image
URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).
Your training datasets should be stored in Amazon S3 or Amazon FSx for Lustre in the AWS Region in
which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker
notebook instance or a SageMaker Studio app running in the same AWS Region. For more information
about storing your training data, see the SageMaker Python SDK data inputs documentation.
Tip
We highly recommend that you use Amazon FSx for Lustre instead of Amazon S3 to increase
training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.
Choose one of the following topics for instructions on how to run your TensorFlow or PyTorch training
scripts. After you launch a training job, you can monitor system utilization and model performance using
Debug and Profile Training Jobs Using Amazon SageMaker Debugger (p. 1649) or Amazon CloudWatch.
While you follow instructions in the following topics to learn more about technical details, we also
recommend that you try the Amazon SageMaker Distributed Training Notebook Examples (p. 1942) to
get started.
Topics
• Using the SageMaker Framework Estimators For PyTorch and TensorFlow (p. 1847)
• Using the SageMaker Generic Estimator to Extend Prebuilt Containers (p. 1849)
• Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850)
SageMaker PyTorch
pt_estimator = PyTorch(
base_job_name="training_job_name_prefix",
source_dir="sub-folder-for-your-code",
entry_point="adapted-training-script.py",
role="SageMakerRole",
py_version="py38",
framework_version="1.12.0",
# For running a multi-node distributed training job, specify a value greater than 1
1847
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
# Example: 2,3,4,..8
instance_count=2,
pt_estimator.fit("s3://bucket/path/to/training/data")
Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the
SageMaker PyTorch DLCs. Create the following requirements.txt file and save in the
source directory where you save the training script.
# requirements.txt
pytorch-lightning
lightning-bolts
For example, the tree-structured directory should look like the following.
### pytorch_training_launcher_jupyter_notebook.ipynb
### sub-folder-for-your-code
### adapted-training-script.py
### requirements.txt
For more information about specifying the source directory to place the
requirements.txt file along with your training script and a job submission, see Using
third-party libraries in the Amazon SageMaker Python SDK documentation.
SageMaker TensorFlow
tf_estimator = TensorFlow(
base_job_name = "training_job_name_prefix",
entry_point="adapted-training-script.py",
role="SageMakerRole",
framework_version="2.9.1",
py_version="py38",
# For running a multi-node distributed training job, specify a value greater than 1
# Example: 2,3,4,..8
instance_count=2,
tf_estimator.fit("s3://bucket/path/to/training/data")
The following two parameters of the SageMaker framework estimator are required to activate the
SageMaker data parallelism.
1848
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
distribution (dict): A dictionary with information on how to run distributed training (default: None).
distribution = {
"smdistributed": {
"dataparallel": {
"enabled": True,
"custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
}
}
}
• If using the smdistributed with dataparallel distribution strategy, you must use one of the
following instance types: ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge. For best
performance, we recommend that you use an EFA-enabled instance, which are ml.p3dn.24xlarge
and ml.p4d.24xlarge.
To extend a prebuilt container or adapt your own container to use the library, you must use one of the
images listed in Supported Frameworks (p. 1834).
Important
From TensorFlow 2.4.1 and PyTorch 1.8.1, the framework DLCs supports EFA-enabled instance
types (ml.p3dn.24xlarge, ml.p4d.24xlarge). We recommend that you use the DLC images
that contain TensorFlow 2.4.1 or later and PyTorch 1.8.1 or later.
For example, if you use PyTorch, your Dockerfile should contain a FROM statement similar to the
following:
ENV PATH="/opt/ml/code:${PATH}"
# this environment variable is used by the SageMaker PyTorch container to determine our
user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to
store your user code.
COPY cifar10.py /opt/ml/code/cifar10.py
1849
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
You can further customize your own Docker container to work with SageMaker using the SageMaker
Training toolkit and the binary file of the SageMaker distributed data parallel library. To learn more, see
the instructions in the following section.
Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library
To build your own Docker container for training and use the SageMaker data parallel library, you must
include the correct dependencies and the binary files of the SageMaker distributed parallel libraries
in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the
minimum set of dependencies for distributed training in SageMaker using the data parallel library.
Note
This custom Docker option with the SageMaker data parallel library as a binary is available only
for PyTorch.
To create a Dockerfile with the SageMaker training toolkit and the data parallel library
1. Start with a Docker image from NVIDIA CUDA. Use the cuDNN developer versions that contain CUDA
runtime and development tools (headers and libraries) to build from the PyTorch source code.
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA base
images. If you want to use the prebuilt DLC images as references while following the rest of
the instructions, see the AWS Deep Learning Containers for PyTorch Dockerfiles.
2. Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the
Amazon S3 bucket paths to the SageMaker data parallel library and other software to use AWS
resources, such as the Amazon S3 plug-in.
To use versions of the third party libraries other than the ones provided in the following code
example, we recommend you look into the official Dockerfiles of AWS Deep Learning Container for
PyTorch to find versions that are tested, compatible, and suitable for your application.
To find URLs for the SMDATAPARALLEL_BINARY argument, see the look up tables at Supported
Frameworks (p. 1834).
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/
${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-
linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/
binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws
3. Set the following environment variables to properly build SageMaker training components and run
the data parallel library. You use these variables for the components in the subsequent steps.
1850
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
4. Install or update curl, wget, and git to download and build packages in the subsequent steps.
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
git \
&& rm -rf /var/lib/apt/lists/*
5. Install Elastic Fabric Adapter (EFA) software for Amazon EC2 network communication.
7. Get, build, and install PyTorch and its dependencies. We build PyTorch from the source code because
we need to have control of the NCCL version to guarantee compatibility with the AWS OFI NCCL plug-
in.
a. Following the steps in the PyTorch official dockerfile, install build dependencies and set up ccache
to speed up recompilation.
RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
1851
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
git \
libjpeg-dev \
libpng-dev \
&& rm -rf /var/lib/apt/lists/*
# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
d. Install and build a specific NCCL version. To do this, replace the content in the PyTorch’s default
NCCL folder (/pytorch/third_party/nccl) with the specific NCCL version from the NVIDIA
repository. The NCCL version was set in the step 3 of this guide.
RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-
gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
e. Build and install PyTorch. This process usually takes slightly more than one hour to complete. It is
built using the NCCL version downloaded in a previous step.
RUN cd /pytorch \
&& CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install \
&& rm -rf /pytorch
8. Build and install AWS OFI NCCL plugin. This enables libfabric support for the SageMaker data parallel
library.
1852
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
&& make \
&& make install \
&& rm -rf /tmp/efa-ofi-nccl
10.Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers.
Allow OpenSSH to talk to containers without asking for confirmation.
12.Install the libboost library. This package is needed for networking the asynchronous IO functionality
of the SageMaker data parallel library.
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/
download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
&& ./bootstrap.sh \
&& ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC
install || true \
&& cd .. \
&& rm -rf boost_1_73_0.tar.gz \
&& rm -rf boost_1_73_0 \
&& cd ${CONDA_PREFIX}/include/boost
1853
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
WORKDIR /root
RUN pip install --no-cache-dir -U \
smclarify \
"sagemaker>=2,<3" \
sagemaker-experiments==0.* \
sagemaker-pytorch-training
14.Finally, install the SageMaker data parallel binary and the remaining dependencies.
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
jq \
libhwloc-dev \
libnuma1 \
libnuma-dev \
libssl1.1 \
libtool \
hwloc \
&& rm -rf /var/lib/apt/lists/*
15.After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to
build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python
SDK.
The following example code shows a complete Dockerfile after combining all the previous code blocks.
# This file creates a docker image with minimum dependencies to run SageMaker data parallel
training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
1854
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
ENV DLC_CONTAINER_TYPE=training
# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-
${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/efa
# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-
latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p $CONDA_PREFIX && \
rm ~/miniconda.sh && \
$CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml
numpy ipython && \
$CONDA_PREFIX/bin/conda clean -ya
# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
git \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*
# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
1855
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.
# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-
gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
RUN ccache -C
# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers
without asking for confirmation
RUN apt-get update \
&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-
recommends \
&& apt-get install -y --no-install-recommends openssh-client openssh-server \
&& mkdir -p /var/run/sshd \
&& cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
&& rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
1856
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/
certs/ca-bundle.crt
# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see
Use Your Own Training Algorithms.
Tip
If you want to extend the custom Dockerfile to incorporate the SageMaker model parallel
library, see Create Your Own Docker Container with the SageMaker Distributed Model Parallel
Library (p. 1925).
1857
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
Topics
• Data Preprocessing (p. 1858)
• Single Versus Multiple Nodes (p. 1858)
• Debug Scaling Efficiency with Debugger (p. 1858)
• Batch Size (p. 1858)
• Custom MPI Options (p. 1859)
• Use Amazon FSx and set up an optimal storage and throughput capacity (p. 1859)
Data Preprocessing
If you preprocess data during training using an external library that utilizes the CPU, you may run into a
CPU bottleneck because SageMaker distributed data parallel uses the CPU for AllReduce operations.
You may be able to improve training time by moving preprocessing steps to a library that uses GPUs or
by completing all preprocessing before training.
To see an example using Debugger in a SageMaker training job, you can reference one of the notebook
examples in the SageMaker Notebook Examples GitHub repository. To learn more about Debugger, see
Amazon SageMaker Debugger.
Batch Size
In distributed training, as more nodes are added, batch sizes should increase proportionally. To improve
convergence speed as you add more nodes to your training job and increase the global batch size,
increase the learning rate.
One way to achieve this is by using a gradual learning rate warmup where the learning rate is ramped up
from a small to a large value as the training job progresses. This ramp avoids a sudden increase of the
learning rate, allowing healthy convergence at the start of training. For example, you can use a Linear
Scaling Rule where each time the mini-batch size is multiplied by k, the learning rate is also multiplied by
k. To learn more about this technique, see the research paper, Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour, Sections 2 and 3.
1858
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
You can set custom MPI operations using the custom_mpi_options parameter in the Estimator.
Any mpirun flags passed in this field are added to the mpirun command and executed by SageMaker
for training. For example, you may define the distribution parameter of an Estimator using the
following to use the NCCL_DEBUG variable to print the NCCL version at the start of the program:
Use Amazon FSx and set up an optimal storage and throughput capacity
When training a model on multiple nodes with distributed data parallelism, it is highly recommended to
use FSx for Lustre. Amazon FSx is a scalable and high-performance storage service that supports shared
file storage with a faster throughput. Using Amazon FSx storage at scale, you can achieve a faster data
loading speed across the compute nodes.
Typically, with distributed data parallelism, you would expect that the total training throughput scales
near-linearly with the number of GPUs. However, if you use suboptimal Amazon FSx storage, the training
performance might slow down due to a low Amazon FSx throughput.
For example, if you use the SCRATCH_2 deployment type of Amazon FSx file system with the minimum
1.2 TiB storage capacity, the I/O throughput capacity is 240 MB/s. Amazon FSx storage works in a way
that you can assign physical storage devices, and the more devices assigned, the larger throughput you
get. The smallest storage increment for the SRATCH_2 type is 1.2 TiB, and the corresponding throughput
gain is 240 MB/s.
Assume that you have a model to train on a 4-node cluster over a 100 GB data set. With a given batch
size that’s optimized to the cluster, assume that the model can complete one epoch in about 30 seconds.
In this case, the minimum required I/O speed is approximately 3 GB/s (100 GB / 30 s). This is apparently
a much higher throughput requirement than 240 MB/s. With such a limited Amazon FSx capacity, scaling
your distributed training job up to larger clusters might aggravate I/O bottleneck problems; model
training throughput might improve in later epochs as cache builds up, but Amazon FSx throughput can
still be a bottleneck.
To alleviate such I/O bottleneck problems, you should increase the Amazon FSx storage size to obtain
a higher throughput capacity. Typically, to find an optimal I/O throughput, you may experiment with
different Amazon FSx throughput capacities, assigning an equal to or slightly lower throughput than
your estimate, until you find that it is sufficient to resolve the I/O bottleneck problems. In case of the
aforementioned example, Amazon FSx storage with 2.4 GB/s throughput and 67 GB RAM cache would
be sufficient. If the file system has an optimal throughput, the model training throughput should reach
maximum either immediately or after the first epoch as cache has built up.
To learn more about how to increase Amazon FSx storage and deployment types, see the following
pages in the Amazon FSx for Lustre documentation:
1859
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
Q: When using the library, how are the allreduce-supporting CPU instances managed? Do I have to
create heterogeneous CPU-GPU clusters, or does the SageMaker service create extra C5s for jobs that
use the library?
The library uses the CPUs available with GPU instances. No additional C5 or CPU instances are launched;
if your SageMaker training job is 8-node ml.p3dn.24xlarge clusters, only 8 ml.p3dn.24xlarge
instances are used. No additional instances are provisioned.
Q: I have a training job taking 5 days on a single ml.p3.24xlarge instance with a set of
hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker's data parallelism
library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I
have to revisit its training hyperparameters after activating the library?
The library changes the overall batch size. The new overall batch size is scaled linearly with the number
of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed
to ensure convergence.
Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker
training job. You save and restore checkpoints in their training script as mentioned in the last steps of the
section called “TensorFlow” (p. 1839) and the section called “PyTorch” (p. 1842).
The library can be used in single-host multi-device training but the library offers performance
improvements only in multi-host training.
The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this
document for various supported input file systems for a training job.
Q: When using the library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS
and Amazon S3 be used?
We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you
prefer, you can use Amazon EFS or Amazon S3.
Q: What frameworks and framework versions are currently supported by the library at launch?
The library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support
TensorFlow 1.x. For more information about which version of the library is packaged within AWS deep
learning containers, see Release Notes for Deep Learning Containers.
Yes, SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of
the box. No extra action is needed to use AMP other than the framework-level modifications to your
training script. If gradients are in FP16, the SageMaker data parallelism library runs its AllReduce
1860
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
operation in FP16. For more information about implementing AMP APIs to your training script, see the
following resources:
Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?
With a larger cluster, the training job requires more I/O throughput, and therefore the training
throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates
that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput
requirement and more complex network topology). For more information about monitoring the Amazon
FSx throughput on CloudWatch, see Monitoring FSx for Lustre in the FSx for Lustre User Guide.
Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?
We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If
you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your
Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information
about how to estimate and choose the right size of I/O throughput capacity, see Use Amazon FSx and set
up an optimal storage and throughput capacity (p. 1859).
Q: (For the library v1.4.0 or later) How do I resolve the Invalid backend error while initializing
process group.
If you encounter the error message ValueError: Invalid backend: 'smddp' when calling
init_process_group, this is due to the breaking change in the library v1.4.0 and later. You must
import the PyTorch client of the library, smdistributed.dataparallel.torch.torch_smddp,
which registers smddp as a backend for PyTorch. To learn more, see Use the SageMaker Distributed Data
Parallel Library as the Backend of torch.distributed (p. 1842).
Q: (For the library v1.4.0 or later) I would like to call the collective primitives of the torch.distributed
interface. Which primitives does the smddp backend support?
In v1.4.0, the library supports all_reduce, broadcast, reduce, all_gather, and barrier.
Q: (For the library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries
like Apex DDP?
The SageMaker data parallel library is tested with other third-party distributed data parallel libraries and
framework implementations that use the torch.distribtued modules. Using the SageMaker data
parallel library with custom DDP classes works as long as the collectives used by the custom DDP classes
are supported by the library. See the preceding question for a list of supported collectives. If you have
these use cases and need further support, reach out to the SageMaker team through the AWS Support
Center or AWS Developer Forums for Amazon SageMaker.
Q: Does the library support the bring-your-own-container (BYOC) option? If so, how do I install the
library and run a distributed training job by writing a custom Dockerfile?
If you want to integrate the SageMaker data parallel library and its minimum dependencies in your
own Docker container, BYOC is the right approach. You can build your own container using the binary
file of the library. The recommended process is to write a custom Dockerfile with the library and its
dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch
a training job using the SageMaker generic estimator class. For more instructions on how to prepare a
1861
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
custom Dockerfile for distributed training in SageMaker with the SageMaker data parallel library, see
Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850).
Topics
• Using SageMaker Distributed Data Parallel with Amazon SageMaker Debugger and
Checkpoints (p. 1862)
• An Unexpected Prefix Attached to Model Parameter Keys (p. 1863)
• SageMaker Distributed Training Job Stalling During Initialization (p. 1863)
• SageMaker Distributed Training Job Stalling at the End of Training (p. 1863)
• Observing Scaling Efficiency Degradation Due to Amazon FSx Throughput Bottlenecks (p. 1864)
• SageMaker Distributed Training Job with PyTorch Returns Deprecation Warnings (p. 1864)
However, when you use SageMaker Debugger, SageMaker distributed data parallel, and SageMaker
checkpoints, you might see an error that looks like the following example.
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
This is due to an internal error between Debugger and checkpoints, which occurs when you enable
SageMaker distributed data parallel.
• If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing
debugger_hook_config=False, which is equivalent to the following framework estimator
example.
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"
estimator = TensorFlow(
...
• If you want to keep using both SageMaker distributed data parallel and SageMaker Debugger, a
workaround is manually adding checkpointing functions to your training script instead of specifying
the checkpoint_s3_uri and checkpoint_local_path parameters from the estimator.
For more information about setting up manual checkpointing in a training script, see Saving
Checkpoints (p. 1939).
1862
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library
This takes each state_dict key as a string value, separates the string at the first occurrence of
'model.', and takes the third list item (with index 2) of the partitioned string.
For more information about the prefix issue, see a discussion thread at Prefix parameter names in saved
model if trained by multi-GPU? in the PyTorch discussion forum.
For more information about the PyTorch methods for saving and loading models, see Saving & Loading
Model Across Devices in the PyTorch documentation.
1. Sign in to the AWS Management Console and open the Amazon VPC console at https://
console.aws.amazon.com/vpc/.
2. Choose Security Groups in the left navigation pane.
3. Select the security group that's tied to the VPC subnet you use for training.
4. In the Details section, copy the Security group ID.
5. On the Inbound rules tab, choose Edit inbound rules.
6. On the Edit inbound rules page, do the following:
a. Choose Add rule.
b. For Type, choose All traffic.
c. For Source, choose Custom, paste the security group ID into the search box, and select the security
group that pops up.
7. Choose Save rules to finish configuring the inbound rule for the security group.
8. On the Outbound rules tab, choose Edit outbound rules.
9. Repeat the step 6 and 7 to add the same rule as an outbound rule.
After you complete the preceding steps for configuring the security group with the inbound and
outbound rules, rerun the training job and verify if the stalling issue is resolved.
For more information about configuring security groups for VPC and EFA, see Security groups for your
VPC and Elastic Fabric Adapter.
1863
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
in the backward pass to ensure they all have the same copy of the model at the end of the batch
iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of
training, the training job stalls. For example, while a group of workers (group A) finishes processing all
batches and exits the training loop, another group of workers (group B) starts processing another batch
and still expects communication from group A to synchronize the gradients. This causes group B to wait
for group A, which already completed training and does not have any gradients to synchronize.
Therefore, when setting up your training dataset, it is important that each worker gets the same number
of data samples so that each worker goes through the same number of batches while training. Make sure
each rank gets the same number of batches to avoid this stalling issue.
In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set
as the backend during the PyTorch distributed initialization. With the single line of backend specification,
you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules.
See Modify a PyTorch Training Script (p. 1842) to learn about the breaking changes and the new way to
use the library with PyTorch.
You can use the library to automatically partition your own TensorFlow and PyTorch models across
multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through
the SageMaker Python SDK.
Use the following sections to learn more about model parallelism and the SageMaker model parallel
library. This library's API documentation is located at Distributed Training APIs in the SageMaker Python
SDK documentation.
To track the latest updates of the library, see the SageMaker Model Parallel Release Notes in the
SageMaker Python SDK documentation.
Topics
1864
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• They limit the size of the model you can train, since the memory footprint of a model scales
proportionally to the number of parameters.
• They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.
To overcome the limitations associated with training a model on a single GPU, SageMaker provides
the model parallel library to help distribute and train DL models efficiently on multiple compute
nodes. Furthermore, with the library, you can achieve most optimized distributed training using EFA-
supported devices, which enhance the performance of inter-node communication with low latency, high
throughput, and OS bypass.
For a training job that uses AMP (FP16) and Adam optimizers, the required GPU memory per parameter
is about 20 bytes, which we can break down as follows:
Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory,
which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory
and V100 with 16/32 GB) available on a single GPU. Note that on top of the memory requirements for
model and optimizer states, there are other memory consumers such as activations generated in the
forward pass. The memory required can be a lot greater than 200GB.
1865
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
For distributed training, we recommend that you use Amazon EC2 P3 and P4 instances that have NVIDIA
V100 and A100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores,
RAM, attached storage volume, and network bandwidth, see the Accelerated Computing section in the
Amazon EC2 Instance Types page.
Even with the accelerated computing instances, it is obvious that models with about 10 billion
parameters such as Megatron-LM and T5 and even larger models with hundreds of billions of parameters
such as GPT-3 cannot fit model replicas in each GPU device.
How the Library Employs Model Parallelism and Memory Saving Techniques
The library consists of various types of model parallelism features and memory-saving features such as
optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be
combined to efficiently train large models that consist of hundreds of billions of parameters.
Topics
• Sharded data parallelism (available for PyTorch) (p. 1866)
• Pipeline parallelism (available for PyTorch and TensorFlow) (p. 1866)
• Tensor parallelism (available for PyTorch) (p. 1868)
• Optimizer state sharding (available for PyTorch) (p. 1870)
• Activation offloading and checkpointing (available for PyTorch) (p. 1872)
• Choosing the right techniques for your model (p. 1872)
SageMaker implements sharded data parallelism through the implementation of MiCS, which is a library
that minimizes communication scale and discussed in the blog post Near-linear scaling of gigantic-
model training on AWS.
You can apply sharded data parallelism to your model as a stand-alone strategy. Furthermore, if
you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs,
ml.p4d.24xlarge, you can take the advantage of improved training speed from the AllGather
operation offered by SMDDP Collectives.
To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded
data parallelism with other techniques like tensor parallelism and FP16 training, see the section called
“Sharded Data Parallelism” (p. 1876).
The library takes care of calculating the number of model replicas (also called
data_parallel_degree) given the two input parameters you provide.
1866
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
distributed model across the GPUs and four-way data parallelism. The following image illustrates how
a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline
parallelism. Each model replica, where we define it as a pipeline parallel group and label it as PP_GROUP,
is partitioned across two GPUs. Each partition of the model is assigned to four GPUs, where the four
partition replicas are in a data parallel group and labeled as DP_GROUP. Without tensor parallelism, the
pipeline parallel group is essentially the model parallel group.
To dive deep into pipeline parallelism, see Core Features of the SageMaker Model Parallelism
Library (p. 1875).
To get started with running your model using pipeline parallelism, see Run a SageMaker Distributed
Training Job with the SageMaker Model Parallel Library.
1867
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Tensor parallelism splits individual layers, or nn.Modules, across devices, to be run in parallel. The
following figure shows the simplest example of how the library splits a model with four layers to achieve
two-way tensor parallelism ("tensor_parallel_degree": 2). The layers of each model replica are
bisected and distributed into two GPUs. In this example case, the model parallel configuration also
includes "pipeline_parallel_degree": 1 and "ddp": True (uses PyTorch DistributedDataParallel
package in the background), so the degree of data parallelism becomes eight. The library manages
communication across the tensor-distributed model replicas.
1868
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
1869
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The usefulness of this feature is in the fact that you can select specific layers or a subset of layers
to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features
for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see Tensor
Parallelism (p. 1890).
To understand how the library performs optimizer state sharding, consider a simple example model
with four layers. The key idea in optimizing state sharding is you don't need to replicate your optimizer
state in all of your GPUs. Instead, a single replica of the optimizer state is sharded across data-parallel
ranks, with no redundancy across devices. For example, GPU 0 holds the optimizer state for layer
one, the next GPU 1 holds the optimizer state for L2, and so on. The following animated figure shows
a backward propagation with the optimizer state sharding technique. At the end of the backward
propagation, there's compute and network time for the optimizer apply (OA) operation to update
optimizer states and the all-gather (AG) operation to update the model parameters for the next
iteration. Most importantly, the reduce operation can overlap with the compute on GPU 0, resulting
in a more memory-efficient and faster backward propagation. In the current implementation, AG and
OA operations do not overlap with compute. It can result in an extended computation during the AG
operation, so there might be a tradeoff.
1870
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
1871
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
For more information about how to use this feature, see Optimizer State Sharding.
Supported Frameworks
The SageMaker model parallelism library supports the following deep learning frameworks and is
available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.
PyTorch versions supported by SageMaker and the SageMaker model parallelism library
1872
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
v1.10.2 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.7.0
pytorch-
training:1.10.2-
gpu-py38-cu113-
ubuntu20.04-
sagemaker
v1.10.0 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.5.0
pytorch-
training:1.10.0-
gpu-py38-cu113-
ubuntu20.04-
sagemaker
v1.9.1 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.4.0
pytorch-
training:1.9.1-
gpu-py38-cu111-
ubuntu20.04
v1.8.1* smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.6.0
pytorch-
training:1.8.1-
gpu-py36-cu111-
ubuntu18.04
1873
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Note
The SageMaker model parallelism library v1.6.0 and later provides extended features for
PyTorch. For more information, see Core Features of the SageMaker Model Parallelism
Library (p. 1875).
** The URLs of the binary files are for installing the SageMaker model parallelism library in custom
containers. For more information, see the section called “Create Your Own Docker Container with the
Library” (p. 1925).
TensorFlow versions supported by SageMaker and the SageMaker model parallelism library
Hugging Face Transformers versions supported by SageMaker and the SageMaker distributed data
parallel library
The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch
and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and
paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging
Face Container Versions.
AWS Regions
The SageMaker data parallel library is available in all of the AWS Regions where the AWS Deep Learning
Containers for SageMaker are in service. For more information, see Available Deep Learning Containers
Images.
Instance type
ml.g4dn.12xlarge
ml.p3.16xlarge
ml.p3dn.24xlarge
ml.p4d.24xlarge
ml.p4de.24xlarge
For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.
1874
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.
When you implement model parallelism to your training job, you keep the same two-step workflow
shown in the Run a SageMaker Distributed Training Job with Model Parallelism section. For adapting
your training script, you'll add zero or few additional code lines to your training script. For launching a
training job of the adapted training script, you'll need to set the distribution configuration parameters to
activate the memory-saving features or to pass values for the degree of parallelism.
To get started with examples, see the following Jupyter notebooks that demonstrate how to use the
SageMaker model parallelism library.
To dive deep into the core features of the library, see the following topics.
Note
The SageMaker distributed training libraries are available through the AWS deep learning
containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform.
To utilize the features of the distributed training libraries, we recommend that you use the
SageMaker Python SDK. You can also manually configure in JSON request syntax if you use
SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout
the documentation, instructions and examples focus on how to use the distributed training
libraries with the SageMaker Python SDK.
Important
The SageMaker model parallelism library supports all the core features for PyTorch, and
supports pipeline parallelism for TensorFlow.
Topics
• Sharded Data Parallelism (p. 1876)
• Pipelining a Model (p. 1887)
• Tensor Parallelism (p. 1890)
• Optimizer State Sharding (p. 1901)
• Activation Checkpointing (p. 1902)
• Activation Offloading (p. 1903)
• FP16 Training with Model Parallelism (p. 1904)
• Support for FlashAttention (p. 1906)
1875
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint
of the model by sharding the training state of the model over multiple GPUs. This returns two benefits:
you can fit larger models, which would otherwise run out of memory with standard data parallelism, or
you can increase the batch size using the freed-up GPU memory.
The standard data parallelism technique replicates the training states across the GPUs in the data
parallel group, and performs gradient aggregation based on the AllReduce operation. Sharded data
parallelism modifies the standard data-parallel distributed training procedure to account for the sharded
nature of the optimizer states. A group of ranks over which the model and optimizer states are sharded
is called a sharding group. The sharded data parallelism technique shards the trainable parameters of a
model and corresponding gradients and optimizer states across the GPUs in the sharding group.
SageMaker achieves sharded data parallelism through the implementation of MiCS, which is discussed
in the AWS blog post Near-linear scaling of gigantic-model training on AWS. In this implementation, you
can set the sharding degree as a configurable parameter, which must be less than the data parallelism
degree. During each forward and backward pass, MiCS temporarily recombines the model parameters
in all GPUs through the AllGather operation. After the forward or backward pass of each layer, MiCS
shards the parameters again to save GPU memory. During the backward pass, MiCS reduces gradients
and simultaneously shards them across GPUs through the ReduceScatter operation. Finally, MiCS
applies the local reduced and sharded gradients to their corresponding local parameter shards, using
the local shards of optimizer states. To bring down communication overhead, the SageMaker model
parallelism library prefetches the upcoming layers in the forward or backward pass, and overlaps the
network communication with the computation.
The training state of the model is replicated across the sharding groups. This means that before
gradients are applied to the parameters, the AllReduce operation must take place across the sharding
groups, in addition to the ReduceScatter operation that takes place within the sharding group.
In effect, sharded data parallelism introduces a tradeoff between the communication overhead and GPU
memory efficiency. Using sharded data parallelism increases the communication cost, but the memory
footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data
parallelism degree, thus larger models can be fit in the GPU cluster.
When you select a value for the degree of sharded data parallelism, the value must evenly divide the
degree of data parallelism. For example, for an 8-way data parallelism job, choose 2, 4, or 8 for the
sharded data parallelism degree. While choosing the sharded data parallelism degree, we recommend
that you start with a small number, and gradually increase it until the model fits in the memory together
with the desired batch size.
After setting up sharded data parallelism, make sure you find the most optimal training configuration
that can successfully run on the GPU cluster. For training large language models (LLM), start from the
batch size 1, and gradually increase it until you reach the point to receive the out-of-memory (OOM)
error. If you encounter the OOM error even with the smallest batch size, apply a higher degree of
sharded data parallelism or a combination of sharded data parallelism and tensor parallelism.
Topics
1876
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• How to apply sharded data parallelism to your training job (p. 1877)
• Reference configurations (p. 1878)
• Sharded data parallelism with SMDDP Collectives (p. 1879)
• Mixed precision training with sharded data parallelism (p. 1882)
• Sharded data parallelism with tensor parallelism (p. 1883)
• Tips and considerations for using sharded data parallelism (p. 1886)
To get started with sharded data parallelism, apply required modifications to your training script, and
set up the SageMaker PyTorch estimator with the sharded-data-parallelism-specific parameters. Also
consider to take reference values and example notebooks as a starting point.
Follow the instructions at Step 1: Modify a PyTorch Training Script (p. 1915) to wrap the model
and optimizer objects with the smdistributed.modelparallel.torch wrappers of the
torch.nn.parallel and torch.distributed modules.
If your model is built with torch.nn.Module and uses parameters that is not defined within the
module class, you should register them to the module manually for SMP to gather the full parameters
while . To register parameters to a module, use smp.register_parameter(module, parameter).
class Module(torch.nn.Module):
def __init__(self, *args):
super().__init__(self, *args)
self.layer1 = Layer1()
self.layer2 = Layer2()
smp.register_parameter(self, self.layer1.weight)
When configuring a SageMaker PyTorch estimator in the section called “Step 2: Launch a Training
Job” (p. 1921), add the parameters for sharded data parallelism.
• "sdp_reduce_bucket_size" (int, default: 5e8) – Specifies the size of PyTorch DDP gradient buckets
in number of elements of the default dtype.
• "sdp_param_persistence_threshold" (int, default: 1e6) – Specifies the size of a parameter tensor
in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter
tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor
is smaller than this threshold, the parameter tensor is not split; this helps reduce communication
overhead because the parameter tensor is replicated across data-parallel GPUs.
1877
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The following code shows an example of how to configure sharded data parallelism.
import sagemaker
from sagemaker.pytorch import PyTorch
smp_options = {
"enabled": True,
"parameters": {
# "pipeline_parallel_degree": 1, # Optional, default is 1
# "tensor_parallel_degree": 1, # Optional, default is 1
"ddp": True,
# parameters for sharded data parallelism
"sharded_data_parallel_degree": 2, # Add this to activate sharded data
parallelism
"sdp_reduce_bucket_size": int(5e8), # Optional
"sdp_param_persistence_threshold": int(1e6), # Optional
"sdp_max_live_parameters": int(1e9), # Optional
"sdp_hierarchical_allgather": True, # Optional
"sdp_gradient_clipping": 1.0 # Optional
}
}
mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}
smp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-job"
)
smp_estimator.fit('s3://my_bucket/my_training_data/')
Reference configurations
The SageMaker distributed training team provides the following reference configurations that you
can use as a starting point. You can extrapolate from the following configurations to experiment and
estimate the GPU memory usage for your model configuration.
1878
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
GPT- 2 ml.p4d.24xlarge
2048 64 4 16
NEOX-20B
GPT- 8 ml.p4d.24xlarge
2048 768 12 32
NEOX-20B
For example, if you increase the sequence length for a 20-billion-parameter model or increase the
size of the model to 65 billion parameters, you need to try reducing the batch size first. If the model
still doesn’t fit with the smallest batch size (the batch size of 1), try increasing the degree of model
parallelism.
GPT- 64 ml.p4d.24xlarge
2048 512 8 16 8 Y
NEOX-65B
GPT- 64 ml.p4d.24xlarge
4096 512 2 64 2 Y
NEOX-65B
The combined usage of sharded data parallelism and tensor parallelism is useful when you want to fit
a large language model (LLM) into a large-scale cluster while using text data with a longer sequence
length, which leads to use a smaller batch size, and consequently handling the GPU memory usage to
train LLMs against longer text sequences. To learn more, see the section called “Sharded data parallelism
with tensor parallelism” (p. 1883).
For case studies, benchmarks, and more configuration examples, see the blog post New performance
improvements in Amazon SageMaker model parallel library.
The SageMaker data parallelism library offers collective communication primitives (SMDDP collectives)
optimized for the AWS infrastructure. It achieves optimization by adopting an all-to-all-type
communication pattern by making use of Elastic Fabric Adapter (EFA), resulting in high-throughput and
less latency-sensitive collectives, offloading the communication-related processing to the CPU, and
freeing up GPU cycles for computation. On large clusters, SMDDP Collectives can offer improvements
in distributed training performance by up to 40% compared to NCCL. For case studies and benchmark
results, see the blog New performance improvements in the Amazon SageMaker model parallelism
library.
Note
Sharded data parallelism with SMDDP Collectives is available in the SageMaker model
parallelism library v1.13.0 and later, and the SageMaker data parallelism library v1.6.0 and
later. See also Supported configurations (p. 1880) to use sharded data parallelism with SMDDP
Collectives.
1879
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
In sharded data parallelism, which is a commonly used technique in large-scale distributed training, the
AllGather collective is used to reconstitute the sharded layer parameters for forward and backward
pass computations, in parallel with GPU computation. For large models, performing the AllGather
operation efficiently is critical to avoid GPU bottleneck problems and slowing down training speed.
When sharded data parallelism is activated, SMDDP Collectives drops into these performance-critical
AllGather collectives, improving training throughput.
When your training job has sharded data parallelism activated and meets the Supported
configurations (p. 1880), SMDDP Collectives are automatically activated. Internally, SMDDP Collectives
optimize the AllGather collective to be performant on the AWS infrastructure and falls back to NCCL
for all other collectives. Furthermore, under unsupported configurations, all collectives, including
AllGather, automatically use the NCCL backend.
Since the SageMaker model parallelism library version 1.13.0, the "ddp_dist_backend" parameter is
added to the modelparallel options. The default value for this configuration parameter is "auto",
which uses SMDDP Collectives whenever possible, and falls back to NCCL otherwise. To force the library
to always use NCCL, specify "nccl" to the "ddp_dist_backend" configuration parameter.
The following code example shows how to set up a PyTorch estimator using the sharded data parallelism
with the "ddp_dist_backend" parameter, which is set to "auto" by default and, therefore, optional
to add.
import sagemaker
from sagemaker.pytorch import PyTorch
smp_options = {
"enabled":True,
"parameters": {
"partitions": 1,
"ddp": True,
"sharded_data_parallel_degree": 64
"bf16": True,
"ddp_dist_backend": "auto" # Specify "nccl" to force to use NCCL.
}
}
mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}
smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=8,
instance_type='ml.p4d.24xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-demo",
)
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
Supported configurations
1880
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The AllGather operation with SMDDP Collectives are activated in training jobs when all the following
configuration requirements are met.
SMDDP Collectives utilize additional GPU memory. There are two environment variables to configure the
GPU memory usage depending on different model training use cases.
The default values for the environment variables should work well for most use cases. We recommend
tuning these variables only if training runs into the out-of-memory (OOM) error.
The following list discusses some tuning tips to reduce the GPU memory footprint of SMDDP Collectives
while retaining the performance gain from them.
• Tuning SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES
• The AllGather input buffer size is smaller for smaller models. Hence, the required size for
SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES can be smaller for models with fewer parameters.
• The AllGather input buffer size decreases as sharded_data_parallel_degree
increases, because the model gets sharded across more GPUs. Hence, the required size for
SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES can be smaller for training jobs with large values for
sharded_data_parallel_degree.
• Tuning SMDDP_AG_SORT_BUFFER_SIZE_BYTES
• The amount of data gathered from inter-node communication is less for models with fewer
parameters. Hence, the required size for SMDDP_AG_SORT_BUFFER_SIZE_BYTES can be smaller for
such models with fewer number of parameters.
Some collectives might fall back to use NCCL; hence, you might not get the performance gain from the
optimized SMDDP collectives. If additional GPU memory is available for use, you can consider increasing
1881
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The following code shows how you can configure the environment variables by appending them to
mpi_options in the distribution parameter for the PyTorch estimator.
import sagemaker
from sagemaker.pytorch import PyTorch
smp_options = {
.... # All modelparallel configuration options go here
}
mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}
# Use the following two lines to tune values of the environment variables for buffer
mpioptions += " -x SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES=8192"
mpioptions += " -x SMDDP_AG_SORT_BUFFER_SIZE_BYTES=8192"
smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=8,
instance_type='ml.p4d.24xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-demo-with-tuning",
)
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
To further save GPU memory with half-precision floating point numbers and sharded data parallelism,
you can activate 16-bit floating point format (FP16) or Brain floating point format (BF16) by adding one
additional parameter to the distributed training configuration.
Note
Mixed precision training with sharded data parallelism is available in the SageMaker model
parallelism library v1.11.0 and later.
To run FP16 training with sharded data parallelism, add "fp16": True" to the smp_options
configuration dictionary. In your training script, you can choose between the static and dynamic loss
scaling options through the smp.DistributedOptimizer module. For more information, see the
section called “FP16 Training with Model Parallelism” (p. 1904).
smp_options = {
"enabled": True,
"parameters": {
"ddp": True,
"sharded_data_parallel_degree": 2,
1882
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
"fp16": True
}
}
The sharded data parallelism feature of SageMaker supports training in BF16 data type. The BF16 data
type uses 8 bits to represent the exponent of a floating point number, while the FP16 data type uses
5 bits. Preserving the 8 bits for the exponent allows to keep the same representation of the exponent
of a 32-bit single precision floating point (FP32) number. This makes the conversion between FP32 and
BF16 simpler and significantly less prone to cause overflow and underflow issues that arise often in
FP16 training, especially when training larger models. While both data types use 16 bits in total, this
increased representation range for the exponent in the BF16 format comes at the expense of reduced
precision. For training large models, this reduced precision is often considered an acceptable trade-off for
the range and training stability.
Note
Currently, BF16 training works only when sharded data parallelism is activated.
To run BF16 training with sharded data parallelism, add "bf16": True to the smp_options
configuration dictionary.
smp_options = {
"enabled": True,
"parameters": {
"ddp": True,
"sharded_data_parallel_degree": 2,
"bf16": True
}
}
If you use sharded data parallelism and also need to reduce the global batch size, consider using tensor
parallelism with sharded data parallelism. When training a large model with sharded data parallelism
on a very large compute cluster (typically 128 nodes or beyond), even a small batch size per GPU results
in a very large global batch size. It might lead to convergence issues or low computational performance
issues. Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone
when a single batch is already large and cannot be reduced further. In such cases, using sharded data
parallelism in combination with tensor parallelism helps reduce the global batch size.
Choosing the optimal sharded data parallel and tensor parallel degrees depends on the scale of the
model, the instance type, and the global batch size that is reasonable for the model to converge. We
recommend that you start from a low tensor parallel degree to fit the global batch size into the compute
cluster to resolve CUDA out-of-memory errors and achieve the best performance. See the following two
example cases to learn how the combination of tensor parallelism and sharded data parallelism helps
you adjust the global batch size by grouping GPUs for model parallelism, resulting in a lower number of
model replicas and a smaller global batch size.
Note
This feature is available from the SageMaker model parallelism library v1.15, and supports
PyTorch v1.13.1.
Note
This feature is available for the supported models by the tensor parallelism functionality
of the library. To find the list of the supported models, see Support for Hugging Face
Transformer Models. Also note that you need to pass tensor_parallelism=True to the
smp.model_creation argument while modifying your training script. To learn more, see the
training script train_gpt_simple.py in the SageMaker Examples GitHub repository.
1883
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Example 1
Assume that we want to train a model over a cluster of 1536 GPUs (192 nodes with 8 GPUs in each),
setting the degree of sharded data parallelism to 32 (sharded_data_parallel_degree=32) and
the batch size per GPU to 1, where each batch has a sequence length of 4096 tokens. In this case, there
are 1536 model replicas, the global batch size becomes 1536, and each global batch contains about 6
million tokens.
Adding tensor parallelism to it can lower the global batch size. One configuration example can be setting
the tensor parallel degree to 8 and the batch size per GPU to 4. This forms 192 tensor parallel groups
or 192 model replicas, where each model replica is distributed across 8 GPUs. The batch size of 4 is the
amount of training data per iteration and per tensor parallel group; that is, each model replica consumes
4 batches per iteration. In this case, the global batch size becomes 768, and each global batch contains
about 3 million tokens. Hence, the global batch size is reduced by half compared to the previous case
with sharded data parallelism only.
Example 2
When both sharded data parallelism and tensor parallelism are activated, the library first applies
tensor parallelism and shards the model across this dimension. For each tensor parallel rank, the data
parallelism is applied as per sharded_data_parallel_degree.
For example, assume that we want to set 32 GPUs with a tensor parallel degree of 4 (forming
groups of 4 GPUs), a sharded data parallel degree of 4, ending up with a replication degree of
2. The assignment creates eight GPU groups based on the tensor parallel degree as follows:
(0,1,2,3), (4,5,6,7), (8,9,10,11), (12,13,14,15), (16,17,18,19), (20,21,22,23),
(24,25,26,27), (28,29,30,31). That is, four GPUs form one tensor parallel group. In this
case, the reduced data parallel group for the 0th rank GPUs of the tensor parallel groups would be
(0,4,8,12,16,20,24,28). The reduced data parallel group is sharded based on the sharded data
parallel degree of 4, resulting in two replication groups for data parallelism. GPUs (0,4,8,12) form
one sharding group, which collectively hold a complete copy of all parameters for the 0th tensor parallel
rank, and GPUs (16,20,24,28) form another such group. Other tensor parallel ranks also have similar
sharding and replication groups.
Figure 1: Tensor parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4,
4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form tensor parallelism
1884
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
groups from TPG0 to TPG7. Replication groups are ({TPG0, TPG4}, {TPG1, TPG5}, {TPG2, TPG6} and {TPG3,
TPG7}); each replication group pair shares the same color but filled differently.
Figure 2: Sharded data parallelism groups for (nodes, sharded data parallel degree, tensor parallel
degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form
sharded data parallelism groups from SDPG0 to SDPG7. Replication groups are ({SDPG0, SDPG4}, {SDPG1,
SDPG5}, {SDPG2, SDPG6} and {SDPG3, SDPG7}); each replication group pair shares the same color but
filled differently.
To use sharded data parallelism with tensor parallelism, you need to set both
sharded_data_parallel_degree and tensor_parallel_degree in the configuration for
distribution while creating an object of the SageMaker PyTorch estimator class.
You also need to activate prescaled_batch. This means that, instead of each GPU reading its
own batch of data, each tensor parallel group collectively reads a combined batch of the chosen
batch size. Effectively, instead of dividing the dataset into parts equal to the number of GPUs (or
data parallel size, smp.dp_size()), it divides into parts equal to the number of GPUs divided by
tensor_parallel_degree (also called reduced data parallel size, smp.rdp_size()). For more details
on prescaled batch, see Prescaled Batch in the SageMaker Python SDK documentation. See also the
example training script train_gpt_simple.py for GPT-2 in the SageMaker Examples GitHub repository.
The following code snippet shows an example of creating a PyTorch estimator object based on the
aforementioned scenario in the section called “Example 2” (p. 1884).
pytorch_estimator = PyTorch(
entry_point="your_training_script.py",
role=role,
instance_type="ml.p4d.24xlarge",
volume_size=200,
instance_count=4,
sagemaker_session=sagemaker_session,
py_version="py3",
framework_version="1.13.1",
distribution={
"smdistributed": {
"modelparallel": {
1885
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
"enabled": True,
"parameters": smp_parameters,
}
},
"mpi": {
"enabled": True,
"processes_per_host": 8,
"custom_mpi_options": mpi_options,
},
},
source_dir="source_directory_of_your_code",
output_path=s3_output_location
)
Consider the following when using the SageMaker model parallelism library's sharded data parallelism.
• Sharded data parallelism is compatible with FP16 training. To run FP16 training, see the the section
called “FP16 Training with Model Parallelism” (p. 1904) section.
• Sharded data parallelism is compatible with tensor parallelism. The following items are what you
might need to consider for using sharded data parallelism with tensor parallelism.
• When using sharded data parallelism with tensor parallelism, the embedding layers
are also automatically distributed across the tensor parallel group. In other words, the
distribute_embedding parameter is automatically set to True. For more information about
tensor parallelism, see the section called “Tensor Parallelism” (p. 1890).
• Note that sharded data parallelism with tensor parallelism currently uses the NCCL collectives as the
backend of the distributed training strategy.
To learn more, see the the section called “Sharded data parallelism with tensor parallelism” (p. 1883)
section.
• Sharded data parallelism currently is not compatible with pipeline parallelism (p. 1866) or optimizer
state sharding (p. 1901). To activate sharded data parallelism, turn off optimizer state sharding and
set the pipeline parallel degree to 1.
• The activation checkpointing (p. 1902) and activation offloading (p. 1903) features are compatible
with sharded data parallelism.
• To use sharded data parallelism with gradient accumulation, set the backward_passes_per_step
argument to the number of accumulation steps while wrapping your model with the
smdistributed.modelparallel.torch.DistributedModel module. This ensures that the
gradient AllReduce operation across the model replication groups (sharding groups) takes place at
the boundary of gradient accumulation.
• You can checkpoint your models trained with sharded data parallelism using the library's
checkpointing APIs, smp.save_checkpoint and smp.resume_from_checkpoint. For more
information, see the section called “Checkpointing a distributed PyTorch model (for the SageMaker
model parallelism library v1.10.0 and later)” (p. 1927).
• The behavior of the delayed_parameter_initialization configuration parameter changes under
sharded data parallelism. When these two features are simultaneously turned on, parameters are
immediately initialized upon model creation in a sharded manner instead of delaying the parameter
initialization, so that each rank initializes and stores its own shard of parameters.
• When sharded data parallelism is activated, the library performs gradient clipping internally when
the optimizer.step() call runs. You don't need to use utility APIs for gradient clipping, such as
torch.nn.utils.clip_grad_norm_(). To adjust the threshold value for gradient clipping, you can
set it through the sdp_gradient_clipping parameter for the distribution parameter configuration
when you construct the SageMaker PyTorch estimator, as shown in the the section called “How to
apply sharded data parallelism to your training job” (p. 1877) section.
1886
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Pipelining a Model
One of the core features of SageMaker's model parallelism library is pipeline parallelism, which
determines the order in which computations are made and data is processed across devices during model
training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the
GPUs compute simultaneously on different data samples, and to overcome the performance loss due
to sequential computation. When you use pipeline parallelism, training job is executed in a pipelined
fashion over microbatches to maximize GPU usage.
Note
Pipeline parallelism, also called model partitioning, is available for both PyTorch and
TensorFlow. For supported versions of the frameworks, see the section called “Supported
Frameworks and AWS Regions” (p. 1872).
Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline
one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller
subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by
which device for every time slot.
For example, depending on the pipeline schedule and the model partition, GPU i might perform
(forward or backward) computation on microbatch b while GPU i+1 performs computation on
microbatch b+1, thereby keeping both GPUs active at the same time. During a single forward or
backward pass, execution flow for a single microbatch might visit the same device multiple times,
depending on the partitioning decision. For instance, an operation that is at the beginning of the model
might be placed on the same device as an operation at the end of the model, while the operations in
between are on different devices, which means this device is visited twice.
The library offers two different pipeline schedules, simple and interleaved, which can be configured using
the pipeline parameter in the SageMaker Python SDK. In most cases, interleaved pipeline can achieve
better performance by utilizing the GPUs more efficiently.
Interleaved Pipeline
In an interleaved pipeline, backward execution of the microbatches is prioritized whenever possible. This
allows quicker release of the memory used for activations, using memory more efficiently. It also allows
for scaling the number of microbatches higher, reducing the idle time of the GPUs. At steady-state, each
device alternates between running forward and backward passes. This means that the backward pass of
one microbatch may run before the forward pass of another microbatch finishes.
The preceding figure illustrates an example execution schedule for the interleaved pipeline over 2 GPUs.
In the figure, F0 represents the forward pass for microbatch 0, and B1 represents the backward pass
for microbatch 1. Update represents the optimizer update of the parameters. GPU0 always prioritizes
backward passes whenever possible (for instance, executes B0 before F2), which allows for clearing of
the memory used for activations earlier.
Simple Pipeline
A simple pipeline, by contrast, finishes running the forward pass for each microbatch before starting
the backward pass. This means that it only pipelines the forward pass and backward pass stages within
themselves. The following figure illustrates an example of how this works, over 2 GPUs.
1887
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Use the following sections to learn about the framework-specific pipeline scheduling decisions
SageMaker's model parallelism library makes for TensorFlow and PyTorch.
The following image is an example of a TensorFlow graph partitioned by the model parallelism library,
using automated model splitting. When a graph is split, each resulting subgraph is replicated B times
(except for the variables), where B is the number of microbatches. In this figure, each subgraph is
replicated 2 times (B=2). An SMPInput operation is inserted at each input of a subgraph, and an
SMPOutput operation is inserted at each output. These operations communicate with the library
backend to transfer tensors to and from each other.
The following image is an example of 2 subgraphs split with B=2 with gradient operations added.
The gradient of a SMPInput op is a SMPOutput op, and vice versa. This enables the gradients to flow
backwards during back-propagation.
1888
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
This GIF demonstrates an example interleaved pipeline execution schedule with B=2 microbatches and
2 subgraphs. Each device sequentially executes one of the subgraph replicas to improve GPU utilization.
As B grows larger, the fraction of idle time slots goes to zero. Whenever it is time to do (forward or
backward) computation on a specific subgraph replica, the pipeline layer signals to the corresponding
blue SMPInput operations to start executing.
Once the gradients from all microbatches in a single mini-batch are computed, the library combines the
gradients across microbatches, which can then be applied to the parameters.
As in TensorFlow, each batch is split into a number of microbatches, which are executed one at a time on
each device. However, the execution schedule is handled via execution servers launched on each device.
Whenever the output of a submodule that is placed on another device is needed on the current device,
an execution request is sent to the execution server of the remote device along with the input tensors to
the submodule. The server then executes this module with the given inputs and returns the response to
the current device.
Since the current device is idle during the remote submodule execution, the local execution for the
current microbatch pauses, and the library runtime switches execution to another microbatch which
the current device can actively work on. The prioritization of microbatches is determined by the chosen
1889
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
pipeline schedule. For an interleaved pipeline schedule, microbatches that are in the backward stage of
the computation are prioritized whenever possible.
Tensor Parallelism
Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and
optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual
weights intact but partitions the set of weights, tensor parallelism splits individual weights. This typically
involves distributed computation of specific operations, modules, or layers of the model.
Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory
(such as large embedding tables with a large vocabulary size or a large softmax layer with a large
number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and
impedes balance of the memory load.
Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not
enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure
microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead
becomes prohibitively large.
Note
Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and
later.
Topics
• How Tensor Parallelism Works (p. 1890)
• Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism (p. 1892)
• Support for Hugging Face Transformer Models (p. 1897)
• Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor
Parallelism (p. 1899)
Tensor parallelism takes place at the level of nn.Modules; it partitions specific modules in the model
across tensor parallel ranks. This is in addition to the existing partition of the set of modules used in
pipeline parallelism.
When a module is partitioned through tensor parallelism, its forward and backward propagation
are distributed. The library handles the necessary communication across devices to implement the
distributed execution of these modules. The modules are partitioned across multiple data parallel ranks.
Contrary to the traditional distribution of workloads, each data parallel rank does not have the complete
model replica when the library’s tensor parallelism is used. Instead, each data parallel rank may have
only a partition of the distributed modules, in addition to the entirety of the modules that are not
distributed.
Example: Consider tensor parallelism across data parallel ranks, where the degree of data parallelism is
4 and the degree of tensor parallelism is 2. Assume that you have a data parallel group that holds the
following module tree, after partitioning the set of modules.
A
### B
| ### E
| ### F
### C
### D
### G
### H
1890
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Assume that tensor parallelism is supported for the modules B, G, and H. One possible outcome of
tensor parallel partition of this model could be:
Each line represents the set of modules stored in that dp_rank, and the notation X:y represents the yth
fraction of the module X. Note the following:
1. Partitioning takes place across subsets of data parallel ranks, which we call TP_GROUP, not the entire
DP_GROUP, so that the exact model partition is replicated across dp_rank 0 and dp_rank 2, and
similarly across dp_rank 1 and dp_rank 3.
2. The modules E and F are no longer part of the model, since their parent module B is partitioned, and
any execution that is normally a part of E and F takes place within the (partitioned) B module.
3. Even though H is supported for tensor parallelism, in this example it is not partitioned, which
highlights that whether to partition a module depends on user input. The fact that a module is
supported for tensor parallelism does not necessarily mean it is partitioned.
When tensor parallelism is performed over data parallel ranks, a subset of the parameters, gradients,
and optimizer states are partitioned across the tensor parallel devices for the modules that are
partitioned. For the rest of the modules, the tensor parallel devices operate in a regular data parallel
manner. To execute the partitioned module, a device first collects the necessary parts of all data samples
across peer devices in the same tensor parallelism group. The device then runs the local fraction of the
module on all these data samples, followed by another round of synchronization which both combines
the parts of the output for each data sample and returns the combined data samples to the GPUs from
which the data sample first originated. The following figure shows an example of this process over a
partitioned nn.Linear module.
1891
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The first figure shows a small model with a large nn.Linear module with data parallelism over the two
tensor parallelism ranks. The nn.Linear module is replicated into the two parallel ranks.
The second figure shows tensor parallelism applied on a larger model while splitting the nn.Linear
module. Each tp_rank holds half the linear module, and the entirety of the rest of the operations. While
the linear module runs, each tp_rank collects the relevant half of all data samples and passes it through
their half of the nn.Linear module. The result needs to be reduce-scattered (with summation as the
reduction operation) so that each rank has the final linear output for their own data samples. The rest of
the model runs in the typical data parallel manner.
Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
• How to configure a SageMaker PyTorch estimator and the SageMaker model parallelism option to use
tensor parallelism.
• How to adapt your training script using the extended smdistributed.modelparallel modules for
tensor parallelism.
To learn more about the smdistributed.modelparallel modules, see the SageMaker model parallel
APIs in the SageMaker Python SDK documentation.
Topics
• Tensor parallelism alone (p. 1893)
• Tensor parallelism combined with pipeline parallelism (p. 1895)
1892
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The following is an example of a distributed training option to activate tensor parallelism alone, without
pipeline parallelism. Configure the mpi_options and smp_options dictionaries to specify distributed
training options to the SageMaker PyTorch estimator.
Note
Extended memory-saving features are available through Deep Learning Containers for PyTorch,
which implements the SageMaker model parallelism library v1.6.0 or later.
mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
smp_options = {
"enabled":True,
"parameters": {
"pipeline_parallel_degree": 1, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 4, # tp over 4 devices
"ddp": True
}
}
smp_estimator = PyTorch(
entry_point='your_training_script.py', # Specify
role=role,
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
framework_version='1.13.1',
py_version='py36',
instance_count=1,
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)
smp_estimator.fit('s3://my_bucket/my_training_data/')
Tip
To find a complete list of parameters for distribution, see Configuration Parameters for
Model Parallelism in the SageMaker Python SDK documentation.
The following example training script shows how to adapt the SageMaker model parallelism library to a
training script. In this example, it is assumed that the script is named your_training_script.py.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets
1893
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")
# smdistributed: Enable tensor parallelism for all supported modules in the model
# i.e., nn.Linear in this case. Alternatively, we can use
# smp.set_tensor_parallelism(model.fc1, True)
# to enable it only for model.fc1
with smp.tensor_parallelism():
model = Net()
1894
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The following is an example of a distributed training option that enables tensor parallelism combined
with pipeline parallelism. Set up the mpi_options and smp_options parameters to specify model
parallel options with tensor parallelism when you configure a SageMaker PyTorch estimator.
Note
Extended memory-saving features are available through Deep Learning Containers for PyTorch,
which implements the SageMaker model parallelism library v1.6.0 or later.
mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True
}
}
smp_estimator = PyTorch(
entry_point='your_training_script.py', # Specify
role=role,
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
framework_version='1.13.1',
py_version='py36',
instance_count=1,
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)
smp_estimator.fit('s3://my_bucket/my_training_data/')
The following example training script shows how to adapt the SageMaker model parallelism library to a
training script. Note that the training script now includes the smp.step decorator:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
1895
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
optimizer.step()
# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")
1896
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
model = Net()
The SageMaker model parallelism library's tensor parallelism offers out-of-the-box support for the
following Hugging Face Transformer models:
• GPT-2, BERT, and RoBERTa (Available in the SageMaker model parallelism library v1.7.0 and later)
• GPT-J (Available in the SageMaker model parallelism library v1.8.0 and later)
• GPT-Neo (Available in the SageMaker model parallelism library v1.10.0 and later)
Note
For any other Transformers models, you need to use the
smdistributed.modelparallel.torch.tp_register_with_module() API to apply tensor parallelism.
Note
To use tensor parallelism for training Hugging Face Transformer models, make sure you use
Hugging Face Deep Learning Containers for PyTorch that has the SageMaker model parallelism
library v1.7.0 and later. For more information, see the SageMaker model parallelism library
release notes.
For the Hugging Face transformer models supported by the library out of the box, you
don't need to manually implement hooks to translate Transformer APIs to smdistributed
transformer layers. You can activate tensor parallelism by using the context manager
smdistributed.modelparallel.torch.tensor_parallelism() and wrapping the model by
smdistributed.modelparallel.torch.DistributedModel(). You don't need to manually register hooks for
tensor parallelism using the smp.tp_register API.
• smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_state_dict_to_hf_gpt2(st
max_seq_len=None)
• smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_hf_state_dict_to_smdistr
• smdistributed.modelparallel.torch.nn.huggingface.bert.translate_state_dict_to_hf_bert(st
max_seq_len=None)
1897
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• smdistributed.modelparallel.torch.nn.huggingface.bert.translate_hf_state_dict_to_smdistr
• smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_state_dict_to_hf_robe
max_seq_len=None)
• smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_hf_state_dict_to_smdi
• smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_state_dict_to_hf_gptj(st
max_seq_len=None) (Available in the SageMaker model parallelism library v1.8.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_hf_gptj_state_dict_to_sm
(Available in the SageMaker model parallelism library v1.8.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_state_dict_to_hf_gptne
max_seq_len=None) (Available in the SageMaker model parallelism library v1.10.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_hf_state_dict_to_smdis
(Available in the SageMaker model parallelism library v1.10.0 and later)
with smp.tensor_parallelism():
model = AutoModelForCausalLM.from_config(hf_gpt2_config)
model = smp.DistributedModel(model)
Given a state_dict from the DistributedModel object, you can load the weights into the original
Hugging Face GPT-2 model using the translate_state_dict_to_hf_gpt2 function as shown in the
following code.
from smdistributed.modelparallel.torch.nn.huggingface.gpt2 \
import translate_state_dict_to_hf_gpt2
max_seq_len = 1024
if smp.rdp_rank() == 0:
state_dict = dist_model.state_dict()
hf_state_dict = translate_state_dict_to_hf_gpt2(state_dict, max_seq_len)
Similarly, given a supported HuggingFace model state_dict, you can use the
translate_hf_state_dict_to_smdistributed function to convert it to a format readable by
smp.DistributedModel. This can be useful in transfer learning use cases, where a pre-trained model is
loaded into a smp.DistributedModel for model-parallel fine-tuning:
from smdistributed.modelparallel.torch.nn.huggingface.roberta \
import translate_state_dict_to_smdistributed
model = AutoModelForMaskedLM.from_config(roberta_config)
model = smp.DistributedModel(model)
pretrained_model = AutoModelForMaskedLM.from_pretrained("roberta-large")
translated_state_dict =
1898
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
translate_state_dict_to_smdistributed(pretrained_model.state_dict())
# start fine-tuning...
Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
This section explains how the ranking mechanism of model parallelism works with tensor parallelism.
This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallelism
Library (p. 1875). With tensor parallelism, the library introduces three types of ranking and process
group APIs: smp.tp_rank() for tensor parallel rank, smp.pp_rank() for pipeline parallel rank, and
smp.rdp_rank() for reduced-data parallel rank. The corresponding communication process groups are
tensor parallel group (TP_GROUP), pipeline parallel group (PP_GROUP), and reduced-data parallel group
(RDP_GROUP). These groups are defined as follows:
• A tensor parallel group (TP_GROUP) is an evenly divisible subset of the data parallel group, over which
tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1,
TP_GROUP is the same as model parallel group (MP_GROUP).
• A pipeline parallel group (PP_GROUP) is the group of processes over which pipeline parallelism takes
place. When the tensor parallelism degree is 1, PP_GROUP is the same as MP_GROUP.
• A reduced-data parallel group (RDP_GROUP) is a set of processes that hold both the same pipeline
parallelism partitions and the same tensor parallel partitions, and perform data parallelism among
themselves. This is called the reduced data parallel group because it is a subset of the entire data
parallelism group, DP_GROUP. For the model parameters that are distributed within the TP_GROUP ,
the gradient allreduce operation is performed only for reduced-data parallel group, while for the
parameters that are not distributed, the gradient allreduce takes place over the entire DP_GROUP.
• A model parallel group (MP_GROUP) refers to a group of processes that collectively store the entire
model. It consists of the union of the PP_GROUPs of all the ranks that are in the TP_GROUP of the
current process. When the degree of tensor parallelism is 1, MP_GROUP is equivalent to PP_GROUP. It
is also consistent with the existing definition of MP_GROUP from previous smdistributed releases.
Note that the current TP_GROUP is a subset of both the current DP_GROUP and the current MP_GROUP.
To learn more about the communication process APIs in the SageMaker model parallelism library, see the
Common API and the PyTorch-specific APIs in the SageMaker Python SDK documentation.
1899
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
This figure shows ranking mechanism, parameter distribution, and associated AllReduce operations of
tensor parallelism.
For example, consider process groups for a single node with 8 GPUs, where the degree of tensor
parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper
center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower
right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism
and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower
figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated
for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of MP_GROUP,
PP_GROUP, and TP_GROUP. The lower right figure shows RDP_GROUP, DP_GROUP, and WORLD over the
same set of GPUs. The gradients for the layers and layer slices that have the same color are allreduced
together for data parallelism. For example, the first layer (light blue) gets the allreduce operations
across DP_GROUP, whereas the dark orange slice in the second layer only gets the allreduce operations
within the RDP_GROUP of its process. The bold dark red arrows represent tensors with the batch of its
entire TP_GROUP.
1900
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition,
data parallelism (allreduce) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7.
Tensor parallelism happens over subsets of DP_GROUPs, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).
When this feature is turned on, the library partitions the set of model parameters based on the data
parallelism degree. The gradients corresponding to the ith partition get reduced only at the ith data
parallel rank. At the end of the first call to an smp.step decorator function, the optimizer wrapped
by smp.DistributedOptimizer redefines its parameters to be only limited to those parameters
corresponding to the partition of the current data parallel rank. The redefined parameters are called
virtual parameters and share underlying storage with the original parameters. During the first call to
optimizer.step, the optimizer states are created based on these redefined parameters, which are
sharded because of the original partition. After the optimizer update, the AllGather operation (as part of
the optimizer.step call) runs across the data parallel ranks to achieve consistent parameter states.
Tip
Optimizer state sharding can be useful when the degree of data parallelism is greater than 1
and the model has more than a billion parameters.
The degree of data parallelism is calculated by (processes_per_host *
instance_count / pipeline_parallel_degree), and the smp.dp_size() function
handles the sizing in the background.
mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True,
"shard_optimizer_state": True
}
}
See Adapt your PyTorch training script (p. 1895) in the Tensor parallelism combined with pipeline
parallelism section. There’s no additional modification required for the script.
1901
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Activation Checkpointing
Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing
activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra
computation time for reduced memory usage. If a module is checkpointed, at the end of a forward
pass, the inputs to and outputs from the module stay in memory. Any intermediate tensors that would
have been part of the computation inside that module are freed up during the forward pass. During
the backward pass of checkpointed modules, these tensors are recomputed. At this point, the layers
beyond this checkpointed module have finished their backward pass, so the peak memory usage with
checkpointing can be lower.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.
When you use automated model partitioning, you can find the partitioning assignment logs starting with
Partition assignments: in the training job logs. If a module is partitioned across multiple ranks
(for example, with one descendant on one rank and another descendant on a different rank), the library
ignores the attempt to checkpoint the module and raises a warning message that the module won't be
checkpointed.
Note
The SageMaker model parallelism library supports both overlapping and non-overlapping
allreduce operation in combination with checkpointing.
Note
PyTorch’s native checkpointing API is not compatible with smdistributed.modelparallel.
Example 1: The following sample code shows how to use activation checkpointing when you have a
model definition in your script.
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
1902
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Example 2: The following sample code shows how to use activation checkpointing when you have a
sequential model in your script.
import torch.nn as nn
from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequential
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.seq = nn.Sequential(
nn.Conv2d(1,20,5),
nn.ReLU(),
nn.Conv2d(20,64,5),
nn.ReLU()
)
Example 3: The following sample code shows how to use activation checkpointing when you import a
prebuilt model from a library, such as PyTorch and Hugging Face Transformers. Whether you checkpoint
sequential modules or not, do the following:
smp.init()
model = AutoModelForCausalLM(*args, **kwargs)
model = smp.DistributedModel(model)
Activation Offloading
When activation checkpointing and pipeline parallelism are turned on and the number of microbatches
is greater than one, activation offloading is an additional feature that can further reduce memory
usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their
microbatches that are not currently running in the CPU. Right before the GPU needs the activations for
the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the
CPU.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.
Use activation offloading to reduce memory usage when the number of microbatches is greater than
1, and activation checkpointing is turned on (see Activation Checkpointing (p. 1902)). When the
activation checkpointing is not used, activation offloading has no effect. When it is used with only one
microbatch, it does not save memory.
1903
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
To adjust how early the activations are loaded back into the GPU, you can use the configuration
parameter "activation_loading_horizon" (default is set to 4, must be int larger than 0). A larger
activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the
horizon is too large, the memory-saving impact of activation offloading might be diminished. If the
horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and
degrading performance.
Tip
Activation offloading can be useful for large models with over a hundred billion parameters.
mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True,
"offload_activations": True,
"activation_loading_horizon": 4 # optional. default is 4.
}
}
# fp16_training_script.py
import torch
import smdistributed.modelparallel.torch as smp
with smp.model_creation(
dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
):
model = ...
1904
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Tip
If you are using tensor parallelism, add tensor_parallelism=smp.tp_size() > 1 to the
smp.model_creation context manager. Adding this line also helps automatically detect
whether tensor parallelism is activated or not.
with smp.model_creation(
... ,
tensor_parallelism=smp.tp_size() > 1
):
model = ...
The following code is an example of wrapping an Adadelta optimizer object with dynamic loss
scaling for FP16 training.
optimizer = torch.optim.Adadelta(...)
optimizer = smp.DistributedOptimizer(
optimizer,
static_loss_scale=None,
dynamic_loss_scale=True,
dynamic_loss_args={
"scale_window": 1000,
"min_scale": 1,
"delayed_shift": 2
}
)
Add the FP16 parameter ("fp16") to the distribution configuration for model parallelism when creating
a SageMaker PyTorch estimator object. For a complete list of the configuration parameters for model
parallelism, see Parameters for smdistributed.
smp_options = {
"enabled": True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2,
"tensor_parallel_degree": 2,
...,
"fp16": True
}
}
fp16_estimator = PyTorch(
entry_point="fp16_training_script.py", # Specify your train script
...,
1905
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": {...}
}
)
fp16_estimator.fit(...)
When FP16 training starts, the model and the optimizer are wrapped by FP16_Module and
FP16_Optimizer respectively, which are modified smdistributed versions of the Apex utils.
FP16_Module converts the model to FP16 dtype and deals with the forward pass in FP16.
Tip
You can apply gradient clipping by calling clip_master_grads before optimizer.step.
Tip
When using torch.optim.lr_scheduler and FP16 training, you need to pass
optimizer.optimizer to the LR scheduler rather than the optimizer. See the following
example code.
scheduler = StepLR(
optimizer.optimizer if smp.state.cfg.fp16 else optimizer,
step_size=1,
gamma=args.gamma
)
The FlashAttention library only supports models when attention_head_size is set to a value that's a
multiple of 8 and less than 128. Therefore, when you train a distributed transformer and make sure that
FlashAttention works properly, you should adjust parameters to make the attention head size comply
the requirements. For more information, see also Installation and features in the FlashAttention GitHub
repository.
For example, assume that you configure a Transformer model with hidden_width=864 and
num_heads=48. The head size of FlashAttention is calculated as attention_head_size =
hidden_width / num_heads = 864 / 48 = 18. To enable FlashAttention, you need to adjust the
num_heads parameter to 54, so that attention_head_size = hidden_width / num_heads =
864 / 54 = 16, which is a multiple of 8.
There are three use-case scenarios for running a SageMaker training job.
1906
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
1. You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This
option is recommended if it is the first time for you to use the model parallel library. To find a tutorial
for how to run a SageMaker model parallel training job, see the example notebooks at PyTorch
training with Amazon SageMaker's model parallelism library.
2. You can extend the pre-built containers to handle any additional functional requirements for your
algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of
how you can extend a pre-built container, see Extend a Pre-built Container (p. 2675).
3. You can adapt your own Docker container to work with SageMaker using the SageMaker Training
toolkit. For an example, see Adapting Your Own Training Container.
For options 2 and 3 in the preceding list, refer to Extend a Pre-built Docker Container that Contains
SageMaker's Distributed Model Parallel Library (p. 1924) to learn how to install the model parallel
library in an extended or customized Docker container.
In all cases, you launch your training job configuring a SageMaker TensorFlow or PyTorch estimator to
activate the library. To learn more, see the following topics.
Topics
• Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel
Library (p. 1907)
• Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921)
Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model
Parallel Library
Use this section to learn how to customize your training script to use the core features of the Amazon
SageMaker model parallelism library. To use the library-specific API functions and parameters, we
recommend you use this documentation alongside the SageMaker model parallel library APIs in the
SageMaker Python SDK documentation.
The training script examples provided in these sections are simplified and designed to highlight the
required changes you must make to use the library. For end-to-end, runnable notebook examples that
demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker model parallelism
library, see Amazon SageMaker Distributed Training Notebook Examples (p. 1942).
Topics
• Split the model of your training script using the SageMaker model parallelism library (p. 1907)
• Modify a TensorFlow training script (p. 1909)
• Modify a PyTorch Training Script (p. 1915)
Split the model of your training script using the SageMaker model parallelism library
There are two ways to modify your training script to set up model splitting: automated splitting or
manual splitting.
Alternatively, you can use manual model splitting. We recommend automated model splitting, unless
you are very familiar with the model architecture and have a good idea of how to efficiently partition
your model.
1907
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
How it works
Auto-partitioning occurs during the first training step, when the smp.step-decorated function is first
called. During this call, the library first constructs a version of the model on the CPU RAM (to avoid GPU
memory limitations), and then analyzes the model graph and makes a partitioning decision. Based on
this decision, each model partition is loaded on a GPU, and only then the first step is executed. Because
of these analysis and partitioning steps, the first training step might take longer.
In either framework, the library manages the communication between devices through its own backend,
which is optimized for AWS infrastructure.
The auto-partition design adapts to the characteristics of the framework, and the library does the
partitioning at the granularity level that is more natural in each framework. For instance, in TensorFlow,
each specific operation can be assigned to a different device, whereas in PyTorch, the assignment is done
at the module level, where each module consists of multiple operations. The follow section reviews the
specifics of the design in each framework.
During the first training step, the model parallelism library internally runs a tracing step that is meant
to construct the model graph and determine the tensor and parameter shapes. After this tracing step,
the library constructs a tree, which consists of the nested nn.Module objects in the model, as well as
additional data gathered from tracing, such as the amount of stored nn.Parameters, and execution
time for each nn.Module.
Next, the library traverses this tree from the root and runs a partitioning algorithm that assigns each
nn.Module to a device, which balances computational load (measured by module execution time)
and memory use (measured by the total stored nn.Parameter size and activations). If multiple
nn.Modules share the same nn.Parameter, then these modules are placed on the same device to
avoid maintaining multiple versions of the same parameter. Once the partitioning decision is made, the
assigned modules and weights are loaded to their devices.
For instructions on how to register the smp.step decorator to your PyTorch training script, see the
section called “Automated splitting with PyTorch” (p. 1915).
The model parallelism library analyzes the sizes of the trainable variables and the graph structure, and
internally uses a graph partitioning algorithm. This algorithm comes up with a device assignment for
each operation, with the objective of minimizing the amount of communication needed across devices,
subject to two constraints:
If you specify speed for optimize (in the model parallelism parameters in the Python SDK), the library
tries to balance the number of operations and tf.Variable objects in each device. Otherwise, it tries to
balance the total size of tf.Variables.
Once the partitioning decision is made, the library creates a serialized representation of the subgraph
that each device needs to execute and imports them onto each device. While partitioning, the library
places operations that consume the same tf.Variable and operations that are part of the same Keras
layer onto the same device. It also respects the colocation constraints imposed by TensorFlow. This
means that, for example, if there are two Keras layers that share a tf.Variable, then all operations
that are part of these layers are placed on a single device.
For instructions on how to register the smp.step decorator to your PyTorch training script, see the
section called “Automated splitting with TensorFlow” (p. 1910).
1908
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
PyTorch on the other hand, does not have an equivalent notion of operation that is sufficiently rich and
universal. The closest unit of computation in PyTorch that has these characteristics is an nn.Module,
which is at a much higher granularity level, and this is why the library does partitioning at this level in
PyTorch.
If you want to manually specify how to partition your model across devices, use the smp.partition
context manager. For instructions on how to set the context manager for manual partitioning, see the
following pages.
To use this option after making modifications, in Step 2, you'll need to set auto_partition to False,
and define a default_partition in the framework estimator class of the SageMaker Python SDK.
Any operation that is not explicitly placed on a partition through the smp.partition context manager
is executed on the default_partition. In this case, the automated splitting logic is bypassed, and
each operation is placed based on your specification. Based on the resulting graph structure, the model
parallelism library creates a pipelined execution schedule automatically.
In this section, you learn how to modify TensorFlow training scripts to configure the SageMaker model
parallelism library for auto-partitioning and manual partitioning. This selection of examples also
includes an example integrated with Horovod for hybrid model and data parallelism.
Note
To find which TensorFlow versions are supported by the library, see the section called
“Supported Frameworks and AWS Regions” (p. 1872).
The required modifications you must make to your training script to use the library are listed in
Automated splitting with TensorFlow (p. 1910).
To learn how to modify your training script to use hybrid model and data parallelism with Horovod, see
Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism (p. 1911).
If you want to use manual partitioning, also review Manual splitting with TensorFlow (p. 1913).
Tip
For end-to-end notebook examples that demonstrate how to use a TensorFlow training script
with the SageMaker model parallelism library, see TensorFlow Examples (p. 1943).
The following topics show examples of training scripts that you can use to configure SageMaker's model
parallelism library for auto-partitioning and manual partitioning TensorFlow models.
Note
Auto-partitioning is enabled by default. Unless otherwise specified, the example scripts use
auto-partitioning.
Topics
1909
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
The following training script changes are required to run a TensorFlow model with SageMaker's model
parallelism library:
To learn more about the SageMaker's model parallelism library API, refer to the API documentation.
The following Python script is an example of a training script after the changes are made.
import tensorflow as tf
# smdistributed: Initialize
smp.init()
1910
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
.batch(256, drop_remainder=True)
)
model = MyModel()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
@tf.function
def train_step(images, labels):
gradients, loss, predictions = get_grads(images, labels)
If you are done preparing your training script, proceed to Step 2: Launch a Training Job Using the
SageMaker Python SDK (p. 1921). If you want to run a hybrid model and data parallel training job,
continue to the next section.
Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism
You can use the SageMaker model parallelism library with Horovod for hybrid model and data
parallelism. To read more about how the library splits a model for hybrid parallelism, see Pipeline
parallelism (available for PyTorch and TensorFlow) (p. 1866).
In this step, we focus on how to modify your training script to adapt the SageMaker model parallelism
library.
To properly set up your training script to pick up the hybrid parallelism configuration that you'll set
in Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921), use the library's helper
1911
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
functions, smp.dp_rank() and smp.mp_rank(), which automatically detect the data parallel rank and
model parallel rank respectively.
To find all MPI primitives the library supports, see MPI Basics in the SageMaker Python SDK
documentation.
• Adding hvd.allreduce
• Broadcasting variables after the first batch, as required by Horovod
• Seeding shuffling and/or sharding operations in the data pipeline with smp.dp_rank().
Note
When you use Horovod, you must not directly call hvd.init in your training script. Instead,
you'll have to set "horovod" to True in the SageMaker Python SDK modelparallel
parameters in Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921). This
allows the library to internally initialize Horovod based on the device assignments of model
partitions. Calling hvd.init() directly in your training script can cause problems.
Note
Using the hvd.DistributedOptimizer API directly in your training script might result in
a poor training performance and speed, because the API implicitly places the AllReduce
operation inside smp.step. We recommend you to use the model parallelism library with
Horovod by directly calling hvd.allreduce after calling accumulate() or reduce_mean()
on the gradients returned from smp.step, as will be shown in the following example.
To learn more about the SageMaker's model parallelism library API, refer to the API documentation.
import tensorflow as tf
import horovod.tensorflow as hvd
# smdistributed: Initialize
smp.init()
1912
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
model = MyModel()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
@tf.function
def train_step(images, labels, first_batch):
gradients, loss, predictions = get_grads(images, labels)
Use smp.partition context managers to place operations in specific partition. Any operation not
placed in any smp.partition contexts is placed in the default_partition. To learn more about the
SageMaker's model parallelism library API, refer to the API documentation.
import tensorflow as tf
# smdistributed: Initialize
smp.init()
1913
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
model = MyModel()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
@tf.function
def train_step(images, labels):
gradients, loss, predictions = get_grads(images, labels)
1914
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
In this section, you learn how to modify PyTorch training scripts to configure the SageMaker model
parallelism library for auto-partitioning and manual partitioning.
Note
To find which PyTorch versions are supported by the library, see the section called “Supported
Frameworks and AWS Regions” (p. 1872).
Tip
For end-to-end notebook examples that demonstrate how to use a PyTorch training script with
the SageMaker model parallelism library, see PyTorch Examples (p. 1943).
Note that auto-partitioning is enabled by default. Unless otherwise specified, the following scripts use
auto-partitioning.
Topics
• Automated splitting with PyTorch (p. 1915)
• Manual splitting with PyTorch (p. 1917)
• Considerations (p. 1918)
• Unsupported framework features (p. 1920)
The following training script changes are required to run a PyTorch training script with SageMaker's
model parallelism library:
1915
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
To learn more about the SageMaker's model parallelism library API, refer to the API documentation.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets
class GroupedNet(nn.Module):
def __init__(self):
super(GroupedNet, self).__init__()
# define layers
optimizer.step()
# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")
1916
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)
model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)
Use smp.partition context managers to place modules in specific devices. Any module not placed
in any smp.partition contexts is placed in the default_partition. The default_partition
needs to be provided if auto_partition is set to False. The modules that are created within a specific
smp.partition context are placed on the corresponding partition.
To learn more about the SageMaker's model parallelism library API, refer to the API documentation.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets
class GroupedNet(nn.Module):
def __init__(self):
super(GroupedNet, self).__init__()
with smp.partition(0):
# define child modules on device 0
with smp.partition(1):
# define child modules on device 1
1917
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
optimizer.step()
# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")
model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)
Considerations
When you configure a PyTorch training script using SageMaker's model parallelism library, you should be
aware of the following:
• If you are using an optimization technique that relies on global gradient norms, for example gradient
norm from the entire model, such as some variants of LAMB optimizer or global gradient clipping,
you need to gather all the norms across the model partitions for correctness. You can use the library’s
communication basic data types to do this.
• All torch.Tensor arguments to the forward methods of the nn.Modules in your model must be
used in the computation of the module output. In other words, the library does not support the case
where there is a torch.Tensor argument to a module on which the module output does not depend.
1918
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• The argument to the smp.DistributedModel.backward() call must depend on all model outputs.
In other words, there cannot be an output from the smp.DistributedModel.forward call that is
not used in the computation of the tensor that is fed into the smp.DistributedModel.backward
call.
• If there are torch.cuda.synchronize() calls in your code, you might need to call
torch.cuda.set_device(smp.local_rank()) immediately before the synchronize call.
Otherwise unnecessary CUDA contexts might be created in device 0, which will needlessly consume
memory.
• Since the library places nn.Modules on different devices, the modules in the model must not depend
on any global state that is modified inside smp.step. Any state that remains fixed throughout
training, or that is modified outside smp.step in a way that is visible to all processes, is allowed.
• You don’t need to move the model to GPU (for example, using model.to(device)) when using
the library. If you try to move the model to GPU before the model is partitioned (before the first
smp.step call), the move call is ignored. The library automatically moves the part of the model
assigned to a rank to its GPU. Once training with the library starts, don’t move the model to CPU
and use it, as it won’t have correct parameters for modules not assigned to the partition held by
the process. If you want to retrain a model or use it for inference without the library after it was
trained using the model parallelism library, the recommended way is to save the full model using our
checkpointing API and load it back to a regular PyTorch Module.
• If you have a list of modules such that output of one feeds into another, replacing that list with
nn.Sequential can significantly improve performance.
• The weight update (optimizer.step()) needs to happen outside of smp.step because that is when
the entire backward pass is done and gradients are ready. When using a hybrid model with model and
data parallelism, at this point, AllReduce of gradients is also guaranteed to finish.
• When using the library in combination with data parallelism, make sure that the number of batches
on all data parallel ranks is the same so that AllReduce does not hang waiting for a rank which is not
participating in the step.
• If you launch a training job using an ml.p4d instance type (such as ml.p4d.24xlarge), you must set the
data loader variable num_workers=0. For example, you may define your DataLoader as follows:
dataloader = torch.utils.data.DataLoader(
data,
batch_size=batch_size,
num_workers=0,
pin_memory=True,
drop_last=True,
shuffle=shuffle,
)
• The inputs to smp.step must be the model inputs generated by DataLoader. This is because
smp.step internally splits the input tensors along the batch dimension and pipelines them. This
means that passing DataLoader itself to the smp.step function to generate the model inputs inside
does not work.
You should access the model inputs generated by train_loader and pass those to an smp.step
decorated function. Do not pass train_loader directly to the smp.step function.
1919
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
@smp.step
def train_step(model, data, target):
...
return output, loss
• The input tensors to smp.step must be moved to the current device using .to() API, which must
take place after the torch.cuda.set_device(local_rank()) call.
For example, you may define the train function as follows. This function adds data and target to
the current device using .to() API before using those input tensors to call train_step.
optimizer.step()
The input tensors to this smp.set decorated function have been moved to the current device in
the train function above. The model does not need to be moved to the current device. The library
automatically moves the part of the model assigned to a rank to its GPU.
@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss
The following PyTorch features are unsupported by SageMaker's model parallelism library:
• If you use data parallelism with the native PyTorch DDP, the
torch.nn.parallel.DistributedDataParallel wrapper module is not supported by the library.
The library internally manages integrating with PyTorch DDP, including parameter broadcast and
gradient AllReduce. When using the library, module buffers are only broadcast once at the start of
training. If your model has module buffers that need to be synchronized across data parallel groups at
each step, you can do so through the torch.distributed API, using the process group that can be
obtained via smp.get_dp_process_group().
• For mixed precision training, the apex.amp module is not supported. The recommended way to use
the library with automatic mixed-precision is to use torch.cuda.amp, with the exception of using
smp.amp.GradScaler instead of the implementation in torch.
• torch.jit.ScriptModules or ScriptFunctions are not supported by smp.DistributedModel.
• apex : FusedLayerNorm, FusedAdam, FusedLAMB, and FusedNovoGrad from apex are not
supported. You can use the library implementations of these through smp.optimizers and smp.nn
APIs instead.
1920
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Topics
• Using the SageMaker TensorFlow and PyTorch Estimators (p. 1921)
• Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel
Library (p. 1924)
• Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library (p. 1925)
The TensorFlow and PyTorch estimator classes contain the distribution parameter, which you can use
to specify configuration parameters for using distributed training frameworks. The SageMaker model
parallel library internally uses MPI for hybrid data and model parallelism, so you must use the MPI option
with the library.
The following template of a TensorFlow or PyTorch estimator shows how to configure the
distribution parameter for using the SageMaker model parallel library with MPI.
import sagemaker
from sagemaker.tensorflow import TensorFlow
smp_options = {
"enabled":True, # Required
"parameters": {
"partitions": 2, # Required
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"horovod": True, # Use this for hybrid model and data parallelism
}
}
mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8, # Required
# "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}
smd_mp_estimator = TensorFlow(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='2.6.3',
py_version='py38',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)
1921
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
import sagemaker
from sagemaker.pytorch import PyTorch
smp_options = {
"enabled":True,
"parameters": { # Required
"pipeline_parallel_degree": 2, # Required
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"ddp": True,
}
}
mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8, # Required
# "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}
smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='1.13.1',
py_version='py38',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
To enable the library, you need to pass configuration dictionaries to the "smdistributed" and "mpi"
keys through the distribution argument of the SageMaker estimator constructors.
• For the "smdistributed" key, pass a dictionary with the "modelparallel" key and the following
inner dictionaries.
Note
Using "modelparallel" and "dataparallel" in one training job is not supported.
• "enabled" – Required. To enable model parallelism, set "enabled": True.
• "parameters" – Required. Specify a set of parameters for SageMaker model parallelism.
• For a complete list of common parameters, see Parameters for smdistributed in the SageMaker
Python SDK documentation.
1922
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Note
You do not need to explicitly specify this default flag to the key. If you explicitly specify it,
your distributed model parallel training job might fail with the following error:
The following MCA parameter has been listed multiple times on the command
line:
MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed
once
on a command line to ensure there is no ambiguity as to its value.
Please correct the situation and try again.
Tip
If you launch a training job using an EFA-enabled instance type, such as ml.p4d.24xlarge
and ml.p3dn.24xlarge, use the following flag for best performance:
To launch the training job using the estimator and your SageMaker model parallel configured training
script, run the estimator.fit() function.
Use the following resources to learn more about using the model parallelism features in the SageMaker
Python SDK:
1923
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• We recommend you use a SageMaker notebook instance if you are new users. To see an example of
how you can launch a training job using a SageMaker notebook instance, see Amazon SageMaker
Distributed Training Notebook Examples (p. 1942).
• You can also submit a distributed training job from your machine using AWS CLI. To set up AWS CLI on
your machine, see set up your AWS credentials and Region for development.
Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel
Library
To extend a pre-built container and use SageMaker's model parallelism library, you must use one of
the available AWS Deep Learning Containers (DLC) images for PyTorch or TensorFlow. The SageMaker
model parallelism library is included in the TensorFlow (2.3.0 and later) and PyTorch (1.6.0 and later) DLC
images with CUDA (cuxyz). For a complete list of DLC images, see Available Deep Learning Containers
Images in the AWS Deep Learning Containers GitHub repository.
Tip
We recommend that you use the image that contains the latest version of TensorFlow or
PyTorch to access the most up-to-date version of the SageMaker model parallelism library.
For example, your Dockerfile should contain a FROM statement similar to the following:
ENV PATH="/opt/ml/code:${PATH}"
# this environment variable is used by the SageMaker container to determine our user code
directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
Additionally, when you define a PyTorch or TensorFlow estimator, you must specify that the
entry_point for your training script. This should be the same path identified with ENV
SAGEMAKER_SUBMIT_DIRECTORY in your Dockerfile.
Tip
You must push this Docker container to Amazon Elastic Container Registry (Amazon ECR)
and use the image URI (image_uri) to define a SageMaker estimator for training. For more
information, see Extend a Pre-built Container (p. 2675).
After you finish hosting the Docker container and retrieving the image URI of the container, create a
SageMaker PyTorch estimator object as follows. This example assumes that you have already defined
smp_options and mpi_options.
smd_mp_estimator = Estimator(
entry_point="your_training_script.py",
role=sagemaker.get_execution_role(),
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag'
instance_count=1,
distribution={
"smdistributed": smp_options,
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
1924
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
To build your own Docker container for training and use the SageMaker model parallel library, you must
include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in
your Dockerfile. This section provides the minimum set of code blocks you must include to properly
prepare a SageMaker training environment and the model parallel library in your own Docker container.
Note
This custom Docker option with the SageMaker model parallel library as a binary is available
only for PyTorch.
To create a Dockerfile with the SageMaker training toolkit and the model parallel library
FROM <cuda-cudnn-base-image>
Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA
base images. We recommend you look into the official Dockerfiles of AWS Deep Learning
Container for PyTorch to find which versions of the libraries you need to install and how to
configure them. The official Dockerfiles are complete, benchmark tested, and managed by
the SageMaker and Deep Learning Container service teams. In the provided link, choose the
PyTorch version you use, choose the CUDA (cuxyz) folder, and choose the Dockerfile ending
with .gpu or .sagemaker.gpu.
2. To set up a distributed training environment, you need to install software for communication and
network devices, such as Elastic Fabric Adapter (EFA), NVIDIA Collective Communications Library
(NCCL), and Open MPI. Depending on the PyTorch and CUDA versions you choose, you must install
compatible versions of the libraries.
Important
Because the SageMaker model parallel library requires the SageMaker data parallel library
in the subsequent steps, we highly recommend that you follow the instructions at Create
Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850) to
properly set up a SageMaker training environment for distributed training.
For more information about setting up EFA with NCCL and Open MPI, see Get started with EFA and
MPI and Get started with EFA and NCCL.
3. Add the following arguments to specify the URLs of the SageMaker distributed training packages for
PyTorch. The SageMaker model parallel library requires the SageMaker data parallel library to use the
cross-node Remote Direct Memory Access (RDMA).
ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-
west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/
smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/
cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG METIS=metis-5.1.0
1925
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
RUN rm /etc/apt/sources.list.d/* \
&& wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \
&& gunzip -f ${METIS}.tar.gz \
&& tar -xvf ${METIS}.tar \
&& cd ${METIS} \
&& apt-get update \
&& make config shared=1 \
&& make install \
&& cd .. \
&& rm -rf ${METIS}.tar* \
&& rm -rf ${METIS} \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
b. Install the RAPIDS Memory Manager library. This requires CMake 3.14 or later.
ARG RMM_VERSION=0.15.0
7. Install the sagemaker-training toolkit. The toolkit contains the common functionality that's necessary
to create a container compatible with the SageMaker training platform and the SageMaker Python
SDK.
8. After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to
build the Docker container and host it in Amazon ECR.
Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see
Use Your Own Training Algorithms.
Topics
• Checkpointing a distributed model (p. 1927)
• Fine-tuning a distributed model (p. 1931)
1926
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
Topics
• Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and
later) (p. 1927)
• Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between
v1.6.0 and v1.9.0) (p. 1929)
• Checkpointing a distributed TensorFlow model (p. 1930)
Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0
and later)
The SageMaker model parallelism library provides checkpoint APIs to save and load full or partial
checkpoints of the distributed model state and its optimizer state.
Note
This checkpointing method is recommended if you use PyTorch and the SageMaker model
parallelism library v1.10.0 or later.
Partial checkpointing
- path
- ${tag}_partial (folder for partial checkpoints)
- model_rankinfo.pt
- optimizer_rankinfo.pt
- fp16_states_rankinfo.pt
- user_content.pt
- $tag (checkpoint file for full checkpoints)
- user_content_$tag (user_content file for full checkpoints)
- newest (a file that indicates the newest checkpoint)
When saving a partial checkpoint, the library also saves the model partition decision as files with .pt
file extension. Conversely, when resuming from the partial checkpoint, the library loads the partition
decision files together. Once the partition decision is loaded, you can't change the partition.
The following code snippet shows how to set the checkpoint APIs in a PyTorch training script.
model = ...
1927
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ... # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"
# Save a checkpoint.
smp.save_checkpoint(
path=checkpoint_path,
tag=f"total_steps{total_steps}",
partial=True,
model=model,
optimizer=optimizer,
user_content=user_content
num_kept_partial_checkpoints=5
)
# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
path=checkpoint_path,
partial=True
)
Full checkpointing
To save the final model artifact for inference purposes, use the
smdistributed.modelparallel.torch.save_checkpoint API with partial=False, which
combines the model partitions to create a single model artifact. Note that this does not combine the
optimizer states.
To initialize training with particular weights, given a full model checkpoint, you can use the
smdistributed.modelparallel.torch.resume_from_checkpoint API with partial=False.
Note that this does not load optimizer states.
Note
With tensor parallelism, in general, the state_dict must be translated between
the original model implementation and the DistributedModel implementation.
Optionally, you can provide the state_dict translation function as an argument to the
smdistributed.modelparallel.torch.resume_from_checkpoint. However, for the
section called “Supported Models Out of the Box” (p. 1897), the library takes care of this
translation automatically.
The following code shows an example of how to use the checkpoint APIs for fully checkpointing a
PyTorch model trained with model parallelism.
model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ... # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"
# Save a checkpoint.
smp.save_checkpoint(
path=checkpoint_path,
tag=f"total_steps{total_steps}",
partial=False,
model=model,
optimizer=optimizer,
1928
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
user_content=user_content
num_kept_partial_checkpoints=5
)
# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
path=checkpoint_path,
partial=False
)
Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library
between v1.6.0 and v1.9.0)
The SageMaker model parallelism library provides Python functions for saving partial or full checkpoints
for training jobs with tensor parallelism. The following procedure shows how to use smp.save() and
smp.load() to save and load a checkpoint when you use tensor parallelism.
Note
This checkpointing method is recommended if you use PyTorch, the section called “Tensor
Parallelism” (p. 1890), and the SageMaker model parallelism library between v1.6.0 and v1.9.0.
1. Prepare a model object and wrap it with the library's wrapper function smp.DistributedModel().
model = MyModel(...)
model = smp.DistributedModel(model)
2. Prepare an optimizer for the model. A set of model parameters is an iterable argument required by
optimizer functions. To prepare a set of model parameters, you must process model.parameters()
to assign unique IDs to individual model parameters.
If there are parameters with duplicated IDs in the model parameter iterable, loading the checkpointed
optimizer state fails. To create an iterable of model parameters with unique IDs for your optimizer, see
the following:
unique_params = []
unique_params_set = set()
for p in model.parameters():
if p not in unique_params_set:
unique_params.append(p)
unique_params_set.add(p)
del unique_params_set
optimizer = smp.DistributedOptimizer(optimizer)
4. Save the model and the optimizer state using smp.save(). Depending on how you want to save
checkpoints, choose one of the following two options:
• Option 1: Save a partial model on each mp_rank for a single MP_GROUP.
1929
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
partial=True,
)
With tensor parallelism, the library saves checkpointed files named in the following format:
checkpoint.pt_{pp_rank}_{tp_rank}.
Note
With tensor parallelism, make sure you set the if statement as if smp.rdp_rank()
== 0 instead of if smp.dp_rank() == 0. When the optimizer state is sharded with
tensor parallelism, all reduced-data parallel ranks must save their own partition of the
optimizer state. Using a wrong if statement for checkpointing might result in a stalling
training job. For more information about using if smp.dp_rank() == 0 without tensor
parallelism, see General Instruction for Saving and Loading in the SageMaker Python SDK
documentation.
• Option 2: Save the full model.
if smp.rdp_rank() == 0:
model_dict = model.state_dict(gather_to_rank0=True) # save the full model
if smp.rank() == 0:
smp.save(
{"model_state_dict": model_dict},
"/checkpoint.pt",
partial=False,
)
Note
Consider the following for full checkpointing:
• If you set gather_to_rank0=True, all ranks other than 0 return empty dictionaries.
• For full checkpointing, you can only checkpoint the model. Full checkpointing of
optimizer states is currently not supported.
• The full model only needs to be saved at smp.rank() == 0.
5. Load the checkpoints using smp.load(). Depending on how you checkpointed in the previous step,
choose one of the following two options:
• Option 1: Load the partial checkpoints.
if smp.rdp_rank() == 0:
checkpoint = smp.load("/checkpoint.pt", partial=False)
model.load_state_dict(checkpoint["model_state_dict"])
The if smp.rdp_rank() == 0 condition is not required, but it can help avoid redundant loading
among different MP_GROUPs. Full checkpointing optimizer state dict is currently not supported with
tensor parallelism.
1930
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• smdistributed.modelparallel.tensorflow.DistributedModel.save_model
• smdistributed.modelparallel.tensorflow.CheckpointManager
import argparse
from transformers import AutoModelForCausalLM
import smdistributed.modelparallel
import smdistributed.modelparallel.torch as smp
def parse_args():
parser = argparse.ArgumentParser()
... # set up numerous args to parse from the configuration dictionary to the script for
training
def main():
"""Main function to train GPT."""
args = parse_args()
1931
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
For a complete example of training scripts and Jupyter notebooks, see the GPT-2 examples for PyTorch
in the SageMaker Examples GitHub repository.
• In the realm of model parallelism, it is best to use powerful instances with large GPU memories to
handle overhead from model parallelism operations such as partitioning models across multiple GPUs.
We recommend using ml.p4d or ml.p3dn instances for training large DL models. These instances are
also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables
large-scale training with model parallelism.
1932
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• The impact of sharding optimizer state depends on the number of data parallel ranks. Typically,
a higher degree of data parallelism (proportional to the size of compute node) can improve the
efficiency of memory usage.
When you want to downsize a cluster, make sure you check the optimizer state sharding configuration.
For example, a large DL model with optimizer state sharding that fits on a node with 16 GPUs won't fit
on a node with 8 GPUs because there are simply not enough GPUs across which to shard the optimizer
state.
Activation checkpointing
• Memory efficiency can be improved by using activation checkpointing for a group of modules. The
more you group the modules, the more efficient the memory usage. When checkpointing sequential
modules for layers, the strategy argument of the smp.set_activation_checkpointing
function groups the layers together for checkpointing. For example, grouping two or more layers
together for checkpointing is more memory efficient than checkpointing one layer at a time, and this
trades extra computation time for reduced memory usage.
Tensor parallelism
n
• The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2 ), where the maximum
degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs,
possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary
numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes,
misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the
nodes; this adds significant overhead from communication of activations across the nodes and can
become computationally expensive.
• You can run pipeline parallelism both within a single node and across multiple nodes. When you
use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline
parallelism across multiple nodes and keeping tensor parallelism within individual nodes.
• Pipeline parallelism comes with the following three knobs: microbatches, active_microbatches,
and prescaled_batch.
• When you use tensor parallelism with pipeline parallelism, we recommend activating
prescaled_batch so that the batch size per model parallel group can be increased for efficient
pipelining. With prescaled_batch activated, the batch size set in the training script becomes
tp_size times the batch size set for each rank without prescaled_batch.
• Increasing the number of microbatches helps achieve efficient pipelining and better performance.
Note that the effective microbatch size is the batch size divided by number of microbatches. If you
increase the number of microbatches while keeping batch size constant, each microbatch processes
fewer samples.
• The number of active_microbatches is the maximum number of microbatches that are
simultaneously in process during pipelining. For each active microbatch in process, its activations
and gradients take up GPU memory. Therefore, increasing active_microbatches takes up more
GPU memory.
1933
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
• If both GPU and GPU memory are underutilized, increase active_microbatches for better
parallelization during pipelining.
• For more information about how to use tensor parallelism with pipeline parallelism, see Tensor
parallelism combined with pipeline parallelism (p. 1895).
• To find descriptions of the aforementioned parameters, see Parameters for smdistributed in the
SageMaker Python SDK documentation.
• Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To
ensure that the offloading and preloading happen in the background, specify a value greater than 1 to
the microbatches parameter.
• When offloading activations, you might be able to increase active_microbatches and sometimes
match with the total number of microbatches. This depends on which modules are checkpointed and
how the model is partitioned.
Reference configurations
The SageMaker model parallelism training team provides the following reference points based on
experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000.
You can extrapolate from the preceding configurations to estimate GPU memory usage for your model
configuration. For example, if you increase the sequence length for a 10-billion-parameter model or
increase the size of the model to 20 billion, you might want to lower batch size first. If the model still
doesn’t fit, try increasing the degree of tensor parallelism.
1934
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
heads. Validate if the reduced model runs well on the notebook instance before using a large cluster
for training the full model.
Monitoring and Logging a Training Job Using the SageMaker Console and
Amazon CloudWatch
To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU
utilization, use visualization provided through the SageMaker console.
For more information, see Monitor and Analyze Training Jobs Using Amazon CloudWatch
Metrics (p. 2127).
Permissions
To run a SageMaker training job with model parallelism or the SageMaker distributed training example
notebooks, make sure you have the right permissions in your IAM role, such as the following:
1935
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
limit, and then experiment with larger batch sizes and numbers of microbatches. As the number of
microbatches is increased, larger batch sizes might become feasible if an interleaved pipeline is used.
• Your batch size must be always divisible by the number of microbatches. Note that depending on the
size of the dataset, sometimes the last batch of every epoch can be of a smaller size than the rest, and
this smaller batch needs to be divisible by the number of microbatches as well. If it is not, you can set
drop_remainder=True in the tf.Dataset.batch() call (in TensorFlow), or set drop_last=True
in DataLoader (in PyTorch), so that this last, small batch is not used. If you are using a different API
for the data pipeline, you might need to manually skip the last batch whenever it is not divisible by the
number of microbatches.
Manual Partitioning
• If you use manual partitioning, be mindful of the parameters that are consumed by multiple
operations and modules in your model, such as the embedding table in transformer architectures.
Modules that share the same parameter must be placed in the same device for correctness. When
auto-partitioning is used, the library automatically enforces this constraint.
Data Preparation
• If the model takes multiple inputs, make sure you seed the random operations in your data pipeline
(e.g., shuffling) with smp.dp_rank(). If the dataset is being deterministically sharded across data
parallel devices, make sure that the shard is indexed by smp.dp_rank(). This is to make sure that the
order of the data seen on all ranks that form a model partition is consistent.
with smp.delay_param_initialization(enabled=True):
model = MyModel()
1936
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
## WRONG
model = MyModel()
optimizer = SomeOptimizer(model.parameters())
model = smp.DistributedModel(model) # optimizer now has outdated parameters!
Instead, the optimizer should be created with the parameters of the smp.DistributedModel as
follows:
## CORRECT
model = smp.DistributedModel(MyModel())
optimizer = SomeOptimizer(model.optimizers())
• When a module is replaced with its distributed counterpart through tensor parallelism, the distributed
module does not inherit its weights from the original module, and initializes new weights. This means
that, for instance, if the weights need to be initialized in a particular call (for example, through a
load_state_dict call), this needs to happen after the smp.DistributedModel call, once the
module distribution takes place.
• When accessing the parameters of distributed modules directly, note that the weight does not have
the same shape as the original module. For instance,
with smp.tensor_parallelism():
linear = nn.Linear(60, 60)
# will pass
assert tuple(linear.weight.shape) == (60, 60)
distributed_linear = smp.DistributedModel(linear)
# will fail. the number of input channels will have been divided by smp.tp_size()
1937
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
For more information about checkpointing a model with tensor parallelism, see the section called
“Checkpointing a distributed model” (p. 1927).
Topics
• Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism
Library (p. 1938)
• Saving Checkpoints (p. 1939)
• Convergence Using Model Parallel and TensorFlow (p. 1940)
• Stalling or Crashing Distributed Training Jobs (p. 1940)
• Receiving NCCL Error for a PyTorch Training Job (p. 1941)
• Receiving RecursionError for a PyTorch Training Job (p. 1942)
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
1938
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
checkpoint_in_bucket="checkpoints"
estimator = TensorFlow(
...
Saving Checkpoints
You might run into the following error when saving checkpoints of a large model on SageMaker:
This could be caused by a SageMaker limitation while uploading the local checkpoint to Amazon S3
during training. To disable checkpointing in SageMaker, use the following example to explicitly upload
the checkpoints.
If you run into the preceding error, do not use checkpoint_s3_uri with the SageMaker estimator
call. While saving checkpoints for larger models, we recommend saving checkpoints to a custom
directory and passing the same to the helper function (as a local_path argument).
import os
def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints",
s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
""" sample function to sync checkpoints from local path to s3 """
import boto3
#check if local path exists
if not os.path.exists(local_path):
raise RuntimeError("Provided local path {local_path} does not exist. Please check")
s3_bucket = s3_uri.replace('s3://','').split('/')[0]
print(f"S3 Bucket: {s3_bucket}")
try:
s3.meta.client.head_bucket(Bucket=s3_bucket)
except Exception as e:
raise e
1939
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
aws_s3_sync(local_path, s3_uri)
return
def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints",
s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
""" sample function to sync checkpoints from s3 to local path """
import boto3
#try to create local path if it does not exist
if not os.path.exists(local_path):
print(f"Provided local path {local_path} does not exist. Creating...")
try:
os.makedirs(local_path)
except Exception as e:
raise RuntimeError(f"Failed to create {local_path}")
s3_bucket = s3_uri.replace('s3://','').split('/')[0]
print(f"S3 Bucket: {s3_bucket}")
try:
s3.meta.client.head_bucket(Bucket=s3_bucket)
except Exception as e:
raise e
aws_s3_sync(s3_uri, local_path)
return
• If you see a distributed training job stalling at the NCCL initialization step, consider the following:
• If you are using one of the EFA-enabled instances ( ml.p4d or ml.p3dn instances) with a custom
VPC and its subnet, ensure that the security group used has inbound and outbound connections
1940
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library
for all ports to and from the same SG. You also generally need outbound connections to any IP
as a separate rule (for internet access). To find instructions on how to add inbound and outbound
rules for EFA communication, refer to SageMaker Distributed Training Job Stalling During
Initialization (p. 1863).
• If you see a distributed training job stalling when checkpointing the full model, this might
be because the state_dict() call on the model or optimizer was not made on all ranks with
rdp_rank()==0 (when using tensor parallelism) or dp_rank()==0 (when using only pipeline
parallelism). These ranks need to communicate to construct the checkpoint to be saved. Similar
stalling issues can also happen when checkpointing partial optimizer if shard_optimizer_state is
enabled.
For more information about checkpointing a model with model parallelism, see General Instruction
for Saving and Loading and Checkpointing a distributed PyTorch model (for the SageMaker model
parallelism library between v1.6.0 and v1.9.0) (p. 1929).
• If the training job crashes with a CUDA Out of Memory error, this means that the distributed training
configuration needs to be adjusted to fit the model on the GPU cluster. For more information and best
practices, see Setting Up the Right Configuration for a Given Model (p. 1932).
• If the training job crashes with an uncorrectable ECC error, this means that one of the GPUs in the
cluster has gone bad. If you need technical support, share the job ARN with the AWS team and restart
your training job from a checkpoint if possible.
• In rare cases, a job configuration that worked previously but is close to the limits of GPU memory
might fail later with a different cluster due to a CUDA Out of Memory error. This could be because
some GPU has lower available memory than usual due to ECC errors.
• Network timeout crash might happen when running a multinode job which doesn’t use
all GPUs in the node. To get around this, use all GPUs on the node by ensuring that the
processes_per_host parameter is set to the number of GPUs in each instance. For example, this
is processes_per_host=8 for ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge
instances.
• If you find that your training job takes a long time during the data downloading stage, make sure
the Amazon S3 path you provided to checkpoint_s3_uri for the SageMaker Estimator class
is unique for the current training job. If this path is reused across multiple training jobs running
simultaneously, all those checkpoints are uploaded and downloaded to the same Amazon S3 path and
might significantly increase checkpoint loading time.
• Use FSx for Lustre when you deal with large data and models.
• If your dataset is large and fetching it takes a long time, we recommend keeping your dataset in FSx
for Lustre.
• When training models are beyond 10 billion parameters, we recommend using FSx for Lustre for
checkpointing.
• After you create a file system, make sure to wait for the status to become available before starting a
training job using it.
You can resolve this by reducing the batch size or active_microbatches. If auto partitioning is not
resulting in a well-balanced partitioning, you might have to consider manual partitioning. For more
information, see Pipeline parallelism across nodes (p. 1933).
1941
Amazon SageMaker Developer Guide
SageMaker Distributed Training Notebook Examples
These notebooks are provided in the SageMaker examples GitHub repository. You can also browse them
on the SageMaker examples website.
The examples are set up to use p3.16xlarge instances for the worker nodes, but you may choose
ml.p3dn.24xlarge or ml.p4d.24xlarge instance types for which the SageMaker distributed training
libraries are optimized. You can test the notebooks using a cluster of a single node; however, to see a
performance improvement as shown in the Training Benchmarks section, use a cluster of multiple nodes
(two or more). The examples call out the section in which you modify this configuration.
• How I trained 10TB for Stable Diffusion on SageMaker in Medium (November 29, 2022)
• Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon
Search , AWS Machine Learning Blog (August 18, 2022)
• Training YOLOv5 on AWS with PyTorch and the SageMaker distributed data parallel library, Medium
(May 6, 2022)
• Speed up EfficientNet model training on SageMaker with PyTorch and the SageMaker distributed data
parallel library, Medium (March 21, 2022)
• Speed up EfficientNet training on AWS with the SageMaker distributed data parallel library, Towards
Data Science (January 12, 2022)
• Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker,
AWS Machine Learning Blog (June 25, 2021)
• Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker,
the Hugging Face website (April 8, 2021)
• New performance improvements in the Amazon SageMaker model parallelism library, AWS Machine
Learning Blog (December 16, 2022)
• Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker,
AWS Machine Learning Blog (October 31, 2022)
1942
Amazon SageMaker Developer Guide
SageMaker Distributed Training Notebook Examples
PyTorch Examples
The SageMaker data parallelism library
• CNN with PyTorch 1.6 and the SageMaker data parallelism library
• MaskRCNN with PyTorch 1.6 and the SageMaker data parallelism library
• BERT with PyTorch 1.6 and the SageMaker data parallelism library
• Train GPT-2 with PyTorch 1.8.1 and Tensor Parallelism Using the SageMaker model parallelism library
• BERT with PyTorch 1.6 and the SageMaker model parallelism library
TensorFlow Examples
The SageMaker data parallelism library
• CNN with TensorFlow 2.3.1 and the SageMaker data parallelism library
• MaskRCNN with TensorFlow 2.3.1 and the SageMaker data parallelism library
• BERT with TensorFlow 2.3.1 and the SageMaker data parallelism library
• CNN with TensorFlow 2.3.1 and the SageMaker model parallelism library
HuggingFace Examples
The following HuggingFace on SageMaker examples are available in the HuggingFace notebooks
repository.
1943
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices
which are optimized for machine learning. If you do not have an active notebook instance, follow the
instructions in Create a Notebook Instance (p. 209) in the SageMaker developer guide to create one.
After you have created an instance, in the Notebook instances page of the SageMaker console, do the
following:
1. Open JupyterLab.
2.
Select the examples icon ( ) in the left tray.
3. Browse the examples for Training and look for notebooks titled Distributed Data Parallel or
Distributed Model Parallel.
1. Open a terminal.
2. In the command line, navigate to the SageMaker folder.
cd SageMaker
Note
To download the HuggingFace example notebooks (p. 1943), clone the HuggingFace
notebooks GitHub repository:
You can configure ML tasks to run in a distributed manner across multiple nodes (instances), accelerators
(NVIDIA GPUs, AWS Trainium chips), and vCPU cores. By running distributed computation, you can
achieve a variety of goals such as computing operations faster, handling large datasets, or training large
ML models.
1944
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices
The following list covers common challenges that you might face when you run an ML training job at
scale.
• You need to make decisions on how to distribute computation depending on ML tasks, software
libraries you want to use, and compute resources.
• Not all ML tasks are straightforward to distribute. Also, not all ML libraries support distributed
computation.
• Distributed computation might not always result in a linear increase in compute efficiency. In
particular, you need to identify if data I/O and inter-GPU communication have bottlenecks or cause
overhead.
• Distributed computation might disturb numerical processes and change model accuracy. Specifically
to data-parallel neural network training, when you change the global batch size while scaling up to a
larger compute cluster, you also need to adjust the learning rate accordingly.
SageMaker provides distributed training solutions to ease such challenges for various use cases. Choose
one of the following options that best fits your use case.
Topics
• Option 1: Use a SageMaker built-in algorithm that supports distributed training (p. 1945)
• Option 2: Run a custom ML code in the SageMaker managed training or processing
environment (p. 1945)
• Option 3: Write your own custom distributed training code (p. 1947)
• Option 4: Launch multiple jobs in parallel or sequentially (p. 1947)
A subset of the SageMaker built-in algorithms support distributed training. To check if the algorithm
of your choice supports distributed training, see the Parallelizable column in the Common Information
About Built-in Algorithms table. Some of the algorithms support multi-instance distributed training,
while the rest of the parallelizable algorithms support parallelization across multiple GPUs in a single
instance, as indicated in the Parallelizable column.
1945
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices
The SageMaker distributed training libraries propose AWS-managed code for neural network data
parallelism and model parallelism. SageMaker distributed training also comes with launcher clients
built into the SageMaker Python SDK, and you don’t need to author parallel launch code. To learn
more, see SageMaker's data parallelism library and SageMaker's model parallelism library.
• Open-source distributed training libraries
Open source frameworks have their own distribution mechanisms such as DistributedDataParallelism
(DDP) in PyTorch or tf.distribute modules in TensorFlow. You can choose to run these distributed
training frameworks in the SageMaker-managed framework containers. For example, the sample code
for training MaskRCNN in SageMaker shows how to use both PyTorch DDP in the SageMaker PyTorch
framework container and Horovod in the SageMaker TensorFlow framework container.
SageMaker ML containers also come with MPI preinstalled, so you can parallelize your entry point script
using mpi4py. Using the MPI integrated training containers is a great option when you launch a third-
party distributed training launcher or write ad-hoc parallel code in the SageMaker managed training
environment.
We often run neural network training jobs on multiple-CPU or multiple-GPU instances. Each GPU-
based instance usually contains multiple GPU devices. Consequently, distributed GPU computing can
happen either within a single GPU instance with multiple GPUs (single-node multi-GPU training),
or across multiple GPU instances with multiple GPU cores in each (multi-node multi-GPU training).
Single-instance training is easier to write code and debug, and the intra-node GPU-to-GPU throughput
is usually faster than the inter-node GPU-to-GPU throughput. Therefore, it is a good idea to scale
data parallelism vertically first (use one GPU instance with multiple GPUs) and expand to multiple
GPU instances if needed. This might not apply to cases where the CPU budget is high (for example, a
massive workload for data pre-processing) and when the CPU-to-GPU ratio of a multi-GPU instance is
too low. In all cases, you need to experiment with different combinations of instance types based on
your own ML training needs and workload.
• Monitor the quality of convergence
When training a neural network with data parallelism, increasing the number of GPUs while keeping
the mini-batch size per GPU constant leads to increasing the size of global mini-batch for the mini-
batch stochastic gradient descent (MSGD) process. The size of mini-batches for MSGD is known to
impact the descent noise and convergence. For properly scaling while preserving accuracy, you need to
adjust other hyperparameters such as the learning rate [Goyal et al. (2017)].
• Monitor I/O bottlenecks
As you increase the number of GPUs, the throughput for reading and writing storage should also
increase. Make sure that your data source and pipeline don’t become bottlenecks.
• Modify your training script as needed
Training scripts written for single-GPU training must be modified for multi-node multi-GPU training. In
most data parallelism libraries, script modification is required to do the following.
• Assign batches of training data to each GPU.
• Use an optimizer that can deal with gradient computation and parameter updates across multiple
GPUs.
• Assign responsibility of checkpointing to a specific host and GPU.
1946
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices
You can also use SageMaker Training and SageMaker Processing to run custom distributed computations
that do not require inter-worker communication. In the computing literature, those tasks are often
described as embarrassingly parallel or share-nothing. Examples include parallel processing of data
files, training models in parallel on different configurations, or running batch inference on a collection
of records. You can trivially parallelize such share-nothing use cases with Amazon SageMaker. When
you launch a SageMaker Training or SageMaker Processing job on a cluster with multiple nodes,
SageMaker by default replicates and launches your training code (in Python or Docker) on all the nodes.
Tasks requiring random spread of input data across such multiple nodes can be facilitated by setting
S3DataDistributionType=ShardedByS3Key in the data input configuration of the SageMaker
TrainingInput API.
• When you have specific data channels and metadata entries (such as hyperparameters, model
configuration, or instance types) for each sub-tasks.
• When you implement retry steps at a sub-task level.
• When you vary the configuration of the sub-tasks over the course of the workload, such as when
training on increasing batch sizes.
• When you need to run an ML task that takes longer than the maximum training time allowed for a
single training job (28 days maximum).
• When different steps of a compute workflow require different instance types.
For the specific case of hyperparameter search, use SageMaker Automated Model Tuning. SageMaker
Automated Model Tuning is a serverless parameter search orchestrator that launches multiple training
jobs on your behalf, according to a search logic that can be random, Bayesian, or HyperBand.
1947
Amazon SageMaker Developer Guide
Training Compiler
Additionally, to orchestrate multiple training jobs, you can also consider workflow orchestration tools,
such as SageMaker Pipelines, AWS Step Functions, and Apache Airflow supported by Amazon Managed
Workflows for Apache Airflow (MWAA) and SageMaker Workflows.
SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the
SageMaker Training Compiler–enabled AWS DLCs, you can compile and optimize training jobs on GPU
instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable
SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for
accelerated computing.
How It Works
SageMaker Training Compiler converts DL models from their high-level language representation
to hardware-optimized instructions. Specifically, SageMaker Training Compiler applies graph-level
optimizations, dataflow-level optimizations, and backend optimizations to produce an optimized model
that efficiently uses hardware resources. As a result, you can train your models faster than when you train
them without compilation.
It is a two-step process to activate SageMaker Training Compiler for your training job:
1948
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
1. Bring your own DL script and, if needed, adapt to compile and train with SageMaker Training Compiler.
To learn more, see Bring Your Own Deep Learning Model (p. 1967).
2. Create a SageMaker estimator object with the compiler configuration parameter using the SageMaker
Python SDK.
a. Turn on SageMaker Training Compiler by adding
compiler_config=TrainingCompilerConfig() to the SageMaker estimator class.
b. Adjust hyperparameters (batch_size and learning_rate) to maximize the benefit that
SageMaker Training Compiler provides.
Compilation through SageMaker Training Compiler changes the memory footprint of the model.
Most commonly, this manifests as a reduction in memory utilization and a consequent increase in
the largest batch size that can fit on the GPU. In some cases, the compiler intelligently promotes
caching which leads to a decrease in the largest batch size that can fit on the GPU. Note that if you
want to change the batch size, you must adjust the learning rate appropriately.
For a reference for batch_size tested for popular models, see Tested Models (p. 1952).
When you adjust the batch size, you also have to adjust the learning_rate appropriately. For
best practices for adjusting the learning rate along with the change in batch size, see the section
called “Best Practices and Considerations” (p. 1989).
c. By running the estimator.fit() class method, SageMaker compiles your model and starts the
training job.
For instructions on how to launch a training job, see Enable SageMaker Training Compiler (p. 1975).
SageMaker Training Compiler does not alter the final trained model, while allowing you to accelerate the
training job by more efficiently using the GPU memory and fitting a larger batch size per iteration. The
final trained model from the compiler-accelerated training job is identical to the one from the ordinary
training job.
Tip
SageMaker Training Compiler only compiles DL models for training on supported GPU instances
managed by SageMaker. To compile your model for inference and deploy it to run anywhere in
the cloud and at the edge, use SageMaker Neo compiler.
Topics
• Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949)
• Bring Your Own Deep Learning Model (p. 1967)
• Enable SageMaker Training Compiler (p. 1975)
• SageMaker Training Compiler Example Notebooks and Blogs (p. 1989)
• SageMaker Training Compiler Best Practices and Considerations (p. 1989)
• SageMaker Training Compiler FAQ (p. 1992)
• SageMaker Training Compiler Troubleshooting (p. 1993)
• Amazon SageMaker Training Compiler Release Notes (p. 1999)
1949
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Supported Frameworks
SageMaker Training Compiler supports the following deep learning frameworks and is available through
AWS Deep Learning Containers.
Topics
• PyTorch (p. 1950)
• TensorFlow (p. 1951)
PyTorch
PyTorch v1.12.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
pytorch-trcomp-
training:1.13.1-
gpu-py39-cu117-
ubuntu20.04-
sagemaker
Transformers v4.17.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-pytorch-
PyTorch v1.10.2 trcomp-training:1.10.2-
transformers4.17.0-
gpu-py38-cu113-
ubuntu20.04
Transformers v4.11.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-pytorch-
PyTorch v1.9.0 training-comp:1.9.0-
transformers4.11.0-
gpu-py38-cu111-
ubuntu20.04
1950
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
TensorFlow
Transformers v4.11.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-
TensorFlow v2.5.1 tensorflow-training-
comp:2.5.1-
transformers4.11.0-
gpu-py37-cu112-
ubuntu18.04
For more information, see Available Images in the AWS Deep Learning Containers GitHub repository.
AWS Regions
The SageMaker Training Compiler Containers are available in the AWS Regions where AWS Deep
Learning Containers are in service except the China regions.
• P4 instances
• P3 instances
• G4dn instances
1951
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
• G5 instances
For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.
If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.
Tested Models
The following table includes a list of the models that have been tested with SageMaker Training
Compiler. For reference, the largest batch size that is able to fit into memory is also included alongside
other training parameters. SageMaker Training Compiler can change the memory footprint of the model
training process; as a result, a larger batch size can often be used during the training process, further
decreasing total training time. In some cases, SageMaker Training Compiler intelligently promotes
caching which leads to a decrease in the largest batch size that can fit on the GPU. You must retune your
model hyperparameters and find an optimal batch size for your case. To save time, use the following
reference tables to look up a batch size that can be a good starting point for your use case.
Note
The batch sizes are local batch size that fit into each individual GPU in the respective instance
type. You should also adjust the learning rate when changing the batch size.
PyTorch 1.13.1
Natural language processing (NLP) models
The following models are tested for training jobs for all combinations of single-node and multi-node
with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.
Single-node/multi-node single-GPU/multi-GPU
1952
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node/multi-node single-GPU/multi-GPU
1953
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node/multi-node single-GPU/multi-GPU
Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.
Single/multi-node single/multi-GPU
PyTorch 1.12.0
The following models are tested for training jobs for all combinations of single-node and multi-node
with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.
Single-node/multi-node single-GPU/multi-GPU
1954
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node/multi-node single-GPU/multi-GPU
TensorFlow 2.11.0
Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.
Single/multi-node single/multi-GPU
1955
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single/multi-node single/multi-GPU
Tested using Transformer models with Sequence_Len=128 and Automatic Mixed Precision (AMP) as
indicated.
Single/multi-node single/multi-GPU
1956
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single/multi-node single/multi-GPU
1957
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
TensorFlow 2.10.0
Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.
Single-node single-GPU/multi-GPU
DetectionTransformer-
COCO-2017 ml.g4dn.2xlarge float32 2 4
ResNet50
DetectionTransformer-
COCO-2017 ml.g5.2xlarge float32 3 6
ResNet50
DetectionTransformer-
COCO-2017 ml.p3.2xlarge float32 2 4
ResNet50
1958
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Tested using Transformer models with Sequence_Len=128 and Automatic Mixed Precision (AMP) as
indicated.
Single-node single-GPU/multi-GPU
1959
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU/multi-GPU
1960
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU/multi-GPU
TensorFlow 2.9.1
Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP).
Single-node single-GPU/multi-GPU
ml.p3.2xlarge 56 128*
DetectionTransformer-
COCO-2017 ml.g4dn.2xlarge 2 2
ResNet50
ml.g5.2xlarge 3 6
ml.p3.2xlarge 2 4
ml.p3.16xlarge 8 32
ml.p3.2xlarge 4 6
* The batch sizes marked with the asterisk symbol (*) indicate the largest batch size tested by the
SageMaker Training Compiler developer team. For the marked cells, the instance may be able to fit a
larger batch size than what is indicated.
1961
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU
ml.g5.2xlarge 1 18 40
ml.p3.2xlarge 1 14 32
ml.p3.2xlarge 1 16 20
ml.p3.2xlarge 1 16 24
ml.p3.2xlarge 1 32 48
ml.g5.2xlarge 1 12 28
ml.p3.2xlarge 1 6 16
ml.p3.2xlarge 1 24 40
ml.p3.2xlarge 1 4 10
ml.g5.2xlarge 1 6 16
ml.p3.2xlarge 1 4 10
wikitext-103- ml.p4d.24xlarge 4 13 25
v1
ml.g5.2xlarge 1 24 36
ml.p3.2xlarge 1 12 20
1962
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU
wikitext-103- ml.p4d.24xlarge 4 36 64
v1
ml.p3.2xlarge 1 2 8
8 32 64
16 32 64
Single-node single-GPU
albert-base-v2 ml.p3.2xlarge 14 28
ml.g4dn.2xlarge 14 24
bert-base-cased ml.p3.2xlarge 16 24
ml.g4dn.2xlarge 12 24
bert-base-uncased ml.p3.2xlarge 16 24
ml.g4dn.2xlarge 12 28
camembert-base ml.p3.2xlarge 12 24
ml.g4dn.2xlarge 12 28
distilbert-base-uncased ml.p3.2xlarge 28 48
ml.g4dn.2xlarge 24 52
distilgpt2 ml.p3.2xlarge 6 12
ml.g4dn.2xlarge 6 14
distilroberta-base ml.p3.2xlarge 20 40
ml.g4dn.2xlarge 12 40
1963
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU
EleutherAI/gpt- ml.p3.2xlarge 2 10
neo-125M
ml.g4dn.2xlarge 2 8
facebook/bart-base ml.p3.2xlarge 2 6
ml.g4dn.2xlarge 2 6
gpt2 ml.p3.2xlarge 4 8
ml.g4dn.2xlarge 2 8
roberta-base ml.p3.2xlarge 12 20
ml.g4dn.2xlarge 12 20
xlnet-base-cased ml.p3.2xlarge 2 8
ml.g4dn.2xlarge 4 6
Single-node single-GPU
albert-base-v2 ml.p3.2xlarge 12 32
bert-base-cased ml.p3.2xlarge 14 24
bert-base-chinese ml.p3.2xlarge 16 24
bert-base-multilingual- ml.p3.2xlarge 4 16
cased
bert-base-multilingual- ml.p3.2xlarge 8 16
uncased
bert-base-uncased ml.p3.2xlarge 12 24
cl-tohoku/bert-base- ml.p3.2xlarge 12 24
japanese-whole-word-
masking
cl-tohoku/bert-base- ml.p3.2xlarge 12 24
japanese
distilbert-base-uncased ml.p3.2xlarge 28 32
distilbert-base- ml.p3.2xlarge 28 32
uncased-finetuned-
sst-2-english
distilgpt2 ml.p3.2xlarge 16 32
facebook/bart-base ml.p3.2xlarge 4 8
gpt2 ml.p3.2xlarge 6 20
1964
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Single-node single-GPU
nreimers/MiniLMv2-L6- ml.p3.2xlarge 20 32
H384-distilled-from-
RoBERTa-Large
roberta-base ml.p3.2xlarge 12 20
Single-node multi-GPU
bert-base-chinese ml.p3.8xlarge 16 26
bert-base-multilingual- ml.p3.8xlarge 6 16
cased
bert-base-multilingual- ml.p3.8xlarge 6 16
uncased
bert-base-uncased ml.p3.8xlarge 14 24
distilbert-base-uncased ml.p3.8xlarge 14 32
distilgpt2 ml.p3.8xlarge 6 32
facebook/bart-base ml.p3.8xlarge 8 16
gpt2 ml.p3.8xlarge 8 20
roberta-base ml.p3.8xlarge 12 20
Model Instance type Batch size for native Batch size for Training
frameworks Compiler
bert-large-uncased ml.g4dn.16xlarge 37 28
bert-large-uncased ml.g5.4xlarge 64 55
bert-large-uncased ml.p3.2xlarge 40 32
1965
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models
Model Instance type Batch size for native Batch size for Training
frameworks Compiler
gpt2 ml.g4dn.16xlarge 89 64
gpt2 ml.p3.2xlarge 94 96
gpt2 ml.p3.8xlarge 96 88
jplu_tf-xlm-roberta- ml.g4dn.16xlarge 52 16
base
jplu_tf-xlm-roberta- ml.g5.4xlarge 64 44
base
1966
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
Single-node single-GPU
bart-base ml.p3.2xlarge 12 64
bart-large ml.p3.2xlarge 4 28
bert-base-multilingual- ml.p3.2xlarge 12 64
cased
bert-base-multilingual- ml.p3.2xlarge 16 96
uncased
bert-base-uncased ml.p3.2xlarge 16 96
bert-large-uncased ml.p3.2xlarge 4 24
gpt2 ml.p3.2xlarge 12 64
gpt2-large ml.p3.2xlarge 2 24
jplu/tf-xlm-roberta- ml.p3.2xlarge 12 32
base
roberta-base ml.p3.2xlarge 4 64
roberta-large ml.p3.2xlarge 4 64
t5-base ml.p3.2xlarge 64 64
Choose one of the following topics depending on the framework you use.
1967
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
Topics
• PyTorch (p. 1968)
• TensorFlow (p. 1973)
Note
After you finish preparing your training script, you can run a SageMaker training job using the
SageMaker framework estimator classes. For more information, see the previous topic at Enable
SageMaker Training Compiler (p. 1975).
PyTorch
Bring your own PyTorch model to SageMaker, and run the training job with SageMaker Training Compiler.
Topics
• PyTorch Models with Hugging Face Transformers (p. 1968)
Topics
• Large Language Models Using the Hugging Face Transformers Trainer Class (p. 1968)
• Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer
API) (p. 1969)
Large Language Models Using the Hugging Face Transformers Trainer Class
If you use the transformers library’s Trainer class, you don’t need to make any additional changes to your
training script. SageMaker Training Compiler automatically compiles your Trainer model if you enable it
through the estimator class. The following code shows the basic form of a PyTorch training script with
Hugging Face Trainer API.
training_args=TrainingArguments(**kwargs)
trainer=Trainer(args=training_args, **kwargs)
Topics
• For single GPU training (p. 1969)
• For distributed training (p. 1969)
1968
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
• Best Practices to Use SageMaker Training Compiler with Trainer (p. 1969)
To run distributed training with SageMaker Training Compiler, you must add the following _mp_fn()
function in your training script and wrap the main() function. It redirects the _mp_fn(index) function
calls from the SageMaker distributed runtime for PyTorch (pytorchxla) to the main() function of your
training script.
def _mp_fn(index):
main()
This function accepts the index argument to indicate the rank of the current GPU in the cluster
for distributed training. To find more example scripts, see the Hugging Face Transformers language
modeling example scripts.
For Transformers v4.17 and before with PyTorch v1.10.2 and before
SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job, and
you don't need to make any modification in your training script. Instead, SageMaker Training Compiler
requires you to pass a SageMaker distributed training launcher script to the entry_point argument and
pass your training script to the hyperparameters argument in the SageMaker Hugging Face estimator.
• Make sure that you use SyncFree optimizers by setting the optim argument to adamw_torch_xla
while setting up transformers.TrainingArgument. See also Optimizer in the Hugging Face Transformers
documentation.
• Ensure that the throughput of the data processing pipeline is higher than the training throughput. You
can tweak the dataloader_num_workers and preprocessing_num_workers arguments of the
transformers.TrainingArgument class to achieve this. Typically, these need to be greater than or equal
to the number of GPUs but less than the number of CPUs.
After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).
Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer
API)
If you have a training script that uses PyTorch directly, you need to make additional changes to your
PyTorch training script to implement PyTorch/XLA. Follow the instructions to modify your script to
properly set up the PyTorch/XLA primatives.
Topics
• For single GPU training (p. 1969)
• For distributed training (p. 1970)
• Best Practices to Use SageMaker Training Compiler with PyTorch/XLA (p. 1972)
1969
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
import torch_xla
import torch_xla.core.xla_model as xm
device=xm.xla_device()
import torch_xla.amp
xm.optimizer_step(optimizer)
5. If you're using a distributed dataloader, wrap your dataloader in the PyTorch/XLA's ParallelLoader
class:
import torch_xla.distributed.parallel_loader as pl
parallel_loader=pl.ParallelLoader(dataloader, [device]).per_device_loader(device)
6. Add mark_step at the end of the training loop when you're not using parallel_loader:
xm.mark_step()
xm.save(model.state_dict(), path_to_save)
After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).
In addition to the changes listed in the previous For single GPU training (p. 1969) section, add the
following changes to properly distribute workload across GPUs.
gradients=xm._fetch_gradients(optimizer)
xm.all_reduce('sum', gradients, scale=1.0/xm.xrt_world_size())
2. If you need to set variables for local_ranks and world_size, use similar code to the following:
local_rank=xm.get_local_ordinal()
1970
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
world_size=xm.xrt_world_size()
3. For any world_size (num_gpus_per_node*num_nodes) greater than 1, you must define a train
sampler which should look similar to the following:
import torch_xla.core.xla_model as xm
if xm.xrt_world_size() > 1:
train_sampler=torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=xm.xrt_world_size(),
rank=xm.get_ordinal(),
shuffle=True
)
train_loader=torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=train_sampler,
drop_last=args.drop_last,
shuffle=False if train_sampler else True,
num_workers=args.num_workers
)
4. Make the following changes to make sure you use the parallel_loader provided by the
torch_xla distributed module.
import torch_xla.distributed.parallel_loader as pl
train_device_loader=pl.MpDeviceLoader(train_loader, device)
With all of these changes, you should be able to launch distributed training with any PyTorch model
without the Transformer Trainer API. Note that these instructions can be used for both single-node
multi-GPU and multi-node multi-GPU.
5. For PyTorch v1.11.0 and later
To run distributed training with SageMaker Training Compiler, you must add the following _mp_fn()
function in your training script and wrap the main() function. It redirects the _mp_fn(index)
function calls from the SageMaker distributed runtime for PyTorch (pytorchxla) to the main()
function of your training script.
def _mp_fn(index):
main()
This function accepts the index argument to indicate the rank of the current GPU in the cluster
for distributed training. To find more example scripts, see the Hugging Face Transformers language
modeling example scripts.
For Transformers v4.17 and before with PyTorch v1.10.2 and before
1971
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job
and requires you to pass a SageMaker distributed training launcher script to the entry_point
argument and pass your training script to the hyperparameters argument in the SageMaker
Hugging Face estimator.
After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).
If you want to leverage the SageMaker Training Compiler on your native PyTorch training script, you may
want to first get familiar with PyTorch on XLA devices. The following sections list some best practices to
enable XLA for PyTorch.
Note
This section for best practices assumes that you use the following PyTorch/XLA modules:
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
One significant difference between PyTorch/XLA and native PyTorch is that the PyTorch/XLA system
runs in lazy mode while the native PyTorch runs in eager mode. Tensors in lazy mode are placeholders
for building the computational graph until they are materialized after the compilation and evaluation
are complete. The PyTorch/XLA system builds the computational graph on the fly when you call PyTorch
APIs to build the computation using tensors and operators. The computational graph gets compiled
and executed when xm.mark_step() is called explicitly or implicitly by pl.MpDeviceLoader/
pl.ParallelLoader, or when you explicitly request the value of a tensor such as by calling
loss.item() or print(loss).
For best performance, you should keep in mind the possible ways to initiate compilation-and-executions
as described in Understand the lazy mode in PyTorch/XLA (p. 1972) and should try to minimize the
number of compilation-and-executions. Ideally, only one compilation-and-execution is necessary per
training iteration and is initiated automatically by pl.MpDeviceLoader/pl.ParallelLoader. The
MpDeviceLoader is optimized for XLA and should always be used if possible for best performance.
During training, you might want to examine some intermediate results such as loss values. In such case,
the printing of lazy tensors should be wrapped using xm.add_step_closure() to avoid unnecessary
compilation-and-executions.
Training in Automatic Mixed Precision (AMP) mode significantly accelerates your training speed
by leveraging the Tensor cores of NVIDIA GPUs. SageMaker Training Compiler provides syncfree
optimizers that are optimized for XLA to improve AMP performance. Currently, the following three
syncfree optimizers are available and should be used if possible for best performance.
torch_xla.amp.syncfree.SGD
torch_xla.amp.syncfree.Adam
torch_xla.amp.syncfree.AdamW
These syncfree optimizers should be paired with torch_xla.amp.GradScaler for gradient scaling/
unscaling.
1972
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
Tip
Starting PyTorch 1.13.1, SageMaker Training Compiler improves performance by letting
PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW)
in torch.optim or transformers.optimization with the syncfree versions of
them in torch_xla.amp.syncfree (such as torch_xla.amp.syncfree.SGD,
torch_xla.amp.syncfree.Adam, torch_xla.amp.syncfree.AdamW). You don't need to
change those code lines where you define optimizers in your training script.
TensorFlow
Bring your own TensorFlow model to SageMaker, and run the training job with SageMaker Training
Compiler.
TensorFlow Models
SageMaker Training Compiler automatically optimizes model training workloads that are built on top of
the native TensorFlow API or the high-level Keras API.
Tip
For preprocessing your input dataset, ensure that you use a static input shape. Dynamic input
shape can initiate recompilation of the model and might increase total training time.
For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras
(tf.keras.Model).
Without Keras
SageMaker Training Compiler does not support eager execution in TensorFlow. Accordingly, you should
wrap your model and training loops with the TensorFlow function decorator (@tf.function) to
leverage compiler acceleration.
SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure
your TensorFlow functions are set to run in graph mode.
TensorFlow 2.0 or later has the eager execution on by default, so you should add the @tf.function
decorator in front of every function that you use for constructing a TensorFlow model.
SageMaker Training Compiler automatically optimizes model training workloads that are built on top of
the native TensorFlow API or the high-level Keras API, such as the TensorFlow transformer models.
Tip
When you create a tokenizer for an NLP model using Transformers in your training script, make
sure that you use a static input tensor shape by specifying padding='max_length'. Do not
1973
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model
use padding='longest' because padding to the longest sequence in the batch can change
the tensor shape for each training batch. The dynamic input shape can initiate recompilation of
the model and might increase total training time. For more information about padding options
of the Transformers tokenizers, see Padding and truncation in the Hugging Face Transformers
documentation.
Topics
• Using Keras (p. 1974)
• Without Keras (p. 1975)
Using Keras
For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras
(tf.keras.Model). As noted in the Quick tour page in the Hugging Face Transformers documentation, you
can use the models as regular TensorFlow Keras models.
SageMaker Training Compiler acceleration works transparently for multi-GPU workloads when the model
is constructed and trained using Keras APIs within the scope of tf.distribute.Strategy.scope()
call.
strategy = tf.distribute.MirroredStrategy()
b. For multi-node multi-GPU, add the following code to properly set the TensorFlow distributed
training configuration before creating the strategy.
def set_sm_dist_config():
DEFAULT_PORT = '8890'
DEFAULT_CONFIG_FILE = '/opt/ml/input/config/resourceconfig.json'
with open(DEFAULT_CONFIG_FILE) as f:
config = json.loads(f.read())
current_host = config['current_host']
tf_config = {
'cluster': {
'worker': []
},
'task': {'type': 'worker', 'index': -1}
}
for i, host in enumerate(config['hosts']):
tf_config['cluster']['worker'].append("%s:%s" % (host, DEFAULT_PORT))
if current_host == host:
tf_config['task']['index'] = i
os.environ['TF_CONFIG'] = json.dumps(tf_config)
set_sm_dist_config()
strategy = tf.distribute.MultiWorkerMirroredStrategy()
1974
Amazon SageMaker Developer Guide
Enable Training Compiler
with strategy.scope():
# create a model and do fit
Without Keras
If you want to bring custom models with custom training loops using TensorFlow without Keras, you
should wrap the model and the training loop with the TensorFlow function decorator (@tf.function)
to leverage compiler acceleration.
SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure
your TensorFlow functions are set to run in graph mode.
TensorFlow 2.0 or later has the eager execution on by default, so you should add the @tf.function
decorator in front of every function that you use for constructing a TensorFlow model.
In addition to the changes needed for Using Keras for distributed training, you need to ensure that
functions to be run on each GPU are annotated with @tf.function, while cross-GPU communication
functions are not annotated. An example training code should look like the following:
@tf.function()
def compiled_step(inputs, outputs):
with tf.GradientTape() as tape:
pred=model(inputs, training=True)
total_loss=loss_object(outputs, pred)/args.batch_size
gradients=tape.gradient(total_loss, model.trainable_variables)
return total_loss, pred, gradients
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss.update_state(total_loss)
train_accuracy.update_state(outputs, pred)
@tf.function()
def train_step_dist(inputs, outputs):
strategy.run(train_step, args= (inputs, outputs))
Note that this instruction can be used for both single-node multi-GPU and multi-node multi-GPU.
Topics
1975
Amazon SageMaker Developer Guide
Enable Training Compiler
• Run PyTorch Training Jobs with SageMaker Training Compiler (p. 1976)
• Run TensorFlow Training Jobs with SageMaker Training Compiler (p. 1983)
Topics
• Using the SageMaker Python SDK (p. 1976)
• Using the SageMaker CreateTrainingJob API Operation (p. 1983)
For information that fits your use case, see one of the following options.
To compile and train a PyTorch model, configure a SageMaker PyTorch estimator with SageMaker
Training Compiler as shown in the following code example.
Note
This native PyTorch support is available in the SageMaker Python SDK v2.120.0 and later.
Make sure that you update the SageMaker Python SDK.
# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
1976
Amazon SageMaker Developer Guide
Enable Training Compiler
learning_rate_native=float('5e-5')
# an updated max batch size that can fit into GPU memory with compiler
batch_size=64
hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}
pytorch_estimator=PyTorch(
entry_point='train.py',
source_dir='path-to-requirements-file', # Optional. Add this if need to install
additional packages.
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='1.13.1',
py_version='py3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
pytorch_estimator.fit()
To compile and train a transformer model with PyTorch, configure a SageMaker Hugging Face
estimator with SageMaker Training Compiler as shown in the following code example.
# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')
# an updated max batch size that can fit into GPU memory with compiler
batch_size=64
hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}
pytorch_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
transformers_version='4.21.1',
pytorch_version='1.11.0',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
1977
Amazon SageMaker Developer Guide
Enable Training Compiler
pytorch_huggingface_estimator.fit()
• For single GPU training (p. 1969) of a PyTorch model using Hugging Face Transformers' Trainer
API
• For single GPU training (p. 1969) of a PyTorch model without Hugging Face Transformers' Trainer
API
• Compile and Train a Hugging Face Transformers Trainer Model for Question and Answering with
the SQuAD dataset
• Compile and Train a Hugging Face Transformer BERT Model with the SST Dataset using SageMaker
Training Compiler
• Compile and Train a Binary Classification Trainer Model with the SST2 Dataset for Single-Node
Single-GPU Training
PyTorch v1.12
For PyTorch v1.12, you can run distributed training with SageMaker Training Compiler by adding
the pytorch_xla option specified to the distribution parameter of the SageMaker PyTorch
estimator class.
Note
This native PyTorch support is available in the SageMaker Python SDK v2.121.0 and later.
Make sure that you update the SageMaker Python SDK.
# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4
# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')
# an updated max batch size that can fit to GPU memory with compiler
batch_size=26
hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}
pytorch_estimator=PyTorch(
entry_point='your_training_script.py',
source_dir='path-to-requirements-file', # Optional. Add this if need to install
additional packages.
instance_count=instance_count,
1978
Amazon SageMaker Developer Guide
Enable Training Compiler
instance_type=instance_type,
framework_version='1.13.1',
py_version='py3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
distribution ={'pytorchxla' : { 'enabled': True }},
disable_profiler=True,
debugger_hook_config=False
)
pytorch_estimator.fit()
Tip
To prepare your training script, see PyTorch (p. 1968)
Transformers v4.21 with PyTorch v1.11
For PyTorch v1.11 and later, SageMaker Training Compiler is available for distributed training with
the pytorch_xla option specified to the distribution parameter.
# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4
# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')
# an updated max batch size that can fit to GPU memory with compiler
batch_size=26
hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}
pytorch_huggingface_estimator=HuggingFace(
entry_point='your_training_script.py',
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.21.1',
pytorch_version='1.11.0',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
distribution ={'pytorchxla' : { 'enabled': True }},
disable_profiler=True,
debugger_hook_config=False
)
pytorch_huggingface_estimator.fit()
Tip
To prepare your training script, see the following pages.
• For distributed training (p. 1969) of a PyTorch model using Hugging Face Transformers'
Trainer API
1979
Amazon SageMaker Developer Guide
Enable Training Compiler
• For distributed training (p. 1970) of a PyTorch model without Hugging Face Transformers'
Trainer API
For the supported version of PyTorch v1.10.2 and before, SageMaker Training Compiler requires an
alternate mechanism for launching a distributed training job. To run distributed training, SageMaker
Training Compiler requires you to pass a SageMaker distributed training launcher script to the
entry_point argument, and pass your training script to the hyperparameters argument. The
following code example shows how to configure a SageMaker Hugging Face estimator applying the
required changes.
# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4
# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')
# an updated max batch size that can fit to GPU memory with compiler
batch_size=26
training_script="your_training_script.py"
hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate,
"training_script": training_script # Specify the file name of your training
script.
}
pytorch_huggingface_estimator=HuggingFace(
entry_point='distributed_training_launcher.py', # Specify the distributed
training launcher script.
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.17.0',
pytorch_version='1.10.2',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
pytorch_huggingface_estimator.fit()
The launcher script should look like the following. It wraps your training script and configures the
distributed training environment depending on the size of the training instance of your choice.
# distributed_training_launcher.py
#!/bin/python
1980
Amazon SageMaker Developer Guide
Enable Training Compiler
import subprocess
import sys
if __name__ == "__main__":
arguments_command = " ".join([arg for arg in sys.argv[1:]])
"""
The following line takes care of setting up an inter-node communication
as well as managing intra-node workers for each GPU.
"""
subprocess.check_call("python -m torch_xla.distributed.sm_dist " +
arguments_command, shell=True)
Tip
To prepare your training script, see the following pages.
• For distributed training (p. 1969) of a PyTorch model using Hugging Face Transformers'
Trainer API
• For distributed training (p. 1970) of a PyTorch model without Hugging Face Transformers'
Trainer API
Tip
To find end-to-end examples, see the following notebooks:
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Single-Node Multi-GPU Training
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Multi-Node Multi-GPU Training
The following list is the minimal set of parameters required to run a SageMaker training job with the
compiler.
Note
When using the SageMaker Hugging Face estimator, you must specify the
transformers_version, pytorch_version, hyperparameters, and compiler_config
parameters to enable SageMaker Training Compiler. You cannot use image_uri to manually
specify the Training Compiler integrated Deep Learning Containers that are listed at Supported
Frameworks (p. 1950).
• entry_point (str) – Required. Specify the file name of your training script.
Note
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and
before, specify the file name of a launcher script to this parameter. The launcher script should
be prepared to wrap your training script and configure the distributed training environment.
For more information, see the following example notebooks:
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Single-Node Multi-GPU Training
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Multi-Node Multi-GPU Training
• source_dir (str) – Optional. Add this if need to install additional packages. To install packages, you
need to prapare a requirements.txt file under this directory.
• instance_count (int) – Required. Specify the number of instances.
• instance_type (str) – Required. Specify the instance type.
• transformers_version (str) – Required only when using the SageMaker Hugging Face estimator.
Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To
find available versions, see Supported Frameworks (p. 1950).
1981
Amazon SageMaker Developer Guide
Enable Training Compiler
Warning
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training
Compiler. We recommend that you turn off Debugger when running SageMaker Training
Compiler to make sure there's no impact on performance. For more information, see the section
called “Considerations” (p. 1990). To turn the Debugger functionalities off, add the following
two arguments to the estimator:
disable_profiler=True,
debugger_hook_config=False
If the training job with the compiler is launched successfully, you receive the following logs during the
job initialization phase:
• With TrainingCompilerConfig(debug=False)
• With TrainingCompilerConfig(debug=True)
1982
Amazon SageMaker Developer Guide
Enable Training Compiler
"AlgorithmSpecification": {
"TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},
"HyperParameters": {
"sagemaker_training_compiler_enabled": "true",
"sagemaker_training_compiler_debug_mode": "false",
"sagemaker_pytorch_xla_multi_worker_enabled": "false" // set to "true" for
distributed training
}
To find a complete list of deep learning container image URIs that have SageMaker Training Compiler
implemented, see Supported Frameworks (p. 1950).
Topics
• Using the SageMaker Python SDK (p. 1983)
• Using the SageMaker Python SDK and Extending SageMaker Framework Deep Learning
Containers (p. 1987)
• Enable SageMaker Training Compiler Using the SageMaker CreateTrainingJob API
Operation (p. 1989)
For information that fits your use case, see one of the following options.
TensorFlow
1983
Amazon SageMaker Developer Guide
Enable Training Compiler
# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')
# an updated max batch size that can fit into GPU memory with compiler
batch_size=64
hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}
tensorflow_estimator=TensorFlow(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='2.9.1',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
tensorflow_estimator.fit()
• For single GPU training (p. 1973) of a model constructed using TensorFlow Keras (tf.keras.*).
• For single GPU training (p. 1973) of a model constructed using TensorFlow modules (tf.*
excluding the TensorFlow Keras modules).
# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')
# an updated max batch size that can fit into GPU memory with compiler
batch_size=64
hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}
tensorflow_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
transformers_version='4.21.1',
tensorflow_version='2.6.3',
hyperparameters=hyperparameters,
1984
Amazon SageMaker Developer Guide
Enable Training Compiler
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
tensorflow_huggingface_estimator.fit()
• For single GPU training (p. 1974) of a TensorFlow Keras model with Hugging Face Transformers
• For single GPU training (p. 1975) of a TensorFlow model with Hugging Face Transformers
# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4
# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')
# an updated max batch size that can fit to GPU memory with compiler
batch_size=26
hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}
tensorflow_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.21.1',
tensorflow_version='2.6.3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
tensorflow_huggingface_estimator.fit()
Tip
To prepare your training script, see the following pages.
• For distributed training (p. 1974) of a TensorFlow Keras model with Hugging Face
Transformers
• For distributed training (p. 1975) of a TensorFlow model with Hugging Face Transformers
1985
Amazon SageMaker Developer Guide
Enable Training Compiler
The following list is the minimal set of parameters required to run a SageMaker training job with the
compiler.
Note
When using the SageMaker Hugging Face estimator, you must specify the
transformers_version, tensorflow_version, hyperparameters, and
compiler_config parameters to enable SageMaker Training Compiler. You cannot use
image_uri to manually specify the Training Compiler integrated Deep Learning Containers that
are listed at Supported Frameworks (p. 1950).
• entry_point (str) – Required. Specify the file name of your training script.
• instance_count (int) – Required. Specify the number of instances.
• instance_type (str) – Required. Specify the instance type.
• transformers_version (str) – Required only when using the SageMaker Hugging Face estimator.
Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To
find available versions, see Supported Frameworks (p. 1950).
• framework_version or tensorflow_version (str) – Required. Specify the TensorFlow
version supported by SageMaker Training Compiler. To find available versions, see Supported
Frameworks (p. 1950).
Note
When using the SageMaker TensorFlow estimator, you must specify framework_version.
When using the SageMaker Hugging Face estimator, you must specify both
transformers_version and tensorflow_version.
• hyperparameters (dict) – Optional. Specify hyperparameters for the training job, such as n_gpus,
batch_size, and learning_rate. When you enable SageMaker Training Compiler, try larger batch
sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted
batch sizes to improve training speed, see the section called “Tested Models” (p. 1952) and SageMaker
Training Compiler Example Notebooks and Blogs (p. 1989).
• compiler_config (TrainingCompilerConfig object) – Required. Include this parameter to turn on
SageMaker Training Compiler. The following are parameters for the TrainingCompilerConfig class.
• enabled (bool) – Optional. Specify True or False to turn on or turn off SageMaker Training
Compiler. The default value is True.
• debug (bool) – Optional. To receive more detailed training logs from your compiler-accelerated
training jobs, change it to True. However, the additional logging might add overhead and slow
down the compiled training job. The default value is False.
Warning
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training
Compiler. We recommend that you turn off Debugger when running SageMaker Training
Compiler to make sure there's no impact on performance. For more information, see the section
called “Considerations” (p. 1990). To turn the Debugger functionalities off, add the following
two arguments to the estimator:
disable_profiler=True,
debugger_hook_config=False
If the training job with the compiler is launched successfully, you receive the following logs during the
job initialization phase:
• With TrainingCompilerConfig(debug=False)
1986
Amazon SageMaker Developer Guide
Enable Training Compiler
• With TrainingCompilerConfig(debug=True)
Using the SageMaker Python SDK and Extending SageMaker Framework Deep
Learning Containers
AWS Deep Learning Containers (DLC) for TensorFlow use adapted versions of TensorFlow that include
changes on top of the open-source TensorFlow framework. The SageMaker Framework Deep Learning
Containers are optimized for the underlying AWS infrastructure and Amazon SageMaker. With the
advantage of using the DLCs, SageMaker Training Compiler integration adds more performance
improvements over the native TensorFlow. Furthermore, you can create a custom training container by
extending the DLC image.
Note
This Docker customization feature is currently available only for TensorFlow.
To extend and customize the SageMaker TensorFlow DLCs for your use-case, use the following
instructions.
Create a Dockerfile
Use the following Dockerfile template to extend the SageMaker TensorFlow DLC. You must use the
SageMaker TensorFlow DLC image as the base image of your Docker container. To find the SageMaker
TensorFlow DLC image URIs, see Supported Frameworks.
ENV PATH="/opt/ml/code:${PATH}"
For more information, see Step 2: Create and upload the Dockerfile and Python training scripts.
• Do not explicitly uninstall or change the version of TensorFlow packages in SageMaker containers.
Doing so causes the AWS optimized TensorFlow packages to be overwritten by open-source
TensorFlow packages, which might result in performance degradation.
• Watch out for packages that have a particular TensorFlow version or flavor as a dependency. These
packages might implicitly uninstall the AWS optimized TensorFlow and install open-source TensorFlow
packages.
For example, there’s a known issue that the tensorflow/models and tensorflow/text libraries always
attempt to reinstall open source TensorFlow. If you need to install these libraries to choose a specific
version for your use case, we recommend that you look into the SageMaker TensorFlow DLC Dockerfiles
for v2.9 or later. The paths to the Dockerfiles are typically in the following format: tensorflow/
training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu. In the
1987
Amazon SageMaker Developer Guide
Enable Training Compiler
Dockerfiles, you should find the code lines to reinstall AWS managed TensorFlow binary (specified to the
TF_URL environment variable) and other dependencies in order. The reinstallation section should look
like the following example:
To build and push your Docker container to Amazon ECR, follow the instructions in the following links:
Use the SageMaker TensorFlow framework estimator as usual. You must specify image_uri to use the
new container you hosted in Amazon ECR.
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'
region = boto3.session.Session().region_name
uri_suffix = 'amazonaws.com'
byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
account_id, region, uri_suffix, ecr_repository + tag
)
byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest
estimator = TensorFlow(
image_uri=image_uri,
role=get_execution_role(),
base_job_name='tf-custom-container-test-job',
instance_count=1,
instance_type='ml.p3.8xlarge'
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)
# Start training
1988
Amazon SageMaker Developer Guide
Example Notebooks and Blogs
estimator.fit()
"AlgorithmSpecification": {
"TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},
"HyperParameters": {
"sagemaker_training_compiler_enabled": "true",
"sagemaker_training_compiler_debug_mode": "false"
}
To find a complete list of deep learning container image URIs that have SageMaker Training Compiler
implemented, see Supported Frameworks (p. 1950).
Example notebooks are provided in the SageMaker examples GitHub repository, and you can also browse
them on the SageMaker examples website.
Examples Notebooks
To find examples of using SageMaker Training Compiler, see the Training Compiler page in the Amazon
SageMaker Example Read the Docs website.
Best Practices
Use the following guidelines to achieve the best results when you run training jobs with SageMaker
Training Compiler.
1989
Amazon SageMaker Developer Guide
Best Practices and Considerations
• Make sure that you use one of the Supported Instance Types (p. 1951) and Tested Models (p. 1952).
• When you create a tokenizer for an NLP model using the Hugging Face Transformers library
in your training script, make sure that you use a static input tensor shape by specifying
padding='max_length'. Do not use padding='longest' because padding to the longest
sequence in the batch can change the tensor shape for each training batch. The dynamic input shape
can initiate recompilation of the model and might increase total training time. For more information
about padding options of the Transformers tokenizers, see Padding and truncation in the Hugging Face
Transformers documentation.
• Measure GPU memory utilization to make sure that you use the maximum batch size that can fit into
the GPU memory. Amazon SageMaker Training Compiler reduces the memory footprint of your model
during training, which typically allows you to fit a larger batch_size in the GPU memory. Using a
larger batch_size results in a better GPU utilization and reduces the total training time.
When you adjust the batch size, you also have to adjust the learning_rate appropriately. For
example, if you increased the batch size by a factor of k, you need to adjust learning_rate linearly
(simple multiplication by k) or multiply by the square root of k. This is to achieve the same or similar
convergence behavior in the reduced training time. For reference of batch_size tested for popular
models, see Tested Models (p. 1952).
• To debug the compiler-accelerated training job, enable the debug flag in the compiler_config
parameter. This enables SageMaker to put the debugging logs into SageMaker training job logs.
huggingface_estimator=HuggingFace(
...
compiler_config=TrainingCompilerConfig(debug=True)
)
Note that if you enable full debugging of the training job with the compiler, this might add some
overhead.
• If you bring a PyTorch model and want to checkpoint it, make sure you use PyTorch/XLA's model
save function to properly checkpoint your model. For more information about the function, see
torch_xla.core.xla_model.save in the PyTorch on XLA Devices documentation.
To learn how to add the modifications to your PyTorch script, see Large Language Models Using
PyTorch Directly (without the Hugging Face Transformers Trainer API) (p. 1969).
For more information about the actual application of using the model save function, see Checkpoint
Writing and Loading in the Hugging Face on PyTorch/XLA TPUs: Faster and cheaper training blog.
• To achieve the most optimal training time for distributed training, consider the following.
• Use instances with multiple GPUs instead of using single-gpu instances. For example, a single
ml.p3dn.24xlarge instance has faster training time compared to 8 x ml.p3.2xlarge instances.
• Use instances with EFA support such as ml.p3dn.24xlarge and ml.p4d.24xlarge. These
instance types have accelerated networking speed and reduce training time.
• Tune the preprocessing_num_workers parameter for datasets, so that model training is not
delayed by slow preprocessing.
Considerations
Consider the following when using SageMaker Training Compiler.
1990
Amazon SageMaker Developer Guide
Best Practices and Considerations
a = b+c
e = a+d
A compiler interprets the code as follows and reduces the memory footprint for the variable a:
e = b+c+d
Now consider the following case in which the code is changed to add a print function for the variable
a.
a = b+c
e = a+d
print(a)
e = b+c+d
a = b+c # Explicit evaluation
print(a)
In PyTorch, for example, avoid using torch.tensor.items(), which might introduce explicit evaluations. In
deep learning, such explicit evaluations can cause overhead because they break fused operations in a
compilation graph of a model and lead to recomputation of the tensors.
If you still want to periodically evaluate the model during training while using SageMaker Training
Compiler, we recommend logging and checkpointing at a lower frequency to reduce overhead due to
explicit evaluations. For example, log every 10 epochs instead of every epoch.
• Graph compilation runs during the first few steps of training. As a result, the first few steps are
expected to be exceptionally slow. However, this is a one-time compilation cost and can be amortized
by training for a longer duration because compilation makes future steps much faster. The initial
compilation overhead depends on the size of the model, the size of the input tensors, and the
distribution of input tensor shapes.
• One of the most typical errors when compiling a PyTorch model is due to a wrong device type
for operators and tensors. To properly compile a PyTorch model, make sure you use XLA devices
(xm.xla_device()) instead of using CUDA or mixing CUDA devices and XLA devices.
• mark_step() is a barrier just for XLA. Failing to set it correctly causes a training job to stall.
• PyTorch/XLA provides additional distributed training APIs. Failing to program the APIs properly causes
gradients to be collected incorrectly, which causes a training convergence failure.
To properly set up your PyTorch script and avoid the aforementioned incorrect API uses, see Large
Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API) (p. 1969).
1991
Amazon SageMaker Developer Guide
Training Compiler FAQ
If you successfully launched your training job with SageMaker Training Compiler, you receive the
following log messages:
• With TrainingCompilerConfig(debug=False)
• With TrainingCompilerConfig(debug=True)
SageMaker Training Compiler supports the most popular deep learning models from the Hugging
Face transformers library. With most of the operators that the compiler supports, these models can
be trained faster with SageMaker Training Compiler. Compilable models include but are not limited
to the following: bert-base-cased, bert-base-chinese, bert-base-uncased, distilbert-
base-uncased, distilbert-base-uncased-finetuned-sst-2-english, gpt2, roberta-base,
roberta-large, t5-base, and xlm-roberta-base. The compiler works with most DL operators and
data structures and can accelerate many other DL models beyond those that have been tested.
Q. What happens if I enable SageMaker Training Compiler with a model that isn't tested?
For an untested model, you might need to first modify the training script to be compatible with
SageMaker Training Compiler. For more information, see Bring Your Own Deep Learning Model (p. 1967)
and follow the instructions on how to prepare your training script.
Once you have updated your training script, you can start the training job. The compiler proceeds to
compile the model. However, training speed may not increase and might even decrease relative to the
baseline with an untested model. You might need to retune training parameters such as batch_size
and learning_rate to achieve any speedup benefits.
If compilation of the untested model fails, the compiler returns an error. See SageMaker Training
Compiler Troubleshooting (p. 1993) for detailed information about the failure types and error messages.
Q. Will I always get a faster training job with SageMaker Training Compiler?
No, not necessarily. First, SageMaker Training Compiler adds some compilation overhead before the
ongoing training process can be accelerated. The optimized training job must run sufficiently long to
amortize and make up for this incremental compilation overhead at the beginning of the training job.
Additionally, as with any model training process, training with suboptimal parameters can increase
training time. SageMaker Training Compiler can change the characteristics of the training job by, for
example, changing the memory footprint of the job. Because of these differences, you might need
to retune your training job parameters to speed up training. A reference table specifying the best
performing parameters for training jobs with different instance types and models can be found at Tested
Models (p. 1952).
1992
Amazon SageMaker Developer Guide
Troubleshooting
Finally, some code in a training script might add additional overhead or disrupt the compiled
computation graph and slow training. If working with a customized or untested model, see the
instructions at Best Practices to Use SageMaker Training Compiler with PyTorch/XLA (p. 1972).
Q. Can I always use a larger batch size with SageMaker Training Compiler?
Batch size increases in most, but not all, cases. The optimizations made by SageMaker Training Compiler
can change the characteristics of your training job, such as the memory footprint. Typically, a Training
Compiler job occupies less memory than an uncompiled training job with the native framework, which
allows for a larger batch size during training. A larger batch size, and a corresponding adjustment to the
learning rate, increases training throughput and can decrease total training time.
However, there could be cases where SageMaker Training Compiler might actually increase memory
footprint based on its optimization scheme. The compiler uses an analytical cost model to predict the
execution schedule with the lowest cost of execution for any compute-intensive operator. This model
could find an optimal schedule that increases memory use. In this case, you won’t be able to increase
batch sizes, but your sample throughput is still higher.
Q. Does SageMaker Training Compiler work with other SageMaker training features, such as the
SageMaker distributed training libraries and SageMaker Debugger?
SageMaker Training Compiler is currently not compatible with SageMaker’s distributed training libraries.
SageMaker Training Compiler is compatible with SageMaker Debugger, but Debugger might degrade
computational performance by adding overhead.
Q. Does SageMaker Training Compiler support custom containers (bring your own container)?
SageMaker Training Compiler is provided through AWS Deep Learning Containers, and you can
extend a subset of the containers to customize for your use-case. Containers that are extended from
AWS DLCs are supported by SageMaker Training Compiler. For more information, see Supported
Frameworks and Using the SageMaker Python SDK and Extending SageMaker Framework Deep Learning
Containers (p. 1987). If you need further support, reach out to the SageMaker team through AWS
Support or AWS Developer Forums for Amazon SageMaker.
When faced with such convergence issues, the first step is to identify if the issue is limited to distributed
training or stems from single-GPU training. Distributed training with SageMaker Training Compiler is an
extension of single-GPU training with additional steps.
1993
Amazon SageMaker Developer Guide
Troubleshooting
Therefore, any convergence issue in single-GPU training propagates to distributed training with multiple
workers.
1994
Amazon SageMaker Developer Guide
Troubleshooting
1995
Amazon SageMaker Developer Guide
Troubleshooting
A flow chart to troubleshoot convergence issues in training jobs when using SageMaker Training
Compiler. Descriptions are in the following sections.
Training with SageMaker Training Compiler leads to change in the memory footprint of a model. The
compiler intelligently arbitrates between re-use and re-compute leading to a corresponding increase or
decrease in memory consumption. To leverage this, it is essential to re-tune the batch size and associated
hyperparameters when migrating a training job to SageMaker Training Compiler. However, incorrect
hyperparameter settings often cause oscillation in training loss and possibly a slower convergence
as a result. In rare cases, aggressive hyperparameters might result in the model not learning (the
training loss metric doesn’t decrease or returns NaN). To identify if the convergence issue is due to the
hyperparameters, do a side-by-side test of two training jobs with and without SageMaker Training
Compiler while keeping all the hyperparameters the same.
Check if the torch_xla APIs are properly set up for single-GPU training
If the convergence issue persists with the baseline hyperparameters, you need to check if there’s any
improper usage of the torch_xla APIs, specifically the ones for updating the model. Fundamentally,
torch_xla continues to accumulate instructions (deferring execution) in the form of graph until it is
explicitly instructed to run the accumulated graph. The torch_xla.core.xla_model.mark_step()
function facilitates the execution of the accumulated graph. The graph execution should be synchronized
using this function after each model update and before printing and logging any variables. If it lacks
the synchronization step, the model might use stale values from memory during prints, logs, and the
subsequent forward passes, instead of using the most recent values that have to be synchronized after
every iteration and model update.
It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly
from the use of AMP) or gradient clipping techniques. The appropriate order of gradient computation
with AMP is as follows.
To find the right APIs for the operations mentioned in the list, see the guide for migrating your training
script to SageMaker Training Compiler.
If the convergence issue arises when re-tuning the batch size and associated hyperparameters such as
the learning rate while using SageMaker Training Compiler, consider using Automatic Model Tuning to
tune your hyperparameters. You can refer to the example notebook on tuning hyperparameters with
SageMaker Training Compiler.
1996
Amazon SageMaker Developer Guide
Troubleshooting
If the convergence issue arises when running a distributed training job with multiple workers, ensure
there is a uniform deterministic behavior across all workers by setting a constant seed where applicable.
Beware of techniques such as weight initialization, which involves randomization. Each worker might end
up training a different model in the absence of a constant seed.
Check if the torch_xla APIs are properly set up for distributed training
If the issue still persists, this is likely due to improper use of the torch_xla APIs for distributed training.
Make sure that you add the following in your estimator to set up a cluster for distributed training with
SageMaker Training Compiler.
This should be accompanied by a function _mp_fn(index) in your training script, which is invoked once
per worker. Without the mp_fn(index) function, you might end up letting each of the workers train the
model independently without sharing model updates.
torch.utils.data.distributed.DistributedSampler()
This ensures that the input data is properly distributed across all workers.
It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly
from use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with
AMP is as follows.
Note that this checklist has an additional item for synchronizing all workers, compared to the checklist
for single-GPU training.
XLA requires additional environment variables to compile the training job. The most common missing
environment variable is GPU_NUM_DEVICES. For the compiler to work properly, you must set this
environment variable equal to the number of GPUs per instance.
1997
Amazon SageMaker Developer Guide
Troubleshooting
• Approach 1 – Use the environment argument of the SageMaker estimator class. For example, if you
use an ml.p3.8xlarge instance that has four GPUs, do the following:
hf_estimator=HuggingFace(
...
instance_type="ml.p3.8xlarge",
hyperparameters={...},
environment={
...
"GPU_NUM_DEVICES": "4" # corresponds to number of GPUs on the specified instance
},
)
• Approach 2 – Use the hyperparameters argument of the SageMaker estimator class and parse it in
your training script.
1. To specify the number of GPUs, add a key-value pair to the hyperparameters argument.
For example, if you use an ml.p3.8xlarge instance that has four GPUs, do the following:
hf_estimator=HuggingFace(
...
entry_point = "train.py"
instance_type= "ml.p3.8xlarge",
hyperparameters = {
...
"n_gpus": 4 # corresponds to number of GPUs on specified instance
}
)
hf_estimator.fit()
2. In your training script, parse the n_gpus hyperparameter and specify it as an input for the
GPU_NUM_DEVICES environment variable.
# train.py
import os, argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
...
# Data, model, and output directories
parser.add_argument("--output_data_dir", type=str,
default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--training_dir", type=str,
default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
args, _ = parser.parse_known_args()
os.environ["GPU_NUM_DEVICES"] = args.n_gpus
• Approach 3 – Hard-code the GPU_NUM_DEVICES environment variable in your training script. For
example, add the following to your script if you use an instance that has four GPUs.
# train.py
import os
1998
Amazon SageMaker Developer Guide
Release Notes
os.environ["GPU_NUM_DEVICES"] = 4
Tip
To find the number of GPU devices on machine learning instances that you want to use, see
Accelerated Computing in the Amazon EC2 Instance Types page.
Bug Fixes
• Fixed a race condition issue on GPU which was causing NAN loss in some models like vision
transformer (ViT) models.
Other Changes
• PyTorch v1.13.1
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-trcomp-training:1.13.1-gpu-py39-
cu117-ubuntu20.04-sagemaker
To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see
Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949).
1999
Amazon SageMaker Developer Guide
Release Notes
For more detailed list of breaking changes from the optimizer changes, see the official TensorFlow
v2.11.0 release notes in the TensorFlow GitHub repository.
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
• TensorFlow v2.11.0
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-
ubuntu20.04-sagemaker
To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see
Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949).
• Fixed the seed for PyTorch training jobs starting PyTorch v1.12 to ensure that there is no discrepancy
in model initialization across different processes. See also PyTorch Reproducibility.
• Fixed the issue causing PyTorch distributed training jobs on G4dn and G5 instances to not default to
communication through PCIe.
Known Issues