0% found this document useful (0 votes)
362 views3,324 pages

Sagemaker DG

Uploaded by

Yasir Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views3,324 pages

Sagemaker DG

Uploaded by

Yasir Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3324

Amazon SageMaker

Developer Guide
Amazon SageMaker Developer Guide

Amazon SageMaker: Developer Guide


Copyright © 2023 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon SageMaker Developer Guide

Table of Contents
What Is Amazon SageMaker? ............................................................................................................... 1
Amazon SageMaker Pricing ......................................................................................................... 1
Are You a First-time User of Amazon SageMaker? .......................................................................... 1
How It Works .................................................................................................................... 2
SageMaker Features ................................................................................................................... 2
New features ..................................................................................................................... 2
Machine learning environments ........................................................................................... 3
Major features ................................................................................................................... 3
Machine Learning with Amazon SageMaker ................................................................................... 5
Explore, Analyze, and Process Data .............................................................................................. 7
Fairness and Model Explainability ................................................................................................. 8
Best Practices for Evaluating Fairness and Explainability in the ML Lifecycle ............................... 8
Sample Notebooks ............................................................................................................. 9
Guide to the SageMaker Clarify Documentation ................................................................... 10
Model Training ......................................................................................................................... 10
Model Deployment ................................................................................................................... 13
Validating Models ..................................................................................................................... 13
Model Monitoring ..................................................................................................................... 14
ML Frameworks and Toolkits ..................................................................................................... 15
Apache MXNet ................................................................................................................. 15
Apache Spark ................................................................................................................... 16
Chainer ........................................................................................................................... 24
Hugging Face ................................................................................................................... 25
PyTorch ........................................................................................................................... 27
R .................................................................................................................................... 28
Scikit-learn ...................................................................................................................... 30
SparkML Serving .............................................................................................................. 31
TensorFlow ...................................................................................................................... 32
Triton Inference Server ...................................................................................................... 32
Supported Regions and Quotas .................................................................................................. 33
Quotas ............................................................................................................................ 34
Get Started ..................................................................................................................................... 35
Set Up Amazon SageMaker Prerequisites ..................................................................................... 35
Create an AWS Account ..................................................................................................... 35
Create an Administrative User and Group ............................................................................ 36
AWS CLI Prerequisites ....................................................................................................... 37
Onboard to Domain .................................................................................................................. 37
Onboard Using Quick setup ............................................................................................... 38
Onboard Using IAM Identity Center .................................................................................... 39
Onboard Using IAM .......................................................................................................... 43
Choose an Amazon VPC .................................................................................................... 46
SageMaker JumpStart ............................................................................................................... 47
Open and use JumpStart .................................................................................................. 47
Solution Templates ........................................................................................................... 50
Foundation Models ........................................................................................................... 58
Task-Specific Models ......................................................................................................... 66
Shared Models and Notebooks ........................................................................................... 79
SageMaker JumpStart Industry: Financial ............................................................................ 83
Get Started with Notebook Instances .......................................................................................... 87
Machine Learning with the SageMaker Python SDK .............................................................. 87
Tutorial Overview ............................................................................................................. 87
Step 1: Create an Amazon SageMaker Notebook Instance ...................................................... 88
Step 2: Create a Jupyter Notebook ..................................................................................... 89
Step 3: Download, Explore, and Transform Data ................................................................... 90

iii
Amazon SageMaker Developer Guide

Step 4: Train a Model ....................................................................................................... 94


Step 5: Deploy the Model .................................................................................................. 98
Step 6: Evaluate the Model .............................................................................................. 100
Step 7: Clean Up ............................................................................................................ 103
Machine Learning Environments ....................................................................................................... 105
SageMaker Domain ................................................................................................................. 105
Prerequisites .................................................................................................................. 108
Multiple Domains Overview ............................................................................................. 108
Domain resource isolation ............................................................................................... 110
Setting Defaults for a Domain .......................................................................................... 112
Environment .................................................................................................................. 114
View and Edit Domains ................................................................................................... 114
Delete a Domain ............................................................................................................ 116
Domain User Profiles ...................................................................................................... 118
IAM Identity Center Groups in a Domain ........................................................................... 122
Collaborate with shared spaces ........................................................................................ 123
SageMaker Studio ................................................................................................................... 128
Studio Features .............................................................................................................. 128
UI Overview ................................................................................................................... 129
Launch Amazon SageMaker Studio ................................................................................... 133
JupyterLab Versioning ..................................................................................................... 135
Use the Studio Launcher ................................................................................................. 141
Use Studio Notebooks ..................................................................................................... 144
Customize Studio ........................................................................................................... 168
Perform Common Tasks .................................................................................................. 194
Studio Pricing ................................................................................................................ 200
Troubleshooting ............................................................................................................. 201
SageMaker Notebook Instances ................................................................................................ 204
AL2 vs AL1 instances ...................................................................................................... 205
JupyterLab versioning ..................................................................................................... 208
Create a Notebook Instance ............................................................................................. 209
Access Notebook Instances .............................................................................................. 212
Update a Notebook Instance ............................................................................................ 212
Customize a Notebook Instance ....................................................................................... 213
Example Notebooks ........................................................................................................ 220
Set the Notebook Kernel ................................................................................................. 221
Git Repos ...................................................................................................................... 222
Notebook Instance Metadata ........................................................................................... 229
Monitor Jupyter Logs in Amazon CloudWatch Logs ............................................................. 229
SageMaker Studio Lab ............................................................................................................. 230
Studio Lab components overview ..................................................................................... 231
Onboard to Studio Lab ................................................................................................... 234
Manage your account ...................................................................................................... 235
Launch Studio Lab .......................................................................................................... 236
Use Studio Lab starter assets ........................................................................................... 237
Use the Studio Lab project runtime .................................................................................. 239
Troubleshooting ............................................................................................................. 256
SageMaker Canvas .................................................................................................................. 258
Are you a first-time SageMaker Canvas user? ..................................................................... 259
Getting started ............................................................................................................... 259
Setting Up and Managing Amazon SageMaker Canvas (for IT Administrators) .......................... 264
Use Ready-to-use models ................................................................................................ 289
Use custom models ........................................................................................................ 297
Logging out ................................................................................................................... 392
Limitations and troubleshooting ....................................................................................... 393
Manage billing and cost .................................................................................................. 400
SageMaker geospatial capabilities ............................................................................................. 401

iv
Amazon SageMaker Developer Guide

Getting Started .............................................................................................................. 402


Earth Observation Jobs ................................................................................................... 405
Vector Enrichment Jobs ................................................................................................... 412
Visualization Using SageMaker geospatial capabilities ......................................................... 413
Amazon SageMaker geospatial Map SDK ........................................................................... 418
SageMaker geospatial capabilities FAQ .............................................................................. 423
Security and Permissions ................................................................................................. 424
RStudio on Amazon SageMaker ................................................................................................ 432
Region availability .......................................................................................................... 433
RStudio components ....................................................................................................... 434
Differences from RStudio Workbench ................................................................................ 434
Manage RStudio on SageMaker ........................................................................................ 434
Use RStudio on Amazon SageMaker .................................................................................. 463
Autopilot: Automated ML ................................................................................................................ 467
Get started ............................................................................................................................ 468
Samples ........................................................................................................................ 468
Videos ........................................................................................................................... 469
Tutorials ........................................................................................................................ 470
Create an Autopilot experiment ................................................................................................ 470
Create an Autopilot experiment using Studio .................................................................... 470
Create an Autopilot experiment programmatically ............................................................. 472
Datasets and problem types .................................................................................................... 475
Datasets and formats ...................................................................................................... 475
Problem types ................................................................................................................ 475
Training modes and algorithm support ...................................................................................... 476
Training modes .............................................................................................................. 476
Algorithm support .......................................................................................................... 477
Metrics and validation ............................................................................................................. 478
Autopilot metrics ............................................................................................................ 478
Autopilot weighted metrics .............................................................................................. 480
Cross-validation in Autopilot ............................................................................................ 481
Model Deployment and Prediction ............................................................................................ 483
Real-time inferencing ...................................................................................................... 483
Batch inferencing ........................................................................................................... 490
Explainability ......................................................................................................................... 496
Models generated ................................................................................................................... 497
Prerequisites .................................................................................................................. 497
Share your Autopilot model ............................................................................................. 497
View model details ......................................................................................................... 497
Model Performance Report .............................................................................................. 498
Notebooks generated .............................................................................................................. 509
Data exploration report ................................................................................................... 510
Candidate definition notebook ......................................................................................... 516
Configure inference output ...................................................................................................... 517
Inference container definitions ......................................................................................... 517
Inference responses ........................................................................................................ 517
Quotas .................................................................................................................................. 522
Quotas that you can increase ........................................................................................... 522
Resource quotas ............................................................................................................. 523
API reference ......................................................................................................................... 524
.................................................................................................................................... 524
Label Data ..................................................................................................................................... 526
Ground Truth ......................................................................................................................... 526
Are You a First-time User of Ground Truth? ........................................................................ 527
Getting started ............................................................................................................... 527
Label Images ................................................................................................................. 532
Label Text .................................................................................................................... 552

v
Amazon SageMaker Developer Guide

Label Videos and Video Frames ........................................................................................ 562


Label 3D Point Clouds ..................................................................................................... 595
Verify and Adjust Labels .................................................................................................. 664
Creating Custom Labeling Workflows ................................................................................ 671
Create a Labeling Job ..................................................................................................... 703
Use Input and Output Data ............................................................................................. 734
Enhanced Data Labeling .................................................................................................. 804
Security and Permissions ................................................................................................. 816
Monitor Labeling Job Status ............................................................................................ 842
SageMaker Ground Truth Plus .................................................................................................. 844
Getting Started with Amazon SageMaker Ground Truth Plus. ............................................... 845
Request a Project ........................................................................................................... 847
Create a Project Team ..................................................................................................... 848
Open the Project Portal .................................................................................................. 849
Create a Batch ............................................................................................................... 851
Review Metrics ............................................................................................................... 851
Review Batches .............................................................................................................. 852
Accept or Reject Batches ................................................................................................. 855
Ground Truth Synthetic Data ................................................................................................... 855
Getting Started with Amazon SageMaker Ground Truth Synthetic Data .................................. 856
Share Data from Your Amazon S3 Bucket .......................................................................... 858
Project Portal ................................................................................................................. 859
Review Batches .............................................................................................................. 861
Accept or Reject Batches ................................................................................................. 862
Create and Manage Workforces ................................................................................................ 863
Using the Amazon Mechanical Turk Workforce ................................................................... 863
Managing Vendor Workforces .......................................................................................... 867
Use a Private Workforce .................................................................................................. 868
Crowd HTML Elements Reference ............................................................................................. 889
SageMaker Crowd HTML Elements .................................................................................... 889
Augmented AI Crowd HTML Elements ............................................................................... 961
Prepare and Analyze Datasets .......................................................................................................... 968
Detect Pre-training Data Bias ................................................................................................... 968
Amazon SageMaker Clarify Terms for Bias and Fairness ....................................................... 969
Sample Notebooks ......................................................................................................... 970
Measure Pre-training Bias ................................................................................................ 970
Generate Reports for Bias in Pre-training Data in SageMaker Studio ...................................... 980
Prepare Data with Data Wrangler ............................................................................................. 981
Get Started with Data Wrangler ....................................................................................... 983
Import ........................................................................................................................... 991
Create and Use a Data Wrangler Flow ............................................................................. 1034
Get Insights On Data and Data Quality ............................................................................ 1045
Automatically Train Models on Your Data Flow ................................................................. 1057
Transform Data ............................................................................................................ 1058
Analyze and Visualize .................................................................................................... 1101
Reusing Data Flows for Different Datasets ....................................................................... 1109
Export ......................................................................................................................... 1116
Use Data Preparation in a Studio Notebook to Get Data Insights ......................................... 1138
Security and Permissions ............................................................................................... 1141
Release Notes ............................................................................................................... 1152
Troubleshoot ................................................................................................................ 1156
Increase Amazon EC2 Instance Limit ............................................................................... 1160
Update Data Wrangler .................................................................................................... 985
Shut Down Data Wrangler ............................................................................................. 1162
Prepare data at scale with Studio notebooks ............................................................................ 1163
Prepare data using Amazon EMR .................................................................................... 1164
Prepare data using Glue Interactive Sessions .................................................................... 1192

vi
Amazon SageMaker Developer Guide

Process Data ................................................................................................................................ 1196


Sample Notebooks ................................................................................................................ 1196
CloudWatch Logs and Metrics ................................................................................................ 1197
Data Processing with Apache Spark ......................................................................................... 1197
Running a Spark Processing Job ..................................................................................... 1197
Data Processing with scikit-learn ............................................................................................ 1198
Data Processing with Framework Processors ............................................................................. 1198
Hugging Face Framework Processor ................................................................................ 1199
MXNet Framework Processor .......................................................................................... 1200
PyTorch Framework Processor ........................................................................................ 1201
TensorFlow Framework Processor .................................................................................... 1202
XGBoost Framework Processor ........................................................................................ 1203
Use Your Own Processing Code .............................................................................................. 1204
Run Scripts with a Processing Container .......................................................................... 1204
Build Your Own Processing Container .............................................................................. 1205
Create, Store, and Share Features ................................................................................................... 1210
How Feature Store works ....................................................................................................... 1210
Create feature groups ........................................................................................................... 1211
Find, discover, and share features ........................................................................................... 1211
Real-time inference for features stored in the online store ........................................................ 1211
Offline store for model training and batch inference ................................................................. 1212
Feature data ingestion ........................................................................................................... 1212
Resilience in Feature Store ..................................................................................................... 1212
Get started with Amazon SageMaker Feature Store ................................................................... 1212
Feature Store concepts .................................................................................................. 1213
Adding required policies to your IAM role ........................................................................ 1215
Create feature groups ................................................................................................... 1215
Use Feature Store with Studio ........................................................................................ 1227
Delete a feature group .................................................................................................. 1227
Data sources and ingestion .................................................................................................... 1230
Stream ingestion .......................................................................................................... 1230
Data Wrangler with Feature Store ................................................................................... 1230
Feature Store Spark ...................................................................................................... 1232
Add features to a feature group ............................................................................................. 1238
Example code ............................................................................................................... 1239
Find features in your feature groups ....................................................................................... 1241
Find feature groups in your Feature Store ................................................................................ 1244
Adding searchable metadata to your features ........................................................................... 1248
Example code ............................................................................................................... 1250
Create a dataset from your feature groups ............................................................................... 1252
Using the Amazon SageMaker Python SDK to get your data from your feature groups ............ 1253
Sample Amazon Athena queries ..................................................................................... 1256
Cross-account offline store access ........................................................................................... 1257
Step 1: Set up the offline store access role in Account A .................................................... 1258
Step 2: Set up an offline store Amazon S3 bucket in Account B ........................................... 1259
Step 3: Set up an offline store AWS KMS encryption key in Account A .................................. 1259
Step 4: Create a feature group in Account A ..................................................................... 1261
Logging Feature Store operations by using AWS CloudTrail ........................................................ 1261
Management events ...................................................................................................... 1261
Data events .................................................................................................................. 1261
Security and access control .................................................................................................... 1262
Using AWS KMS permissions for Amazon SageMaker Feature Store ..................................... 1263
Authorizing use of a customer managed Key for your online store ....................................... 1263
Using grants to authorize Feature Store .......................................................................... 1265
Monitoring Feature Store interaction with AWS KMS ......................................................... 1265
Accessing data in your online store ................................................................................. 1265
Authorizing use of a customer managed key for your offline store ....................................... 1265

vii
Amazon SageMaker Developer Guide

Quotas, naming rules and data types ...................................................................................... 1266


Quota terminologies ..................................................................................................... 1266
Limits and quotas ......................................................................................................... 1266
Naming rules ................................................................................................................ 1266
Data types ................................................................................................................... 1266
Amazon SageMaker Feature Store offline store data format ....................................................... 1267
Amazon SageMaker Feature Store notebook examples .............................................................. 1268
Feature Store sample notebooks ..................................................................................... 1268
Training ....................................................................................................................................... 1270
The simplest training workflow in SageMaker ........................................................................... 1270
Full view of the SageMaker Training workflow and features ....................................................... 1270
Before training ............................................................................................................. 1271
During training ............................................................................................................. 1273
After training ............................................................................................................... 1275
Choose an Algorithm ............................................................................................................ 1276
Choose an algorithm implementation .............................................................................. 1277
Problem types for the basic machine learning paradigms ................................................... 1279
Use Built-in Algorithms ................................................................................................. 1281
Use Reinforcement Learning .......................................................................................... 1559
Run local code as a remote job .............................................................................................. 1565
Set up your environment ............................................................................................... 1566
Invoking a function ....................................................................................................... 1572
Configuration file .......................................................................................................... 1576
Customize your runtime environment .............................................................................. 1577
Container image compatibility ........................................................................................ 1578
Logging parameters and metrics with Amazon SageMaker Experiments ................................ 1581
Using modular code with the @remote decorator ............................................................. 1584
Private repository for runtime dependencies .................................................................... 1585
Example notebooks ....................................................................................................... 1586
Experiments ......................................................................................................................... 1587
Supported AWS Regions ................................................................................................ 1587
Create an experiment .................................................................................................... 1587
View, search, and compare experiment runs ..................................................................... 1592
SageMaker integrations ................................................................................................. 1596
Tutorials ...................................................................................................................... 1598
CloudTrail metrics ......................................................................................................... 1599
Clean up experiment resources ....................................................................................... 1600
Additional supported SDK .............................................................................................. 1602
Experiments FAQs ......................................................................................................... 1605
Search using the console and API ................................................................................... 1607
Automatic Model Tuning ........................................................................................................ 1612
How Hyperparameter Tuning Works ................................................................................ 1613
Define metrics and environment variables ........................................................................ 1615
Define Hyperparameter Ranges ...................................................................................... 1617
Track and set completion criteria .................................................................................... 1620
Tune Multiple Algorithms .............................................................................................. 1623
Example: Hyperparameter Tuning Job ............................................................................. 1628
Stop Training Jobs Early ................................................................................................ 1640
Run a Warm Start Hyperparameter Tuning Job ................................................................. 1641
Resource Limits for Automatic Model Tuning .................................................................... 1645
Best Practices for Hyperparameter Tuning ....................................................................... 1647
Debug and Profile ................................................................................................................. 1649
Debugger Features ........................................................................................................ 1649
Supported Frameworks and Algorithms ........................................................................... 1650
Debugger Architecture ................................................................................................... 1653
Tutorials ...................................................................................................................... 1654
Debug Training Jobs ..................................................................................................... 1664

viii
Amazon SageMaker Developer Guide

Profile Training Jobs ..................................................................................................... 1709


List of Built-in Rules ..................................................................................................... 1748
Create Custom Rules ..................................................................................................... 1793
Use Debugger with Custom Training Containers ................................................................ 1795
Configure Debugger Using SageMaker API ....................................................................... 1799
Best Practices for Debugger ........................................................................................... 1809
Advanced Topics and Reference ...................................................................................... 1812
SageMaker Debugger Release Notes ................................................................................ 1820
Distributed Training .............................................................................................................. 1821
Get Started with Distributed Training .............................................................................. 1821
Basic Distributed Training Concepts ................................................................................. 1824
Advanced Concepts ....................................................................................................... 1825
Strategies .................................................................................................................... 1826
Optimize Distributed Training ......................................................................................... 1827
Scenarios ..................................................................................................................... 1828
SageMaker's Data Parallelism Library .............................................................................. 1831
SageMaker's Model Parallelism Library ............................................................................ 1864
SageMaker Distributed Training Notebook Examples ......................................................... 1942
Distributed computing with SageMaker best practices ....................................................... 1944
Training Compiler ................................................................................................................. 1948
What Is SageMaker Training Compiler? ............................................................................ 1948
How It Works ............................................................................................................... 1948
Supported Frameworks, AWS Regions, Instance Types, and Tested Models ............................ 1949
Bring Your Own Deep Learning Model ............................................................................. 1967
Enable Training Compiler ............................................................................................... 1975
Example Notebooks and Blogs ....................................................................................... 1989
Best Practices and Considerations ................................................................................... 1989
Training Compiler FAQ .................................................................................................. 1992
Troubleshooting ............................................................................................................ 1993
Release Notes ............................................................................................................... 1999
SageMaker Clarify Bias Detection and Model Explainability ........................................................ 2003
SageMaker Clarify Processing Jobs .................................................................................. 2003
Configure a SageMaker Clarify Processing Job .................................................................. 2004
Run SageMaker Clarify Processing Jobs ........................................................................... 2049
Get Analysis Results ...................................................................................................... 2061
Troubleshoot Jobs ......................................................................................................... 2069
Sample Notebooks ........................................................................................................ 2071
Detect Post-training Data and Model Bias ........................................................................ 2072
Model Explainability ...................................................................................................... 2093
Access Training Data ............................................................................................................. 2095
SageMaker Input Modes and AWS Cloud Storage .............................................................. 2096
Choosing Data Input Mode Using the SageMaker Python SDK ............................................. 2099
Configure Data Input Channel to Use Amazon FSx for Lustre .............................................. 2100
Best Practices for Choosing Data Source and Input Mode ................................................... 2102
Heterogeneous Cluster Training .............................................................................................. 2105
How to Configure a Heterogeneous Cluster ...................................................................... 2105
Distributed Training with a Heterogeneous Cluster ............................................................ 2108
Modify Your Training Script to Assign Instance Groups ....................................................... 2110
Considerations .............................................................................................................. 2112
Examples, Blogs, and Case Studies .................................................................................. 2112
Incremental Training ............................................................................................................. 2113
Perform Incremental Training (Console) ........................................................................... 2113
Perform Incremental Training (API) ................................................................................. 2115
Managed Spot Training ......................................................................................................... 2117
Using Managed Spot Training ......................................................................................... 2118
Managed Spot Training Lifecycle .................................................................................... 2118
Managed Warm Pools ........................................................................................................... 2119

ix
Amazon SageMaker Developer Guide

How it works ................................................................................................................ 2119


Warm pool resource limits ............................................................................................. 2122
How to use SageMaker managed warm pools ................................................................... 2123
Considerations .............................................................................................................. 2127
Monitor and Analyze Using CloudWatch Metrics ....................................................................... 2127
Defining Training Metrics ............................................................................................... 2128
Monitoring Training Job Metrics (CloudWatch Console) ...................................................... 2130
Monitoring Training Job Metrics (SageMaker Console) ........................................................ 2130
Example: Viewing a Training and Validation Curve ............................................................ 2132
Training Storage Folders ........................................................................................................ 2133
Overview ..................................................................................................................... 2134
SageMaker Environment Variables and Default Paths for Training Storage Folders ................. 2136
Tips and Considerations for Setting Up Storage Paths ....................................................... 2137
Use Augmented Manifest Files ............................................................................................... 2138
Augmented Manifest File format .................................................................................... 2138
Stream Augmented Manifest File Data ............................................................................ 2139
Use an Augmented Manifest File (Console) ....................................................................... 2140
Use an Augmented Manifest File (API) ............................................................................. 2141
Use Checkpoints ................................................................................................................... 2142
Frameworks and algorithms ........................................................................................... 2142
Enable Checkpointing .................................................................................................... 2143
Browse Checkpoint Files ................................................................................................ 2144
Resume Training From a Checkpoint ............................................................................... 2145
Considerations for Checkpointing ................................................................................... 2145
Use TensorBoard ................................................................................................................... 2146
Supported frameworks and AWS Regions ........................................................................ 2146
Prerequisites ................................................................................................................ 2147
Prepare a training job with a TensorBoard output data configuration ................................... 2147
How to access TensorBoard on SageMaker ....................................................................... 2149
Access and visualize training output data in TensorBoard ................................................... 2150
Explore training output data visualized in TensorBoard ...................................................... 2151
Delete unused TensorBoard applications .......................................................................... 2154
Considerations .............................................................................................................. 2154
Deploy Models for Inference .......................................................................................................... 2155
Before you begin .................................................................................................................. 2155
Steps for model deployment .................................................................................................. 2155
Inference options .................................................................................................................. 2156
Advanced endpoint options .................................................................................................... 2157
Bring your own model ........................................................................................................... 2157
Next steps ........................................................................................................................... 2158
Monitoring ................................................................................................................... 2158
CI/CD for model deployment ......................................................................................... 2158
Deployment guardrails .................................................................................................. 2158
Inferentia ..................................................................................................................... 2158
Optimize model performance ......................................................................................... 2158
Autoscaling .................................................................................................................. 2159
Get an endpoint instance type recommendation ....................................................................... 2159
How it Works ............................................................................................................... 2159
How to Get Started ...................................................................................................... 2159
Example notebooks ....................................................................................................... 2159
Prerequisites ................................................................................................................ 2160
Recommendation jobs ................................................................................................... 2167
Real-time inference ............................................................................................................... 2195
Hosting options ............................................................................................................ 2195
Automatically scale models ............................................................................................ 2253
Host instance storage volumes ....................................................................................... 2269
Safely validate models in production ............................................................................... 2269

x
Amazon SageMaker Developer Guide

Clarify online explainability ............................................................................................ 2280


Invoke real-time endpoints ............................................................................................ 2297
Monitor models for data and model quality, bias, and explainability ............................................ 2299
How It Works ............................................................................................................... 2300
Capture data ................................................................................................................ 2301
Monitor data quality ..................................................................................................... 2310
Monitor model quality ................................................................................................... 2316
Monitor bias drift ......................................................................................................... 2325
Monitor Feature Attribution Drift .................................................................................... 2334
Schedule monitoring jobs .............................................................................................. 2343
Prebuilt container ......................................................................................................... 2347
Interpret results ............................................................................................................ 2348
Visualize results for real-time endpoints .......................................................................... 2350
Advanced topics ........................................................................................................... 2356
Serverless Inference .............................................................................................................. 2371
How it works ................................................................................................................ 2372
Getting started ............................................................................................................. 2374
Create, invoke, update, and delete a serverless endpoint .................................................... 2375
Monitor a serverless endpoint ........................................................................................ 2387
Automatically scale Provisioned Concurrency for a serverless endpoint ................................ 2389
Troubleshooting ............................................................................................................ 2397
Asynchronous inference ......................................................................................................... 2398
How It Works ............................................................................................................... 2398
How Do I Get Started? .................................................................................................. 2399
Create, invoke, and update an Asynchronous Endpoint ...................................................... 2399
Monitor asynchronous endpoint ..................................................................................... 2408
Check prediction results ................................................................................................ 2411
Autoscale an asynchronous endpoint ............................................................................... 2413
Troubleshooting ............................................................................................................ 2416
Batch Transform ................................................................................................................... 2421
Use Batch Transform to Get Inferences from Large Datasets ............................................... 2421
Speed up a Batch Transform Job .................................................................................... 2423
Use Batch Transform to Test Production Variants .............................................................. 2423
Sample Notebooks ........................................................................................................ 2423
Associate Prediction Results with Input ............................................................................ 2423
Storage in Batch Transform ........................................................................................... 2429
Troubleshooting ............................................................................................................ 2429
Model parallelism and large model inference ........................................................................... 2430
Deep learning containers for LMI .................................................................................... 2430
SageMaker endpoint parameters for LMI ......................................................................... 2432
LMI tutorials ................................................................................................................ 2433
Configurations and settings ........................................................................................... 2440
Choosing instance types for LMI ..................................................................................... 2443
LMI FAQs ..................................................................................................................... 2447
LMI troubleshooting ...................................................................................................... 2447
Release notes for LMI deep learning containers ................................................................ 2448
Update models in production ................................................................................................. 2450
How to Get Started ...................................................................................................... 2451
Auto-Rollback Configuration and Monitoring .................................................................... 2451
Blue/Green Deployments ............................................................................................... 2454
Exclusions .................................................................................................................... 2467
Shadow tests ....................................................................................................................... 2467
Create a shadow test .................................................................................................... 2468
View, monitor, and edit shadow tests .............................................................................. 2471
Complete a shadow test ................................................................................................ 2479
Best Practices ............................................................................................................... 2481
Exclusions .................................................................................................................... 2481

xi
Amazon SageMaker Developer Guide

Access containers through SSM .............................................................................................. 2481


Allowlist ...................................................................................................................... 2482
Enable SSM access ........................................................................................................ 2482
IAM configuration ......................................................................................................... 2482
SSM access with AWS PrivateLink ................................................................................... 2483
Logging with Amazon CloudWatch Logs .......................................................................... 2483
Accessing model containers ............................................................................................ 2484
Catalog models with Model Registry ....................................................................................... 2484
Models, Model Versions, and Model Groups ...................................................................... 2485
Collections ................................................................................................................... 2502
Model Registry FAQ ...................................................................................................... 2508
Deploy models at the edge with SageMaker Edge Manager ........................................................ 2510
Why Use Edge Manager? ............................................................................................... 2510
How Does it Work? ....................................................................................................... 2510
How Do I Use SageMaker Edge Manager? ........................................................................ 2511
Getting Started ............................................................................................................ 2511
Set Up Devices and Fleets .............................................................................................. 2526
Package Model ............................................................................................................. 2533
The Edge Manager Agent .............................................................................................. 2539
Manage Model ............................................................................................................. 2553
SageMaker Edge Manager end of life .............................................................................. 2561
Optimize model performance using Neo .................................................................................. 2562
What is SageMaker Neo? ............................................................................................... 2562
How it Works ............................................................................................................... 2563
Sample Notebooks ........................................................................................................ 2564
Compile Models ............................................................................................................ 2564
Cloud Instances ............................................................................................................ 2577
Edge Devices ................................................................................................................ 2604
Troubleshoot Errors ....................................................................................................... 2622
Elastic Inference ................................................................................................................... 2628
Migrate from Amazon Elastic Inference to other instances .................................................. 2630
How EI Works ............................................................................................................... 2634
Choose an EI Accelerator Type ........................................................................................ 2634
Use EI in a SageMaker Notebook Instance ........................................................................ 2635
Use EI on a Hosted Endpoint ......................................................................................... 2635
Frameworks that Support EI ........................................................................................... 2635
Use EI with SageMaker Built-in Algorithms ....................................................................... 2635
EI Sample Notebooks .................................................................................................... 2636
Set Up to Use EI ........................................................................................................... 2636
Attach EI to a Notebook Instance ................................................................................... 2639
Endpoints with Elastic Inference ..................................................................................... 2641
Best practices ....................................................................................................................... 2644
Best practices for deploying models on SageMaker Hosting Services .................................... 2645
Monitor Security Best Practices ...................................................................................... 2645
Low latency real-time inference with AWS PrivateLink ....................................................... 2646
Migrate inference workload from x86 to AWS Graviton ..................................................... 2647
Troubleshoot deployments ............................................................................................. 2650
Inference cost optimization best practices ........................................................................ 2651
Best practices to minimize interruptions during GPU driver upgrades ................................... 2653
Supported features ............................................................................................................... 2655
Resources ............................................................................................................................. 2659
Blogs, example notebooks, and additional resources .......................................................... 2659
Troubleshooting and reference ....................................................................................... 2661
Model Hosting FAQs ...................................................................................................... 2662
Docker containers with SageMaker .................................................................................................. 2668
Scenarios and Guidance ......................................................................................................... 2668
Use cases for using pre-built Docker containers with SageMaker ......................................... 2668

xii
Amazon SageMaker Developer Guide

Use cases for extending a pre-built Docker container ......................................................... 2669


Use case for building your own container ........................................................................ 2669
Docker Container Basics ......................................................................................................... 2670
Use Pre-built SageMaker Docker images .................................................................................. 2671
Prebuilt Deep Learning Images ....................................................................................... 2671
Prebuilt Scikit-learn and Spark ML Images ....................................................................... 2672
Deep Graph Networks ................................................................................................... 2673
Extend a Pre-built Container .......................................................................................... 2675
Adapting your own Docker container to work with SageMaker .................................................... 2684
Individual Framework Libraries ....................................................................................... 2684
SageMaker Training and Inference Toolkits ....................................................................... 2684
Adapting your own training container ............................................................................. 2686
Adapting Your Own Inference Container .......................................................................... 2698
Create a container with your own algorithms and models .......................................................... 2702
Use Your Own Training Algorithms ................................................................................. 2702
Use Your Own Inference Code ........................................................................................ 2712
Example Notebooks .............................................................................................................. 2721
Setup .......................................................................................................................... 2722
Host Models Trained in Scikit-learn ................................................................................. 2722
Package TensorFlow and Scikit-learn Models for Use in SageMaker ...................................... 2722
Train and Deploy a Neural Network on SageMaker ............................................................ 2722
Training Using Pipe Mode .............................................................................................. 2722
Bring Your Own R Model ............................................................................................... 2723
Extend a Prebuilt PyTorch Container Image ...................................................................... 2723
Train and Debug Training Jobs on a Custom Container ...................................................... 2723
Troubleshooting .................................................................................................................... 2723
Workflows .................................................................................................................................... 2725
Amazon SageMaker Model Building Pipelines ........................................................................... 2726
Pipeline Overview ......................................................................................................... 2726
Create and Manage Pipelines ......................................................................................... 2776
Projects ............................................................................................................................... 2801
Why MLOps? ................................................................................................................ 2802
SageMaker Projects ....................................................................................................... 2804
SageMaker Studio Permissions Required to Use Projects .................................................... 2806
Create an MLOps Project ............................................................................................... 2808
Templates .................................................................................................................... 2808
View Resources ............................................................................................................. 2817
Update an MLOps Project .............................................................................................. 2817
Delete an MLOps Project ............................................................................................... 2819
Project walkthrough ...................................................................................................... 2819
Project Walkthrough Using Third-party Git Repos ............................................................. 2824
ML Lineage Tracking ............................................................................................................. 2828
Tracking Entities ........................................................................................................... 2829
SageMaker-Created Entities ............................................................................................ 2831
Manually Create Entities ................................................................................................ 2833
Querying Lineage Entities .............................................................................................. 2836
Cross-Account Tracking .................................................................................................. 2842
Kubernetes Orchestration ...................................................................................................... 2844
SageMaker Operators for Kubernetes .............................................................................. 2845
SageMaker Components for Kubeflow Pipelines ................................................................ 2891
Notebook Jobs ..................................................................................................................... 2908
Installation Guide ......................................................................................................... 2909
Create and manage your scheduled notebook jobs ............................................................ 2918
Troubleshooting guide ................................................................................................... 2929
Constraints and considerations ....................................................................................... 2930
Pricing for SageMaker Notebook Jobs ............................................................................. 2931
Workflows FAQ ..................................................................................................................... 2931

xiii
Amazon SageMaker Developer Guide

Augmented AI .............................................................................................................................. 2937


Get Started with Amazon Augmented AI ................................................................................. 2938
Core Components of Amazon A2I ................................................................................... 2938
Prerequisites to Using Augmented AI .............................................................................. 2942
Tutorial: Get Started in the Amazon A2I Console .............................................................. 2942
Tutorial: Get Started Using the Amazon A2I API ............................................................... 2948
Use Cases and Examples ........................................................................................................ 2958
Use SageMaker Notebook Instance with Amazon A2I Jupyter Notebook ............................... 2960
Use with Amazon Textract ............................................................................................. 2960
Use with Amazon Rekognition ........................................................................................ 2963
Use With Custom Task Types ......................................................................................... 2964
Create a Human Review Workflow .......................................................................................... 2966
Create a Human Review Workflow (Console) .................................................................... 2967
Create a Human Review Workflow (API) ........................................................................... 2969
JSON Schema for Human Loop Activation Conditions in Amazon Augmented AI .................... 2972
Delete a Human Review Workflow .......................................................................................... 2984
Delete a Flow Definition Using the Console or the SageMaker API ....................................... 2985
Create and Start a Human Loop ............................................................................................. 2985
Create and Start a Human Loop for a Built-in Task Type .................................................... 2986
Create and Start a Human Loop for a Custom Task Type .................................................... 2989
Next Steps: .................................................................................................................. 2990
Delete a Human Loop ........................................................................................................... 2990
Human Loop Data Retention and Deletion ....................................................................... 2991
Stop and Delete a Flow Definition Using the Console or the Amazon A2I API ......................... 2991
Create and Manage Worker Task Templates ............................................................................. 2993
Create and Delete Worker Task Templates ....................................................................... 2993
Create Custom Worker Task Templates ............................................................................ 2995
Creating Good Worker Instructions .................................................................................. 3002
Monitor and Manage Your Human Loop .................................................................................. 3003
Output Data ......................................................................................................................... 3004
Output Data From Built-In Task Types ............................................................................ 3004
Output Data From Custom Task Types ............................................................................. 3010
Track Worker Activity .................................................................................................... 3012
Permissions and Security ....................................................................................................... 3013
CORS Permission Requirement ....................................................................................... 3014
Add Permissions to the IAM Role Used to Create a Flow Definition ...................................... 3014
Create a User That Can Invoke Amazon A2I API Operations ................................................ 3016
Create a User With Permissions to Invoke Amazon A2I, Amazon Textract, and Amazon
Rekognition API Operations ........................................................................................... 3016
Enable Worker Task Template Previews ........................................................................... 3017
Using Amazon A2I with AWS KMS Encrypted Buckets ........................................................ 3018
Additional Permissions and Security Resources ................................................................. 3018
CloudWatch Events ............................................................................................................... 3018
Send Events from Your Human Loop to CloudWatch Events ............................................... 3019
Set Up a Target to Process Events .................................................................................. 3020
Use Human Review Output ............................................................................................ 3020
More Information ......................................................................................................... 3020
API References ..................................................................................................................... 3021
Programmatic Tutorials ................................................................................................. 3020
Marketplace ................................................................................................................................. 3022
Topics .................................................................................................................................. 3022
SageMaker Algorithms ........................................................................................................... 3022
SageMaker Model Packages ................................................................................................... 3022
Use your own algorithms and models with the AWS Marketplace ................................................ 3023
Create Algorithm and Model Package Resources ............................................................... 3023
Use Algorithm and Model Package Resources ................................................................... 3029
Sell Amazon SageMaker Algorithms and Model Packages ........................................................... 3036

xiv
Amazon SageMaker Developer Guide

Topics .......................................................................................................................... 3037


Develop Algorithms and Models in Amazon SageMaker ..................................................... 3037
List Your Algorithm or Model Package on AWS Marketplace ............................................... 3038
Find and Subscribe to Algorithms and Model Packages on AWS Marketplace ................................. 3039
Use Algorithms and Model Packages ............................................................................... 3039
Security ....................................................................................................................................... 3040
Access Control ...................................................................................................................... 3040
Access control and Studio notebooks .............................................................................. 3040
Control root access to a Notebook instance ...................................................................... 3042
Data Protection .................................................................................................................... 3042
Protect Data at Rest Using Encryption ............................................................................. 3043
Protecting Data in Transit with Encryption ....................................................................... 3045
Key Management .......................................................................................................... 3047
Internetwork Traffic Privacy ........................................................................................... 3048
Identity and Access Management ............................................................................................ 3048
Audience ...................................................................................................................... 3048
Authenticating with Identities ........................................................................................ 3049
Managing Access Using Policies ...................................................................................... 3051
How Amazon SageMaker Works with IAM ........................................................................ 3053
Identity-Based Policy Examples ...................................................................................... 3055
Cross-Service Confused Deputy Prevention ...................................................................... 3080
SageMaker Roles .......................................................................................................... 3086
Role Manager ............................................................................................................... 3108
Amazon SageMaker API Permissions Reference ................................................................. 3116
AWS Managed Policies for SageMaker ............................................................................. 3136
Troubleshooting ............................................................................................................ 3205
Logging and Monitoring ........................................................................................................ 3206
Compliance validation ........................................................................................................... 3207
Resilience ............................................................................................................................. 3208
Infrastructure Security ........................................................................................................... 3208
SageMaker Scans AWS Marketplace Training and Inference Containers for Security
Vulnerabilities .............................................................................................................. 3208
Connect to Resources From Within a VPC ........................................................................ 3208
Run Training and Inference Containers in Internet-Free Mode ............................................. 3212
Connect to SageMaker Through a VPC Interface Endpoint .................................................. 3213
Give SageMaker Access to Resources in your Amazon VPC .................................................. 3223
Governance .................................................................................................................................. 3243
Amazon SageMaker Role Manager .......................................................................................... 3243
Amazon SageMaker Model Cards ............................................................................................ 3243
Amazon SageMaker Model Dashboard ..................................................................................... 3243
Model Cards ......................................................................................................................... 3243
Prerequisites ................................................................................................................ 3244
Intended uses of a model .............................................................................................. 3244
Risk ratings .................................................................................................................. 3244
Model card JSON schema .............................................................................................. 3245
SageMaker Console ....................................................................................................... 3254
SageMaker Python SDK ................................................................................................. 3256
SageMaker APIs ............................................................................................................ 3259
Model card FAQs .......................................................................................................... 3260
Model Dashboard .................................................................................................................. 3261
Model Dashboard elements ............................................................................................ 3262
View Model Monitor schedules and alerts ........................................................................ 3263
View a model lineage graph ........................................................................................... 3265
View Endpoint Status .................................................................................................... 3267
Model Dashboard FAQ ................................................................................................... 3268
Monitoring ................................................................................................................................... 3271
Monitoring with CloudWatch .................................................................................................. 3271

xv
Amazon SageMaker Developer Guide

Endpoint Invocation Metrics ........................................................................................... 3272


Multi-Model Endpoint Metrics ........................................................................................ 3273
Jobs and Endpoint Metrics ............................................................................................. 3275
Inference Recommender Metrics ..................................................................................... 3278
Ground Truth Metrics .................................................................................................... 3279
Feature Store Metrics .................................................................................................... 3281
Pipelines Metrics ........................................................................................................... 3282
Logging with CloudWatch ...................................................................................................... 3284
Log SageMaker API Calls with CloudTrail ................................................................................. 3285
SageMaker Information in CloudTrail ............................................................................... 3286
Operations Performed by Automatic Model Tuning ........................................................... 3286
Understanding SageMaker Log File Entries ....................................................................... 3287
Monitoring user resource access from Amazon SageMaker Studio ................................................ 3288
Prerequisites ................................................................................................................ 3288
Turn on sourceIdentity ............................................................................................ 3289
Turn off sourceIdentity ............................................................................................ 3290
Automating with EventBridge ................................................................................................. 3290
Training job state change .............................................................................................. 3291
HyperParameter tuning job state change ......................................................................... 3292
Transform job state change ........................................................................................... 3293
Endpoint state change .................................................................................................. 3294
Feature group state change ........................................................................................... 3295
Model package state change .......................................................................................... 3296
Pipeline execution state change ...................................................................................... 3297
Pipeline step state change ............................................................................................. 3297
SageMaker image state change ...................................................................................... 3298
SageMaker image version state change ........................................................................... 3298
Endpoint deployment state change ................................................................................. 3299
Model card state change ............................................................................................... 3301
API and SDK Reference .................................................................................................................. 3302
Overview ............................................................................................................................. 3302
Programming Model for Amazon SageMaker ............................................................................ 3302
SageMaker Document History ........................................................................................................ 3304
AWS glossary ............................................................................................................................... 3308

xvi
Amazon SageMaker Developer Guide
Amazon SageMaker Pricing

What Is Amazon SageMaker?


Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and
developers can quickly and easily build and train machine learning models, and then directly deploy
them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook
instance for easy access to your data sources for exploration and analysis, so you don't have to manage
servers. It also provides common machine learning algorithms that are optimized to run efficiently
against extremely large data in a distributed environment. With native support for bring-your-own-
algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your
specific workflows. Deploy a model into a secure and scalable environment by launching it with a few
clicks from SageMaker Studio or the SageMaker console.

This guide includes information and tutorials on SageMaker features. For additional information, see
Amazon SageMaker developer resources.

Topics

• Amazon SageMaker Features (p. 2)


• Amazon SageMaker Pricing (p. 1)
• Are You a First-time User of Amazon SageMaker? (p. 1)

Amazon SageMaker Pricing


As with other AWS products, there are no contracts or minimum commitments for using Amazon
SageMaker. Training and hosting are billed by minutes of usage, with no minimum fees and no upfront
commitments. For more information about the cost of using SageMaker, see SageMaker Pricing.

Are You a First-time User of Amazon SageMaker?


If you are a first-time user of SageMaker, we recommend that you do the following:

1. Read How Amazon SageMaker Works (p. 2) – This section provides an overview of SageMaker,
explains key concepts, and describes the core components involved in building AI solutions with
SageMaker. We recommend that you read this topic in the order presented.
2. Set Up Amazon SageMaker Prerequisites (p. 35) – This section explains how to set up your AWS
account.
3. Amazon SageMaker Autopilot simplifies the machine learning experience by automating machine
learning tasks. If you are new to SageMaker, it provides the easiest learning path. It also serves as an
excellent ML learning tool that provides visibility into the code with notebooks generated for each
of the automated ML tasks. For an introduction to its capabilities, see Automate model development
with Amazon SageMaker Autopilot (p. 467). To get started building, training, and deploying machine
learning models, Autopilot provides:
• Samples: Explore modeling with Amazon SageMaker Autopilot (p. 468)
• Videos: Use Autopilot to automate and explore the machine learning process (p. 469)
• Tutorials: Get started with Amazon SageMaker Autopilot (p. 470)
4. Get Started with Amazon SageMaker (p. 35) – This section walks you through training your first
model using SageMaker Studio, or the SageMaker console and the SageMaker API. You use training
algorithms provided by SageMaker.

1
Amazon SageMaker Developer Guide
How It Works

5. Explore other topics – Depending on your needs, do the following:


• Submit Python code to train with deep learning frameworks – In SageMaker, you can use your
own training scripts to train models. For information, see Use Machine Learning Frameworks,
Python, and R with Amazon SageMaker (p. 15).
• Use SageMaker directly from Apache Spark – For information, see Use Apache Spark with Amazon
SageMaker (p. 16).
• Use SageMaker to train and deploy your own custom algorithms – Package your custom
algorithms with Docker so you can train and/or deploy them in SageMaker. To learn how SageMaker
interacts with Docker containers, and for the SageMaker requirements for Docker images, see Using
Docker containers with SageMaker (p. 2668).
6. View the API Reference – This section describes the SageMaker API operations.

How Amazon SageMaker Works


SageMaker is a fully managed service that enables you to quickly and easily integrate machine learning-
based models into your applications. This section provides an overview of machine learning and explains
how SageMaker works. If you are a first-time user of SageMaker, we recommend that you read the
following sections in order:

1. Machine Learning with Amazon SageMaker (p. 5)


2. Explore, Analyze, and Process Data (p. 7)
3. Train a Model with Amazon SageMaker (p. 10)
4. Deploy a Model in Amazon SageMaker (p. 13)
5. Use Machine Learning Frameworks, Python, and R with Amazon SageMaker (p. 15)
6. Get Started with Amazon SageMaker (p. 35)

Amazon SageMaker Features


Amazon SageMaker includes the following features.

Topics
• New features for re:Invent 2022 (p. 2)
• Machine learning environments (p. 3)
• Major features (p. 3)

New features for re:Invent 2022


SageMaker includes the following new features for re:Invent 2022.

SageMaker geospatial capabilities (p. 401)

Build, train, and deploy ML models using geospatial data.


SageMaker Model Cards (p. 3243)

Document information about your ML models in a single place for streamlined governance and
reporting throughout the ML lifecycle.
SageMaker Model Dashboard (p. 3261)

A pre-built, visual overview of all the models in your account. Model Dashboard integrates
information from SageMaker Model Monitor, transform jobs, endpoints, lineage tracking, and

2
Amazon SageMaker Developer Guide
Machine learning environments

CloudWatch so you can access high-level model information and track model performance in one
unified view.
SageMaker Role Manager (p. 3108)

Administrators can define least-privilege permissions for common ML activities using custom and
preconfigured persona-based IAM roles.
AutoML step (p. 2733)

Create an AutoML job to automatically train a model in SageMaker Pipelines.


Collaboration with shared spaces (p. 123)

A shared space consists of a shared JupyterServer application and a shared directory. All user profiles
in a Domain have access to all shared spaces in the Domain.
Data Wrangler data preparation widget (p. 1138)

Interact with your data, get visualizations, explore actionable insights, and fix data quality issues.
Inference shadow tests (p. 2467)

Evaluate any changes to your model-serving infrastructure by comparing it's performance against
the currently deployed infrastructure.
Notebook-based Workflows (p. 2908)

Run your SageMaker Studio notebook as a non-interactive, scheduled job.


Studio Git extension (p. 190)

A Git extension to enter the URL of a Git repository, clone it into your environment, push changes,
and view commit history.

Machine learning environments


SageMaker includes the following machine learning environments.

SageMaker Studio (p. 128)

An integrated machine learning environment where you can build, train, deploy, and analyze your
models all in the same application.
SageMaker Studio Lab (p. 230)

A free service that gives customers access to AWS compute resources in an environment based on
open-source JupyterLab.
SageMaker Canvas (p. 258)

An auto ML service that gives people with no coding experience the ability to build models and make
predictions with them.
RStudio on Amazon SageMaker (p. 432)

An integrated development environment for R, with a console, syntax-highlighting editor


that supports direct code execution, and tools for plotting, history, debugging and workspace
management.

Major features
SageMaker includes the following major features in alphabetical order excluding any SageMaker prefix.

3
Amazon SageMaker Developer Guide
Major features

Amazon Augmented AI (p. 2937)

Build the workflows required for human review of ML predictions. Amazon A2I brings human review
to all developers, removing the undifferentiated heavy lifting associated with building human review
systems or managing large numbers of human reviewers.
SageMaker Autopilot (p. 467)

Users without machine learning knowledge can quickly build classification and regression models.
Batch Transform (p. 2421)

Preprocess datasets, run inference when you don't need a persistent endpoint, and associate input
records with inferences to assist the interpretation of results.
SageMaker Clarify (p. 8)

Improve your machine learning models by detecting potential bias and help explain the predictions
that models make.
SageMaker Data Wrangler (p. 981)

Import, analyze, prepare, and featurize data in SageMaker Studio. You can integrate Data Wrangler
into your machine learning workflows to simplify and streamline data pre-processing and feature
engineering using little to no coding. You can also add your own Python scripts and transformations
to customize your data prep workflow.
SageMaker Debugger (p. 1649)

Inspect training parameters and data throughout the training process. Automatically detect and
alert users to commonly occurring errors such as parameter values getting too large or small.
SageMaker Edge Manager (p. 2510)

Optimize custom models for edge devices, create and manage fleets and run models with an
efficient runtime.
SageMaker Elastic Inference (p. 2628)

Speed up the throughput and decrease the latency of getting real-time inferences.
SageMaker Experiments (p. 1587)

Experiment management and tracking. You can use the tracked data to reconstruct an experiment,
incrementally build on experiments conducted by peers, and trace model lineage for compliance and
audit verifications.
SageMaker Feature Store (p. 1210)

A centralized store for features and associated metadata so features can be easily discovered and
reused. You can create two types of stores, an Online or Offline store. The Online Store can be used
for low latency, real-time inference use cases and the Offline Store can be used for training and
batch inference.
SageMaker Ground Truth (p. 526)

High-quality training datasets by using workers along with machine learning to create labeled
datasets.
SageMaker Ground Truth Plus (p. 844)

A turnkey data labeling feature to create high-quality training datasets without having to build
labeling applications and manage the labeling workforce on your own.
SageMaker Inference Recommender (p. 2159)

Get recommendations on inference instance types and configurations (e.g. instance count, container
parameters and model optimizations) to use your ML models and workloads.

4
Amazon SageMaker Developer Guide
Machine Learning with Amazon SageMaker

SageMaker JumpStart (p. 47)

Learn about SageMaker features and capabilities through curated 1-click solutions, example
notebooks, and pretrained models that you can deploy. You can also fine-tune the models and
deploy them.
SageMaker ML Lineage Tracking (p. 2828)

Track the lineage of machine learning workflows.


SageMaker Model Building Pipelines (p. 2726)

Create and manage machine learning pipelines integrated directly with SageMaker jobs.
SageMaker Model Monitor (p. 2299)

Monitor and analyze models in production (endpoints) to detect data drift and deviations in model
quality.
SageMaker Model Registry (p. 2484)

Versioning, artifact and lineage tracking, approval workflow, and cross account support for
deployment of your machine learning models.
SageMaker Neo (p. 2562)

Train machine learning models once, then run anywhere in the cloud and at the edge.
Preprocessing (p. 1196)

Analyze and preprocess data, tackle feature engineering, and evaluate models.
SageMaker Projects (p. 2801)

Create end-to-end ML solutions with CI/CD by using SageMaker projects.


Reinforcement Learning (p. 1559)

Maximize the long-term reward that an agent receives as a result of its actions.
SageMaker Serverless Endpoints (p. 2371)

A serverless endpoint option for hosting your ML model. Automatically scales in capacity to serve
your endpoint traffic. Removes the need to select instance types or manage scaling policies on an
endpoint.
SageMaker Studio Notebooks (p. 144)

The next generation of SageMaker notebooks that include AWS IAM Identity Center (successor to
AWS Single Sign-On) (IAM Identity Center) integration, fast start-up times, and single-click sharing.
SageMaker Studio Notebooks and Amazon EMR (p. 1164)

Easily discover, connect to, create, terminate and manage Amazon EMR clusters in single account
and cross account configurations directly from SageMaker Studio.
SageMaker Training Compiler (p. 1948)

Train deep learning models faster on scalable GPU instances managed by SageMaker.

Machine Learning with Amazon SageMaker


This section describes a typical machine learning workflow and summarizes how you accomplish those
tasks with Amazon SageMaker.

In machine learning, you "teach" a computer to make predictions, or inferences. First, you use an
algorithm and example data to train a model. Then you integrate your model into your application to

5
Amazon SageMaker Developer Guide
Machine Learning with Amazon SageMaker

generate inferences in real time and at scale. In a production environment, a model typically learns from
millions of example data items and produces inferences in hundreds to less than 20 milliseconds.

The following diagram illustrates the typical workflow for creating a machine learning model:

As the diagram illustrates, you typically perform the following activities:

1. Generate example data—To train a model, you need example data. The type of data that you need
depends on the business problem that you want the model to solve (the inferences that you want
the model to generate). For example, suppose that you want to create a model to predict a number
given an input image of a handwritten digit. To train such a model, you need example images of
handwritten numbers.

Data scientists often spend a lot of time exploring and preprocessing, or "wrangling," example data
before using it for model training. To preprocess data, you typically do the following:
a. Fetch the data— You might have in-house example data repositories, or you might use datasets
that are publicly available. Typically, you pull the dataset or datasets into a single repository.
b. Clean the data—To improve model training, inspect the data and clean it as needed. For example, if
your data has a country name attribute with values United States and US, you might want to
edit the data to be consistent.
c. Prepare or transform the data—To improve performance, you might perform additional data
transformations. For example, you might choose to combine attributes. If your model predicts the
conditions that require de-icing an aircraft, instead of using temperature and humidity attributes
separately, you might combine those attributes into a new attribute to get a better model.

In SageMaker, you preprocess example data in a Jupyter notebook on your notebook instance. You
use your notebook to fetch your dataset, explore it, and prepare it for model training. For more
information, see Explore, Analyze, and Process Data (p. 7). For more information about preparing
data in AWS Marketplace, see data preparation.
2. Train a model—Model training includes both training and evaluating the model, as follows:
• Training the model— To train a model, you need an algorithm or a pre-trained base model. The
algorithm you choose depends on a number of factors. For a quick, out-of-the-box solution, you
might be able to use one of the algorithms that SageMaker provides. For a list of algorithms
provided by SageMaker and related considerations, see Use Amazon SageMaker Built-in Algorithms
or Pre-trained Models (p. 1281). For a UI-based training solution that provides algorithms and
models, see SageMaker JumpStart (p. 47).

6
Amazon SageMaker Developer Guide
Explore, Analyze, and Process Data

You also need compute resources for training. Depending on the size of your training dataset and
how quickly you need the results, you can use resources ranging from a single general-purpose
instance to a distributed cluster of GPU instances. For more information, see Train a Model with
Amazon SageMaker (p. 10).

• Evaluating the model—After you've trained your model, you evaluate it to determine whether
the accuracy of the inferences is acceptable. In SageMaker, you use either the AWS SDK for Python
(Boto) or the high-level Python library that SageMaker provides to send requests to the model for
inferences.

You use a Jupyter notebook in your SageMaker notebook instance to train and evaluate your model.

3. Deploy the model— You traditionally re-engineer a model before you integrate it with your
application and deploy it. With SageMaker hosting services, you can deploy your model
independently, decoupling it from your application code. For more information, see Deploy Models for
Inference (p. 2155).

Machine learning is a continuous cycle. After deploying a model, you monitor the inferences, collect
"ground truth," and evaluate the model to identify drift. You then increase the accuracy of your
inferences by updating your training data to include the newly collected ground truth. You do this by
retraining the model with the new dataset. As more and more example data becomes available, you
continue retraining your model to increase accuracy.

Explore, Analyze, and Process Data


Before using a dataset to train a model, data scientists typically explore, analyze, and preprocess it.

Amazon SageMaker Processing enables running jobs to preprocess and postprocess data, perform
feature engineering, and evaluate models on SageMaker easily and at scale. When combined with the
other critical machine learning tasks provided by SageMaker, such as training and hosting, Processing
provides you with the benefits of a fully managed machine learning environment, including all the
security and compliance support built into SageMaker. With Processing, you have the flexibility to use
the built-in data processing containers or to bring your own containers and submit custom jobs to run on
managed infrastructure. After you submit a job, SageMaker launches the compute instances, processes
and analyzes the input data, and releases the resources upon completion. For more information, see
Process Data (p. 1196).

• For information about how to run your own data processing scripts, see Data Processing with scikit-
learn (p. 1198).
• For information about how to build your own processing container to run scripts, see Build Your Own
Processing Container (Advanced Scenario) (p. 1205).
• For information about how to perform exploratory data analysis (EDA) with a visual no-code interface,
see Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981).

7
Amazon SageMaker Developer Guide
Fairness and Model Explainability

What Is Fairness and Model Explainability for


Machine Learning Predictions?
Amazon SageMaker Clarify helps improve your machine learning (ML) models by detecting potential
bias and helping explain the predictions that models make. It helps you identify various types of bias
in pretraining data and in posttraining that can emerge during model training or when the model is
in production. SageMaker Clarify helps explain how these models make predictions using a feature
attribution approach. It also monitors inferences models make in production for bias or feature
attribution drift. The fairness and explainability functionality provided by SageMaker Clarify provides
components that help AWS customers build less biased and more understandable machine learning
models. It also provides tools to help you generate model governance reports that you can use to inform
risk and compliance teams, and external regulators.

Machine learning models and data-driven systems are being increasingly used to help make decisions
across domains such as financial services, healthcare, education, and human resources. Machine learning
applications provide benefits such as improved accuracy, increased productivity, and cost savings to help
meet regulatory requirements, improve business decisions, and provide better insights into data science
procedures.

• Regulatory – In many situations, it is important to understand why an ML model made a specific


prediction and also whether the prediction it made was impacted by any bias, either during training
or at inference. Recently, policymakers, regulators, and advocates have raised awareness about the
ethical and policy challenges posed by ML and data-driven systems. In particular, they have expressed
concerns about the potentially discriminatory impact of such systems (for example, inadvertently
encoding of bias into automated decisions).
• Business – The adoption of AI systems in regulated domains requires trust, which can be built by
providing reliable explanations of the behavior of trained models and how the deployed models make
predictions. Model explainability may be particularly important to certain industries with reliability,
safety, and compliance requirements, such as financial services, human resources, healthcare, and
automated transportation. To take a common financial example, lending applications that incorporate
the use of ML models might need to provide explanations about how those models made certain
predictions to internal teams of loan officers, customer service representatives, and forecasters, in
addition to end users/customers.
• Data Science – Data scientists and ML engineers need tools to generate the insights required to debug
and improve ML models through better feature engineering, to determine whether a model is making
inferences based on noisy or irrelevant features, and to understand the limitations of their models and
failure modes their models may encounter.

For a blog that shows how to architect and build a complete machine learning use case involving
fraudulent automobile claims that integrates SageMaker Clarify into a SageMaker pipeline, see the
Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker
demo. This blog discusses how to assess pre and post training bias, how to mitigate the bias, and how
the data features impact the prediction. There are links to the relevent code for each task in the ML
lifecycle, including the creation of an automated workflow that integrates the fairness and explainablity
functionality of SageMaker Clarify into a SageMaker Pipeline.

Best Practices for Evaluating Fairness and


Explainability in the ML Lifecycle
Fairness as a Process – The notions of bias and fairness are highly dependent on the application. Further,
the choice of the attributes for which bias is to be measured, as well as the choice of the bias metrics,
may need to be guided by social, legal, and other non-technical considerations. Building consensus and
achieving collaboration across key stakeholders (such as product, policy, legal, engineering, and AI/ML

8
Amazon SageMaker Developer Guide
Sample Notebooks

teams, as well as end users and communities) is a prerequisite for the successful adoption of fairness-
aware ML approaches in practice.

Fairness and Explainability by Design in the ML Lifecycle – You should consider fairness and
explainability during each stage of the ML lifecycle: problem formation, dataset construction, algorithm
selection, model training process, testing process, deployment, and monitoring/feedback. It is important
to have the right tools to do this analysis. To encourage engaging with these considerations, here are a
few example questions we recommend you ask during each of these stages.

Sample Notebooks
Amazon SageMaker Clarify provides the following sample notebooks:

• Explainability and bias detection with Amazon SageMaker Clarify – Use SageMaker Clarify to create a
processing job for the detecting bias and explaining model predictions with feature attributions.
• Monitoring bias drift and feature attribution drift Amazon SageMaker Clarify – Use Amazon SageMaker
Model Monitor to monitor bias drift and feature attribution drift over time.
• Fairness and Explainability with SageMaker Clarify (Bring Your Own Container) – This sample notebook
introduces key terms and concepts needed to understand SageMaker Clarify, and it walks you through
an end-to-end data science workflow demonstrating how to build your own model and container
that can work seamlessly with your Clarify jobs, use the model and SageMaker Clarify to measure
bias, explain the importance of the various input features on the model's decision and then access the
reports through SageMaker Studio if you have an instance set up.
• Fairness and Explainability with SageMaker Clarify - Spark Distributed Processing – This sample
notebook walks you through key terms and concepts needed to understand SageMaker Clarify,
measures the pre-training bias of a dataset and post-training bias of a model, explains the importance
of the various input features on the model's decision, and accesses the reports through SageMaker
Studio if you have an instance set up.
• Mitigate Bias, Train another unbiased Model and Put in the Model Registry – This notebook describes
how to detect bias using SageMaker Clarify, mitigate it with Synthetic Minority Over-sampling
Technique (SMOTE), train another model, then put it in the Model Registry along with all the lineage
of the artifacts created along the way: data, code and model metadata. This notebook forms part of a
series that shows how to integrate SageMaker Clarify into a SageMaker Pipeline that is described in the
Architect and build the full machine learning lifecycle with AWS blog.

These notebooks have been verified to run in Amazon SageMaker Studio only. If you need instructions on
how to open a notebook in Amazon SageMaker Studio, see Create or Open an Amazon SageMaker Studio
Notebook (p. 148). If you're prompted to choose a kernel, choose Python 3 (Data Science).

9
Amazon SageMaker Developer Guide
Guide to the SageMaker Clarify Documentation

Guide to the SageMaker Clarify Documentation


Bias can occur and be measured in the data at each stage of the machine learning lifecycle: before
training a model and after model training. SageMaker Clarify can provide feature attribution
explanations of model predictions for trained models and for models deployed to production, where
models can be monitored for any drift from their baseline explanatory attributions. Clarify calculates
baselines when needed. The documentation for SageMaker Clarify is embedded throughout the larger
SageMaker documentation set at the relevant ML stages as follows:

• For further information on detecting bias in preprocessing data before it's used to train a model, see
Detect Pre-training Data Bias (p. 968).
• For further information on detecting posttraining data and model bias, see Detect Post-training Data
and Model Bias with Amazon SageMaker Clarify (p. 2072).
• For further information on the model-agnostic feature attribution approach to explain model
predictions after training, see Amazon SageMaker Clarify Model Explainability (p. 2093).
• For further information on monitoring for bias in production model inferences due to the drift
of data away from the baseline used to train the model, see Monitor Bias Drift for Models in
Production (p. 2325).
• For further information on monitoring for the drift of features' contributions away from the baseline
that was established during model training, see Monitor Feature Attribution Drift for Models in
Production (p. 2334).

Train a Model with Amazon SageMaker


The following diagram shows how you train and deploy a model with Amazon SageMaker:

10
Amazon SageMaker Developer Guide
Model Training

11
Amazon SageMaker Developer Guide
Model Training

The area labeled SageMaker highlights the two components of SageMaker: model training and model
deployment.

To train a model in SageMaker, you create a training job. The training job includes the following
information:

• The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training
data.
• The compute resources that you want SageMaker to use for model training. Compute resources are
machine learning (ML) compute instances that are managed by SageMaker.
• The URL of the S3 bucket where you want to store the output of the job.
• The Amazon Elastic Container Registry path where the training code is stored. For more information,
see Docker Registry Paths and Example Code.

Note
Your input dataset must be in the same AWS Region as your training job.

You have the following options for a training algorithm:

• Use an algorithm provided by SageMaker—SageMaker provides dozens of built-in training


algorithms and hundreds of pre-trained models. If one of these meets your needs, it's a great out-
of-the-box solution for quick model training. For a list of algorithms provided by SageMaker, see Use
Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281). To try an exercise that uses
an algorithm provided by SageMaker, see Get Started with Amazon SageMaker (p. 35). You can also
use SageMaker JumpStart (p. 47) to use algorithms and models through the Studio UI.
• Use SageMaker Debugger—to inspect training parameters and data throughout the training process
when working with the TensorFlow, PyTorch, and Apache MXNet learning frameworks or the XGBoost
algorithm. Debugger automatically detects and alerts users to commonly occurring errors such as
parameter values getting too large or small. For more information about using Debugger, see Debug
and Profile Training Jobs Using Amazon SageMaker Debugger (p. 1649). Debugger sample notebooks
are available at Amazon SageMaker Debugger Samples.
• Use Apache Spark with SageMaker—SageMaker provides a library that you can use in Apache Spark
to train models with SageMaker. Using the library provided by SageMaker is similar to using Apache
Spark MLLib. For more information, see Use Apache Spark with Amazon SageMaker (p. 16).
• Submit custom code to train with deep learning frameworks—You can submit custom Python code
that uses TensorFlow, PyTorch, or Apache MXNet for model training. For more information, see Use
TensorFlow with Amazon SageMaker (p. 32), Use PyTorch with Amazon SageMaker (p. 27), and
Use Apache MXNet with Amazon SageMaker (p. 15).
• Use your own custom algorithms—Put your code together as a Docker image and specify the registry
path of the image in a SageMaker CreateTrainingJob API call. For more information, see Using
Docker containers with SageMaker (p. 2668).
• Use an algorithm that you subscribe to from AWS Marketplace—For information, see Find and
Subscribe to Algorithms and Model Packages on AWS Marketplace (p. 3039).

After you create the training job, SageMaker launches the ML compute instances and uses the training
code and the training dataset to train the model. It saves the resulting model artifacts and other output
in the S3 bucket you specified for that purpose.

You can create a training job with the SageMaker console or the API. For information about creating a
training job with the API, see the CreateTrainingJob API.

When you create a training job with the API, SageMaker replicates the entire dataset on ML compute
instances by default. To make SageMaker replicate a subset of the data on each ML compute instance,

12
Amazon SageMaker Developer Guide
Model Deployment

you must set the S3DataDistributionType field to ShardedByS3Key. You can set this field using the
low-level SDK. For more information, see S3DataDistributionType in S3DataSource.
Important
To prevent your algorithm container from contending for memory, we reserve memory for our
SageMaker critical system processes on your ML compute instances and therefore you cannot
expect to see all the memory for your instance type.

Deploy a Model in Amazon SageMaker


After you train your machine learning model, you can deploy it using Amazon SageMaker to get
predictions in any of the following ways, depending on your use case:

• For persistent, real-time endpoints that make one prediction at a time, use SageMaker real-time
hosting services. See Real-time inference (p. 2195).
• Workloads that have idle periods between traffic spurts and can tolerate cold starts, use Serverless
Inference. See Serverless Inference (p. 2371).
• Requests with large payload sizes up to 1GB, long processing times, and near real-time latency
requirements, use Amazon SageMaker Asynchronous Inference. See Asynchronous inference (p. 2398).
• To get predictions for an entire dataset, use SageMaker batch transform. See Use Batch
Transform (p. 2421).

SageMaker also provides features to manage resources and optimize inference performance when
deploying machine learning models:

• To manage models on edge devices so that you can optimize, secure, monitor, and maintain machine
learning models on fleets of edge devices such as smart cameras, robots, personal computers, and
mobile devices, see Deploy models at the edge with SageMaker Edge Manager (p. 2510).
• To optimize Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, and ONNX models for
inference on Android, Linux, and Windows machines based on processors from Ambarella, ARM,
Intel, Nvidia, NXP, Qualcomm, Texas Instruments, and Xilinx, see Optimize model performance using
Neo (p. 2562).

For more information about all deployment options, see Deploy Models for Inference (p. 2155).

Validate a Machine Learning Model


After training a model, evaluate it to determine whether its performance and accuracy enable you to
achieve your business goals. You might generate multiple models using different methods and evaluate
each. For example, you could apply different business rules for each model, and then apply various
measures to determine each model's suitability. You might consider whether your model needs to be
more sensitive than specific (or vice versa).

You can evaluate your model using historical data (offline) or live data:

• Offline testing—Use historical, not live, data to send requests to the model for inferences.

Deploy your trained model to an alpha endpoint, and use historical data to send inference requests to
it. To send the requests, use a Jupyter notebook in your Amazon SageMaker notebook instance and
either the AWS SDK for Python (Boto) or the high-level Python library provided by SageMaker.
• Online testing with live data—SageMaker supports A/B testing for models in production by using
production variants. Production variants are models that use the same inference code and are

13
Amazon SageMaker Developer Guide
Model Monitoring

deployed on the same SageMaker endpoint. You configure the production variants so that a small
portion of the live traffic goes to the model that you want to validate. For example, you might choose
to send 10% of the traffic to a model variant for evaluation. After you are satisfied with the model's
performance, you can route 100% traffic to the updated model. For an example of testing models in
production, see Production variants (p. 2270).

For more information, see articles and books about how to evaluate models, for example, Evaluating
Machine Learning Models.

Options for offline model evaluation include:

• Validating using a holdout set—Machine learning practitioners often set aside a part of the data as a
"holdout set." They don’t use this data for model training.

With this approach, you evaluate how well your model provides inferences on the holdout set. You
then assess how effectively the model generalizes what it learned in the initial training, as opposed to
using model memory. This approach to validation gives you an idea of how often the model is able to
infer the correct answer.

In some ways, this approach is similar to teaching elementary school students. First, you provide them
with a set of examples to learn, and then test their ability to generalize from their learning. With
homework and tests, you pose problems that were not included in the initial learning and determine
whether they are able to generalize effectively. Students with perfect memories could memorize the
problems, instead of learning the rules.

Typically, the holdout dataset is of 20-30% of the training data.

• k-fold validation—In this validation approach, you split the example dataset into k parts. You treat
each of these parts as a holdout set for k training runs, and use the other k-1 parts as the training set
for that run. You produce k models using a similar process, and aggregate the models to generate your
final model. The value k is typically in the range of 5-10.

Monitoring a Model in Production


After you deploy a model into your production environment, use Amazon SageMaker model monitor
to continuously monitor the quality of your machine learning models in real time. Amazon SageMaker
model monitor enables you to set up an automated alert triggering system when there are deviations
in the model quality, such as data drift and anomalies. Amazon CloudWatch Logs collects log files of
monitoring the model status and notifies when the quality of your model hits certain thresholds that
you preset. CloudWatch stores the log files to an Amazon S3 bucket you specify. Early and pro-active
detection of model deviations through AWS model monitor products enables you to take prompt actions
to maintain and improve the quality of your deployed model.

For more information about SageMaker model monitoring products, see Monitor models for data and
model quality, bias, and explainability (p. 2299).

To start your machine learning journey with SageMaker, sign up for an AWS account at Set Up
SageMaker.

14
Amazon SageMaker Developer Guide
ML Frameworks and Toolkits

Use Machine Learning Frameworks, Python, and R


with Amazon SageMaker
You can use Python and R natively in Amazon SageMaker notebook kernels. There are also kernels that
support specific frameworks. A very popular way to get started with SageMaker is to use the Amazon
SageMaker Python SDK. It provides open source Python APIs and containers that make it easy to train
and deploy models in SageMaker, as well as examples for use with several different machine learning and
deep learning frameworks.

For information about using specific frameworks or how to use R in SageMaker, see the following topics.

Languages SDKs and user guides:

• Amazon SageMaker Python SDK


• R (p. 28)
• API Reference Guide for Amazon SageMaker (p. 3302)

Machine learning and deep learning frameworks guides:

• Apache MXNet (p. 15)


• Apache Spark (p. 16)
• Chainer (p. 24)
• Hugging Face (p. 25)
• PyTorch (p. 27)
• Scikit-learn (p. 30)
• SparkML Serving (p. 31)
• TensorFlow (p. 32)
• Triton Inference Server (p. 32)

Use Apache MXNet with Amazon SageMaker


You can use SageMaker to train and deploy a model using custom MXNet code. The Amazon SageMaker
Python SDK MXNet estimators and models and the SageMaker open-source MXNet container make
writing a MXNet script and running it in SageMaker easier.

What do you want to do?


I want to train a custom MXNet model in SageMaker.

For a sample Jupyter notebook, see the MXNet example notebooks in the Amazon SageMaker
Examples GitHub repository.

For documentation, see Train a Model with MXNet.


I have an MXNet model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.

For more information, see Deploy MXNet models.


I have an MXNet model that I trained outside of SageMaker, and I want to deploy it to a SageMaker
endpoint

For more information, see Deploy Endpoints from Model Data.

15
Amazon SageMaker Developer Guide
Apache Spark

I want to see the API documentation for Amazon SageMaker Python SDK MXNet classes.

For more information, see MXNet Classes.


I want to find the SageMaker MXNet container repository.

For more information, see SageMaker MXNet Container GitHub repository.


I want to find information about MXNet versions supported by AWS Deep Learning Containers.

For more information, see Available Deep Learning Container Images.

For general information about writing MXNet script mode training scripts and using MXNet script mode
estimators and models with SageMaker, see Using MXNet with the SageMaker Python SDK.

Use Apache Spark with Amazon SageMaker


This section provides information for developers who want to use Apache Spark for preprocessing data
and Amazon SageMaker for model training and hosting. For information about supported versions of
Apache Spark, see the Getting SageMaker Spark page in the SageMaker Spark GitHub repository.

SageMaker provides an Apache Spark library, in both Python and Scala, that you can use to easily train
models in SageMaker using org.apache.spark.sql.DataFrame data frames in your Spark clusters.
After model training, you can also host the model using SageMaker hosting services.

The SageMaker Spark library, com.amazonaws.services.sagemaker.sparksdk, provides the


following classes, among others:

• SageMakerEstimator—Extends the org.apache.spark.ml.Estimator interface. You can use


this estimator for model training in SageMaker.
• KMeansSageMakerEstimator, PCASageMakerEstimator, and XGBoostSageMakerEstimator—
Extend the SageMakerEstimator class.
• SageMakerModel—Extends the org.apache.spark.ml.Model class. You can use this
SageMakerModel for model hosting and obtaining inferences in SageMaker.

With SageMaker Studio, you can easily connect to an Amazon EMR cluster. For more information, see
Prepare data at Scale with Studio Notebooks.

Download the SageMaker Spark Library


You have the following options for downloading the Spark library provided by SageMaker:

• You can download the source code for both PySpark and Scala libraries from the SageMaker Spark
GitHub repository.
• For the Python Spark library, you have the following additional options:
• Use pip install:

pip install sagemaker_pyspark

• In a notebook instance, create a new notebook that uses either the Sparkmagic (PySpark) or the
Sparkmagic (PySpark3) kernel and connect to a remote Amazon EMR cluster.
Note
The EMR cluster must be configured with an IAM role that has the
AmazonSageMakerFullAccess policy attached. For information about configuring roles
for an EMR cluster, see Configure IAM Roles for Amazon EMR Permissions to AWS Services
in the Amazon EMR Management Guide.

16
Amazon SageMaker Developer Guide
Apache Spark

• You can get the Scala library from Maven. Add the Spark library to your project by adding the
following dependency to your pom.xml file:

<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>sagemaker-spark_2.11</artifactId>
<version>spark_2.2.0-1.0</version>
</dependency>

Integrate Your Apache Spark Application with SageMaker


The following is high-level summary of the steps for integrating your Apache Spark application with
SageMaker.

1. Continue data preprocessing using the Apache Spark library that you are familiar with. Your dataset
remains a DataFrame in your Spark cluster. Load your data into a DataFrame and preprocess it so
that you have a features column with org.apache.spark.ml.linalg.Vector of Doubles,
and an optional label column with values of Double type.
2. Use the estimator in the SageMaker Spark library to train your model. For example, if you
choose the k-means algorithm provided by SageMaker for model training, you call the
KMeansSageMakerEstimator.fit method.

Provide your DataFrame as input. The estimator returns a SageMakerModel object.


Note
SageMakerModel extends the org.apache.spark.ml.Model.

The fit method does the following:

a. Converts the input DataFrame to the protobuf format by selecting the features and label
columns from the input DataFrame and uploading the protobuf data to an Amazon S3 bucket.
The protobuf format is efficient for model training in SageMaker.
b. Starts model training in SageMaker by sending a SageMaker CreateTrainingJob request.
After model training has completed, SageMaker saves the model artifacts to an S3 bucket.

SageMaker assumes the IAM role that you specified for model training to perform tasks on your
behalf. For example, it uses the role to read training data from an S3 bucket and to write model
artifacts to a bucket.
c. Creates and returns a SageMakerModel object. The constructor does the following tasks, which
are related to deploying your model to SageMaker.

i. Sends a CreateModel request to SageMaker.


ii. Sends a CreateEndpointConfig request to SageMaker.
iii. Sends a CreateEndpoint request to SageMaker, which then launches the specified
resources, and hosts the model on them.
3. You can get inferences from your model hosted in SageMaker with the
SageMakerModel.transform.

Provide an input DataFrame with features as input. The transform method transforms it
to a DataFrame containing inferences. Internally, the transform method sends a request to
the InvokeEndpoint SageMaker API to get inferences. The transform method appends the
inferences to the input DataFrame.

17
Amazon SageMaker Developer Guide
Apache Spark

Example 1: Use Amazon SageMaker for Training and Inference


with Apache Spark
Topics
• Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker with Apache
Spark (p. 22)
• Use the SageMakerEstimator in a Spark Pipeline (p. 23)

Amazon SageMaker provides an Apache Spark library (in both Python and Scala) that you can use to
integrate your Apache Spark applications with SageMaker. For example, you might use Apache Spark
for data preprocessing and SageMaker for model training and hosting. For more information, see Use
Apache Spark with Amazon SageMaker (p. 16). This section provides example code that uses the
Apache Spark Scala library provided by SageMaker to train a model in SageMaker using DataFrames
in your Spark cluster. The example also hosts the resulting model artifacts using SageMaker hosting
services. Specifically, this example does the following:

• Uses the KMeansSageMakerEstimator to fit (or train) a model on data

Because the example uses the k-means algorithm provided by SageMaker to train a model, you
use the KMeansSageMakerEstimator. You train the model using images of handwritten single-
digit numbers (from the MNIST dataset). You provide the images as an input DataFrame. For your
convenience, SageMaker provides this dataset in an S3 bucket.

In response, the estimator returns a SageMakerModel object.

• Obtains inferences using the trained SageMakerModel

To get inferences from a model hosted in SageMaker, you call the SageMakerModel.transform
method. You pass a DataFrame as input. The method transforms the input DataFrame to another
DataFrame containing inferences obtained from the model.

For a given input image of a handwritten single-digit number, the inference identifies a cluster that the
image belongs to. For more information, see K-Means Algorithm (p. 1485).

This is the example code:

import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator

val spark = SparkSession.builder.getOrCreate

// load mnist data as a dataframe from libsvm


val region = "us-east-1"
val trainingData = spark.read.format("libsvm")
.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")

18
Amazon SageMaker Developer Guide
Apache Spark

val testData = spark.read.format("libsvm")


.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")

val roleArn = "arn:aws:iam::account-id:role/rolename"

val estimator = new KMeansSageMakerEstimator(


sagemakerRole = IAMRole(roleArn),
trainingInstanceType = "ml.p2.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.c4.xlarge",
endpointInitialInstanceCount = 1)
.setK(10).setFeatureDim(784)

// train
val model = estimator.fit(trainingData)

val transformedData = model.transform(testData)


transformedData.show

The code does the following:

• Loads the MNIST dataset from an S3 bucket provided by SageMaker (awsai-sparksdk-dataset)


into a Spark DataFrame (mnistTrainingDataFrame):

// Get a Spark session.

val spark = SparkSession.builder.getOrCreate

// load mnist data as a dataframe from libsvm


val region = "us-east-1"
val trainingData = spark.read.format("libsvm")
.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")
val testData = spark.read.format("libsvm")
.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")

val roleArn = "arn:aws:iam::account-id:role/rolename"


trainingData.show()

The show method displays the first 20 rows in the data frame:

+-----+--------------------+
|label| features|
+-----+--------------------+
| 5.0|(784,[152,153,154...|
| 0.0|(784,[127,128,129...|
| 4.0|(784,[160,161,162...|
| 1.0|(784,[158,159,160...|
| 9.0|(784,[208,209,210...|
| 2.0|(784,[155,156,157...|
| 1.0|(784,[124,125,126...|
| 3.0|(784,[151,152,153...|
| 1.0|(784,[152,153,154...|
| 4.0|(784,[134,135,161...|
| 3.0|(784,[123,124,125...|
| 5.0|(784,[216,217,218...|
| 3.0|(784,[143,144,145...|
| 6.0|(784,[72,73,74,99...|
| 1.0|(784,[151,152,153...|
| 7.0|(784,[211,212,213...|
| 2.0|(784,[151,152,153...|

19
Amazon SageMaker Developer Guide
Apache Spark

| 8.0|(784,[159,160,161...|
| 6.0|(784,[100,101,102...|
| 9.0|(784,[209,210,211...|
+-----+--------------------+
only showing top 20 rows

In each row:
• The label column identifies the image's label. For example, if the image of the handwritten number
is the digit 5, the label value is 5.
• The features column stores a vector (org.apache.spark.ml.linalg.Vector) of Double
values. These are the 784 features of the handwritten number. (Each handwritten number is a 28 x
28-pixel image, making 784 features.)

• Creates a SageMaker estimator (KMeansSageMakerEstimator)

The fit method of this estimator uses the k-means algorithm provided by SageMaker to train models
using an input DataFrame. In response, it returns a SageMakerModel object that you can use to get
inferences.
Note
The KMeansSageMakerEstimator extends the SageMaker SageMakerEstimator, which
extends the Apache Spark Estimator.

val estimator = new KMeansSageMakerEstimator(


sagemakerRole = IAMRole(roleArn),
trainingInstanceType = "ml.p2.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.c4.xlarge",
endpointInitialInstanceCount = 1)
.setK(10).setFeatureDim(784)

The constructor parameters provide information that is used for training a model and deploying it on
SageMaker:
• trainingInstanceType and trainingInstanceCount—Identify the type and number of ML
compute instances to use for model training.

• endpointInstanceType—Identifies the ML compute instance type to use when hosting the model
in SageMaker. By default, one ML compute instance is assumed.

• endpointInitialInstanceCount—Identifies the number of ML compute instances initially


backing the endpoint hosting the model in SageMaker.

• sagemakerRole—SageMaker assumes this IAM role to perform tasks on your behalf. For example,
for model training, it reads data from S3 and writes training results (model artifacts) to S3.
Note
This example implicitly creates a SageMaker client. To create this client, you must provide
your credentials. The API uses these credentials to authenticate requests to SageMaker. For
example, it uses the credentials to authenticate requests to create a training job and API
calls for deploying the model using SageMaker hosting services.
• After the KMeansSageMakerEstimator object has been created, you set the following parameters,
are used in model training:
20
Amazon SageMaker Developer Guide
Apache Spark

• The number of clusters that the k-means algorithm should create during model training. You
specify 10 clusters, one for each digit, 0 through 9.
• Identifies that each input image has 784 features (each handwritten number is a 28 x 28-pixel
image, making 784 features).

• Calls the estimator fit method

// train
val model = estimator.fit(trainingData)

You pass the input DataFrame as a parameter. The model does all the work of training the model
and deploying it to SageMaker. For more information see, Integrate Your Apache Spark Application
with SageMaker (p. 17). In response, you get a SageMakerModel object, which you can use to get
inferences from your model deployed in SageMaker.

You provide only the input DataFrame. You don't need to specify the registry path to the k-means
algorithm used for model training because the KMeansSageMakerEstimator knows it.

• Calls the SageMakerModel.transform method to get inferences from the model deployed in
SageMaker.

The transform method takes a DataFrame as input, transforms it, and returns another DataFrame
containing inferences obtained from the model.

val transformedData = model.transform(testData)


transformedData.show

For simplicity, we use the same DataFrame as input to the transform method that we used for
model training in this example. The transform method does the following:
• Serializes the features column in the input DataFrame to protobuf and sends it to the SageMaker
endpoint for inference.
• Deserializes the protobuf response into the two additional columns (distance_to_cluster and
closest_cluster) in the transformed DataFrame.

The show method gets inferences to the first 20 rows in the input DataFrame:

+-----+--------------------+-------------------+---------------+
|label| features|distance_to_cluster|closest_cluster|
+-----+--------------------+-------------------+---------------+
| 5.0|(784,[152,153,154...| 1767.897705078125| 4.0|
| 0.0|(784,[127,128,129...| 1392.157470703125| 5.0|
| 4.0|(784,[160,161,162...| 1671.5711669921875| 9.0|
| 1.0|(784,[158,159,160...| 1182.6082763671875| 6.0|
| 9.0|(784,[208,209,210...| 1390.4002685546875| 0.0|
| 2.0|(784,[155,156,157...| 1713.988037109375| 1.0|
| 1.0|(784,[124,125,126...| 1246.3016357421875| 2.0|
| 3.0|(784,[151,152,153...| 1753.229248046875| 4.0|
| 1.0|(784,[152,153,154...| 978.8394165039062| 2.0|
| 4.0|(784,[134,135,161...| 1623.176513671875| 3.0|
| 3.0|(784,[123,124,125...| 1533.863525390625| 4.0|
| 5.0|(784,[216,217,218...| 1469.357177734375| 6.0|
| 3.0|(784,[143,144,145...| 1736.765869140625| 4.0|
| 6.0|(784,[72,73,74,99...| 1473.69384765625| 8.0|

21
Amazon SageMaker Developer Guide
Apache Spark

| 1.0|(784,[151,152,153...| 944.88720703125| 2.0|


| 7.0|(784,[211,212,213...| 1285.9071044921875| 3.0|
| 2.0|(784,[151,152,153...| 1635.0125732421875| 1.0|
| 8.0|(784,[159,160,161...| 1436.3162841796875| 6.0|
| 6.0|(784,[100,101,102...| 1499.7366943359375| 7.0|
| 9.0|(784,[209,210,211...| 1364.6319580078125| 6.0|
+-----+--------------------+-------------------+---------------+

You can interpret the data, as follows:


• A handwritten number with the label 5 belongs to cluster 4 (closest_cluster).
• A handwritten number with the label 0 belongs to cluster 5.
• A handwritten number with the label 4 belongs to cluster 9.
• A handwritten number with the label 1 belongs to cluster 6.

For more information on how to run these examples, see https://github.com/aws/sagemaker-spark/


blob/master/README.md on GitHub.

Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker
with Apache Spark
In Example 1: Use Amazon SageMaker for Training and Inference with Apache Spark (p. 18), you
use the kMeansSageMakerEstimator because the example uses the k-means algorithm provided by
Amazon SageMaker for model training. You might choose to use your own custom algorithm for model
training instead. Assuming that you have already created a Docker image, you can create your own
SageMakerEstimator and specify the Amazon Elastic Container Registry path for your custom image.

The following example shows how to create a KMeansSageMakerEstimator from the


SageMakerEstimator. In the new estimator, you explicitly specify the Docker registry path to your
training and inference code images.

import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.SageMakerEstimator
import
com.amazonaws.services.sagemaker.sparksdk.transformation.serializers.ProtobufRequestRowSerializer
import
com.amazonaws.services.sagemaker.sparksdk.transformation.deserializers.KMeansProtobufResponseRowDeseri

val estimator = new SageMakerEstimator(


trainingImage =
"811284229777.dkr.ecr.us-east-1.amazonaws.com/kmeans:1",
modelImage =
"811284229777.dkr.ecr.us-east-1.amazonaws.com/kmeans:1",
requestRowSerializer = new ProtobufRequestRowSerializer(),
responseRowDeserializer = new KMeansProtobufResponseRowDeserializer(),
hyperParameters = Map("k" -> "10", "feature_dim" -> "784"),
sagemakerRole = IAMRole(roleArn),
trainingInstanceType = "ml.p2.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.c4.xlarge",
endpointInitialInstanceCount = 1,
trainingSparkDataFormat = "sagemaker")

In the code, the parameters in the SageMakerEstimator constructor include:

• trainingImage —Identifies the Docker registry path to the training image containing your custom
code.
• modelImage —Identifies the Docker registry path to the image containing inference code.

22
Amazon SageMaker Developer Guide
Apache Spark

• requestRowSerializer —Implements
com.amazonaws.services.sagemaker.sparksdk.transformation.RequestRowSerializer.

This parameter serializes rows in the input DataFrame to send them to the model hosted in
SageMaker for inference.
• responseRowDeserializer —Implements

com.amazonaws.services.sagemaker.sparksdk.transformation.ResponseRowDeserializer.

This parameter deserializes responses from the model, hosted in SageMaker, back into a DataFrame.
• trainingSparkDataFormat —Specifies the data format that Spark uses when uploading training
data from a DataFrame to S3. For example, "sagemaker" for protobuf format, "csv" for comma-
separated values, and "libsvm" for LibSVM format.

You can implement your own RequestRowSerializer and ResponseRowDeserializer to serialize


and deserialize rows from a data format that your inference code supports, such as .libsvm or ..csv.

Use the SageMakerEstimator in a Spark Pipeline


You can use org.apache.spark.ml.Estimator estimators and org.apache.spark.ml.Model
models, and SageMakerEstimator estimators and SageMakerModel models in
org.apache.spark.ml.Pipeline pipelines, as shown in the following example:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.PCA
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator

val spark = SparkSession.builder.getOrCreate

// load mnist data as a dataframe from libsvm


val region = "us-east-1"
val trainingData = spark.read.format("libsvm")
.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")
val testData = spark.read.format("libsvm")
.option("numFeatures", "784")
.load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")

// substitute your SageMaker IAM role here


val roleArn = "arn:aws:iam::account-id:role/rolename"

val pcaEstimator = new PCA()


.setInputCol("features")
.setOutputCol("projectedFeatures")
.setK(50)

val kMeansSageMakerEstimator = new KMeansSageMakerEstimator(


sagemakerRole = IAMRole(integTestingRole),
requestRowSerializer =
new ProtobufRequestRowSerializer(featuresColumnName = "projectedFeatures"),
trainingSparkDataFormatOptions = Map("featuresColumnName" -> "projectedFeatures"),
trainingInstanceType = "ml.p2.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.c4.xlarge",
endpointInitialInstanceCount = 1)
.setK(10).setFeatureDim(50)

val pipeline = new Pipeline().setStages(Array(pcaEstimator, kMeansSageMakerEstimator))

23
Amazon SageMaker Developer Guide
Chainer

// train
val pipelineModel = pipeline.fit(trainingData)

val transformedData = pipelineModel.transform(testData)


transformedData.show()

The parameter trainingSparkDataFormatOptions configures Spark to serialize to protobuf the


"projectedFeatures" column for model training. Additionally, Spark serializes to protobuf the "label"
column by default.

Because we want to make inferences using the "projectedFeatures" column, we pass the column name
into the ProtobufRequestRowSerializer.

The following example shows a transformed DataFrame:

+-----+--------------------+--------------------+-------------------+---------------+
|label| features| projectedFeatures|distance_to_cluster|closest_cluster|
+-----+--------------------+--------------------+-------------------+---------------+
| 5.0|(784,[152,153,154...|[880.731433034386...| 1500.470703125| 0.0|
| 0.0|(784,[127,128,129...|[1768.51722024166...| 1142.18359375| 4.0|
| 4.0|(784,[160,161,162...|[704.949236329314...| 1386.246826171875| 9.0|
| 1.0|(784,[158,159,160...|[-42.328192193771...| 1277.0736083984375| 5.0|
| 9.0|(784,[208,209,210...|[374.043902028333...| 1211.00927734375| 3.0|
| 2.0|(784,[155,156,157...|[941.267714528850...| 1496.157958984375| 8.0|
| 1.0|(784,[124,125,126...|[30.2848596410594...| 1327.6766357421875| 5.0|
| 3.0|(784,[151,152,153...|[1270.14374062052...| 1570.7674560546875| 0.0|
| 1.0|(784,[152,153,154...|[-112.10792566485...| 1037.568359375| 5.0|
| 4.0|(784,[134,135,161...|[452.068280676606...| 1165.1236572265625| 3.0|
| 3.0|(784,[123,124,125...|[610.596447285397...| 1325.953369140625| 7.0|
| 5.0|(784,[216,217,218...|[142.959601818422...| 1353.4930419921875| 5.0|
| 3.0|(784,[143,144,145...|[1036.71862533658...| 1460.4315185546875| 7.0|
| 6.0|(784,[72,73,74,99...|[996.740157435754...| 1159.8631591796875| 2.0|
| 1.0|(784,[151,152,153...|[-107.26076167417...| 960.963623046875| 5.0|
| 7.0|(784,[211,212,213...|[619.771820430940...| 1245.13623046875| 6.0|
| 2.0|(784,[151,152,153...|[850.152101817161...| 1304.437744140625| 8.0|
| 8.0|(784,[159,160,161...|[370.041887230547...| 1192.4781494140625| 0.0|
| 6.0|(784,[100,101,102...|[546.674328209335...| 1277.0908203125| 2.0|
| 9.0|(784,[209,210,211...|[-29.259112927426...| 1245.8182373046875| 6.0|
+-----+--------------------+--------------------+-------------------+---------------+

SDK examples: Use Amazon SageMaker with Apache Spark


The following list is a subset of available examples. Visit the examples website to see more.

• sagemaker-spark: a Spark library for SageMaker


• SageMaker PySpark K-Means Clustering MNIST Example
• Distributed Data Processing using Apache Spark and SageMaker Processing

Note
To run the notebooks on a notebook instance, see Example Notebooks (p. 220). To run the
notebooks on Studio, see Create or Open an Amazon SageMaker Studio Notebook (p. 148).

Use Chainer with Amazon SageMaker


You can use SageMaker to train and deploy a model using custom Chainer code. The SageMaker Python
SDK Chainer estimators and models and the SageMaker open-source Chainer container make writing a
Chainer script and running it in SageMaker easier.

24
Amazon SageMaker Developer Guide
Hugging Face

What do you want to do?


I want to train a custom Chainer model in SageMaker.

For a sample Jupyter notebook, see the Chainer example notebooks in the Amazon SageMaker
Examples GitHub repository.

For documentation, see Train a Model with Chainer.


I have a Chainer model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.

For more information, see Deploy Chainer models.


I have a Chainer model that I trained outside of SageMaker, and I want to deploy it to a SageMaker
endpoint

For more information, see Deploy Endpoints from Model Data.


I want to see the API documentation for Amazon SageMaker Python SDK Chainer classes.

For more information, see Chainer Classes.


I want to find information about SageMaker Chainer containers.

For more information, see the SageMaker Chainer Container GitHub repository.

For information about supported Chainer versions, and for general information about writing Chainer
training scripts and using Chainer estimators and models with SageMaker, see Using Chainer with the
SageMaker Python SDK.

Use Hugging Face with Amazon SageMaker


Amazon SageMaker enables customers to train, fine-tune, and run inference using Hugging Face models
for Natural Language Processing (NLP) on SageMaker. You can use Hugging Face for both training and
inference. This functionality is available through the development of Hugging Face AWS Deep Learning
Containers. These containers include Hugging Face Transformers, Tokenizers and the Datasets library,
which allows you to use these resources for your training and inference jobs. For a list of the available
Deep Learning Containers images, see Available Deep Learning Containers Images. These Deep Learning
Containers images are maintained and regularly updated with security patches.

To use the Hugging Face Deep Learning Containers with the SageMaker Python SDK for training, see the
Hugging Face SageMaker Estimator. With the Hugging Face Estimator, you can use the Hugging Face
models as you would any other SageMaker Estimator. However, using the SageMaker Python SDK is
optional. You can also orchestrate your use of the Hugging Face Deep Learning Containers with the AWS
CLI and AWS SDK for Python (Boto3).

For more information on Hugging Face and the models available in it, see the Hugging Face
documentation.

Training
To run training, you can use any of the thousands of models available in Hugging Face and fine-tune
them for your specific use case with additional training. With SageMaker, you can use standard training
or take advantage of SageMaker Distributed Data and Model Parallel training. As with other SageMaker
training jobs using custom code, you can capture your own metrics by passing a metrics definition to the
SageMaker Python SDK as shown in Defining Training Metrics (SageMaker Python SDK) . The captured
metrics are then accessible via CloudWatch and as a Pandas DataFrame via the TrainingJobAnalytics

25
Amazon SageMaker Developer Guide
Hugging Face

method. Once your model is trained and fine-tuned, you can use it like any other model to run inference
jobs.

How to run training with the Hugging Face Estimator


You can implement the Hugging Face Estimator for training jobs using the SageMaker Python SDK. The
SageMaker Python SDK is an open source library for training and deploying machine learning models
on SageMaker. For more information on the Hugging Face Estimator, see the SageMaker Python SDK
documentation.

With the SageMaker Python SDK, you can run training jobs using the Hugging Face Estimator in the
following environments:

• SageMaker Studio: Amazon SageMaker Studio is the first fully integrated development environment
(IDE) for machine learning (ML). SageMaker Studio provides a single, web-based visual interface where
you can perform all ML development steps required to prepare, build, train and tune, deploy and
manage models. For information on using Jupyter Notebooks in Studio, see Use Amazon SageMaker
Studio Notebooks.
• SageMaker Notebook Instances: An Amazon SageMaker notebook instance is a machine learning
(ML) compute instance running the Jupyter Notebook App. This app lets you run Jupyter Notebooks
in your notebook instance to prepare and process data, write code to train models, deploy models
to SageMaker hosting, and test or validate your models without SageMaker Studio features like
Debugger, Model Monitoring, and a web-based IDE.
• Locally: If you have connectivity to AWS and have appropriate SageMaker permissions, you can use
the SageMaker Python SDK locally to launch remote training and inference jobs for Hugging Face in
SageMaker on AWS. This works on your local machine, as well as other AWS services with a connected
SageMaker Python SDK and appropriate permissions.

Inference
For inference, you can use your trained Hugging Face model or one of the pretrained Hugging Face
models to deploy an inference job with SageMaker. With this collaboration, you only need one line of
code to deploy both your trained models and pre-trained models with SageMaker. You can also run
inference jobs without having to write any custom inference code. With custom inference code, you can
customize the inference logic by providing your own Python script.

How to deploy an inference job using the Hugging Face Deep Learning
Containers
You have two options for running inference with SageMaker. You can run inference using a model that
you trained, or deploy a pre-trained Hugging Face model.

• Run inference with your trained model: You have two options for running inference with your own
trained model. You can run inference with a model that you trained using an existing Hugging Face
model with the SageMaker Hugging Face Deep Learning Containers, or you can bring your own existing
Hugging Face model and deploy it using SageMaker. When you run inference with a model that you
trained with the SageMaker Hugging Face Estimator, you can deploy the model immediately after
training completes or you can upload the trained model to an Amazon S3 bucket and ingest it when
running inference later. If you bring your own existing Hugging Face model, you must upload the
trained model to an Amazon S3 bucket and ingest that bucket when running inference as shown in
Deploy your Hugging Face Transformers for inference example.
• Run inference with a pre-trained HuggingFace model: You can use one of the thousands of pre-
trained Hugging Face models to run your inference jobs with no additional training needed. To run
inference, you select the pre-trained model from the list of Hugging Face models, as outlined in
Deploy pre-trained Hugging Face Transformers for inference example.

26
Amazon SageMaker Developer Guide
PyTorch

What do you want to do?


The following Jupyter Notebooks in the Hugging Face notebooks repository illustrate how to use the
Hugging Face Deep Learning Containers with SageMaker in various use cases.

I want to train and deploy a text classification model using Hugging Face in SageMaker with PyTorch.

For a sample Jupyter Notebook, see the PyTorch Getting Started Demo.
I want to train and deploy a text classification model using Hugging Face in SageMaker with TensorFlow.

For a sample Jupyter Notebook, see the TensorFlow Getting Started example.
I want to run distributed training with data parallelism using Hugging Face and SageMaker Distributed.

For a sample Jupyter Notebook, see the Distributed Training example.


I want to run distributed training with model parallelism using Hugging Face and SageMaker Distributed.

For a sample Jupyter Notebook, see the Model Parallelism example.


I want to use a spot instance to train and deploy a model using Hugging Face in SageMaker.

For a sample Jupyter Notebook, see the Spot Instances example.


I want to capture custom metrics and use SageMaker Checkpointing when training a text classification
model using Hugging Face in SageMaker.

For a sample Jupyter Notebook, see the Training with Custom Metrics example.
I want to train a distributed question-answering TensorFlow model using Hugging Face in SageMaker.

For a sample Jupyter Notebook, see the Distributed TensorFlow Training example.
I want to train a distributed summarization model using Hugging Face in SageMaker.

For a sample Jupyter Notebook, see the Distributed Summarization Training example.
I want to train an image classification model using Hugging Face in SageMaker.

For a sample Jupyter Notebook, see the Vision Transformer Training example.
I want to deploy my trained Hugging Face model in SageMaker.

For a sample Jupyter Notebook, see the Deploy your Hugging Face Transformers for inference
example.
I want to deploy a pre-trained Hugging Face model in SageMaker.

For a sample Jupyter Notebook, see the Deploy pre-trained Hugging Face Transformers for inference
example.

Use PyTorch with Amazon SageMaker


You can use Amazon SageMaker to train and deploy a model using custom PyTorch code. The SageMaker
Python SDK PyTorch estimators and models and the SageMaker open-source PyTorch container make
writing a PyTorch script and running it in SageMaker easier.

What do you want to do?


I want to train a custom PyTorch model in SageMaker.

For a sample Jupyter notebook, see the PyTorch example notebook in the Amazon SageMaker
Examples GitHub repository.

For documentation, see Train a Model with PyTorch.

27
Amazon SageMaker Developer Guide
R

I have a PyTorch model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.

For more information, see Deploy PyTorch models.


I have a PyTorch model that I trained outside of SageMaker, and I want to deploy it to a SageMaker
endpoint

For more information, see Deploy Endpoints from Model Data.


I want to see the API documentation for Amazon SageMaker Python SDK PyTorch classes.

For more information, see PyTorch Classes.


I want to find the SageMaker PyTorch container repository.

For more information, see SageMaker PyTorch Container GitHub repository.


I want to find information about PyTorch versions supported by AWS Deep Learning Containers.

For more information, see Available Deep Learning Container Images.

For general information about writing PyTorch training scripts and using PyTorch estimators and models
with SageMaker, see Using PyTorch with the SageMaker Python SDK.

R User Guide to Amazon SageMaker


This document will walk you through ways of leveraging Amazon SageMaker features using R. This guide
introduces SageMaker's built-in R kernel, how to get started with R on SageMaker, and finally several
example notebooks.

The examples are organized in three levels, Beginner, Intermediate, and Advanced. They start
from Getting Started with R on SageMaker, continue to end-to-end machine learning with R on
SageMaker, and then finish with more advanced topics such as SageMaker Processing with R script, and
Bring-Your-Own (BYO) R algorithm to SageMaker.

For information on how to bring your own custom R image to Studio, see Bring your own SageMaker
image (p. 169). For a similar blog article, see Bringing your own R environment to Amazon SageMaker
Studio.

RStudio Support in SageMaker


Amazon SageMaker supports RStudio as a fully-managed integrated development environment
(IDE) integrated with Amazon SageMaker Domain. With RStudio integration, you can launch an
RStudio environment in the Domain to run your RStudio workflows on SageMaker resources. For more
information, see RStudio on Amazon SageMaker (p. 432).

R Kernel in SageMaker
SageMaker notebook instances support R using a pre-installed R kernel. Also, the R kernel has the
reticulate library, an R to Python interface, so you can use the features of SageMaker Python SDK from
within an R script.

• reticulatelibrary: provides an R interface to the Amazon SageMaker Python SDK. The reticulate
package translates between R and Python objects.

Get Started with R in SageMaker


• Create a Notebook Instance using the t2.medium instance type and default storage size. You can
pick a faster instance and more storage if you plan to continue using the instance for more advanced
examples, or create a bigger instance later.

28
Amazon SageMaker Developer Guide
R

• Wait until the status of the notebook is In Service, and then click Open Jupyter.

• Create a new notebook with R kernel from the list of available environments.

• When the new notebook is created, you should see an R logo in the upper right corner of the notebook
environment, and also R as the kernel under that logo. This indicates that SageMaker has successfully
launched the R kernel for this notebook.

• Alternatively, when you are in a Jupyter notebook, you can use Kernel menu, and then select R from
Change Kernel option.

29
Amazon SageMaker Developer Guide
Scikit-learn

Example Notebooks
Prerequisites

Getting Started with R on SageMaker: This sample notebook describes how you can develop R scripts
using Amazon SageMaker‘s R kernel. In this notebook you set up your SageMaker environment and
permissions, download the abalone dataset from the UCI Machine Learning Repository, do some basic
processing and visualization on the data, then save the data as .csv format to S3.

Beginner Level

SageMaker Batch Transform using R Kernel: This sample Notebook describes how to conduct a batch
transform job using SageMaker’s Transformer API and the XGBoost algorithm. The notebook also uses
the Abalone dataset.

Intermediate Level

Hyperparameter Optimization for XGBoost in R: This sample notebook extends the previous
beginner notebooks that use the abalone dataset and XGBoost. It describes how to do model tuning
with hyperparameter optimization. You will also learn how to use batch transform for batching
predictions, as well as how to create a model endpoint to make real-time predictions.

Amazon SageMaker Processing with R: SageMaker Processing lets you preprocess, post-process and
run model evaluation workloads. This example shows you how to create an R script to orchestrate a
Processing job.

Advanced Level

Train and Deploy Your Own R Algorithm in SageMaker: Do you already have an R algorithm, and you
want to bring it into SageMaker to tune, train, or deploy it? This example walks you through how to
customize SageMaker containers with custom R packages, all the way to using a hosted endpoint for
inference on your R-origin model.

Use Scikit-learn with Amazon SageMaker


You can use Amazon SageMaker to train and deploy a model using custom Scikit-learn code. The
SageMaker Python SDK Scikit-learn estimators and models and the SageMaker open-source Scikit-learn
containers make writing a Scikit-learn script and running it in SageMaker easier.

Requirements

Scikit-learn 1.2 has the following dependencies.

Dependency Minimum version

Python 3.8

NumPy 1.17.3

SciPy 1.3.2

joblib 1.1.1

threadpoolctl 2.0.0

The SageMaker Scikit-learn container supports the following Scikit-learn versions.

30
Amazon SageMaker Developer Guide
SparkML Serving

Supported Scikit-learn version Minimum Python version

1.2-1 3.8

1.0-1 3.7

0.23-1 3.6

0.20.0 2.7 or 3.4

For general information about writing Scikit-learn training scripts and using Scikit-learn estimators and
models with SageMaker, see Using Scikit-learn with the SageMaker Python SDK.

What do you want to do?


Note
Matplotlib v2.2.3 or newer is required to run the SageMaker Scikit-learn example notebooks.

I want to use Scikit-learn for data processing, feature engineering, or model evaluation in SageMaker.

For a sample Jupyter notebook, see https://github.com/awslabs/amazon-sagemaker-examples/


tree/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation.

For documentation, see ReadTheDocs.


I want to train a custom Scikit-learn model in SageMaker.

For a sample Jupyter notebook, see https://github.com/awslabs/amazon-sagemaker-examples/


tree/master/sagemaker-python-sdk/scikit_learn_iris.

For documentation, see Train a Model with Scikit-learn.


I have a Scikit-learn model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.

For more information, see Deploy Scikit-learn models.


I have a Scikit-learn model that I trained outside of SageMaker, and I want to deploy it to a SageMaker
endpoint

For more information, see Deploy Endpoints from Model Data.


I want to see the API documentation for Amazon SageMaker Python SDK Scikit-learn classes.

For more information, see Scikit-learn Classes.


I want to see information about SageMaker Scikit-learn containers.

For more information, see SageMaker Scikit-learn Container GitHub repository.

Use SparkML Serving with Amazon SageMaker


The Amazon SageMaker Python SDK SparkML Serving model and predictor and the Amazon SageMaker
open-source SparkML Serving container support deploying Apache Spark ML pipelines serialized with
MLeap in SageMaker to get inferences.

For information about using the SparkML Serving container to deploy models to SageMaker, see
SageMaker Spark ML Container GitHub repository. For information about the Amazon SageMaker
Python SDK SparkML Serving model and predictors, see the SparkML Serving Model and Predictor API
documentation.

31
Amazon SageMaker Developer Guide
TensorFlow

Use TensorFlow with Amazon SageMaker


You can use Amazon SageMaker to train and deploy a model using custom TensorFlow code. The
SageMaker Python SDK TensorFlow estimators and models and the SageMaker open-source TensorFlow
containers make writing a TensorFlow script and running it in SageMaker easier.

Use TensorFlow Version 1.11 and Later


For TensorFlow versions 1.11 and later, the Amazon SageMaker Python SDK supports script mode
training scripts.

What do you want to do?


I want to train a custom TensorFlow model in SageMaker.

For a sample Jupyter notebook, see TensorFlow script mode training and serving.

For documentation, see Train a Model with TensorFlow.


I have a TensorFlow model that I trained in SageMaker, and I want to deploy it to a hosted endpoint.

For more information, see Deploy TensorFlow Serving models.


I have a TensorFlow model that I trained outside of SageMaker, and I want to deploy it to a SageMaker
endpoint

For more information, see Deploying directly from model artifacts.


I want to see the API documentation for Amazon SageMaker Python SDK TensorFlow classes.

For more information, see TensorFlow Estimator.


I want to find the SageMaker TensorFlow container repository.

For more information, see SageMaker TensorFlow Container GitHub repository.


I want to find information about TensorFlow versions supported by AWS Deep Learning Containers.

For more information, see Available Deep Learning Container Images.

For general information about writing TensorFlow script mode training scripts and using TensorFlow
script mode estimators and models with SageMaker, see Using TensorFlow with the SageMaker Python
SDK.

Use TensorFlow Legacy Mode for Versions 1.11 and Earlier


The Amazon SageMaker Python SDK provides a legacy mode that supports TensorFlow versions 1.11 and
earlier. Use legacy mode TensorFlow training scripts to run TensorFlow jobs in SageMaker if:

• You have existing legacy mode scripts that you do not want to convert to script mode.
• You want to use a TensorFlow version earlier than 1.11.

For information about writing legacy mode TensorFlow scripts to use with the SageMaker Python SDK,
see TensorFlow SageMaker Estimators and Models.

Use Triton Inference Server with Amazon SageMaker


SageMaker enables customers to deploy a model using custom code with NVIDIA Triton Inference Server.
This functionality is available through the development of Triton Inference Server Containers. These

32
Amazon SageMaker Developer Guide
Supported Regions and Quotas

containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful
environment variables that let you optimize performance on SageMaker. For a list of all available Deep
Learning Containers images, see Available Deep Learning Containers Images. Deep Learning Containers
images are maintained and regularly updated with security patches.

You can use the Triton Inference Server Container with SageMaker Python SDK as you would any other
container in your SageMaker models. However, using the SageMaker Python SDK is optional. You can use
Triton Inference Server Containers with the AWS CLI and AWS SDK for Python (Boto3).

For more information on NVIDIA Triton Inference Server see the Triton documentation.

Inference
Note
The Triton Python backend uses shared memory (SHMEM) to connect your code to Triton.
SageMaker Inference provides up to half of the instance memory as SHMEM so you can use an
instance with more memory for larger SHMEM size.

For inference, you can use your trained ML models with Triton Inference Server to deploy an inference
job with SageMaker.

Some of the key features of Triton Inference Server Container are:

• Support for multiple frameworks: Triton can be used to deploy models from all major ML
frameworks. Triton supports TensorFlow GraphDef and SavedModel, ONNX, PyTorch TorchScript,
TensorRT, and custom Python/C++ model formats.
• Model pipelines: Triton model ensemble represents a pipeline of one model with pre/post processing
logic and the connection of input and output tensors between them. A single inference request to an
ensemble triggers the execution of the entire pipeline.
• Concurrent model execution: Multiple instances of the same model can run simultaneously on the
same GPU or on multiple GPUs.
• Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and
batching algorithms that combine individual inference requests together to improve inference
throughput. These scheduling and batching decisions are transparent to the client requesting
inference.
• Diverse CPU and GPU support: The models can be executed on CPUs or GPUs for maximum flexibility
and to support heterogeneous computing requirements.

What do you want to do?


I want to deploy my trained PyTorch model in SageMaker.

For a sample Jupyter Notebook, see the Deploy your PyTorch Resnet50 model with Triton Inference
Server example.
I want to deploy my trained Hugging Face model in SageMaker.

For a sample Jupyter Notebook, see the Deploy your PyTorch BERT model with Triton Inference
Server example.

Supported Regions and Quotas


For the AWS Regions supported by Amazon SageMaker and the Amazon Elastic Compute Cloud (Amazon
EC2) instance types that are available in each Region, see Amazon SageMaker Pricing.

33
Amazon SageMaker Developer Guide
Quotas

For a list of the SageMaker service endpoints for each Region, see Amazon SageMaker endpoints and
quotas in the AWS General Reference.

Quotas
For a list of SageMaker quotas, see Amazon SageMaker endpoints and quotas in the AWS General
Reference.

The Service Quotas console provides information about your service quotas. You can use the Service
Quotas console to view your default service quotas or to request quota increases. To request a quota
increase for adjustable quotas, see Requesting a quota increase.

You can set up a quota request template for your AWS Organization that automatically requests quota
increases during account creation. For more information, see Using Service Quotas request templates.

34
Amazon SageMaker Developer Guide
Set Up Amazon SageMaker Prerequisites

Get Started with Amazon SageMaker


Before you can use Amazon SageMaker, you must sign up for an AWS account and create an
administrative user by following the steps in Set Up Amazon SageMaker Prerequisites (p. 35).

Amazon SageMaker Studio Lab does not require an AWS account or IAM integration.

After you complete these tasks, continue to one of the following topics, depending on your use case.

• Onboard to Amazon SageMaker Domain (p. 37): Follow these steps to create a Domain, which gives
you access to Amazon SageMaker Studio and RStudio on Amazon SageMaker. For more information
about Domains, see Amazon SageMaker Domain (p. 105).
• SageMaker JumpStart (p. 47): Follow these steps to start working with SageMaker JumpStart
and learn about SageMaker features and capabilities through curated one-click solutions, example
notebooks, and pretrained models that you can deploy. To use SageMaker JumpStart, which is a
feature of Amazon SageMaker Studio, you must first onboard to an Amazon SageMaker Domain.
• Get Started with Amazon SageMaker Notebook Instances (p. 87): Follow these steps to train and
deploy Machine Learning (ML) models using SageMaker notebook instances. SageMaker notebook
instances help create the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud
(Amazon EC2) and providing preconfigured kernels. For more information, see Amazon SageMaker
Notebook Instances (p. 204).
• Amazon SageMaker Studio Lab (p. 230): Follow these steps to start working with Amazon SageMaker
Studio Lab. Studio Lab is a free service that gives you access to AWS compute resources, in an
environment based on open-source JupyterLab, without requiring an AWS account.

Topics
• Set Up Amazon SageMaker Prerequisites (p. 35)
• Onboard to Amazon SageMaker Domain (p. 37)
• SageMaker JumpStart (p. 47)
• Get Started with Amazon SageMaker Notebook Instances (p. 87)

Set Up Amazon SageMaker Prerequisites


In this section, you sign up for an AWS account and create an AWS Identity and Access Management
(IAM) admin user.

If you're new to SageMaker, we recommend that you read How Amazon SageMaker Works (p. 2).

Topics
• Create an AWS Account (p. 35)
• Create an Administrative User and Group (p. 36)
• AWS CLI Prerequisites (p. 37)

Create an AWS Account


In this section, you sign up for an AWS account. If you already have an AWS account, skip this step.

When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all
AWS services, including SageMaker. You are charged only for the services that you use.

35
Amazon SageMaker Developer Guide
Create an Administrative User and Group

To create an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.

When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.

Write down your AWS account ID because you'll need it for the next task.

Create an Administrative User and Group


When you create an AWS account, you get a single sign-in identity that has complete access to all of the
AWS services and resources in the account. This identity is called the AWS account root user. Signing in
to the AWS console using the email address and password that you used to create the account gives you
complete access to all of the AWS resources in your account.

We strongly recommend that you not use the root user for everyday tasks, even the administrative
ones. Instead, adhere to the Security best practices in IAM, and create an administrative user. Then
securely lock away the root user credentials and use them to perform only a few account and service
management tasks.

To create an administrative user

1. Create an administrative user in your AWS account. For instructions, see Create an administrative
user in the IAM User Guide.
Note
We assume that you use administrator user credentials for the exercises and procedures
in this guide. If you choose to create and use another user, grant that user minimum
permissions. For more information, see Authenticating with Identities (p. 3049).
2. Ensure that your administrator user has the AmazonSageMakerFullAccess policy, as well as a policy
with the following content needed to create a SageMaker domain. For more information about
creating IAM policies, see Creating IAM policies.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:*"
],
"Resource": [
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:app/*",
"arn:aws:sagemaker:*:*:flow-definition/*"
]
},
{
"Effect": "Allow",
"Action": [
"iam:GetRole",

36
Amazon SageMaker Developer Guide
AWS CLI Prerequisites

"servicecatalog:*"
],
"Resource": [
"*"
]
}
]
}

AWS CLI Prerequisites


The following prerequisites are required to onboard to a Domain using the AWS CLI.

• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.

Onboard to Amazon SageMaker Domain


An Amazon SageMaker Domain consists of an associated Amazon Elastic File System (Amazon EFS)
volume; a list of authorized users; and a variety of security, application, policy, and Amazon Virtual
Private Cloud (Amazon VPC) configurations. To use Amazon SageMaker Studio, Amazon SageMaker
Studio notebooks, and RStudio, you must complete the Amazon SageMaker Domain onboarding process
using the SageMaker console or the AWS CLI. For more information about Amazon SageMaker Domains,
see Amazon SageMaker Domain (p. 105).

When onboarding, you can choose to use either AWS IAM Identity Center (successor to AWS Single Sign-
On) (IAM Identity Center) or AWS Identity and Access Management (IAM) for authentication methods.
When you use IAM authentication, you can choose either the Quick setup or the Standard setup
procedure. RStudio setup is only available when using the Standard setup procedure.
Note
If you onboard using IAM authentication and want to switch to authentication using IAM
Identity Center later, you must delete the Domain that you created. Then, you need to manually
re-import all notebooks and other user data that you created. For more information, see Delete
an Amazon SageMaker Domain (p. 116).

The simplest way to create a Amazon SageMaker Domain is to follow the Quick setup procedure from
the SageMaker console. Quick setup uses the same default settings as the Standard setup procedures.
These settings include shareable notebooks and public internet access. For more control, including
the option of using authentication using IAM Identity Center and RStudio, use the Standard setup
procedures.

Authentication using IAM Identity Center

To use authentication using IAM Identity Center with Studio and RStudio, you must onboard to an AWS
Organizations organization.
Note
The AWS Organizations account must be in the same AWS Region as Studio and RStudio.

Authentication using IAM Identity Center provides the following benefits over IAM authentication:

• Members given access to Studio have a unique sign-in URL that directly opens Studio, and they sign in
with their IAM Identity Center credentials. When you use IAM authentication, you must sign in through
the SageMaker console.

37
Amazon SageMaker Developer Guide
Onboard Using Quick setup

• Organizations manage their members in IAM Identity Center instead of the Domain. You can assign
multiple members access to the Domain at the same time. When you use IAM authentication, you must
add and manage members manually, one at time, using the Domain Control Panel.

Topics
• Onboard to Amazon SageMaker Domain Using Quick setup (p. 38)
• Onboard to Amazon SageMaker Domain Using IAM Identity Center (p. 39)
• Onboard to Amazon SageMaker Domain Using IAM (p. 43)
• Choose an Amazon VPC (p. 46)

Onboard to Amazon SageMaker Domain Using Quick


setup
This topic describes how to onboard to Amazon SageMaker Domain using the Quick setup procedure
from the SageMaker console, which uses AWS Identity and Access Management (IAM) authentication. For
information on how to onboard using the standard IAM procedure, see Onboard Using IAM (p. 43).

RStudio support is not currently available when onboarding using the Quick setup procedure.

For information on how to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On)
(IAM Identity Center), see Onboard Using IAM Identity Center (p. 39).

To onboard to the Domain using Quick setup

1. Open the SageMaker console.


2. Choose Domains at the left of the page.
3. From the Domains page, choose Create domain.
4. On the Setup SageMaker Domain page, choose Quick setup.
5. For Domain Name, enter a unique name for your Domain.
6. Under User profile, for Name keep the default name or create a new name. The name can be up to
63 characters. Valid characters: A-Z, a-z, 0-9, and - (hyphen).
7. For Execution role, choose an option from the role selector. This is the default role that is assigned
to user profiles in the Amazon SageMaker Domain.

If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).

If you choose Create a new role, the Create an IAM role dialog opens:

• For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
• Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.

If you choose Create role using the role creation wizard, the Amazon SageMaker Role Manager
page opens. For more information about using SageMaker Role Manager, see Amazon SageMaker
Role Manager (p. 3108).
8. Turn on Enable SageMaker Canvas permissions (by default this option is turned on).
9. Choose Submit.

38
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center

10. From the pop-up window, select a Amazon Virtual Private Cloud (Amazon VPC) and subnet to use.
11. Choose Save and continue.
Note
If you receive an error message that you need to create an Amazon VPC, see Choose an
Amazon VPC (p. 46).

When Status is Ready, the user name that you specified is enabled.

Now that you've onboarded to the Domain, you can launch an app following the steps in Launch Amazon
SageMaker Studio (p. 133). For information about adding users to your Domain, see Add and Remove
User Profiles (p. 119).

For information about using SageMaker Studio, see SageMaker Studio (p. 128).

Onboard to Amazon SageMaker Domain Using IAM


Identity Center
This topic describes how to onboard to Amazon SageMaker Domain using authentication using IAM
Identity Center from the SageMaker console or the AWS CLI. For information about setting up IAM
Identity Center for use with a Domain, see Set Up IAM Identity Center for use with Amazon SageMaker
Domain (p. 42). For information on how to onboard using AWS Identity and Access Management (IAM)
authentication, see Onboard Using Quick setup (p. 38) or Onboard Using IAM (p. 43).

Onboard from the console


To onboard to Domain using IAM Identity Center

1. Open the SageMaker console.


2. Choose Domains on the left of the page.
3. From the Domains page, choose Create domain.
4. On the Setup SageMaker Domain page, choose Standard setup.
5. Select Configure.

Step 1: General settings

1. For Domain Name, enter a unique name for your Domain.


2. For Authentication, choose AWS IAM Identity Center (successor to AWS Single Sign-On).
3. If you don't have IAM Identity Center created in the same Region as your SageMaker Domain,
you must create IAM Identity Center in the same Region as your SageMaker Domain before
proceeding. To continue to onboard without IAM Identity Center, choose the AWS Identity and
Access Management (IAM) authentication method or the Quick setup procedure, which also uses
IAM.

For information about setting up IAM Identity Center for use with Domain, see Set Up IAM Identity
Center for use with Amazon SageMaker Domain (p. 42).
4. Under Permission, for Default execution role, choose an option from the role selector.

If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).

If you choose Create a new role, the Create an IAM role dialog opens:

39
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center

a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
5. Under Network and storage, specify the following:

• Your Amazon Virtual Private Cloud (Amazon VPC) information – For more information, see Choose
an Amazon VPC (p. 46).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).
Note
Encryption in transit is only available for Amazon SageMaker Studio.
6. Select Next.

Step 2: Studio settings

1. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as
the default for your Domain. For information on selecting a JupyterLab version, see JupyterLab
Versioning (p. 135).
2. Under Notebook Sharing Configuration, accept the default notebook sharing configuration or
customize the options.
3. Under SageMaker Projects and JumpStart, accept the default Project and JumpStart settings,
or customize whether administrators and users can create projects and use Jumpstart. For more
information, see SageMaker Studio Permissions Required to Use Projects (p. 2806).
4. Select Next.

Step 3: RStudio settings

1. Under RStudio Workbench, verify that your RStudio license is automatically detected. For more
information about getting an RStudio license and activating it with SageMaker, see RStudio
license (p. 435).
2. Select an instance type to launch your RStudio Server on. For more information, see
RStudioServerPro instance type (p. 437).
3. Under Permission, create your role or select an existing role. The role must have the following
permissions policy. This policy allows the RStudioServerPro app to access necessary resources and
allows Amazon SageMaker to automatically launch an RStudioServerPro app when the existing
RStudioServerPro app is in a Deleted or Failed status. For information about adding permissions
to a role, see Modifying a role permissions policy (console).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",
"license-manager:CheckInLicense",
"logs:CreateLogDelivery",

40
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center

"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}

4. Under RStudio Connect, add the URL for your RStudio Connect server. RStudio Connect is a
publishing platform for Shiny applications, R Markdown reports, dashboards, plots, and more.
When you onboard to RStudio on SageMaker, an RStudio Connect server is not created. For more
information, see RStudio Connect URL (p. 438).
5. Under RStudio Package Manager, add the URL for your RStudio Package Manager. SageMaker
creates a default package repository for the Package Manager when you onboard RStudio. For more
information about RStudio Package Manager, see RStudio Package Manager (p. 438).
6. Select Next.

Step 4: SageMaker Canvas settings

1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions option
turned on (it is turned on by default). This establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).
3. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role. However, if you already have an IAM role with the required Amazon Forecast
permissions attached, select Use an existing execution role. For more information, see the IAM role
setup method (p. 278).
4. Use the default IAM role suffix or provide a custom suffix for the role.
5. For Local file upload configuration, select Enable local file upload to enable users to upload local
files into their SageMaker Canvas application (it's already checked by default).
6. Choose Submit.

Onboard from the AWS CLI


Use the following commands to onboard to a Domain using authentication using IAM Identity Center
from the AWS CLI.

1. Create an execution role that is used to create a Domain and attach the
AmazonSageMakerFullAccess policy. You can also use an existing role that has, at a minimum, an
attached trust policy that grants SageMaker permission to assume the role. For more information,
see SageMaker Roles (p. 3086).

aws iam create-role --role-name execution-role-name


aws iam attach-role-policy --role-name execution-role-name --policy-arn
arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

41
Amazon SageMaker Developer Guide
Onboard Using IAM Identity Center

2. Get the default Amazon Virtual Private Cloud (Amazon VPC) of your account.

aws --region region ec2 describe-vpcs --filters Name=isDefault,Values=true --query


"Vpcs[0].VpcId" --output text

3. Get the list of subnets in the default Amazon VPC.

aws --region region ec2 describe-subnets --filters Name=vpc-id,Values=default-vpc-id --


query "Subnets[*].SubnetId" --output json

4. Create a Domain by passing the default Amazon VPC ID, subnets, and execution role ARN. You must
also pass a SageMaker image ARN. For information on the available JupyterLab version ARNs, see
Setting a default JupyterLab version (p. 137).

aws --region region sagemaker create-domain --domain-name domain-name --vpc-


id default-vpc-id --subnet-ids subnet-ids --auth-mode SSO --default-user-
settings "ExecutionRole=arn:aws:iam::account-number:role/execution-role-
name,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=image-
arn}}" \ --query DomainArn --output text

5. Verify that the Domain has been created.

aws --region region sagemaker list-domains

To access the Domain after onboarding

After you are given access to the Domain, you are sent an email inviting you to create a password and
use IAM Identity Center. The email also contains the URL to sign in to the Domain. For more information
about signing in and session duration, see How to sign in to the user portal.

After you activate your account, go to the Domain URL, sign in, and wait for your user profile to be
created. On subsequent visits, you only need to wait for the Studio or RStudio app to load.

Bookmark the URL. The URL is also available on the Domain settings page.

For information about using Studio, see SageMaker Studio (p. 128).

For information about using RStudio, see RStudio on Amazon SageMaker (p. 432).

Set Up IAM Identity Center for use with Amazon SageMaker


Domain
To use authentication in IAM Identity Center, you must belong to an AWS Organizations. If you don't
belong to an AWS Organizations, you can create one following the steps in Tutorial: Creating and
configuring an organization.

After you have created your organization and user, you can create a SageMaker user profile for that user
in IAM Identity Center as follows.

1. From the Amazon SageMaker console: – You can use the Amazon SageMaker console to create a
user profile for the user in IAM Identity Center. If the user in IAM Identity Center hasn’t already been
associated with the Domain, it is automatically associated.
2. Using the AWS CLI or AWS CloudFormation – A user in IAM Identity Center assigned to the Domain
can create a user profile using the SageMaker console, the AWS CLI or AWS CloudFormation.
• The user in IAM Identity Center, or a group in IAM Identity Center containing that user, must first
be assigned to the Domain from the IAM Identity Center console. For more information about
application assignment, see Assign user access.

42
Amazon SageMaker Developer Guide
Onboard Using IAM

• A user profile can then be created for the user in IAM Identity Center with the AWS CLI or AWS
CloudFormation.

Note
To simplify administration of access permissions, we recommend assigning groups in IAM
Identity Center to the Domain instead of assigning users in IAM Identity Center. Groups allow
permissions to be granted or denied to multiple users at once. A user can be moved out of a
group or to a different group if needed. When assigning user access to applications, IAM Identity
Center does not currently support users being added to nested groups. If a user is added to a
nested group, they may receive a "You do not have any applications" error message during sign-
in. Assignments must be made to the immediate group the user is a member of.

Return to the Domains page to continue to onboard using authentication using IAM Identity Center.

Onboard to Amazon SageMaker Domain Using IAM


This topic describes how to onboard to Amazon SageMaker Domain using the standard setup procedure
for AWS Identity and Access Management (IAM) authentication from the SageMaker console or the AWS
CLI. To onboard faster using IAM, see Onboard Using Quick setup (p. 38).

For information on how to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On)
(IAM Identity Center), see Onboard Using IAM Identity Center (p. 39).

Onboard Using console


To onboard to Domain using IAM

1. Open the SageMaker console.


2. Choose Domains at the top left of the page.
3. From the Domains page, choose Create domain.
4. On the Setup SageMaker Domain page, choose Standard setup.
5. Select Configure.

Step 1: General settings

1. For Domain Name, enter a unique name for your Domain.


2. For Authentication, choose AWS Identity and Access Management (IAM).
3. Under Permission, for Default execution role, choose an option from the role selector.

If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).

If you choose Create a new role, the Create an IAM role dialog opens:

a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
4. For Space default execution role, choose an option from the role selector.

If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).

43
Amazon SageMaker Developer Guide
Onboard Using IAM

If you choose Create a new role, the Create an IAM role dialog opens:

a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role
with the AmazonSageMakerFullAccess policy attached.
5. Under Network and storage, specify the following:

• Your Amazon Virtual Private Cloud (Amazon VPC) information – For more information, see Choose
an Amazon VPC (p. 46).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).
Note
Encryption in transit is only available for Amazon SageMaker Studio.
6. Select Next.

Step 2: Studio settings

1. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as
the default for your Domain. For information on selecting a JupyterLab version, see JupyterLab
Versioning (p. 135).
2. Under Notebook Sharing Configuration, accept the default notebook sharing configuration or
customize the options.
3. Under SageMaker Projects and JumpStart, accept the default Project and JumpStart settings
or customize whether administrators and user can create projects and use Jumpstart. For more
information, see SageMaker Studio Permissions Required to Use Projects (p. 2806).
4. Select Next.

Step 3: RStudio settings

1. Under RStudio Workbench, verify that your RStudio license is automatically detected. For more
information about getting an RStudio license and activating it with SageMaker, see RStudio
license (p. 435).
2. Select an instance type to launch your RStudio Server on. For more information, see
RStudioServerPro instance type (p. 437).
3. Under Permission, create your role or select an existing role. The role must have the following
permissions policy. This policy allows the RStudioServerPro app to access necessary resources and
allows Amazon SageMaker to automatically launch an RStudioServerPro app when the existing
RStudioServerPro app is in a Deleted or Failed status. For information on adding permissions to a
role, see Modifying a role permissions policy (console).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",

44
Amazon SageMaker Developer Guide
Onboard Using IAM

"license-manager:CheckInLicense",
"logs:CreateLogDelivery",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}

4. Under RStudio Connect, add the URL for your RStudio Connect Server. RStudio Connect is a
publishing platform for Shiny applications, R Markdown reports, dashboards, plots, and more. When
you onboard to RStudio on Amazon SageMaker, an RStudio Connect server is not created. You must
create an RStudio Connect server on an EC2 instance to use Connect with Amazon SageMaker. For
more information, see RStudio Connect URL (p. 438).
5. Under RStudio Package Manager, add the URL for your RStudio Package Manager. SageMaker
creates a default package repository for the Package Manager when you onboard RStudio. For more
information about RStudio Package Manager, see RStudio Package Manager (p. 438).
6. Select Next.

Step 4: SageMaker Canvas settings

1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions option
turned on (it is turned on by default). This establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).
3. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role, or select Use an existing execution role if you already have an IAM role with the
required Amazon Forecast permissions attached (for more information, see the IAM role setup
method (p. 278)).
4. Use the default IAM role suffix or provide a custom suffix for the role.
5. For Local file upload configuration, select Enable local file upload to enable users to upload local
files into their SageMaker Canvas application (it's already checked by default).
6. Choose Submit.

Onboard Using the AWS CLI


Use the following commands to onboard to a Domain using authentication using IAM from the AWS CLI.

1. Create an execution role that is used to create a Domain and attach the
AmazonSageMakerFullAccess policy. You can also use an existing role that has, at a minimum, an
attached trust policy that grants SageMaker permission to assume the role. For more information,
see SageMaker Roles (p. 3086).

aws iam create-role --role-name execution-role-name

45
Amazon SageMaker Developer Guide
Choose an Amazon VPC

aws iam attach-role-policy --role-name execution-role-name --policy-arn


arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

2. Get the default Amazon Virtual Private Cloud (Amazon VPC) of your account.

aws --region region ec2 describe-vpcs --filters Name=isDefault,Values=true --query


"Vpcs[0].VpcId" --output text

3. Get the list of subnets in the default Amazon VPC.

aws --region region ec2 describe-subnets --filters Name=vpc-id,Values=default-vpc-id --


query "Subnets[*].SubnetId" --output json

4. Create a Domain by passing the default Amazon VPC ID, subnets, and execution role ARN. You must
also pass a SageMaker image ARN. For information on the available JupyterLab version ARNs, see
Setting a default JupyterLab version (p. 137).

aws --region region sagemaker create-domain --domain-name domain-name --vpc-


id default-vpc-id --subnet-ids subnet-ids --auth-mode IAM --default-user-
settings "ExecutionRole=arn:aws:iam::account-number:role/execution-role-
name,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=image-
arn}}" \ --query DomainArn --output text

5. Verify that the Domain has been created.

aws --region region sagemaker list-domains

For information about using Amazon SageMaker Studio, see SageMaker Studio (p. 128).

For information about using RStudio, see RStudio on Amazon SageMaker (p. 432).

Choose an Amazon VPC


This topic provides detailed information about choosing an Amazon Virtual Private Cloud (Amazon
VPC) when you onboard to Amazon SageMaker Domain. For more information about onboarding to
SageMaker Domain, see Onboard to Amazon SageMaker Domain (p. 37).

By default, SageMaker Domain uses two Amazon VPC. One Amazon VPC is managed by Amazon
SageMaker and provides direct internet access. You specify the other Amazon VPC, which provides
encrypted traffic between the Domain and your Amazon Elastic File System (Amazon EFS) volume.

You can change this behavior so that SageMaker sends all traffic over your specified Amazon VPC. When
you choose this option, you must provide the subnets, security groups, and interface endpoints that are
necessary to communicate with the SageMaker API and SageMaker runtime, and various AWS services,
such as Amazon Simple Storage Service (Amazon S3) and Amazon CloudWatch, that are used by Amazon
SageMaker Studio and your Studio notebooks.

When you onboard to SageMaker Domain, you tell SageMaker to send all traffic over your Amazon VPC
by setting the network access type to VPC only.

To specify the Amazon VPC information


When you specify the Amazon VPC entities (that is, the Amazon VPC, subnet, or security group) in the
following procedure, one of three options is presented based on the number of entities you have in the
current AWS Region. The behavior is as follows:

• One entity – SageMaker uses that entity. This can't be changed.


• Multiple entities – You must choose the entities from the dropdown list.

46
Amazon SageMaker Developer Guide
SageMaker JumpStart

• No entities – You must create one or more entities in order to use Domain. Choose Create <entity> to
open the VPC console in a new browser tab. After you create the entities, return to the Domain Get
started page to continue the onboarding process.

This procedure is part of the Amazon SageMaker Domain onboarding process when you choose Standard
setup. Your Amazon VPC information is specified under the Network section.

1. Choose the Amazon VPC.


2. Choose one or more subnets. If you don't choose any subnets, SageMaker uses all the subnets in
the Amazon VPC. We recommend that you use multiple subnets that are not created in constrained
Availability Zones. Using subnets in these constrained Availability Zones can result in insufficient
capacity errors and longer application creation times. For more information about constrained
Availability Zones, see Availability Zones.
3. Select the network access type.
Note
If VPC only is selected, SageMaker automatically applies the security group settings defined
for the Domain to all shared spaces created in the Domain. If Public internet only is
selected, SageMaker does not apply the security group settings to shared spaces created in
the Domain.

• Public internet only – Non-Amazon EFS traffic goes through a SageMaker managed Amazon VPC,
which allows internet access. Traffic between the Domain and your Amazon EFS volume is through
the specified Amazon VPC.
• VPC only – All SageMaker traffic is through the specified Amazon VPC and subnets. You must use
a subnet that does not have direct internet access in VPC only mode. Internet access is disabled by
default.
4. Choose the security groups. If you chose Public internet only, this step is optional. If you chose VPC
only, this step is required.
Note
For the maximum number of allowed security groups, see UserSettings.

For Amazon VPC requirements in VPC only mode, see Connect SageMaker Studio Notebooks in a VPC to
External Resources (p. 3209).

SageMaker JumpStart
SageMaker JumpStart provides pretrained, open-source models for a wide range of problem types to
help you get started with machine learning. You can incrementally train and tune these models before
deployment. JumpStart also provides solution templates that set up infrastructure for common use
cases, and executable example notebooks for machine learning with SageMaker.

You can access the pretrained models, solution templates, and examples through the JumpStart landing
page in Amazon SageMaker Studio. The following steps show how to access JumpStart models and
solutions using Amazon SageMaker Studio.

You can also access JumpStart models using the SageMaker Python SDK. For information about how
to use JumpStart models programmatically, see Use SageMaker JumpStart Algorithms with Pretrained
Models.

Open and use JumpStart


The following sections give information on how to open, use, and manage JumpStart from the Amazon
SageMaker Studio UI.

47
Amazon SageMaker Developer Guide
Open and use JumpStart

Open JumpStart
In Amazon SageMaker Studio, open the JumpStart landing page either through the Home page or the
Home menu on the left-side panel.

• From the Home page you can either:


• Choose Quick start solutions in the Prebuilt and automated solutions pane. This opens the
SageMaker JumpStart landing page.
• Choose a model directly in the Quick start solutions pane, or choose Browse all Quick start
solutions. This opens the SageMaker JumpStart landing page.
• From the Home menu in the left panel you can either:
• Navigate to the Quick start solutions node, then choose Solutions, models, example notebooks.
This opens the SageMaker JumpStart landing page.
• Navigate to the Quick start solutions node, then choose Launched Quick start assets.

The Launch quick start assets page lists your currently launched solutions, deployed model
endpoints, and training jobs created with Quick start. You can access the JumpStart landing page
from this tab by clicking on the Browse Quick start solutions button at the top right of the tab.

The JumpStart landing page lists available end-to-end machine learning solutions, pretrained models,
and example notebooks. From any individual solution or model page, you can choose the Browse

JumpStart button ( ) at the top right of the tab to return to the SageMaker
JumpStart page.

48
Amazon SageMaker Developer Guide
Open and use JumpStart

Important
Before downloading or using third-party content: You are responsible for reviewing and
complying with any applicable license terms and making sure that they are acceptable for your
use case.

Use JumpStart
From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and
other resources.

You can find JumpStart resources by using the search bar, or by browsing each category. Use the tabs to
filter the available solutions by categories:

• Solutions – In one step, launch comprehensive machine learning solutions that tie SageMaker to other
AWS services. Select Explore All Solutions to view all available solutions.
• ML tasks – Find a model by problem type (e.g., Image Classification, Image Embedding, Object
Detection, Text Generation). Select Explore All Models to view all available models.
• Data types – Find a model by data type (e.g., Vision, Text, Tabular, Audio). Select Explore All Models to
view all available models.
• Notebooks – Find example notebooks that use SageMaker features across multiple model types and
use cases. Select Explore All Notebooks to view all available example notebooks.
• Frameworks – Find a model by framework (e.g., PyTorch, TensorFlow, Hugging Face).
• Resources – Use example notebooks, blogs, and video tutorials to learn and head start your problem
types.
• Blogs – Read details and solutions from machine learning experts.
• Video tutorials – Watch video tutorials for SageMaker features and machine learning use cases from
machine learning experts.
• Example notebooks – Run example notebooks that use SageMaker features like Spot Instance
training and experiments over a large variety of model types and use cases.

49
Amazon SageMaker Developer Guide
Solution Templates

Manage JumpStart
From the Home menu in the left panel, navigate to Quick start solutions, then choose Launched Quick
start assets to list your currently launched solutions, deployed model endpoints, and training jobs
created with Quick start.

Topics
• Solution Templates (p. 50)
• JumpStart Foundation Models (p. 58)
• Task-Specific Models (p. 66)
• Shared Models and Notebooks (p. 79)
• Amazon SageMaker JumpStart Industry: Financial (p. 83)

Solution Templates
SageMaker JumpStart provides one-click, end-to-end solutions for many common machine learning use
cases. Explore the following use cases for more information on available solution templates.

• Demand forecasting (p. 51)


• Credit rating prediction (p. 51)
• Fraud detection (p. 52)
• Computer vision (p. 52)
• Extract and analyze data from documents (p. 53)
• Predictive maintenance (p. 53)
• Churn prediction (p. 54)
• Personalized recommendations (p. 54)
• Reinforcement learning (p. 55)
• Healthcare and life sciences (p. 55)
• Financial pricing (p. 55)
• Causal inference (p. 56)

50
Amazon SageMaker Developer Guide
Solution Templates

Choose the solution template that best fits your use case from the JumpStart landing page. When you
choose a solution template, JumpStart opens a new tab showing a description of the solution and a
Launch button. When you select Launch, JumpStart creates all of the resources that you need to run the
solution, including training and model hosting instances. For more information on launching a JumpStart
solution, see the section called “Launch a Solution” (p. 56).

After launching the solution, you can explore solution features and any generated artifacts in JumpStart.
Use the Launched Quick start assets menu to find your solution. In your solution's tab, select Open
Notebook to use provided notebooks and explore the solution’s features. When artifacts are generated
during launch or after running the provided notebooks, they're listed in the Generated Artifacts table.

You can delete individual artifacts with the trash icon ( ). You can delete all of the solution’s
resources by choosing Delete solution resources.

Demand forecasting
Demand forecasting uses historical time series data in order to make future estimations in relation to
customer demand over a specific period and streamline the supply-demand decision-making process
across businesses.

Demand forecasting use cases include predicting ticket sales in the transportation industry, stock prices,
number of hospital visits, number of customer representatives to hire for multiple locations in the next
month, product sales across multiple regions in the next quarter, cloud server usage for the next day for
a video streaming service, electricity consumption for multiple regions over the next week, number of
IoT devices and sensors such as energy consumption, and more.

Time series data is categorized as univariate and multi-variate. For example, the total electricity
consumption for a single household is a univariate time series over a period of time. When multiple
univariate time series are stacked on each other, it’s called a multi-variate time series. For example, the
total electricity consumption of 10 different (but correlated) households in a single neighborhood make
up a multi-variate time series dataset.

Solution name Description Get started

Demand forecasting Demand forecasting for GitHub »


multivariate time series data
using three state-of-the-art time
series forecasting algorithms:
LSTNet, Prophet, and SageMaker
DeepAR.

Credit rating prediction


Use JumpStart's credit rating prediction solutions to predict corporate credit ratings or to explain credit
prediction decisions made by machine learning models. Compared to traditional credit rating modeling
methods, machine learning models can automate and improve the accuracy of credit prediction.

Solution name Description Get started

Corporate credit rating Multimodal (long text and GitHub »


prediction tabular) machine learning for
quality credit predictions using
AWS AutoGluon Tabular.

51
Amazon SageMaker Developer Guide
Solution Templates

Solution name Description Get started

Graph-based credit scoring Predict corporate credit Find in Amazon SageMaker


ratings using tabular data Studio.
and a corporate network by
training a Graph Neural Network
GraphSAGE and AWS AutoGluon
Tabular model.

Explain credit decisions Predict credit default in credit GitHub »


applications and provide
explanations using LightGBM
and SHAP (SHapley Additive
exPlanations).

Fraud detection
Many businesses lose billions annually to fraud. Machine learning based fraud detection models can help
systematically identify likely fraudulent activities from a tremendous amount of data. The following
solutions use transaction and user identity datasets to identify fraudulent transactions.

Solution name Description Get started

Detect malicious users and Automatically detect potentially GitHub »


transactions fraudulent activity in
transactions using SageMaker
XGBoost with the over-sampling
technique Synthetic Minority
Over-sampling (SMOTE).

Fraud detection in financial Detect fraud in financial GitHub »


transactions using deep graph transactions by training a graph
library convolutional network with
the deep graph library and a
SageMaker XGBoost model.

Financial payment classification Classify financial payments Find in Amazon SageMaker


based on transaction Studio.
information using SageMaker
XGBoost. Use this solution
template as an intermediate
step in fraud detection,
personalization, or anomaly
detection.

Computer vision
With the rise of business use cases such as autonomous vehicles, smart video surveillance, healthcare
monitoring and various object counting tasks, fast and accurate object detection systems are rising
in demand. These systems involve not only recognizing and classifying every object in an image, but
localizing each one by drawing the appropriate bounding box around it. In the last decade, the rapid
advances of deep learning techniques greatly accelerated the momentum of object detection.

52
Amazon SageMaker Developer Guide
Solution Templates

Solution name Description Get started

Visual product defect detection Identify defective regions GitHub »


in product images either by
training an object detection
model from scratch or fine-
tuning pretrained SageMaker
models.

Handwriting recognition Recognize handwritten text GitHub »


in images by training an
object detection model and
handwriting recognition model.
Label your own data using
SageMaker Ground Truth.

Object detection for bird species Identify birds species in a scene Find in Amazon SageMaker
using a SageMaker object Studio.
detection model.

Extract and analyze data from documents


JumpStart provides solutions for you to uncover valuable insights and connections in business-critical
documents. Use cases include text classification, document summarization, handwriting recognition,
relationship extraction, question and answering, and filling in missing values in tabular records.

Solution name Description Get started

Privacy for sentiment Anonymize text to better GitHub »


classification preserve user privacy in
sentiment classification.

Document understanding Document summarization, GitHub »


entity, and relationship
extraction using the
transformers library in PyTorch.

Handwriting recognition Recognize handwritten text GitHub »


in images by training an
object detection model and
handwriting recognition model.
Label your own data using
SageMaker Ground Truth.

Filling in missing values in Fill missing values in tabular GitHub »


tabular records records by training a SageMaker
AutoPilot model.

Predictive maintenance
Predictive maintenance aims to optimize the balance between corrective and preventative maintenance
by facilitating the timely replacement of components. The following solutions use sensor data from
industrial assets to predict machine failures, unplanned downtime, and repair costs.

53
Amazon SageMaker Developer Guide
Solution Templates

Solution name Description Get started

Predictive maintenance for Predict vehicle fleet failures GitHub »


vehicle fleets using vehicle sensor and
maintenance information with
a convolutional neural network
model.

Predictive maintenance for Predict the remaining useful GitHub »


manufacturing life for each sensor by training
a stacked Bidirectional LSTM
neural network model using
historical sensor readings.

Churn prediction
Customer churn, or rate of attrition, is a costly problem faced by a wide range of companies. In an effort
to reduce churn, companies can identify customers that are likely to leave their service in order to focus
their efforts on customer retention. Use a JumpStart churn prediction solution to analyze data sources
such as user behavior and customer support chat logs to identify customers that are at a high risk of
cancelling a subscription or service.

Solution name Description Get started

Churn prediction with text Predict churn using numerical, GitHub »


categorical, and textual
features with BERT encoder and
RandomForestClassifier.

Churn prediction for mobile Identify unhappy mobile phone Find in Amazon SageMaker
phone customers customers using SageMaker Studio.
XGBoost.

Personalized recommendations
You can use JumpStart solutions to analyze customer identity graphs or user sessions to
better understand and predict customer behavior. Use the following solutions for personalized
recommendations to model customer identity across multiple devices, to determine the likelihood of
a customer making a purchase, or to create a custom movie recommender based on past customer
behavior.

Solution name Description Get started

Entity resolution in identity Perform cross-device entity GitHub »


graphs with deep graph library linking for online advertising by
training a graph convolutional
network with deep graph library.

Purchase modeling Predict whether a customer will GitHub »


make a purchase by training a
SageMaker XGBoost model.

Customized recommender Train and deploy a custom Find in Amazon SageMaker


system recommender system that Studio.

54
Amazon SageMaker Developer Guide
Solution Templates

Solution name Description Get started


generates movie suggestions
for a customer based on
past behavior using Neural
Collaborative Filtering in
SageMaker.

Reinforcement learning
Reinforcement learning (RL) is a type of learning that is based on interaction with the environment. This
type of learning is used by an agent that must learn behavior through trial-and-error interactions with
a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives
as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain
rewards with exploiting actions that have known rewards.

RL is well-suited for solving large, complex problems, such as supply chain management, HVAC systems,
industrial robotics, game artificial intelligence, dialog systems, and autonomous vehicles.

Solution name Description Get started

Reinforcement learning for Provide a reinforcement learning GitHub »


Battlesnake AI competitions workflow for training and
inference with the BattleSnake
AI competitions.

Distributed reinforcement Distributed reinforcement GitHub »


learning for Procgen challenge learning starter kit for NeurIPS
2020 Procgen Reinforcement
learning challenge.

Healthcare and life sciences


Clinicians and researchers can use JumpStart solutions to analyze medical imagery, genomic information,
and clinical health records.

Solution name Description Get started

Lung cancer survival prediction Predict non-small cell lung GitHub »


cancer patient survival status
with 3-dimensional lung
computerized tomography (CT)
scans, genomic data, and clinical
health records using SageMaker
XGBoost.

Financial pricing
Many businesses dynamically adjust pricing on a regular basis in order to maximize their returns. Use
the following JumpStart solutions for price optimization, dynamic pricing, option pricing, or portfolio
optimization use cases.

55
Amazon SageMaker Developer Guide
Solution Templates

Solution name Description Get started

Price optimization Estimate price elasticity using Find in Amazon SageMaker


Double Machine Learning (ML) Studio.
for causal inference and the
Prophet forecasting procedure.
Use these estimates to optimize
daily prices.

Causal inference
Researchers can use machine learning models such as Bayesian networks to represent causal
dependencies and draw causal conclusions based on data. Use the following JumpStart solution to
understand the causal relationship between Nitrogen-based fertilizer application and corn crop yields.

Solution name Description Get started

Crop yield counterfactuals Generate a counterfactual Find in Amazon SageMaker


analysis of corn response to Studio.
nitrogen. This solution learns
the crop phenology cycle in its
entirety using multi-spectral
satellite imagery and ground-
level observations.

Launch a Solution
First, choose a solution through the SageMaker JumpStart landing page in the Amazon SageMaker
Studio UI. For information on the onboarding steps to sign in to Amazon SageMaker Studio, see Onboard
to Amazon SageMaker Domain. For details on getting to the SageMaker JumpStart landing page, see
Open and use JumpStart (p. 47).

After you choose a solution, a solution's tab opens showing a description of the solution and a Launch
button. To launch a solution, select Launch in the Launch Solution section. JumpStart then creates all
of the resources needed to run the solution. This includes training and model hosting instances.

Advanced parameters
The solution that you choose may have advanced parameters that you can select. Choose Advanced
Parameters to specify the AWS Identity and Access Management role for the solution.

Solutions are able to launch resources across 9 AWS services that interact with each other. For the
solution to work as expected, newly created components from one service must be able to act on newly
created components from another service. We recommend that you use the default IAM role to ensure
that all needed permissions are added. For more information about IAM roles, see Identity and Access
Management for Amazon SageMaker (p. 3048).

Default IAM role

If you select this option, the default IAM roles that are required by this solution are used. Each solution
requires different resources. The following list describes the default roles that are used for the solutions
based on the service needed. For a description of the permissions required for each service, see AWS
Managed Policies for SageMaker projects and JumpStart (p. 3172).

56
Amazon SageMaker Developer Guide
Solution Templates

• API Gateway – AmazonSageMakerServiceCatalogProductsApiGatewayRole


• CloudFormation – AmazonSageMakerServiceCatalogProductsCloudformationRole
• CodeBuild – AmazonSageMakerServiceCatalogProductsCodeBuildRole
• CodePipeline – AmazonSageMakerServiceCatalogProductsCodePipelineRole
• Events – AmazonSageMakerServiceCatalogProductsEventsRole
• Firehose – AmazonSageMakerServiceCatalogProductsFirehoseRole
• Glue – AmazonSageMakerServiceCatalogProductsGlueRole
• Lambda – AmazonSageMakerServiceCatalogProductsLambdaRole
• SageMaker – AmazonSageMakerServiceCatalogProductsExecutionRole

If you are using a new SageMaker Domain with JumpStart project templates enabled, these roles are
automatically created in your account.

If you are using an existing SageMaker domain, these roles may not exist in your account. If this is the
case, you will receive the following error when launching the solution.

Unable to locate the updated roles required to launch this solution, a general role '/
service-role/AmazonSageMakerServiceCatalogProductsUseRole' will be used. Please update your
studio domain to generate these roles.

You can still launch a solution without the needed role, but the legacy default role
AmazonSageMakerServiceCatalogProductsUseRole is used in place of the needed role. The legacy
default role has trust relationships with all of the services that JumpStart solutions need to interact with.
For the best security, we recommend that you update your domain to have the newly created default
roles for each AWS service.

If you have already onboarded to a SageMaker domain, you can update your domain to generate the
default roles using the following procedure.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Control Panel at the top left of the page.
3.
From the Domain page, choose the Settings icon ( ) to edit the domain settings.
4. On General Settings choose Next.
5. Under SageMaker Projects and JumpStart, select Enable Amazon SageMaker project templates
and Amazon SageMaker JumpStart for this account and Enable Amazon SageMaker project
templates and Amazon SageMaker JumpStart for Studio users, choose Next.
6. Select Submit.

You should be able to see the default roles listed in Projects - Amazon SageMaker project templates
enabled for this account under the Apps - Studio tab.

Find IAM role

If you select this option, you must select an existing IAM role from the dropdown list for each of the
required services. The selected role must have at least the minimum permissions required for the
corresponding service. For a description of the permissions required for each service, see AWS Managed
Policies for SageMaker projects and JumpStart (p. 3172).

Input IAM role

If you select this option, you must manually enter the ARN for an existing IAM role. The selected role
must have at least the minimum permissions required for the corresponding service. For a description

57
Amazon SageMaker Developer Guide
Foundation Models

of the permissions required for each service, see AWS Managed Policies for SageMaker projects and
JumpStart (p. 3172).

JumpStart Foundation Models


Amazon SageMaker JumpStart offers state-of-the-art, built-in foundation models for use cases such
as content writing, code generation, question answering, copywriting, summarization, classification,
information retrieval, and more. Use JumpStart foundation models to build your own generative AI
solutions and integrate custom solutions with additional SageMaker features. For more information, see
Getting started with Amazon SageMaker JumpStart.

A foundation model is a large pre-trained model that is adaptable to many downstream tasks and often
serves as the starting point for developing more specialized models. Examples of foundation models
include AlexaTM, BLOOM, and FLAN, which are pre-trained on massive amounts of text data and can
be fine-tuned for specific language tasks. Amazon SageMaker JumpStart onboards and maintains open
source, community, and third-party foundation models for you to access, customize, and integrate into
your machine learning lifecycles.

To get started exploring and experimenting with available models, see How to use JumpStart
foundation models (p. 60). All foundation models are available to use programmatically with the
SageMaker Python SDK. For more information, see Use foundation models with the SageMaker Python
SDK (p. 60).

For more information on considerations to make when choosing a model, see Choose a foundation
model (p. 61).

For specifics about customization and fine-tuning foundation models, see Customize a foundation
model (p. 63).

For more general information on foundation models, see the paper On the Opportunities and Risks of
Foundation Models.

Featured foundation models


Explore featured foundation models for a variety of use cases:

Model name Model ID Use case Fine-


Source
Public
tunable
or
proprietary

AlexaTM 20B pytorch- Text generation No


Alexa
Publicly
textgeneration1- available
alexa20b

Bloom 1b7 huggingface- Text generation No


Hugging
Publicly
textgeneration- Face
available
bloom-1b7

FLAN-T5 XL huggingface-text2text- Text2text generation Yes


Hugging
Publicly
flan-t5-xl Face
available

Stable Diffusion 2.1 base model-txt2img- Text to image Yes


Stability
Publicly
stabilityai-stable- AIavailable
diffusion-v2-1-base

Stable Diffusion 2 model-txt2img- Text to image No


Stability
Publicly
stabilityai-stable- AIavailable
diffusion-v2

58
Amazon SageMaker Developer Guide
Foundation Models

Model name Model ID Use case Fine-


Source
Public
tunable
or
proprietary

Stable Diffusion model-txt2img- Text to image No


Stability
Publicly
stabilityai-stable- AIavailable
diffusion-v1-4

Stable Diffusion x4 upscaler model-upscaling- Text to image No


Stability
Publicly
FP16 stabilityai-stable- AIavailable
diffusion-x4-upscaler-
fp16

To get started with one of these featured models, see How to use JumpStart foundation
models (p. 60) or explore one of the available Example notebooks (p. 59). In a given example
notebook, try switching out the model ID to experiment with different models within the same model
family.

Example notebooks
For step-by-step examples on how to use JumpStart foundation models with the SageMaker Python
SDK, refer to the following notebooks on text generation, image generation, and model customization.

If a notebook is associated with a specific foundation model, you can find the foundation model in
SageMaker Studio, navigate to the Run in notebook section, and choose Open notebook.

Alternatively, for instructions on how to create and access Jupyter notebook instances that you can
use to run the example in SageMaker, see Amazon SageMaker Notebook Instances. After you have
created a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of
the SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

Text generation
Explore text generation example notebooks, including guidance on general text generation workflows,
multilingual text classification, real-time batch inference, few-shot learning, chatbot interactions, and
more.

• In-context learning with AlexaTM 20B in SageMaker JumpStart


• SageMaker JumpStart Foundation Models - HuggingFace Text2Text Generation
• SageMaker JumpStart Foundation Models - BloomZ: Multilingual Text Classification, Question and
Answering, Code Generation, Paragraph rephrase, and More
• SageMaker JumpStart Foundation Models - HuggingFace Text2Text Generation Batch Transform and
Real-Time Batch Inference
• SageMaker JumpStart Foundation Models - GPT-J, GPT-Neo Few-shot learning
• SageMaker JumpStart Foundation Models - OpenChatKit GPT-NeoXT-Chat-Base-20B Chatbot

Image generation
Get started with text-to-image Stable Diffusion models, learn how to run image generation inference,
and experiment with a simple workflow to generate images of your dog.

• Introduction to JumpStart - Text to Image


• Introduction to JumpStart - Text to Image (Inference only)
• Introduction to JumpStart Image editing - Stable Diffusion Inpainting

59
Amazon SageMaker Developer Guide
Foundation Models

• Generate fun images of your dog

Model customization
Sometimes your use case requires greater foundation model customization for specific tasks. For more
information on model customization approaches, see Customize a foundation model (p. 63) or
explore one of the following example notebooks.

• SageMaker JumpStart Foundation Models - Fine-tuning text generation GPT-J 6B model on domain
specific dataset
• SageMaker JumpStart Foundation Models - HuggingFace Text2Text Instruction Fine-Tuning
• Retrieval-Augmented Generation: Question Answering based on Custom Dataset
• Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced
LangChain Library

How to use JumpStart foundation models


Discover JumpStart foundation models directly through the console, choose, train, or deploy foundation
models through Amazon SageMaker Studio, or use JumpStart foundation models programmatically with
the SageMaker Python SDK.

Use foundation models with the SageMaker Python SDK


All JumpStart foundation models are available to deploy programmatically using the SageMaker Python
SDK. For more information, see Deploy a Pre-Trained Model Directly to a SageMaker Endpoint.

To reference available model IDs, see the Built-in Algorithms with pre-trained Model Table. Search for
the name of the foundation model of your choice in the Search bar, change the number of entries shown
using the Show entries dropdown menu, or choose the Next text highlighted in blue on the lefthand
side of the page to navigate through the available models.

For example notebooks with detailed steps on using JumpStart foundation models with the SageMaker
Python SDK, see Example notebooks (p. 59).

Discover foundation models in the SageMaker Console


You can explore JumpStart foundation models directly through the Amazon SageMaker Console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Find JumpStart on the left-hand navigation panel and choose Foundation models.
3. Browse models or search for a specific model. If you need guidance, see Choose a foundation
model (p. 61). Choose View model for the foundation model of your choice. If you have access
the proprietary foundation models in preview, you can toggle off the Show models in preview
option to only display publicly available models. If you don't have access to proprietary foundation
models in preview, you can request access.
Note
Proprietary foundation models are currently in preview. You need the
AmazonSageMakerFullAccess policy attached to your role to access proprietary
foundation models. If you don’t have access to the proprietary models, choose Request
access. You can reach out to your account administrator or Amazon SageMaker JumpStart
support [email protected] for further details.
4. To view an example notebook, choose View code in the upper right-hand corner.
5. To view and run an example notebook directly in Amazon SageMaker Studio, choose Open
notebook in Studio in the upper right-hand corner.

60
Amazon SageMaker Developer Guide
Foundation Models

Use foundation models in Amazon SageMaker Studio


You can deploy, train, and fine-tune JumpStart foundation models directly through the Studio UI.

In the SageMaker JumpStart section of the navigation pane, choose Models, notebooks, solutions.
Then, scroll down to find the Foundation Models section. You can choose a model from here by choosing
View model, or choose Explore All Foundation Models to see all available foundation models. If you
choose to see all available foundation models, you can further filter them by task, data type, content
type, or framework. You can also search for a model name in the Search bar. If you need guidance on
selecting a model, see Choose a foundation model (p. 61).

After you choose View model for the foundation model of your choice in Studio, you can deploy the
model. For more information, see Deploy a Model (p. 69). You can also choose Open notebook in
the Run in notebook section to run an example notebook for the foundation model directly in Studio.
If the model is fine-tunable, you can also fine-tune the model. For more information, see Fine-Tune
a Model (p. 76). For a list of which JumpStart foundation models are fine-tunable, see Fine-tune a
foundation model (p. 64).

Choose a foundation model


Amazon SageMaker JumpStart provides access to hundreds of publicly available and proprietary
foundation models from third-party sources and partners. You can explore the JumpStart foundation
model selection directly in the SageMaker Console or Studio. When selecting a foundation model, we
recommend starting with a specific task in mind and exploring potential models from there.

Foundation model tasks


By definition, foundation models are adaptable to many downstream tasks. Foundation models are
trained on massive amounts of general domain data and the same model can be implemented or

61
Amazon SageMaker Developer Guide
Foundation Models

customized for multiple use cases. When choosing your foundation model, start with defining a specific
task.

Choose a text generation model


Text generation foundation models can be used for text summarization, text classification, question
answering, long-form content generation, short-form copywriting, information extraction, and more.

Try out trending publicly available text generation and text-to-text models for the following tasks:

• Text summarization: GPT-J 6B, FLAN-T5 XL


• Question answering: FLAN-T5 XL, GPT NeoX 20B
• Chatbot: GPT NeoX 20B Chat Base, FLAN-T5 XL
• Content generation: BloomZ 7B1 FP16
• Copywriting: GPT NeoX 20B
• Information extraction: GPT NeoX 20B
• Multilingual tasks: AlexaTM 20B, BloomZ 7B1 FP16

Choose an image generation model


JumpStart provides a wide variety of Stable Diffusion image generation foundation models including
base models from Stability AI as well as pre-trained models for specific text-to-image tasks from
Hugging Face. If you need to fine-tune your text-to-image foundation model, you can use Stable
Diffusion 2.1 base from Stability AI. If you want to explore models that are already trained on specific art
styles, you can explore one of the many third-party models from Hugging Face directly in the SageMaker
Console or Studio.

Try out some recommended text-to-image models for the following tasks:

• Image generation: Stable Diffusion, Stable Diffusion 2


• Fine-tuning: Stable Diffusion 2.1 base
• In-painting: Stable Diffusion 2 Inpainting

Licenses and model sources


Amazon SageMaker JumpStart provides access to both publicly available and proprietary foundation
models. Foundation models are onboarded and maintained from third-party open source and
proprietary providers. As such, they are released under different licenses as designated by the model
source. Be sure to review the license for any foundation model that you use. You are responsible for
reviewing and complying with any applicable license terms and making sure they are acceptable for
your use case before downloading or using the content. Some examples of common foundation model
licenses include:

• Alexa Teacher Model


• Apache 2.0
• BigScience Responsible AI License v1.0
• CreativeML Open RAIL++-M license

Note
Proprietary foundation models are currently in preview. You need the
AmazonSageMakerFullAccess policy attached to your role to access proprietary foundation
models. If you don’t have access to the proprietary models, choose Request access. You can
reach out to your account administrator or Amazon SageMaker JumpStart support sagemaker-
[email protected] for further details.

62
Amazon SageMaker Developer Guide
Foundation Models

Customize a foundation model


Foundation models are extremely powerful models able to solve a wide array of tasks. To solve most
tasks effectively, these models require some form of customization.

The recommended way to first customize a foundation model to a specific use case is through prompt
engineering. Providing your foundation model with well-engineered, context-rich prompts can help
achieve desired results without any fine-tuning or changing of model weights. For more information, see
Prompt engineering for foundation models (p. 63).

If prompt engineering alone is not enough to customize your foundation model to a specific task, you
can fine-tune a foundation model on additional domain-specific data. For more information, see Fine-
tune a foundation model (p. 64). The fine-tuning process involves changing model weights.

If you want to customize your model with information from a knowledge library without any retraining,
see Retrieval Augmented Generation (RAG) (p. 65).

Prompt engineering for foundation models


Prompt engineering is the process of designing and refining the prompts or input stimuli for a language
model to generate specific types of output. Prompt engineering involves selecting appropriate keywords,
providing context, and shaping the input in a way that encourages the model to produce the desired
response and is a vital technique to actively shape the behavior and output of foundation models.

Effective prompt engineering is crucial for directing model behavior and achieving desired responses.
Through prompt engineering, you can control a model’s tone, style, and domain expertise without
more involved customization measures like fine-tuning. We recommend dedicating time to prompt
engineering before you consider fine-tuning a model on additional data. The goal is to provide sufficient
context and guidance to the model so that it can generalize and perform well on unseen or limited data
scenarios.

Zero-shot learning
Zero-shot learning involves training a model to generalize and make predictions on unseen classes or
tasks. To perform prompt engineering in zero-shot learning environments, we recommend constructing
prompts that explicitly provide information about the target task and the desired output format. For
example, if you want to use a foundation model for zero-shot text classification on a set of classes
that the model did not see during training, a well-engineered prompt could be: "Classify the
following text as either sports, politics, or entertainment: [input text]." By
explicitly specifying the target classes and the expected output format, you can guide the model to make
accurate predictions even on unseen classes.

Few-shot learning
Few-shot learning involves training a model with a limited amount of data for new classes or tasks.
Prompt engineering in few-shot learning environments focuses on designing prompts that effectively
use the limited available training data. For example, if you use a foundation model for an image
classification task and only have a few examples of a new image class, you can engineer a prompt that
includes the available labeled examples with a placeholder for the target class. For example, the prompt
could be: "[image 1], [image 2], and [image 3] are examples of [target class].
Classify the following image as [target class]". By incorporating the limited labeled
examples and explicitly specifying the target class, you can guide the model to generalize and make
accurate predictions even with minimal training data.

If prompt engineering is not sufficient to adapt your foundation model to specific business needs,
domain-specific language, target tasks, or other requirements, you can consider fine-tuning your
model on additional data or using Retrieval Augmented Generation (RAG) to augment your model
architecture with enhanced context from archived knowledge sources. For more information, see Fine-
tune a foundation model (p. 64) or Retrieval Augmented Generation (RAG) (p. 65).

63
Amazon SageMaker Developer Guide
Foundation Models

Fine-tune a foundation model


Foundation models are computationally expensive and trained on a large, unlabeled corpus. Fine-tuning
a pre-trained foundation model is an affordable way to take advantage of their broad capabilities while
customizing a model on your own small, corpus. Fine-tuning is a customization method that involved
further training and does change the weights of your model.

Fine-tuning might be useful to you if you need:

• to customize your model to specific business needs


• your model to successfully work with domain-specific language, such as industry jargon, technical
terms, or other specialized vocabulary
• enhanced performance for specific tasks
• accurate, relative, and context-aware responses in applications
• responses that are more factual, less toxic, and better-aligned to specific requirements

There are two main approaches that you can take for fine-tuning depending on your use case and chosen
foundation model. If you're interested in fine-tuning your model on domain-specific data, see Domain
adaptation fine-tuning (p. 64). If you're interested in instruction-based fine-tuning using prompt and
response examples, see Instruction-based fine-tuning (p. 64).

Domain adaptation fine-tuning

Domain adaptation fine-tuning allows you to leverage pre-trained foundation models and adapt them
to specific tasks using limited domain-specific data. If prompt engineering efforts do not provide enough
customization, you can use domain adaption fine-tuning to get your model working with domain-specific
language, such as industry jargon, technical terms, or other specialized data. This fine-tuning process
modifies the weights of the model. For more information, see the SageMaker JumpStart Foundation
Models - Fine-tuning text generation GPT-J 6B model on domain specific dataset example notebook.

Domain adaptation fine-tuning is available with the following foundation models:

• GPT-J 6B
• GPT Neo 2.7B
• BloomZ 7b1

Instruction-based fine-tuning

Instruction-based fine-tuning uses labeled examples to improve the performance of a pre-trained


foundation model on a specific task. The labeled examples are formatted as prompt, response pairs
and phrased as instructions. This fine-tuning process modifies the weights of the model. For more
information on instruction-based fine-tuning, see the papers Introducing FLAN: More generalizable
Language Models with Instruction Fine-Tuning and Scaling Instruction-Finetuned Language Models.

Fine-tuned LAnguage Net (FLAN) models use instruction tuning to make models more amenable to
solving general downstream NLP tasks. Amazon SageMaker JumpStart provides a number of foundation
models in the FLAN model family. For example, FLAN-T5 models are instruction fine-tuned on a wide
range of tasks to increase zero-shot performance for a variety of common use cases. With additional
data and fine-tuning, instruction-based models can be further adapted to more specific tasks that
weren’t considered during pre-training. For more information, see the SageMaker JumpStart Foundation
Models - HuggingFace Text2Text Instruction Fine-Tuning example notebook.

Instruction-based fine-tuning is available with the following foundation models:

• FLAN-T5 XL

64
Amazon SageMaker Developer Guide
Foundation Models

• FLAN-T5 Large
• FLAN-T5 Small
• FLAN-T5 Base

Retrieval Augmented Generation (RAG)


Foundation models are usually trained offline, making the model agnostic to any data that is created
after the model was trained. Additionally, foundation models are trained on very general domain
corpora, making them less effective for domain-specific tasks. You can use Retrieval Augmented
Generation (RAG) to retrieve data from outside a foundation model and augment your prompts by
adding the relevant retrieved data in context. For more information about RAG model architectures, see
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

With RAG, the external data used to augment your prompts can come from multiple data sources, such
as a document repositories, databases, or APIs. The first step is to convert your documents and any
user queries into a compatible format to perform relevancy search. To make the formats compatible,
a document collection, or knowledge library, and user-submitted queries are converted to numerical
representations using embedding language models. Embedding is the process by which text is given
numerical representation in a vector space. RAG model architectures compare the embeddings of user
queries within the vector of the knowledge library. The original user prompt is then appended with
relevant context from similar documents within the knowledge library. This augmented prompt is then
sent to the foundation model. You can update knowledge libraries and their relevant embeddings
asynchronously.

65
Amazon SageMaker Developer Guide
Task-Specific Models

For more information, see the following example notebooks:

• Retrieval-Augmented Generation: Question Answering based on Custom Dataset


• Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced
LangChain Library

Task-Specific Models
JumpStart supports task-specific models across fifteen of the most popular problem types. Of the
supported problem types, Vision and NLP-related types total thirteen. There are eight problem types
that support incremental training and fine-tuning. For more information about incremental training and
hyper-parameter tuning, see SageMaker Automatic Model Tuning. JumpStart also supports four popular
algorithms for tabular data modeling.

You can search and browse models from the JumpStart landing page in Studio. When you select a
model, the model detail page provides information about the model, and you can train and deploy your
model in a few steps. The description section describes what you can do with the model, the expected
types of inputs and outputs, and the data type needed for fine-tuning your model.

You can also programmatically utilize models with the SageMaker Python SDK. For a list of all available
models, see the JumpStart Available Model Table.

The list of problem types and links to their example Jupyter notebooks are summarized in the following
table.

Problem types Supports Trainable on a Supported Example


inference with custom dataset frameworks Notebooks
pre-trained
models

Image Yes Yes PyTorch, Introduction to


classification TensorFlow JumpStart - Image
Classification

Object detection Yes Yes PyTorch, Introduction


TensorFlow, to JumpStart -
MXNet Object Detection

Semantic Yes Yes MXNet Introduction


segmentation to JumpStart
- Semantic
Segmentation

Instance Yes Yes MXNet Introduction


segmentation to JumpStart
- Instance
Segmentation

Image embedding Yes No TensorFlow, Introduction to


MXNet JumpStart - Image
Embedding

Text classification Yes Yes TensorFlow Introduction to


JumpStart - Text
Classification

Sentence pair Yes Yes TensorFlow, Introduction


classification Hugging Face to JumpStart -

66
Amazon SageMaker Developer Guide
Task-Specific Models

Problem types Supports Trainable on a Supported Example


inference with custom dataset frameworks Notebooks
pre-trained
models
Sentence Pair
Classification

Question Yes Yes PyTorch, Hugging Introduction


answering Face to JumpStart
– Question
Answering

Named entity Yes No Hugging Face Introduction


recognition to JumpStart -
Named Entity
Recognition

Text Yes No Hugging Face Introduction to


summarization JumpStart - Text
Summarization

Text generation Yes No Hugging Face Introduction to


JumpStart - Text
Generation

Machine Yes No Hugging Face Introduction


translation to JumpStart
- Machine
Translation

Text embedding Yes No TensorFlow, Introduction to


MXNet JumpStart - Text
Embedding

67
Amazon SageMaker Developer Guide
Task-Specific Models

Problem types Supports Trainable on a Supported Example


inference with custom dataset frameworks Notebooks
pre-trained
models

Tabular Yes Yes LightGBM, Introduction


classification CatBoost, to JumpStart
XGBoost, - Tabular
AutoGluon- Classification
Tabular, - LightGBM,
TabTransformer, CatBoost
Linear Learner
Introduction
to JumpStart
- Tabular
Classification -
XGBoost, Linear
Learner

Introduction
to JumpStart
- Tabular
Classification
- AutoGluon
Learner

Introduction
to JumpStart
- Tabular
Classification -
TabTransformer
Learner

Tabular regression Yes Yes LightGBM, Introduction


CatBoost, to JumpStart -
XGBoost, Tabular Regression
AutoGluon- - LightGBM,
Tabular, CatBoost
TabTransformer,
Linear Learner Introduction
to JumpStart –
Tabular Regression
- XGBoost, Linear
Learner

Introduction
to JumpStart –
Tabular Regression
- AutoGluon
Learner

Introduction
to JumpStart –
Tabular Regression
- TabTransformer
Learner

68
Amazon SageMaker Developer Guide
Task-Specific Models

Deploy a Model
When you deploy a model from JumpStart, SageMaker hosts the model and deploys an endpoint that
you can use for inference. JumpStart also provides an example notebook that you can use to access the
model after it's deployed.

Model deployment configuration


After you choose a model, the model's tab opens. In the Deploy Model pane, choose Deployment
Configuration to configure your model deployment.

The default instance type for deploying a model depends on the model. The instance type is the
hardware that the training job runs on. In the following example, the ml.p2.xlarge instance is the
default for this particular BERT model.

You can also change the endpoint name, add key;value resource tags, activate or deactive the
jumpstart- prefix for any JumpStart resources related to the model, and specify an Amazon S3 bucket
for storing model artifacts used by your SageMaker endpoint.

69
Amazon SageMaker Developer Guide
Task-Specific Models

Choose Security Settings to specify the AWS Identity and Access Management (IAM ) role, Amazon
Virtual Private Cloud (Amazon VPC), and encryption keys for the model.

70
Amazon SageMaker Developer Guide
Task-Specific Models

Model deployment security

When you deploy a model with JumpStart, you can specify an IAM role, Amazon VPC, and encryption
keys for the model. If you don't specify any values for these entries: The default IAM role is your Studio
runtime role; default encryption is used; no Amazon VPC is used.

IAM role

You can select an IAM role that is passed as part of training jobs and hosting jobs. SageMaker uses this
role to access training data and model artifacts. If you don't select an IAM role, SageMaker deploys the
model using your Studio runtime role. For more information about IAM roles, see Identity and Access
Management for Amazon SageMaker (p. 3048).

The role that you pass must have access to the resources that the model needs, and must include all of
the following.

• For training jobs: CreateTrainingJob API: Execution Role Permissions.


• For hosting jobs: CreateModel API: Execution Role Permissions.

71
Amazon SageMaker Developer Guide
Task-Specific Models

Note
You can scope down the Amazon S3 permissions granted in each of the following roles. Do
this by using the ARN of your Amazon Simple Storage Service (Amazon S3) bucket and the
JumpStart Amazon S3 bucket.

{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket"
],
"Resources": [
"arn:aws:s3:::jumpstart-cache-prod-<region>/*",
"arn:aws:s3:::jumpstart-cache-prod-<region>",
"arn:aws:s3:::bucket/*"
]
}

Find IAM role

If you select this option, you must select an existing IAM role from the dropdown list.

Input IAM role

If you select this option, you must manually enter the ARN for an existing IAM role. If your Studio
runtime role or Amazon VPC block the iam:list* call, you must use this option to use an existing IAM
role.

72
Amazon SageMaker Developer Guide
Task-Specific Models

Amazon VPC

All JumpStart models run in network isolation mode. After the model container is created, no more calls
can be made. You can select an Amazon VPC that is passed as part of training jobs and hosting jobs.
SageMaker uses this Amazon VPC to push and pull resources from your Amazon S3 bucket. This Amazon
VPC is different from the Amazon VPC that limits access to the public internet from your Studio instance.
For more information about the Studio Amazon VPC, see Connect SageMaker Studio Notebooks in a VPC
to External Resources (p. 3209).

The Amazon VPC that you pass does not need access to the public internet, but it does need access
to Amazon S3. The Amazon VPC endpoint for Amazon S3 must allow access to at least the following
resources that the model needs.

{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket"
],
"Resources": [
"arn:aws:s3:::jumpstart-cache-prod-<region>/*",
"arn:aws:s3:::jumpstart-cache-prod-<region>",
"arn:aws:s3:::bucket/*"
]
}

If you do not select an Amazon VPC, no Amazon VPC is used.

Find VPC

If you select this option, you must select an existing Amazon VPC from the dropdown list. After you
select an Amazon VPC, you must select a subnet and security group for your Amazon VPC. For more
information about subnets and security groups, see Overview of VPCs and subnets.

73
Amazon SageMaker Developer Guide
Task-Specific Models

Input VPC

If you select this option, you must manually select the subnet and security group that compose your
Amazon VPC. If your Studio runtime role or Amazon VPC blocks the ec2:list* call, you must use this
option to select the subnet and security group.

Encryption keys
You can select an AWS KMS key that is passed as part of training jobs and hosting jobs. SageMaker uses
this key to encrypt the Amazon EBS volume for the container, and the repackaged model in Amazon S3
for hosting jobs and the output for training jobs. For more information about AWS KMS keys, see AWS
KMS keys.

The key that you pass must trust the IAM role that you pass. If you do not specify an IAM role, the AWS
KMS key must trust your Studio runtime role.

If you do not select an AWS KMS key, SageMaker provides default encryption for the data in the Amazon
EBS volume and the Amazon S3 artifacts.

74
Amazon SageMaker Developer Guide
Task-Specific Models

Find encryption keys

If you select this option, you must select existing AWS KMS keys from the dropdown list.

Input encryption keys

If you select this option, you must manually enter the AWS KMS keys. If your Studio execution role or
Amazon VPC block the kms:list* call, you must use this option to select existing AWS KMS keys.

75
Amazon SageMaker Developer Guide
Task-Specific Models

Fine-Tune a Model
Fine-tuning trains a pretrained model on a new dataset without training from scratch. This process, also
known as transfer learning, can produce accurate models with smaller datasets and less training time.
You can fine-tune a model if its card shows a fine-tunable attribute set to Yes.

Fine-Tuning data source


When you fine-tune a model, you can use the default dataset or choose your own data, which is located
in an Amazon S3 bucket.

To browse the buckets available to you, choose Find S3 bucket. These buckets are limited by the
permissions used to set up your Studio account. You can also specify an Amazon S3 URI by choosing
Enter Amazon S3 bucket location.

76
Amazon SageMaker Developer Guide
Task-Specific Models

Tip
To find out how to format the data in your bucket, choose Learn more. The description section
for the model has detailed information about inputs and outputs.

For text models:

• The bucket must have a data.csv file.


• The first column must be a unique integer for the class label. For example: 1, 2, 3, 4, n
• The second column must be a string.
• The second column should have the corresponding text that matches the type and language for the
model.

For vision models:

• The bucket must have as many subdirectories as the number of classes.


• Each subdirectory should contain images that belong to that class in .jpg format.

Note
The Amazon S3 bucket must be in the same AWS Region where you're running SageMaker
Studio because SageMaker doesn't allow cross-Region requests.

Fine-Tuning deployment configuration


The p3 family is recommended as the fastest for deep learning training, and this is recommended for
fine-tuning a model. The following chart shows the number of GPUs in each instance type. There are
other available options that you can choose from, including p2 and g4 instance types.

77
Amazon SageMaker Developer Guide
Task-Specific Models

Instance type GPUs

p3.2xlarge 1

p3.8xlarge 4

p3.16xlarge 8

p3dn.24xlarge 8

Hyperparameters
You can customize the hyperparameters of the training job that are used to fine-tune the model. The
hyperparameters available for each fine-tunable model differ depending on the model. For information
on each available hyperparameter, reference the hyperparameters documentation for the model of your
choosing in Use Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281). For example,
see Image Classification - TensorFlow Hyperparameters (p. 1526) for details on the fine-tunable Image
Classification - TensorFlow hyperparameters.

If you use the default dataset for text models without changing the hyperparameters, you get a nearly
identical model as a result. For vision models, the default dataset is different from the dataset used to
train the pretrained models, so your model is different as a result.

The following hyperparameters are common among models:

• Epochs – One epoch is one cycle through the entire dataset. Multiple intervals complete a batch, and
multiple batches eventually complete an epoch. Multiple epochs are run until the accuracy of the
model reaches an acceptable level, or when the error rate drops below an acceptable level.
• Learning rate – The amount that values should be changed between epochs. As the model is refined,
its internal weights are being nudged and error rates are checked to see if the model improves. A
typical learning rate is 0.1 or 0.01, where 0.01 is a much smaller adjustment and could cause the
training to take a long time to converge, whereas 0.1 is much larger and can cause the training to
overshoot. It is one of the primary hyperparameters that you might adjust for training your model.
Note that for text models, a much smaller learning rate (5e-5 for BERT) can result in a more accurate
model.
• Batch size – The number of records from the dataset that is to be selected for each interval to send to
the GPUs for training.

In an image example, you might send out 32 images per GPU, so 32 would be your batch size. If
you choose an instance type with more than one GPU, the batch is divided by the number of GPUs.
Suggested batch size varies depending on the data and the model that you are using. For example,
how you optimize for image data differs from how you handle language data.

In the instance type chart in the deployment configuration section, you can see the number of GPUs
per instance type. Start with a standard recommended batch size (for example, 32 for a vision model).
Then, multiply this by the number of GPUs in the instance type that you selected. For example, if
you're using a p3.8xlarge, this would be 32(batch size) multiplied by 4 (GPUs), for a total of 128, as
your batch size adjusts for the number of GPUs. For a text model like BERT, try starting with a batch
size of 64, and then reduce as needed.

Training output
When the fine-tuning process is complete, JumpStart provides information about the model: parent
model, training job name, training job ARN, training time, and output path. The output path is where you
can find your new model in an Amazon S3 bucket. The folder structure uses the model name that you
provided and the model file is in an /output subfolder and it's always named model.tar.gz.

78
Amazon SageMaker Developer Guide
Shared Models and Notebooks

Example: s3://bucket/model-name/output/model.tar.gz

Share Models
You can share JumpStart models through the Studio UI directly from the Launched Quick start assets
page using the following procedure:

1. Open Amazon SageMaker Studio and choose Launched Quick start assets in the Quick start
solutions section of the lefthand navigation pane.
2. Select the Training jobs tab to view the list of your model training jobs.
3. Under the Training jobs list, select the training job that you want to share. This opens the training job
details page. You cannot share more than one training job at a time.
4. In the header for the training job, choose Share, and select either Share to Canvas or Share with my
organization.

For more information about how to share a model with a SageMaker Canvas user, see Bring Your Own
Model Into Canvas.
Note
Only tabular models can be shared to SageMaker Canvas. Trying to share a non-tabular model
to SageMaker Canvas throws the error Unsupported Data Type.

For more information about sharing models with your organization, see Shared Models and
Notebooks (p. 79).

Shared Models and Notebooks


Share your models and notebooks to centralize model artifacts, facilitate discoverability, and increase
the reuse of models within your organization. When sharing your models, you can provide training and
inference environment information and allow collaborators to use these environments for their own
training and inference jobs.

All models that you share and models that are shared with you are searchable in a centralized location
directly in Amazon SageMaker Studio. For information on the onboarding steps to sign into Amazon
SageMaker Studio, see Onboard to Amazon SageMaker Domain.

Access shared models and notebooks


To access your shared content, choose Shared models in the left navigation pane of the Amazon
SageMaker Studio UI.

Add shared content


You can share models or notebooks through the Shared models section of the Studio UI. For details
about each step, see Share models and notebooks through the Studio UI (p. 80).

Filter shared content


There are three main options for filtering shared models and notebooks:

1. Shared by me – Models and notebooks that you shared to either JumpStart or SageMaker Canvas.
2. Shared with me – Models and notebooks shared with you
3. Shared by my organization – All models and notebooks that are shared to anyone in your
organization

79
Amazon SageMaker Developer Guide
Shared Models and Notebooks

You can also sort your models and notebooks based on the time they were last updated or by ascending

or descending alphabetical order. Choose the filter ( ) icon to further sort your selections.

Share tabular models with SageMaker Canvas users


In addition to sharing models with your organization, you can also share models with collaborators that
use SageMaker Canvas. If you share models to SageMaker Canvas, your collaborators can import those
models into SageMaker Canvas and use them to generate predictions.
Important
Important: You can only share tabular models to SageMaker Canvas.

You can filter for models and notebooks shared to and from SageMaker Canvas by selecting the filter

( ) icon in the Shared by me or Shared with me tabs. For more information about how to share a
model to SageMaker Canvas, see Bring Your Own Model Into Canvas.

Share models and notebooks through the Studio UI


To share models and notebooks, navigate to the Shared models section in
Amazon SageMaker Studio, choose Shared by my organization, and then

80
Amazon SageMaker Developer Guide
Shared Models and Notebooks

select the Add dropdown list. Choose to either add a model or add a notebook.

Add a model
To add a model, choose Shared by my organization, and then select Add model from the the Add
dropdown list. Enter the basic information for your model, and add any training or inference information
you want to share with collaborators to train or deploy your model. After you enter all the necessary
information, choose Add model in the lower right corner.

Basic information

First, add the basic descriptive information about your model. This information is used to improve the
searchability of your model.

1. Add a title for this model. Adding a title automatically populates a unique identifier in the ID field
based on the model title.
2. Add a description of the model.
3. Select a data type from the options: text, vision, tabular, or audio.
4. Select a machine learning task from the list of available tasks, such as image classification or text
generation.
5. Select a machine learning framework.

81
Amazon SageMaker Developer Guide
Shared Models and Notebooks

6. Add metadata information with keywords or phrases to use when searching for a model. Use commas
to separate keywords. Any spaces are automatically replaced with commas.

Enable training

When adding a model to share, you can optionally provide a training environment and allow
collaborators in your organization to train the shared model.
Note
If you are adding a tabular model, you also need to specify a column format and target column
to enable training. For more information, see Amazon SageMaker Canvas in the Amazon
SageMaker Developer Guide.

1. Add a container to use for model training. You can select a container used for an existing training job,
bring your own container in Amazon ECR, or use an Amazon SageMaker Deep Learning Container.
2. Add environment variables.
3. Provide a training script location.
4. Provide a script mode entry point.
5. Provide an Amazon S3 URI for model artifacts generated during training.
6. Provide the Amazon S3 URI to the default training dataset.
7. Provide a model output path. The model output path should be the Amazon S3 URI path for any
model artifacts generated from training. SageMaker saves the model artifacts as a single compressed
TAR file in Amazon S3.
8. Provide a validation dataset to use for evaluating your model during training. Validation datasets must
contain the same number of columns and the same feature headers as the training dataset.
9. Turn on network isolation. Network isolation isolates the model container so that no inbound or
outbound network calls can be made to or from the model container.
10.Provide training channels through which SageMaker can access your data. For example, you might
specify input channels named train or test. For each channel, specify a channel name and a URI to
the location of your data. Choose Browse to search for Amazon S3 locations.
11.Provide hyperparameters. Add any hyperparameters with which collaborators should experiment
during training. Provide a range of valid values for these hyperparameters. This range is used
for training job hyperparameter validation. You can define ranges based on the datatype of the
hyperparameter.
12.Select an instance type. We recommend a GPU instance with more memory for training with large
batch sizes. For a comprehensive list of SageMaker training instances across AWS Regions, see the On-
Demand Pricing table in Amazon SageMaker Pricing.
13.Provide metrics. Define metrics for a training job by specifying a name and a regular expression for
each metric that your training monitors. Design the regular expressions to capture the values of
metrics that your algorithm emits. For example, the metric loss might have the regular expression
"Loss =(.*?);".

Enable deployment

When adding a model to share, you can optionally provide an inference environment in which
collaborators in your organization can deploy the shared model for inference.

1. Add a container to use for inference. You can bring your own container in Amazon ECR or use an
Amazon SageMaker Deep Learning Container.
2. Provide the Amazon S3 URI to an inference script. Custom inference scripts run inside your chosen
container. Your inference script should include a function for model loading, and optionally
functions generating predictions, and input and output processing. For more information on creating
inference scripts for the framework of your choice, see Frameworks in the SageMaker Python SDK

82
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial

documentation. For example, for TensorFlow, see How to implement the pre- and/or post-processing
handler(s).
3. Provide an Amazon S3 URI for model artifacts. Model artifacts are the output that results from
training a model, and typically consist of trained parameters, a model definition that describes how to
compute inferences, and other metadata. If you trained your model in SageMaker, the model artifacts
are saved as a single compressed TAR file in Amazon S3. If you trained your model outside SageMaker,
you need to create this single compressed TAR file and save it in an Amazon S3 location.
4. Select an instance type. We recommend a GPU instance with more memory for training with large
batch sizes. For a comprehensive list of SageMaker training instances across AWS Regions, see the On-
Demand Pricing table in Amazon SageMaker Pricing.

Add a notebook
To add a notebook, choose Shared by my organization, and then select Add notebook from the the
Add dropdown list. Enter the basic information for your notebook and provide an Amazon S3 URI for the
location of that notebook.

Basic information

First, add the basic descriptive information about your notebook. This information is used to improve the
searchability of your notebook.

1. Add a title for this notebook. Adding a title automatically populates a unique identifier in the ID field
based on the notebook title.
2. Add a description of the notebook.
3. Select a data type from the options: text, vision, tabular, or audio.
4. Select an ML task from the list of available tasks, such as image classification or text generation.
5. Select an ML framework.
6. Add metadata information with keywords or phrases to use when searching for a notebook. Use
commas to separate keywords. Any spaces are automatically replaced with commas.

Add notebook

Provide an Amazon S3 URI for the location of that notebook. You can choose Browse to search through
your Amazon S3 buckets for your notebook file location. After you find your notebook, copy the Amazon
S3 URI, choose Cancel, and then add the Amazon S3 URI to the Notebook Location field.

After you enter all the necessary information, choose Add notebook in the lower right corner.

Amazon SageMaker JumpStart Industry: Financial


Use SageMaker JumpStart Industry: Financial solutions, models, and example notebooks to learn about
SageMaker features and capabilities through curated one-step solutions and example notebooks of
industry-focused machine learning (ML) problems. The notebooks also walk through how to use the
SageMaker JumpStart Industry Python SDK to enhance industry text data and fine-tune pretrained
models.

Topics
• Amazon SageMaker JumpStart Industry Python SDK (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Solution (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Models (p. 84)
• Amazon SageMaker JumpStart Industry: Financial Example Notebooks (p. 86)
• Amazon SageMaker JumpStart Industry: Financial Blog Posts (p. 86)

83
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial

• Amazon SageMaker JumpStart Industry: Financial Related Research (p. 86)


• Amazon SageMaker JumpStart Industry: Financial Additional Resources (p. 87)

Amazon SageMaker JumpStart Industry Python SDK


SageMaker JumpStart provides processing tools for curating industry datasets and fine-tuning
pretrained models through its client library called SageMaker JumpStart Industry Python SDK. For
detailed API documentation of the SDK, and to learn more about processing and enhancing industry text
datasets for improving the performance of state-of-the-art models on SageMaker JumpStart, see the
SageMaker JumpStart Industry Python SDK open source documentation.

Amazon SageMaker JumpStart Industry: Financial Solution


SageMaker JumpStart Industry: Financial provides the following solution notebooks:

• Corporate Credit Rating Prediction

This SageMaker JumpStart Industry: Financial solution provides a template for a text-enhanced
corporate credit rating model. It shows how to take a model based on numeric features (in this case,
Altman's famous 5 financial ratios) combined with texts from SEC filings to achieve an improvement in
the prediction of credit ratings. In addition to the 5 Altman ratios, you can add more variables as needed
or set custom variables. This solution notebook shows how SageMaker JumpStart Industry Python
SDK helps process Natural Language Processing (NLP) scoring of texts from SEC filings. Furthermore,
the solution demonstrates how to train a model using the enhanced dataset to achieve a best-in-class
model, deploy the model to a SageMaker endpoint for production, and receive improved predictions in
real time.

• Graph-Based Credit Scoring

Credit ratings are traditionally generated using models that use financial statement data and market
data, which is tabular only (numeric and categorical). This solution constructs a network of firms using
SEC filingsand shows how to use the network of firm relationships with tabular data to generate accurate
rating predictions. This solution demonstrates a methodology to use data on firm linkages to extend
the traditionally tabular-based credit scoring models, which have been used by the ratings industry for
decades, to the class of machine learning models on networks.
Note
The solution notebooks are for demonstration purposes only. They should not be relied on as
financial or investment advice.

You can find these financial services solutions through the SageMaker JumpStart page in Studio.
Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and
launch SageMaker Studio. For more information about how to find the solution card, see the
previous topic at SageMaker JumpStart.

Amazon SageMaker JumpStart Industry: Financial Models


SageMaker JumpStart Industry: Financial provides the following pretrained Robustly Optimized BERT
approach (RoBERTa) models:

• Financial Text Embedding (RoBERTa-SEC-Base)


• RoBERTa-SEC-WIKI-Base
• RoBERTa-SEC-Large

84
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial

• RoBERTa-SEC-WIKI-Large

The RoBERTa-SEC-Base and RoBERTa-SEC-Large models are the text embedding models based on
GluonNLP's RoBERTa model and pretrained on S&P 500 SEC 10-K/10-Q reports of the decade of the
2010's (from 2010 to 2019). In addition to these, SageMaker JumpStart Industry: Financial provides two
more RoBERTa variations, RoBERTa-SEC-WIKI-Base and RoBERTa-SEC-WIKI-Large, which are pretrained
on the SEC filings and common texts of Wikipedia.

You can find these models in SageMaker JumpStart by navigating to the Text Models node, choosing
Explore All Text Models, and then filtering for the ML Task Text Embedding. You can access any
corresponding notebooks after selecting the model of your choice. The paired notebooks will walk you
through how the pretrained models can be fine-tuned for specific classification tasks on multimodal
datasets, which are enhanced by the SageMaker JumpStart Industry Python SDK.
Note
The model notebooks are for demonstration purposes only. They should not be relied on as
financial or investment advice.

The following screenshot shows the pretrained model cards provided through the SageMaker JumpStart
page on Studio.

Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and

85
Amazon SageMaker Developer Guide
SageMaker JumpStart Industry: Financial

launch SageMaker Studio. For more information about how to find the model cards, see the
previous topic at SageMaker JumpStart.

Amazon SageMaker JumpStart Industry: Financial Example


Notebooks
SageMaker JumpStart Industry: Financial provides the following example notebooks to demonstrate
solutions to industry-focused ML problems:

• Financial TabText Data Construction – This example introduces how to use the SageMaker JumpStart
Industry Python SDK for processing the SEC filings, such as text summarization and scoring texts based
on NLP score types and their corresponding word lists. To preview the content of this notebook, see
Simple Construction of a Multimodal Dataset from SEC Filings and NLP Scores.
• Multimodal ML on TabText Data – This example shows how to merge different types of datasets
into a single dataframe called TabText and perform multimodal ML. To preview the content of this
notebook, see Machine Learning on a TabText Dataframe – An Example Based on the Paycheck
Protection Program.
• Multi-category ML on SEC filings data – This example shows how to train an AutoGluon NLP model
over the multimodal (TabText) datasets curated from SEC filings for a multiclass classification task.
Classify SEC 10K/Q Filings to Industry Codes Based on the MDNA Text Column.

Note
The example notebooks are for demonstrative purposes only. They should not be relied on as
financial or investment advice.
Note
The SageMaker JumpStart Industry: Financial solutions, model cards, and example notebooks
are hosted and runnable only through SageMaker Studio. Log in to the SageMaker console, and
launch SageMaker Studio. For more information about how to find the example notebooks, see
the previous topic at SageMaker JumpStart.

To preview the content of the example notebooks, see Tutorials – Finance in the SageMaker JumpStart
Industry Python SDK documentation.

Amazon SageMaker JumpStart Industry: Financial Blog Posts


For thorough applications of using SageMaker JumpStart Industry: Financial solutions, models,
examples, and the SDK, see the following blog posts:

• Use pre-trained financial language models for transfer learning in Amazon SageMaker JumpStart
• Use SEC text for ratings classification using multimodal ML in Amazon SageMaker JumpStart
• Create a dashboard with SEC text for financial NLP in Amazon SageMaker JumpStart
• Build a corporate credit ratings classifier using graph machine learning in Amazon SageMaker
JumpStart
• Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial
data

Amazon SageMaker JumpStart Industry: Financial Related


Research
For research related to SageMaker JumpStart Industry: Financial solutions, see the following papers:

• Context, Language Modeling, and Multimodal Data in Finance


• Multimodal Machine Learning for Credit Modeling

86
Amazon SageMaker Developer Guide
Get Started with Notebook Instances

• On the Lack of Robust Interpretability of Neural Text Classifiers


• FinLex: An Effective Use of Word Embeddings for Financial Lexicon Generation

Amazon SageMaker JumpStart Industry: Financial Additional


Resources
For additional documentation and tutorials, see the following resources:

• The SageMaker JumpStart Industry: Financial Python SDK


• SageMaker JumpStart Industry: Financial Python SDK Tutorials
• The SageMaker JumpStart Industry: Financial GitHub repository
• Getting started with Amazon SageMaker - Machine Learning Tutorials

Get Started with Amazon SageMaker Notebook


Instances
One of the best ways for machine learning (ML) practitioners to use Amazon SageMaker is to train and
deploy ML models using SageMaker notebook instances. The SageMaker notebook instances help create
the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud (Amazon EC2) and
providing preconfigured kernels with the following packages: the Amazon SageMaker Python SDK,
AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), Conda, Pandas, deep learning
framework libraries, and other libraries for data science and machine learning.

Machine Learning with the SageMaker Python SDK


To train, validate, deploy, and evaluate an ML model in a SageMaker notebook instance, use the
SageMaker Python SDK. The SageMaker Python SDK abstracts AWS SDK for Python (Boto3) and
SageMaker API operations. It enables you to integrate with and orchestrate other AWS services, such
as Amazon Simple Storage Service (Amazon S3) for saving data and model artifacts, Amazon Elastic
Container Registry (ECR) for importing and servicing the ML models, Amazon Elastic Compute Cloud
(Amazon EC2) for training and inference.

You can also take advantage of SageMaker features that help you deal with every stage of a complete
ML cycle: data labeling, data preprocessing, model training, model deployment, evaluation on prediction
performance, and monitoring the quality of model in production.

If you're a first-time SageMaker user, we recommend you to use the SageMaker Python SDK, following
the end-to-end ML tutorial. To find the open source documentation, see the Amazon SageMaker Python
SDK.

Tutorial Overview
This Get Started tutorial walks you through how to create a SageMaker notebook instance, open a
Jupyter notebook with a preconfigured kernel with the Conda environment for machine learning, and
start a SageMaker session to run an end-to-end ML cycle. You'll learn how to save a dataset to a default
Amazon S3 bucket automatically paired with the SageMaker session, submit a training job of an ML
model to Amazon EC2, and deploy the trained model for prediction by hosting or batch inferencing
through Amazon EC2.

This tutorial explicitly shows a complete ML flow of training the XGBoost model from the SageMaker
built-in model pool. You use the US Adult Census dataset, and you evaluate the performance of the
trained SageMaker XGBoost model on predicting individuals' income.

87
Amazon SageMaker Developer Guide
Step 1: Create an Amazon SageMaker Notebook Instance

• SageMaker XGBoost – The XGBoost model is adapted to the SageMaker environment and
preconfigured as Docker containers. SageMaker provides a suite of built-in algorithms that are
prepared for using SageMaker features. To learn more about what ML algorithms are adapted to
SageMaker, see Choose an Algorithm and Use Amazon SageMaker Built-in Algorithms. For the
SageMaker built-in algorithm API operations, see First-Party Algorithms in the Amazon SageMaker
Python SDK.
• Adult Census dataset – The dataset from the 1994 Census bureau database by Ronny Kohavi and Barry
Becker (Data Mining and Visualization, Silicon Graphics). The SageMaker XGBoost model is trained
using this dataset to predict if an individual makes over $50,000 a year or less.

Topics
• Step 1: Create an Amazon SageMaker Notebook Instance (p. 88)
• Step 2: Create a Jupyter Notebook (p. 89)
• Step 3: Download, Explore, and Transform a Dataset (p. 90)
• Step 4: Train a Model (p. 94)
• Step 5: Deploy the Model to Amazon EC2 (p. 98)
• Step 6: Evaluate the Model (p. 100)
• Step 7: Clean Up (p. 103)

Step 1: Create an Amazon SageMaker Notebook


Instance
An Amazon SageMaker notebook instance is a fully managed machine learning (ML) Amazon Elastic
Compute Cloud (Amazon EC2) compute instance that runs the Jupyter Notebook App. You use the
notebook instance to create and manage Jupyter notebooks for preprocessing data and to train and
deploy machine learning models.

To create a SageMaker notebook instance

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Notebook instances, and then choose Create notebook instance.
3. On the Create notebook instance page, provide the following information (if a field is not
mentioned, leave the default values):

a. For Notebook instance name, type a name for your notebook instance.
b. For Notebook Instance type, choose ml.t2.medium. This is the least expensive instance type
that notebook instances support, and it suffices for this exercise. If a ml.t2.medium instance
type isn't available in your current AWS Region, choose ml.t3.medium.
c. For Platform Identifier, choose a platform type to create the notebook instance on. This
platform type dictates the Operating System and the JupyterLab version that your notebook
instance is created with. For information about platform identifier type, see Amazon Linux 2 vs
Amazon Linux notebook instances (p. 205). For information about JupyterLab versions, see
JupyterLab versioning (p. 208).
d. For IAM role, choose Create a new role, and then choose Create role. This IAM role
automatically gets permissions to access any S3 bucket that has sagemaker in the name. It
gets these permissions through the AmazonSageMakerFullAccess policy, which SageMaker
attaches to the role.
Note
If you want to grant the IAM role permission to access S3 buckets without sagemaker
in the name, you need to attach the S3FullAccess policy or limit the permissions

88
Amazon SageMaker Developer Guide
Step 2: Create a Jupyter Notebook

to specific S3 buckets to the IAM role. For more information and examples of adding
bucket policies to the IAM role, see Bucket Policy Examples.
e. Choose Create notebook instance.

In a few minutes, SageMaker launches an ML compute instance—in this case, a notebook


instance—and attaches a 5 GB of Amazon EBS storage volume to it. The notebook instance
has a preconfigured Jupyter notebook server, SageMaker and AWS SDK libraries, and a set of
Anaconda libraries.

For more information about creating a SageMaker notebook instance, see Create a Notebook
Instance.

(Optional) Change SageMaker Notebook Instance Settings


If you want to change the ML compute instance type or the size of the Amazon EBS storage of a
SageMaker notebook instance that's already created, you can edit the notebook instance settings.

To change and update the SageMaker Notebook instance type and the EBS volume

1. On the Notebook instances page in the SageMaker console, choose your notebook instance.
2. Choose Actions, choose Stop, and then wait until the notebook instance fully stops.
3. After the notebook instance status changes to Stopped, choose Actions, and then choose Update
settings.

a. For Notebook instance type, choose a different ML instance type.


b. For Volume size in GB, type a different integer to specify a new EBS volume size.
Note
EBS storage volumes are encrypted, so SageMaker can't determine the amount of
available free space on the volume. Because of this, you can increase the volume size
when you update a notebook instance, but you can't decrease the volume size. If you
want to decrease the size of the ML storage volume in use, create a new notebook
instance with the desired size.
4. At the bottom of the page, choose Update notebook instance.
5. When the update is complete, Start the notebook instance with the new settings.

For more information about updating SageMaker notebook instance settings, see Update a Notebook
Instance.

(Optional) Advanced Settings for SageMaker Notebook


Instances
The following tutorial video shows how to set up and use SageMaker notebook instances through the
SageMaker console with advanced options, such as SageMaker lifecycle configuration and importing
GitHub repositories. (Length: 26:04)

For complete documentation about SageMaker notebook instance, see Use Amazon SageMaker
notebook Instances.

Step 2: Create a Jupyter Notebook


To start scripting for training and deploying your model, create a Jupyter notebook in the SageMaker
notebook instance. Using the Jupyter notebook, you can conduct machine learning (ML) experiments for
training and inference while accessing the SageMaker features and the AWS infrastructure.

89
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data

To create a Jupyter notebook

1. Open the notebook instance as follows:

a. Sign in to the SageMaker console at https://console.aws.amazon.com/sagemaker/.


b. On the Notebook instances page, open your notebook instance by choosing either Open
JupyterLab for the JupyterLab interface or Open Jupyter for the classic Jupyter view.
Note
If the notebook instance status shows Pending in the Status column, your notebook
instance is still being created. The status will change to InService when the notebook
instance is ready for use.
2. Create a notebook as follows:

• If you opened the notebook in the JupyterLab view, on the File menu, choose New, and then
choose Notebook. For Select Kernel, choose conda_python3. This preinstalled environment
includes the default Anaconda installation and Python 3.
• If you opened the notebook in the classic Jupyter view, on the Files tab, choose New, and then
choose conda_python3. This preinstalled environment includes the default Anaconda installation
and Python 3.
3. Save the notebooks as follows:

• In the JupyterLab view, choose File, choose Save Notebook As..., and then rename the notebook.
• In the Jupyter classic view, choose File, choose Save as..., and then rename the notebook.

Step 3: Download, Explore, and Transform a Dataset


In this step, you load the Adult Census dataset to your notebook instance using the SHAP (SHapley
Additive exPlanations) Library, review the dataset, transform it, and upload it to Amazon S3. SHAP is a
game theoretic approach to explain the output of any machine learning model. For more information
about SHAP, see Welcome to the SHAP documentation.

To run the following example, paste the sample code into a cell in your notebook instance.

Load Adult Census Dataset Using SHAP


Using the SHAP library, import the Adult Census dataset as shown following:

import shap
X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)
feature_names = list(X.columns)
feature_names

Note
If the current Jupyter kernel does not have the SHAP library, install it by running the following
conda command:

%conda install -c conda-forge shap

If you're using JupyterLab, you must manually refresh the kernel after the installation and
updates have completed. Run the following IPython script to shut down the kernel (the kernel
will restart automatically):

import IPython

90
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data

IPython.Application.instance().kernel.do_shutdown(True)

The feature_names list object should return the following list of features:

['Age',
'Workclass',
'Education-Num',
'Marital Status',
'Occupation',
'Relationship',
'Race',
'Sex',
'Capital Gain',
'Capital Loss',
'Hours per week',
'Country']

Tip
If you're starting with unlabeled data, you can use Amazon SageMaker Ground Truth to create a
data labeling workflow in minutes. To learn more, see Label Data.

Overview the Dataset


Run the following script to display the statistical overview of the dataset and histograms of the numeric
features.

display(X.describe())
hist = X.hist(bins=30, sharey=True, figsize=(20, 10))

91
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data

Tip
If you want to use a dataset that needs to be cleaned and transformed, you can simplify and
streamline data preprocessing and feature engineering using Amazon SageMaker Data Wrangler.
To learn more, see Prepare ML Data with Amazon SageMaker Data Wrangler.

Split the Dataset into Train, Validation, and Test Datasets


Using Sklearn, split the dataset into a training set and a test set. The training set is used to train the
model, while the test set is used to evaluate the performance of the final trained model. The dataset is
randomly sorted with the fixed random seed: 80 percent of the dataset for training set and 20 percent of
it for a test set.

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train_display = X_display.loc[X_train.index]

Split the training set to separate out a validation set. The validation set is used to evaluate the
performance of the trained model while tuning the model's hyperparameters. 75 percent of the training
set becomes the final training set, and the rest is the validation set.

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,


random_state=1)
X_train_display = X_display.loc[X_train.index]
X_val_display = X_display.loc[X_val.index]

Using the pandas package, explicitly align each dataset by concatenating the numeric features with the
true labels.

import pandas as pd
train = pd.concat([pd.Series(y_train, index=X_train.index,
name='Income>50K', dtype=int), X_train], axis=1)
validation = pd.concat([pd.Series(y_val, index=X_val.index,
name='Income>50K', dtype=int), X_val], axis=1)
test = pd.concat([pd.Series(y_test, index=X_test.index,
name='Income>50K', dtype=int), X_test], axis=1)

Check if the dataset is split and structured as expected:

train

92
Amazon SageMaker Developer Guide
Step 3: Download, Explore, and Transform Data

validation

test

Convert the Train and Validation Datasets to CSV Files


Convert the train and validation dataframe objects to CSV files to match the input file format for
the XGBoost algorithm.

# Use 'csv' format to store the data


# The first column is expected to be the output column
train.to_csv('train.csv', index=False, header=False)
validation.to_csv('validation.csv', index=False, header=False)

Upload the Datasets to Amazon S3


Using the SageMaker and Boto3, upload the training and validation datasets to the default Amazon
S3 bucket. The datasets in the S3 bucket will be used by a compute-optimized SageMaker instance on
Amazon EC2 for training.

93
Amazon SageMaker Developer Guide
Step 4: Train a Model

The following code sets up the default S3 bucket URI for your current SageMaker session, creates a
new demo-sagemaker-xgboost-adult-income-prediction folder, and uploads the training and
validation datasets to the data subfolder.

import sagemaker, boto3, os


bucket = sagemaker.Session().default_bucket()
prefix = "demo-sagemaker-xgboost-adult-income-prediction"

boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'data/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'data/validation.csv')).upload_file('validation.csv')

Run the following AWS CLI to check if the CSV files are successfully uploaded to the S3 bucket.

! aws s3 ls {bucket}/{prefix}/data --recursive

This should return the following output:

Step 4: Train a Model


The Amazon SageMaker Python SDK provides framework estimators and generic estimators to train
your model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker features
for training and the AWS infrastructures, such as Amazon Elastic Container Registry (Amazon ECR),
Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3). For
more information about SageMaker built-in framework estimators, see Frameworksin the Amazon
SageMaker Python SDK documentation. For more information about built-in algorithms, see Use
Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281).

Topics
• Choose the Training Algorithm (p. 94)
• Create and Run a Training Job (p. 94)

Choose the Training Algorithm


To choose the right algorithm for your dataset, you typically need to evaluate different models to find
the most suitable models to your data. For simplicity, the SageMaker XGBoost Algorithm (p. 1369) built-
in algorithm is used throughout this tutorial without the pre-evaluation of models.
Tip
If you want SageMaker to find an appropriate model for your tabular dataset, use Amazon
SageMaker Autopilot that automates a machine learning solution. For more information, see
Automate model development with Amazon SageMaker Autopilot (p. 467).

Create and Run a Training Job


After you figured out which model to use, start constructing a SageMaker estimator for training. This
tutorial uses the XGBoost built-in algorithm for the SageMaker generic estimator.

To run a model training job

1. Import the Amazon SageMaker Python SDK and start by retrieving the basic information from your
current SageMaker session.

94
Amazon SageMaker Developer Guide
Step 4: Train a Model

import sagemaker

region = sagemaker.Session().boto_region_name
print("AWS Region: {}".format(region))

role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))

This returns the following information:

• region – The current AWS Region where the SageMaker notebook instance is running.
• role – The IAM role used by the notebook instance.

Note
Check the SageMaker Python SDK version by running sagemaker.__version__. This
tutorial is based on sagemaker>=2.20. If the SDK is outdated, install the latest version by
running the following command:

! pip install -qU sagemaker

If you run this installation in your exiting SageMaker Studio or notebook instances, you
need to manually refresh the kernel to finish applying the version update.
2. Create an XGBoost estimator using the sagemaker.estimator.Estimator class. In the following
example code, the XGBoost estimator is named xgb_model.

from sagemaker.debugger import Rule, rule_configs


from sagemaker.session import TrainingInput

s3_output_location='s3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model')

container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")


print(container)

xgb_model=sagemaker.estimator.Estimator(
image_uri=container,
role=role,
instance_count=1,
instance_type='ml.m4.xlarge',
volume_size=5,
output_path=s3_output_location,
sagemaker_session=sagemaker.Session(),
rules=[
Rule.sagemaker(rule_configs.create_xgboost_report()),
ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]
)

To construct the SageMaker estimator, specify the following parameters:

• image_uri – Specify the training container image URI. In this example, the SageMaker XGBoost
training container URI is specified using sagemaker.image_uris.retrieve.
• role – The AWS Identity and Access Management (IAM) role that SageMaker uses to perform
tasks on your behalf (for example, reading training results, call model artifacts from Amazon S3,
and writing training results to Amazon S3).
• instance_count and instance_type – The type and number of Amazon EC2 ML compute
instances to use for model training. For this training exercise, you use a single ml.m4.xlarge

95
Amazon SageMaker Developer Guide
Step 4: Train a Model

instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS)
storage, and a high network performance. For more information about EC2 compute instance
types, see Amazon EC2 Instance Types. For more information about billing, see Amazon
SageMaker pricing.
• volume_size – The size, in GB, of the EBS storage volume to attach to the training instance. This
must be large enough to store training data if you use File mode (File mode is on by default). If
you don't specify this parameter, its value defaults to 30.
• output_path – The path to the S3 bucket where SageMaker stores the model artifact and
training results.
• sagemaker_session – The session object that manages interactions with SageMaker API
operations and other AWS service that the training job uses.
• rules – Specify a list of SageMaker Debugger built-in rules. In this example, the
create_xgboost_report() rule creates an XGBoost report that provides insights into the
training progress and results, and the ProfilerReport() rule creates a report regarding the EC2
compute resource utilization. For more information, see SageMaker Debugger XGBoost Training
Report (p. 1685).

Tip
If you want to run distributed training of large sized deep learning models, such as
convolutional neural networks (CNN) and natural language processing (NLP) models, use
SageMaker Distributed for data parallelism or model parallelism. For more information, see
Distributed Training in Amazon SageMaker (p. 1821).
3. Set the hyperparameters for the XGBoost algorithm by calling the set_hyperparameters
method of the estimator. For a complete list of XGBoost hyperparameters, see XGBoost
Hyperparameters (p. 1377).

xgb_model.set_hyperparameters(
max_depth = 5,
eta = 0.2,
gamma = 4,
min_child_weight = 6,
subsample = 0.7,
objective = "binary:logistic",
num_round = 1000
)

Tip
You can also tune the hyperparameters using the SageMaker hyperparameter
optimization feature. For more information, see Perform Automatic Model Tuning with
SageMaker (p. 1612).
4. Use the TrainingInput class to configure a data input flow for training. The following example
code shows how to configure TrainingInput objects to use the training and validation
datasets you uploaded to Amazon S3 in the Split the Dataset into Train, Validation, and Test
Datasets (p. 92) section.

from sagemaker.session import TrainingInput

train_input = TrainingInput(
"s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
"s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv"
)

5. To start model training, call the estimator's fit method with the training and validation datasets.
By setting wait=True, the fit method displays progress logs and waits until training is complete.

96
Amazon SageMaker Developer Guide
Step 4: Train a Model

xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)

For more information about model training, see Train a Model with Amazon SageMaker (p. 10). This
tutorial training job might take up to 10 minutes.

After the training job has done, you can download an XGBoost training report and a profiling
report generated by SageMaker Debugger. The XGBoost training report offers you insights into the
training progress and results, such as the loss function with respect to iteration, feature importance,
confusion matrix, accuracy curves, and other statistical results of training. For example, you can find
the following loss curve from the XGBoost training report which clearly indicates that there is an
overfitting problem.

Run the following code to specify the S3 bucket URI where the Debugger training reports are
generated and check if the reports exist.

rule_output_path = xgb_model.output_path + "/" + xgb_model.latest_training_job.job_name


+ "/rule-output"
! aws s3 ls {rule_output_path} --recursive

Download the Debugger XGBoost training and profiling reports to the current workspace:

! aws s3 cp {rule_output_path} ./ --recursive

Run the following IPython script to get the file link of the XGBoost training report:

from IPython.display import FileLink, FileLinks

97
Amazon SageMaker Developer Guide
Step 5: Deploy the Model

display("Click link below to view the XGBoost Training report",


FileLink("CreateXgboostReport/xgboost_report.html"))

The following IPython script returns the file link of the Debugger profiling report that shows
summaries and details of the EC2 instance resource utilization, system bottleneck detection results,
and python operation profiling results:

profiler_report_name = [rule["RuleConfigurationName"]
for rule in xgb_model.latest_training_job.rule_job_summary()
if "Profiler" in rule["RuleConfigurationName"]][0]
profiler_report_name
display("Click link below to view the profiler report", FileLink(profiler_report_name
+"/profiler-output/profiler-report.html"))

Tip
If the HTML reports do not render plots in the JupyterLab view, you must choose Trust
HTML at the top of the reports.
To identify training issues, such as overfitting, vanishing gradients, and other problems
that prevents your model from converging, use SageMaker Debugger and take automated
actions while prototyping and training your ML models. For more information, see Debug
and Profile Training Jobs Using Amazon SageMaker Debugger (p. 1649). To find a complete
analysis of model parameters, see the Explainability with Amazon SageMaker Debugger
example notebook.

You now have a trained XGBoost model. SageMaker stores the model artifact in your S3 bucket. To
find the location of the model artifact, run the following code to print the model_data attribute of the
xgb_model estimator:

xgb_model.model_data

Tip
To measure biases that can occur during each stage of the ML lifecycle (data collection, model
training and tuning, and monitoring of ML models deployed for prediction), use SageMaker
Clarify. For more information, see Amazon SageMaker Clarify Model Explainability (p. 2093). For
an end-to-end example of it, see the Fairness and Explainability with SageMaker Clarify example
notebook.

Step 5: Deploy the Model to Amazon EC2


To get predictions, deploy your model to Amazon EC2 using Amazon SageMaker.

Topics
• Deploy the Model to SageMaker Hosting Services (p. 98)
• (Optional) Use SageMaker Predictor to Reuse the Hosted Endpoint (p. 99)
• (Optional) Make Prediction with Batch Transform (p. 99)

Deploy the Model to SageMaker Hosting Services


To host a model through Amazon EC2 using Amazon SageMaker, deploy the model that you trained in
Create and Run a Training Job (p. 94) by calling the deploy method of the xgb_model estimator.
When you call the deploy method, you must specify the number and type of EC2 ML instances that you
want to use for hosting an endpoint.

import sagemaker

98
Amazon SageMaker Developer Guide
Step 5: Deploy the Model

from sagemaker.serializers import CSVSerializer


xgb_predictor=xgb_model.deploy(
initial_instance_count=1,
instance_type='ml.t2.medium',
serializer=CSVSerializer()
)

• initial_instance_count (int) – The number of instances to deploy the model.


• instance_type (str) – The type of instances that you want to operate your deployed model.
• serializer (int) – Serialize input data of various formats (a NumPy array, list, file, or buffer) to a
CSV-formatted string. We use this because the XGBoost algorithm accepts input files in CSV format.

The deploy method creates a deployable model, configures the SageMaker hosting services endpoint,
and launches the endpoint to host the model. For more information, see the SageMaker generic
Estimator's deploy class method in the Amazon SageMaker Python SDK. To retrieve the name of
endpoint that's generated by the deploy method, run the following code:

xgb_predictor.endpoint_name

This should return the endpoint name of the xgb_predictor. The format of the endpoint name is
"sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS". This endpoint stays active in the ML instance,
and you can make instantaneous predictions at any time unless you shut it down later. Copy this
endpoint name and save it to reuse and make real-time predictions elsewhere in SageMaker Studio or
SageMaker notebook instances.
Tip
To learn more about compiling and optimizing your model for deployment to Amazon EC2
instances or edge devices, see Compile and Deploy Models with Neo.

(Optional) Use SageMaker Predictor to Reuse the Hosted


Endpoint
After you deploy the model to an endpoint, you can set up a new SageMaker predictor by pairing the
endpoint and continuously make real-time predictions in any other notebooks. The following example
code shows how to use the SageMaker Predictor class to set up a new predictor object using the same
endpoint. Re-use the endpoint name that you used for the xgb_predictor.

import sagemaker
xgb_predictor_reuse=sagemaker.predictor.Predictor(
endpoint_name="sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS",
sagemaker_session=sagemaker.Session(),
serializer=sagemaker.serializers.CSVSerializer()
)

The xgb_predictor_reuse Predictor behaves exactly the same as the original xgb_predictor. For
more information, see the SageMaker Predictor class in the Amazon SageMaker Python SDK.

(Optional) Make Prediction with Batch Transform


Instead of hosting an endpoint in production, you can run a one-time batch inference job to make
predictions on a test dataset using the SageMaker batch transform. After your model training has
completed, you can extend the estimator to a transformer object, which is based on the SageMaker
Transformer class. The batch transformer reads in input data from a specified S3 bucket and makes
predictions.

99
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model

To run a batch transform job

1. Run the following code to convert the feature columns of the test dataset to a CSV file and uploads
to the S3 bucket:

X_test.to_csv('test.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(
os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

2. Specify S3 bucket URIs of input and output for the batch transform job as shown following:

# The location of the test dataset


batch_input = 's3://{}/{}/test'.format(bucket, prefix)

# The location to store the results of the batch transform job


batch_output = 's3://{}/{}/batch-prediction'.format(bucket, prefix)

3. Create a transformer object specifying the minimal number of parameters: the instance_count
and instance_type parameters to run the batch transform job, and the output_path to save
prediction data as shown following:

transformer = xgb_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
output_path=batch_output
)

4. Initiate the batch transform job by executing the transform() method of the transformer object
as shown following:

transformer.transform(
data=batch_input,
data_type='S3Prefix',
content_type='text/csv',
split_type='Line'
)
transformer.wait()

5. When the batch transform job is complete, SageMaker creates the test.csv.out prediction data
saved in the batch_output path, which should be in the following format: s3://sagemaker-
<region>-111122223333/demo-sagemaker-xgboost-adult-income-prediction/batch-
prediction. Run the following AWS CLI to download the output data of the batch transform job:

! aws s3 cp {batch_output} ./ --recursive

This should create the test.csv.out file under the current working directory. You'll be able to see
the float values that are predicted based on the logistic regression of the XGBoost training job.

Step 6: Evaluate the Model


Now that you have trained and deployed a model using Amazon SageMaker, evaluate the model to
ensure that it generates accurate predictions on new data. For model evaluation, use the test dataset
that you created in Step 3: Download, Explore, and Transform a Dataset (p. 90).

100
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model

Evaluate the Model Deployed to SageMaker Hosting Services


To evaluate the model and use it in production, invoke the endpoint with the test dataset and check
whether the inferences you get returns a target accuracy you want to achieve.

To evaluate the model

1. Set up the following function to predict each line of the test set. In the following example code, the
rows argument is to specify the number of lines to predict at a time. You can change the value of it
to perform a batch inference that fully utilizes the instance's hardware resource.

import numpy as np
def predict(data, rows=1000):
split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
predictions = ''
for array in split_array:
predictions = ','.join([predictions,
xgb_predictor.predict(array).decode('utf-8')])
return np.fromstring(predictions[1:], sep=',')

2. Run the following code to make predictions of the test dataset and plot a histogram. You need to
take only the feature columns of the test dataset, excluding the 0th column for the actual values.

import matplotlib.pyplot as plt

predictions=predict(test.to_numpy()[:,1:])
plt.hist(predictions)
plt.show()

3. The predicted values are float type. To determine True or False based on the float values, you
need to set a cutoff value. As shown in the following example code, use the Scikit-learn library to
return the output confusion metrics and classification report with a cutoff of 0.5.

import sklearn

cutoff=0.5
print(sklearn.metrics.confusion_matrix(test.iloc[:, 0], np.where(predictions > cutoff,
1, 0)))
print(sklearn.metrics.classification_report(test.iloc[:, 0], np.where(predictions >
cutoff, 1, 0)))

This should return the following confusion matrix:

101
Amazon SageMaker Developer Guide
Step 6: Evaluate the Model

4. To find the best cutoff with the given test set, compute the log loss function of the logistic
regression. The log loss function is defined as the negative log-likelihood of a logistic model that
returns prediction probabilities for its ground truth labels. The following example code numerically
and iteratively calculates the log loss values (-(y*log(p)+(1-y)log(1-p)), where y is the true
label and p is a probability estimate of the corresponding test sample. It returns a log loss versus
cutoff graph.

import matplotlib.pyplot as plt

cutoffs = np.arange(0.01, 1, 0.01)


log_loss = []
for c in cutoffs:
log_loss.append(
sklearn.metrics.log_loss(test.iloc[:, 0], np.where(predictions > c, 1, 0))
)

plt.figure(figsize=(15,10))
plt.plot(cutoffs, log_loss)
plt.xlabel("Cutoff")
plt.ylabel("Log loss")
plt.show()

This should return the following log loss curve.

5. Find the minimum points of the error curve using the NumPy argmin and min functions:

print(

102
Amazon SageMaker Developer Guide
Step 7: Clean Up

'Log loss is minimized at a cutoff of ', cutoffs[np.argmin(log_loss)],


', and the log loss value at the minimum is ', np.min(log_loss)
)

This should return: Log loss is minimized at a cutoff of 0.53, and the log loss
value at the minimum is 4.348539186773897.

Instead of computing and minimizing the log loss function, you can estimate a cost function as
an alternative. For example, if you want to train a model to perform a binary classification for a
business problem such as a customer churn prediction problem, you can set weights to the elements
of confusion matrix and calculate the cost function accordingly.

You have now trained, deployed, and evaluated your first model in SageMaker.
Tip
To monitor model quality, data quality, and bias drift, use Amazon SageMaker Model Monitor
and SageMaker Clarify. To learn more, see Amazon SageMaker Model Monitor, Monitor Data
Quality, Monitor Model Quality, Monitor Bias Drift, and Monitor Feature Attribution Drift.
Tip
To get human review of low confidence ML predictions or a random sample of predictions, use
Amazon Augmented AI human review workflows. For more information, see Using Amazon
Augmented AI for Human Review.

Step 7: Clean Up
To avoid incurring unnecessary charges, use the AWS Management Console to delete the endpoints and
resources that you created while running the exercises.
Note
Training jobs and logs cannot be deleted and are retained indefinitely.
Note
If you plan to explore other exercises in this guide, you might want to keep some of these
resources, such as your notebook instance, S3 bucket, and IAM role.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/ and delete the


following resources:
• The endpoint. Deleting the endpoint also deletes the ML compute instance or instances that
support it.

1. Under Inference, choose Endpoints.


2. Choose the endpoint that you created in the example, choose Actions, and then choose Delete.
• The endpoint configuration.

1. Under Inference, choose Endpoint configurations.


2. Choose the endpoint configuration that you created in the example, choose Actions, and then
choose Delete.
• The model.

1. Under Inference, choose Models.


2. Choose the model that you created in the example, choose Actions, and then choose Delete.
• The notebook instance. Before deleting the notebook instance, stop it.

1. Under Notebook, choose Notebook instances.

103
Amazon SageMaker Developer Guide
Step 7: Clean Up

2. Choose the notebook instance that you created in the example, choose Actions, and then
choose Stop. The notebook instance takes several minutes to stop. When the Status changes to
Stopped, move on to the next step.
3. Choose Actions, and then choose Delete.
2. Open the Amazon S3 console at https://console.aws.amazon.com/s3/, and then delete the bucket
that you created for storing model artifacts and the training dataset.
3. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/, and then
delete all of the log groups that have names starting with /aws/sagemaker/.

104
Amazon SageMaker Developer Guide
SageMaker Domain

Amazon SageMaker Machine


Learning Environments
Amazon SageMaker supports the following machine learning environments.

• Amazon SageMaker Studio: Lets you build, train, debug, deploy, and monitor your machine learning
models.
• Amazon SageMaker Notebook Instances: Lets you prepare and process data, and train and deploy
machine learning models from a compute instance running the Jupyter Notebook application.
• Amazon SageMaker Studio Lab: Studio Lab is a free service that gives you access to AWS compute
resources, in an environment based on open-source JupyterLab, without requiring an AWS account.
• Amazon SageMaker Canvas: Gives you the ability to use machine learning to generate predictions
without needing to code.
• Amazon SageMaker geospatial: Gives you the ability to build, train, and deploy geospatial models.
• RStudio on Amazon SageMaker: RStudio is an IDE for R, with a console, syntax-highlighting editor
that supports direct code execution, and tools for plotting, history, debugging and workspace
management.

To use these machine learning environments, except Studio Lab and SageMaker Notebook Instances,
you or your organization's administrator must create an Amazon SageMaker Domain. Studio Lab has a
separate onboarding process.

Topics
• Amazon SageMaker Domain (p. 105)
• Amazon SageMaker Studio (p. 128)
• Amazon SageMaker Notebook Instances (p. 204)
• Amazon SageMaker Studio Lab (p. 230)
• Amazon SageMaker Canvas (p. 258)
• Amazon SageMaker geospatial capabilities (p. 401)
• RStudio on Amazon SageMaker (p. 432)

Amazon SageMaker Domain


Amazon SageMaker Domain supports SageMaker machine learning (ML) environments. A SageMaker
Domain is composed of the following entities. For onboarding steps to create a Domain, see Onboard to
Amazon SageMaker Domain (p. 37).

• Domain: An Amazon SageMaker Domain consists of an associated Amazon Elastic File System (Amazon
EFS) volume; a list of authorized users; and a variety of security, application, policy, and Amazon
Virtual Private Cloud (Amazon VPC) configurations. Users within a Domain can share notebook files
and other artifacts with each other. An account can have multiple Domains. For more information
about multiple Domains, see Multiple Domains Overview (p. 108).
• UserProfile: A user profile represents a single user within a Domain. It is the main way to reference a
user for the purposes of sharing, reporting, and other user-oriented features. This entity is created
when a user onboards to the Amazon SageMaker Domain. For more information about user profiles,
see Domain User Profiles (p. 118).

105
Amazon SageMaker Developer Guide
SageMaker Domain

• shared space: A shared space consists of a shared JupyterServer application and shared directory. All
users within the Domain have access to the shared space. All user profiles in a Domain have access
to all shared spaces in the Domain. For more information about shared spaces, see Collaborate with
shared spaces (p. 123).
• App: An app represents an application that supports the reading and execution experience of the
user’s notebooks, terminals, and consoles. The type of app can be JupyterServer, KernelGateway,
RStudioServerPro, or RSession. A user may have multiple apps active simultaneously.

The following tables describe the status values for the Domain, UserProfile, shared space, and App
entities. Where applicable, they also give troubleshooting steps.

Domain status values

Value Description

Pending Ongoing creation of Domain.

InService Successful creation of Domain.

Updating Ongoing update of Domain.

Deleting Ongoing deletion of Domain.

Failed Unsuccessful creation of Domain. Call the


DescribeDomain API to see the failure reason
for Domain creation. Delete the failed Domain
and recreate the Domain after fixing the error
mentioned in FailureReason.

Update_Failed Unsuccessful update of Domain. Call the


DescribeDomain API to see the failure reason
for Domain update. Call the UpdateDomain
API after fixing the error mentioned in
FailureReason.

Delete_Failed Unsuccessful deletion of Domain. Call the


DescribeDomain API to see the failure reason
for Domain deletion. Because deletion failed, you
might have some resources that are still running,
but you cannot use or update the Domain. Call the
DeleteDomain API again after fixing the error
mentioned in FailureReason.

UserProfile status values

Value Description

Pending Ongoing creation of UserProfile.

InService Successful creation of UserProfile.

Updating Ongoing update of UserProfile.

Deleting Ongoing deletion of UserProfile.

Failed Unsuccessful creation of UserProfile. Call the


DescribeUserProfile API to see the failure
reason for UserProfile creation. Delete the

106
Amazon SageMaker Developer Guide
SageMaker Domain

Value Description
failed UserProfile and recreate it after fixing
the error mentioned in FailureReason.

Update_Failed Unsuccessful update of UserProfile. Call


the DescribeUserProfile API to see the
failure reason for UserProfile update. Call the
UpdateUserProfile API again after fixing the
error mentioned in FailureReason.

Delete_Failed Unsuccessful deletion of UserProfile. Call


the DescribeUserProfile API to see the
failure reason for UserProfile deletion.
Because deletion failed, you might have some
resources that are still running, but you cannot
use or update the UserProfile. Call the
DeleteUserProfile API again after fixing the
error mentioned in FailureReason.

shared space status values

Value Description

Pending Ongoing creation of shared space.

InService Successful creation of shared space.

Deleting Ongoing deletion of shared space.

Failed Unsuccessful creation of shared space. Call the


DescribeSpace API to see the failure reason
for shared space creation. Delete the failed
shared space and recreate it after fixing the error
mentioned in FailureReason.

Update_Failed Unsuccessful update of shared space. Call the


DescribeSpace API to see the failure reason
for shared space update. Call the UpdateSpace
API again after fixing the error mentioned in
FailureReason.

Delete_Failed Unsuccessful deletion of shared space. Call the


DescribeSpace API to see the failure reason for
shared space deletion. Because deletion failed,
you might have some resources that are still
running, but you cannot use or update the shared
space. Call the DeleteSpace API again after
fixing the error mentioned in FailureReason.

Deleted Successful deletion of shared space.

App status values

Value Description

Pending Ongoing creation of App.

107
Amazon SageMaker Developer Guide
Prerequisites

Value Description

InService Successful creation of App.

Deleting Ongoing deletion of App.

Failed Unsuccessful creation of App. Call the


DescribeApp API to see the failure reason for
App creation. Call the CreateApp API again after
fixing the error mentioned in FailureReason.

Deleted Successful deletion of App.

Topics
• Prerequisites (p. 108)
• Multiple Domains Overview (p. 108)
• Domain resource isolation (p. 110)
• Setting Defaults for a Domain (p. 112)
• Environment (p. 114)
• View and Edit Domains (p. 114)
• Delete an Amazon SageMaker Domain (p. 116)
• Domain User Profiles (p. 118)
• IAM Identity Center Groups in a Domain (p. 122)
• Collaborate with shared spaces (p. 123)

Prerequisites
To use the features available in an Amazon SageMaker Domain, you must first onboard to a Domain. For
more information, see Onboard to Amazon SageMaker Domain.

If you are interacting with your Domain using the AWS CLI, you must also complete the following
prerequisites.

• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.

Multiple Domains Overview


Amazon SageMaker supports the creation of multiple Amazon SageMaker Domains in a single AWS
Region for each account. Additional Domains in a Region have the same features and capabilities as the
first Domain in a Region. Each Domain can have distinct Domain settings. The same user profile cannot
be added to multiple Domains in a single Region within the same account. For more information about
Domain limits, see Amazon SageMaker endpoints and quotas.

Topics
• Automatic tag propagation (p. 109)
• Scoping each Domain (p. 109)
• Backfilling Domain tags (p. 110)

108
Amazon SageMaker Developer Guide
Multiple Domains Overview

Automatic tag propagation


By default, any SageMaker resources that support tagging and are created from within the Studio
UI after 11/30/2022 are automatically tagged with a Domain ARN tag. The Domain ARN tag is based
on the Domain ID of the Domain that the resource is created in. The following list describes the only
SageMaker resources that do not support automatic tag propagation, as well as the impacted API calls
where the tag is not returned because it was not automatically set.
Note
All SageMaker List APIs do not support tag-based resource isolation.
The default app, which manages the Studio UI, is not automatically tagged.

SageMaker resource Affected API calls

ImageVersionArn • describe-image-version
• update-image-version
• delete-image-version

ModelCardExportJobArn describe-model-card-export-job

PipelineExecutionArn • retry-pipeline-execution
• update-pipeline-execution
• describe-pipeline-execution
• describe-pipeline-definition-for-execution

ModelPackageArn describe-action

Scoping each Domain


You can use Domain ARN tags to enable Domain-level SageMaker resource isolation by modifying the
IAM execution role of your Domain. With resource isolation, SageMaker resources, such as models,
experiments, training jobs, and pipelines created in one Domain, cannot be accessed from other
Domains.

You can also use these tags for cost allocation using AWS Billing and Cost Management. For more
information, see Using AWS cost allocation tags.

To enable resource isolation, you must modify the IAM execution role of your Domain, as follows.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",
"Effect": "Allow",
"Action": "sagemaker:Create*",
"NotResource": [
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"sagemaker:Update*",

109
Amazon SageMaker Developer Guide
Domain resource isolation

"sagemaker:Delete*",
"sagemaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn": "domain-arn"
}
}
},
{
"Sid": "AllowActionsThatDontSupportTagging",
"Effect": "Allow",
"Action": [
"sagemaker:DescribeImageVersion",
"sagemaker:UpdateImageVersion",
"sagemaker:DeleteImageVersion",
"sagemaker:DescribeModelCardExportJob",
"sagemaker:RetryPipelineExecution",
"sagemaker:DescribePipelineExecution",
"sagemaker:UpdatePipelineExecution",
"sagemaker:DescribeAction"
],
"Resource": "*"
},
{
"Sid": "DeleteDefaultApp",
"Effect": "Allow",
"Action": "sagemaker:DeleteApp",
"Resource": "arn:aws:sagemaker:*:*:app/domain-ID/*/jupyterserver/default"
}
}

Backfilling Domain tags


If you have created resources in a Domain before 11/30/2022, those resources are not automatically
tagged with the Domain Amazon Resource Name (ARN) tag.

To accurately attribute resources to their respective Domain, you must add the Domain tag to existing
resources using the AWS CLI, as follows.

1. Map all existing SageMaker resources and their respective ARNs to the Domains that exist in your
account.
2. Run the following command from your local machine to tag the resource with the ARN of the
resource's respective Domain. This must be repeated for every SageMaker resource in your account.

aws resourcegroupstaggingapi tag-resources \


--resource-arn-list arn:aws:sagemaker:region:account-id:space/domain-id/space-name
\
--tags sagemaker:domain-arn=arn:aws:sagemaker:region:account-id:domain/domain-id

Domain resource isolation


You can isolate resources between each of the Domains in your account and Region using an AWS
Identity and Access Management policy. With this resource isolation, resources in one Domain cannot be
accessed from within another Domain. The following topic shows how to create a new IAM policy that
limits access to resources in the Domain to user profiles with the Domain tag, as well as how to attach
this policy to the IAM execution role of the Domain. You must repeat this process for each Domain in
your account.

110
Amazon SageMaker Developer Guide
Domain resource isolation

Console
The following section shows how to create a new IAM policy that limits access to resources in the Domain
to user profiles with the Domain tag, as well as how to attach this policy to the IAM execution role of the
Domain, from the Amazon SageMaker console.

1. Create an IAM policy named StudioDomainResourceIsolationPolicy-domain-id with the


following JSON policy document by completing the steps in Creating IAM policies (console).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",
"Effect": "Allow",
"Action": [
"SageMaker:Create*"
],
"NotResource":
[
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"SageMaker:Update*",
"SageMaker:Delete*",
"SageMaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn":
"arn:aws:sagemaker:region:account-id:domain/domain-id"
}
}
}
]
}

2. Attach the StudioDomainResourceIsolationPolicy-domain-id policy to the Domain's


execution role by completing the steps in Modifying a role (console).

AWS CLI
The following section shows how to create a new IAM policy that limits access to resources in the Domain
to user profiles with the Domain tag, as well as how to attach this policy to the execution role of the
Domain, from the AWS CLI.

1. Create a file named StudioDomainResourceIsolationPolicy-domain-id with the following


content from your local machine.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateAPIs",

111
Amazon SageMaker Developer Guide
Setting Defaults for a Domain

"Effect": "Allow",
"Action": [
"SageMaker:Create*"
],
"NotResource":
[
"arn:aws:sagemaker:*:*:domain/*",
"arn:aws:sagemaker:*:*:user-profile/*",
"arn:aws:sagemaker:*:*:space/*"
]
},
{
"Sid": "ResourceAccessRequireDomainTag",
"Effect": "Allow",
"Action": [
"SageMaker:Update*",
"SageMaker:Delete*",
"SageMaker:Describe*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/sagemaker:domain-arn":
"arn:aws:sagemaker:region:account-id:domain/domain-id"
}
}
}
]
}

2. Create a new IAM policy using the StudioDomainResourceIsolationPolicy-domain-id file.

aws iam create-policy --policy-name StudioDomainResourceIsolationPolicy-domain-id --


policy-document file://StudioDomainResourceIsolationPolicy-domain-id

3. Attach the newly created policy to a new or existing role that is used as the Domain's execution role.

aws iam attach-role-policy --policy-arn arn:aws:iam:account-


id:policy/StudioDomainResourceIsolationPolicy-domain-id --role-name domain-execution-
role

Setting Defaults for a Domain


With SageMaker, you can set default settings for your resources at the Amazon SageMaker Domain level.
These default settings are used in the creation of resources within the Domain. The following sections list
default settings for Domain and give information on using context keys when setting defaults.

Topics
• Domain default settings (p. 112)
• Context keys (p. 113)

Domain default settings


You can set the following defaults when creating or updating a Domain. Values passed at the user profile
and shared space level override defaults set at the Domain level.

• DefaultUserSettings
• DefaultSpaceSettings

112
Amazon SageMaker Developer Guide
Setting Defaults for a Domain

Note
DefaultSpaceSettings only supports the use of JupyterLab 3 image ARNs for
SageMakerImageArn. For more information, see JupyterLab Versioning (p. 135).

"DefaultSpaceSettings": {
"ExecutionRole": "string",
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"InstanceType": "string",
"LifecycleConfigArn": "string",
"SageMakerImageArn": "string",
"SageMakerImageVersionArn": "string"
},
"LifecycleConfigArns": [ "string" ]
},
"KernelGatewayAppSettings": {
"CustomImages": [
{
"AppImageConfigName": "string",
"ImageName": "string",
"ImageVersionNumber": number
}
],
"DefaultResourceSpec": {
"InstanceType": "string",
"LifecycleConfigArn": "string",
"SageMakerImageArn": "string",
"SageMakerImageVersionArn": "string"
},
"LifecycleConfigArns": [ "string" ]
},
"SecurityGroups": [ "string" ]
}

Context keys
You can add context keys to the IAM policy that creates a Domain. This restricts the values that users can
pass for those fields. The following list shows the context keys that Domain supports and where they're
implemented.

• sagemaker:ImageArns
• Implemented as part of DefaultUserSettings:SagemakerImageArn
in DefaultUserSettings.JupyterServerAppSettings and
DefaultUserSettings.KernelGatewayAppSettings. CustomImages in
DefaultUserSettings.KernelGatewayAppSettings.
• Implemented as part of DefaultSpaceSettings:SagemakerImageArn
in DefaultSpaceSettings.JupyterServerAppSettings and
DefaultSpaceSettings.KernelGatewayAppSettings. CustomImages in
DefaultSpaceSettings.KernelGatewayAppSettings.
• sagemaker:VpcSecurityGroupIds
• Implemented as part of DefaultUserSettings:SecurityGroups in DefaultUserSettings.
• Implemented as part of DefaultSpaceSettings:SecurityGroups in
DefaultSpaceSettings.
• sagemaker:DomainSharingOutputKmsKey

Implemented as part of DefaultUserSettings:S3KmsKeyId in


DefaultSpaceSettings.SharingSettings.

113
Amazon SageMaker Developer Guide
Environment

You cannot restrict users to passing incompatible values when using context keys for the defaults.
For example, the values for SageMakerImageArn set as part of DefaultUserSettings and
DefaultSpaceSettings must be compatible. You cannot set the following incompatible default
values. For more information about the available JupyterLab version ARNs, see Setting a default
JupyterLab version (p. 137).

• Only a JupyterLab version 1 ARN can be used for the SageMakerImageArn value in
DefaultUserSettings
• Only a JupyterLab version 3 ARN can be used for the SageMakerImageArn value in
DefaultSpaceSettings

Environment
This page gives information about modifications to the Amazon SageMaker Domain environment. This
includes custom images, lifecycle configurations, and git repositories attached to a Domain environment.
These can also be attached to a shared space using the AWS CLI by passing values to the create-space
command using the space-settings parameter.

For more information about bringing a custom Amazon SageMaker Studio image, see Bring your own
SageMaker image.

For more information about bringing a custom RStudio image, see Bring your own image to RStudio on
SageMaker.

For instructions on using a lifecycle configuration with Studio, see Use Lifecycle Configurations with
Amazon SageMaker Studio.

For information about attaching a git repository to a Domain, see Attach Suggested Git Repos to
SageMaker.

Complete the following procedure to view the custom images, lifecycle configurations, and git
repositories attached to a Domain environment.

Open the Environment page

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select a Domain to open the Environment page.
4. On the Domain details page, choose the Environment tab.

View and Edit Domains


This topic shows how to view a list of your Amazon SageMaker Domains, view the details of a Domain,
and edit Domain settings from the Amazon SageMaker console or AWS Command Line Interface (AWS
CLI).

Topics
• View Domains (p. 114)
• Edit Domain settings (p. 115)

View Domains
The following section shows how to view a list of your Domains, and details of an individual Domain
from the SageMaker console or the AWS CLI.

114
Amazon SageMaker Developer Guide
View and Edit Domains

Console
The console's Domain overview page gives information about the structure of a Domain, and it provides
a list of your Domains. The page's Domain structure diagram describes Domain components and how
they interact with each other.

The following procedure shows how to view a list of your Domains from the SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.

To view the details of the Domain, complete the following procedure. This page gives information about
the general settings for the Domain, including the name, Domain ID, execution role used to create the
Domain, and the authentication method of the Domain.

1. From the list of Domains, select the Domain that you want to open the Domain settings page for.
2. On the Domain details page, choose the Domain settings tab.

AWS CLI
Run the following command from the terminal of your local machine to view a list of Domains from the
AWS CLI.

aws sagemaker list-domains --region region

Edit Domain settings


You can edit the settings of a Domain from the SageMaker console or the AWS CLI. The following
considerations apply when updating the settings of a Domain.

• If DefaultUserSettings and DefaultSpaceSettings are set, they cannot be unset.


• DefaultUserSettings.ExecutionRole can only be updated if there are no applications running in
any user profile within the Domain. This value cannot be unset.
• DefaultSpaceSettings.ExecutionRole can only be updated if there are no applications running
in any of shared spaces within the Domain. This value cannot be unset.
• If the Domain was created in VPC only mode, SageMaker automatically applies updates to the security
group settings defined for the Domain to all shared spaces created in the Domain.
• DomainId cannot be updated.

The following section shows how to edit Domain settings from the SageMaker console or the AWS CLI.

Console
You can edit the Domain from the SageMaker console using the following procedure.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to open the Domain settings page for.
4. On the Domain details page, choose the Domain settings tab.
5. Choose Edit.

115
Amazon SageMaker Developer Guide
Delete a Domain

AWS CLI
Run the following command from the terminal of your local machine to update a Domain from the AWS
CLI. For more information about the structure of default-user-settings, see CreateDomain.

aws sagemaker update-domain \


--domain-id domain-id \
--default-user-settings default-user-settings \
--default-space-settings default-space-settings \
--domain-settings-for-update settings-for-update \
--region region

Delete an Amazon SageMaker Domain


A Domain consists of a list of authorized users, configuration settings, and an Amazon Elastic File System
(Amazon EFS) volume. The Amazon EFS volume contains data for the users, including notebooks,
resources, and artifacts. A user can have multiple applications (apps) which support the reading and
execution experience of the user’s notebooks, terminals, and consoles.

You can delete your Domain using one of the following:

• AWS console
• AWS Command Line Interface (AWS CLI)
• SageMaker SDK

The following sections explain how to delete a Domain and the requirements for doing so.

Requirements
You must satisfy the following requirements to delete a Domain.

• You must have admin permission to delete a Domain.


• You can only delete an app with the status InService displayed as Ready in the Domain. To delete
the containing Domain, you don't need to delete an app whose status is Failed. In the Domain, an
attempt to delete an app in the failed state results in an error.
• To delete a Domain, the Domain cannot contain any user profiles or shared spaces. To delete a user
profile or shared space, the user profile or space cannot contain any non-failed apps.

When you delete these resources, the following occurs:


• App – The data (files and notebooks) in a user's home directory is saved. Unsaved notebook data is
lost.
• User profile – The user can no longer sign in to the Domain. The user loses access to their home
directory, but the data is not deleted. An admin can retrieve the data from the Amazon EFS volume
where it is stored under the user's AWS account.
• To switch authentication modes from IAM to IAM Identity Center, you must delete the Domain.

EFS files
Your files are kept in an Amazon EFS volume as a backup. This backup includes the files in the mounted
directory, which is /home/sagemaker-user for Jupyter and /root for your kernel.

When you delete files from these mounted directories, the kernel or app may move the deleted files
into a hidden trash folder. If the trash folder is inside the mounted directory, those files are copied
into the Amazon EFS volume and will incur charges. To avoid these Amazon EFS charges, you must

116
Amazon SageMaker Developer Guide
Delete a Domain

identify and clean the trash folder location. The trash folder location for default apps and kernels is
~/.local/. This may vary depending on the Linux distribution used for custom apps or kernels. For
more information about the Amazon EFS volume, see Manage Your Amazon EFS Storage Volume in
SageMaker Studio (p. 198).

When you use the SageMaker console to delete the Domain, the Amazon EFS volume is detached but not
deleted. The same behavior occurs by default when you use the AWS CLI or the SageMaker Python SDK
to delete the Domain. However, when you use the AWS CLI or the SageMaker Python SDK, you can set
the RetentionPolicy to HomeEfsFileSystem=Delete to delete the Amazon EFS volume along with
the Domain.

Delete an Amazon SageMaker Domain (console)


To delete a Domain

1. Open the SageMaker console.


2. On the left navigation pane, choose Domains.
3. Select the Domain that you want to delete.
4. Repeat the following steps for each user in the User profiles list.

a. Choose the user.


b. On the User Details page, for each non-failed app in the Apps list, choose Action.
c. From the dropdown list, choose Delete.
d. On the Delete app dialog box, choose Yes, delete app. Then enter delete in the confirmation
field, and choose Delete.
e. When Status shows as Deleted for all apps, choose Edit.
f. On the Edit User page, choose Delete user.
g. On the Delete user dialog box, choose Yes, delete user. Then enter delete in the confirmation
field, and choose Delete.

Important
When a user is deleted, they lose access to the Amazon EFS volume that contains their data,
including notebooks and other artifacts. The data is not deleted and can be accessed by an
administrator.
5. When all users are deleted, choose the Space management tab.
6. Repeat the following steps for each shared space in the Spaces list.

a. Select the name of the shared space.


b. Choose Delete app for every app.
c. On the Delete app dialog box, choose Yes, delete app. Then enter delete in the confirmation
field, and choose Delete.
d. Choose Cancel.
e. Select the shared space.
f. Choose Delete.
g. On the Delete space dialog box, choose Yes, delete space. Then enter delete in the confirmation
field, and choose Delete space.
7. When all users and shared spaces are deleted, choose the Domain settings tab.
8. Choose Edit.
9. On the General settings page, choose Delete Domain.
10. On the Delete Domain dialog box, choose Yes, delete Domain. Then enter delete in the
confirmation field, and choose Delete.

117
Amazon SageMaker Developer Guide
Domain User Profiles

Delete an Amazon SageMaker Domain (AWS CLI)


To delete a Domain

1. Retrieve the list of Domains in your account.

aws --region Region sagemaker list-domains

2. Retrieve the list of applications for the Domain to be deleted.

aws --region Region sagemaker list-apps \


--domain-id-equals DomainId

3. Delete each application in the list.

aws --region Region sagemaker delete-app \


--domain-id DomainId \
--app-name AppName \
--app-type AppType \
--user-profile-name UserProfileName

4. Retrieve the list of user profiles in the Domain.

aws --region Region sagemaker list-user-profiles \


--domain-id-equals DomainId

5. Delete each user profile in the list.

aws --region Region sagemaker delete-user-profile \


--domain-id DomainId \
--user-profile-name UserProfileName

6. Retrieve the list of shared spaces in the Domain.

aws --region Region sagemaker list-spaces \


--domain-id DomainId

7. Delete each shared space in the list.

aws --region Region sagemaker delete-space \


--domain-id DomainId \
--space-name SpaceName

8. Delete the Domain. To also delete the Amazon EFS volume, specify HomeEfsFileSystem=Delete.

aws --region Region sagemaker delete-domain \


--domain-id DomainId \
--retention-policy HomeEfsFileSystem=Retain

Domain User Profiles


A user profile represents a single user within an Amazon SageMaker Domain. The user profile is the main
way to reference a user for the purposes of sharing, reporting, and other user-oriented features. This
entity is created when a user onboards to the Amazon SageMaker Domain. A user profile can have (at
most) a single JupyterServer application outside the context of a shared space. The user profile's Studio

118
Amazon SageMaker Developer Guide
Domain User Profiles

application is directly associated with the user profile and has an isolated Amazon EFS directory, an
execution role associated with the user profile, and Kernel Gateway applications.

Topics
• Add and Remove User Profiles (p. 119)
• View User Profiles and User Profile Details (p. 121)

Add and Remove User Profiles


The following sections demonstrate how to add and remove user profiles from an Amazon SageMaker
Domain using the SageMaker console or the AWS Command Line Interface (AWS CLI).

Topics
• Add user profiles (p. 119)
• Remove user profiles (p. 120)

Add user profiles


The following section shows how to add user profiles to a Domain using the SageMaker console or the
AWS CLI.

Add user profiles from the console


You can add user profiles to a Domain from the SageMaker console by following this procedure.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to add a user profile to.
4. On the Domain details page, choose the User profiles tab.
5. Choose Add user. This opens a new page.
6. Use the default name for your user profile or add a custom name.
7. For Execution role, choose an option from the role selector. If you choose Enter a custom IAM role
ARN, the role must have, at a minimum, an attached trust policy that grants SageMaker permission
to assume the role. For more information, see SageMaker Roles.

If you choose Create a new role, the Create an IAM role dialog box opens:

a. For S3 buckets you specify, specify additional Amazon S3 buckets that users of your notebooks
can access. If you don't want to add access to more buckets, choose None.
b. Choose Create role. SageMaker creates a new IAM role, AmazonSageMaker-
ExecutionPolicy, with the AmazonSageMakerFullAccess policy attached.
8. (Optional) Add tags to the user profile. All resources that the user profile creates will have a Domain
ARN tag and a user profile ARN tag. The Domain ARN tag is based on Domain ID, while the user
profile ARN tag is based on the user profile name.
9. Choose Next.
10. Under Default JupyterLab version, select a JupyterLab version from the dropdown to use as the
default for your user profile. For information about selecting a JupyterLab version, see JupyterLab
Versioning.
11. In the SageMaker Projects and JumpStart section, you have two options. You can accept the default
Project and JumpStart settings, or you can customize whether the user profile can create projects
and use JumpStart. For more information, see SageMaker Studio Permissions Required to Use
Projects.
12. Choose Next.

119
Amazon SageMaker Developer Guide
Domain User Profiles

13. (Optional) If the Domain has an RStudio license associated, select whether you want to create the
user with one of the following authorizations:

• Unauthorized
• RStudio Admin
• RStudio User
14. Choose Next.
15. For the Canvas base permissions configuration, select whether to establish the minimum required
permissions to use the SageMaker Canvas application.
16. (Optional) For the Time series forecasting configuration: To grant user permissions for time series
forecasting in SageMaker Canvas, leave the Enable time series forecasting option turned on. It is
turned on by default.
17. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role. Alternatively, if you already have an IAM role with the required Amazon Forecast
permissions attached, select Use an existing execution role. For more information, see the IAM role
setup method (p. 278).
18. Choose Submit.

Create user profiles from the AWS CLI

To create a user profile in a Domain from the AWS CLI, run the following command from the terminal of
your local machine. For information about the available JupyterLab version ARNs, see Setting a default
JupyterLab version (p. 137).

aws --region region \


sagemaker create-user-profile \
--domain-id domain-id \
--user-profile-name user-name \
--user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "sagemaker-image-arn",
"InstanceType": "system"
}
}
}'

Remove user profiles


All apps launched by a user profile must be deleted to delete the user profile. The following section
shows how to remove user profiles from a Domain using the SageMaker console or AWS CLI.

Remove user profiles from the console

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to remove a user profile from.
4. On the Domain details page, choose the User profiles tab.
5. Select the user profile that you want to delete. The user profile must not contain any non-failed
apps.
6. On the User details page, choose Edit.
7. Choose Delete user. This opens a new pop-up.
8. On the Delete user pop-up, choose Yes, delete user.
9. Enter delete in the field to confirm deletion.

120
Amazon SageMaker Developer Guide
Domain User Profiles

10. Choose Delete.

Remove user profiles from the AWS CLI


To delete a user profile from the AWS CLI, run the following command from the terminal of your local
machine.

aws sagemaker delete-user-profile \


--region region \
--domain-id domain-id \
--user-profile-name user-name

View User Profiles and User Profile Details


This topic shows how to view a list of user profiles in an Amazon SageMaker Domain, and view details for
a user profile from the SageMaker console or the AWS Command Line Interface (AWS CLI).

Topics
• View user profiles (p. 121)
• View user profile details (p. 121)

View user profiles


The following section describes how to view a list of user profiles in a Domain from the SageMaker
console or the AWS CLI.

View user profiles from the console


Complete the following procedure to view a list of user profiles in the Domain from the SageMaker
console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to view a list of user profiles for.
4. On the Domain details page, choose the User profiles tab.

View user profiles from the AWS CLI


To view the user profiles in a Domain from the AWS CLI, run the following command from the terminal
of your local machine.

aws sagemaker list-user-profiles \


--region region \
--domain-id domain-id

View user profile details


The following section describes how to view the details of a user profile from the SageMaker console or
the AWS CLI.

View user profile details from the console


Complete the following procedure to view the details of a user profile from the SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

121
Amazon SageMaker Developer Guide
IAM Identity Center Groups in a Domain

2. On the left navigation pane, choose Domains.


3. From the list of Domains, select the Domain that you want to view a list of user profiles for.
4. On the Domain details page, choose the User profiles tab.
5. Select the user profile that you want to view details for.

View user profile details from the AWS CLI

To describe a user profile from the AWS CLI, run the following command from the terminal of your local
machine.

aws sagemaker describe-user-profile \


--region region \
--domain-id domain-id \
--user-profile-name user-name

IAM Identity Center Groups in a Domain


If you use AWS IAM Identity Center (successor to AWS Single Sign-On) authentication for your Amazon
SageMaker Domain, you can add and edit group and user access to a Domain. For more information
about IAM Identity Center authentication, see What is IAM Identity Center?. The following topics show
how to manage IAM Identity Center users and groups that have access to a Domain.

Topics
• View groups and users (p. 122)
• Add groups and users (p. 122)
• Remove groups (p. 122)

View groups and users


Complete the following procedure to view a list of IAM Identity Center groups and users from the
Amazon SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to open the Domain settings page for.
4. On the Domain details page, choose the Groups tab.

Add groups and users


Complete the following procedure to add groups and users to your Domain from the SageMaker console.

1. On the Groups tab, choose Assign users and groups.


2. On the Assign users and groups page, select the users and groups that you want to add.
3. Choose Assign users and groups.

Remove groups
Complete the following procedure to remove groups from your Domain from the SageMaker console. For
information about deleting a user, see Remove user profiles (p. 120).

122
Amazon SageMaker Developer Guide
Collaborate with shared spaces

1. On the Groups tab, choose the group that you want to remove.
2. Choose Unassign groups.
3. On the pop-up window, choose Yes, unassign groups.
4. Enter unassign in the field.
5. Choose Unassign groups.

Collaborate with shared spaces


A shared space consists of a shared JupyterServer application and a shared directory. All user profiles
in a Domain have access to all shared spaces in the Domain. Amazon SageMaker automatically scopes
resources in a shared space within the context of the Amazon SageMaker Studio application that you
launch in that shared space. Resources in a shared space include notebooks, files, experiments, and
models.

A shared space only supports Studio and KernelGateway applications. A shared space only supports
the use of a JupyterLab 3 image Amazon Resource Name (ARN). For more information, see JupyterLab
Versioning (p. 135).

Amazon SageMaker automatically tags all SageMaker resources that you create within the scope of
a shared space. You can use these tags to monitor costs and plan budgets using tools, such as AWS
Budgets.

A shared space uses the same VPC settings as the Domain that it's created in.
Note
Domains with AWS IAM Identity Center (successor to AWS Single Sign-On) authentication do not
currently support the use of shared spaces. Shared spaces do not support the use of Amazon
SageMaker Data Wrangler or Amazon EMR cross-account clusters.

Automatic tagging

All resources created in a shared space are automatically tagged with a Domain ARN tag and shared
space ARN tag. The Domain ARN tag is based on the Domain ID, while the shared space ARN tag is based
on the shared space name.

You can use these tags to monitor AWS CloudTrail usage. For more information, see Log Amazon
SageMaker API Calls with AWS CloudTrail.

You can also use these tags to monitor costs with AWS Billing and Cost Management. For more
information, see Using AWS cost allocation tags.

Real time co-editing of notebooks

A key benefit of a shared space is that it facilitates collaboration between members of the shared space
in real time. Users collaborating in a workspace get access to a shared Studio application where they
can access, read, and edit their notebooks in real time. Real time collaboration is only supported for
JupyterServer applications within a shared space.

Users with access to a shared space can simultaneously open, view, edit, and execute Jupyter notebooks
in the shared Studio application in that space.

The notebook indicates each co-editing user with a different cursor that shows the user profile name.
While multiple users can view the same notebook, co-editing is best suited for small groups of two to
five users.

To track changes being made by multiple users, we strongly recommended using Studio's built-in Git-
based version control.

123
Amazon SageMaker Developer Guide
Collaborate with shared spaces

JupyterServer 2

To use shared spaces, Jupyter Server version 2 is required. Certain JupyterLab extensions and packages
can forcefully downgrade Jupyter Server to version 1. This prevents the use of shared space. Run the
following from the command prompt to change the version number and continue using shared spaces.

conda activate studio


pip install jupyter-server==2.0.0rc3

Customize a shared space

To attach a lifecycle configuration or custom image to a shared space, you must use the AWS CLI. For
more information about creating and attaching lifecycle configurations, see Creating and Associating a
Lifecycle Configuration (p. 183). For more information about creating and attaching custom images,
see Bring your own SageMaker image (p. 169).

Create a shared space


The following topic demonstrates how to create a shared space in an existing Amazon SageMaker
Domain. If you created your Domain without support for shared spaces, you must add support for shared
spaces to your existing Domain before you can create a shared space.

Topics
• Add shared space support to an existing Domain (p. 124)
• Create from the console (p. 125)
• Create from AWS CLI (p. 125)

Add shared space support to an existing Domain


You can use the SageMaker console or the AWS CLI to add support for shared spaces to an existing
Domain.

Console

Complete the following procedure to add support for shared spaces to an existing Domain from the
SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation, choose Domains.
3. From the list of Domains, select the Domain that you want to open the Domain settings page for.
4. On the Domain details page, choose the Domain settings tab.
5. Choose Edit.
6. For Space default execution role, set an IAM role that is used by default for all shared spaces
created in the Domain.
7. Choose Next.
8. Choose Next.
9. Choose Next.
10. Choose Submit.

AWS CLI

Run the following command from the terminal of your local machine to add default shared space
settings to a Domain from the AWS CLI. If you are adding default shared space settings to a Domain

124
Amazon SageMaker Developer Guide
Collaborate with shared spaces

within an Amazon VPC, you must also include a list of security groups. shared spaces only support the
use of JupyterLab 3 image ARNs. For more information, see JupyterLab Versioning (p. 135).

# Public Internet domain


aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-
arn,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=sagemaker-
image-arn}}"

# VPCOnly domain
aws --region region \
sagemaker update-domain \
--domain-id domain-id \
--default-space-settings "ExecutionRole=execution-role-
arn,JupyterServerAppSettings={DefaultResourceSpec={InstanceType=system,SageMakerImageArn=sagemaker-
image-arn}},SecurityGroups=[security-groups]"

Verify that the default shared space settings have been updated.

aws --region region \


sagemaker describe-domain \
--domain-id domain-id

Create from the console


Complete the following procedure to create a shared space in the Domain from the SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to create a shared space for.
4. On the Domain details page, choose the Space management tab.
5. Choose Create.
6. Enter a name for your shared space. shared space names within a Domain must be unique. The
execution role for the shared space is set to the Domain IAM execution role.

Create from AWS CLI


This section shows how to create a shared space from the AWS CLI.

You cannot set the execution role of a shared space when creating or updating it.
The DefaultDomainExecRole can only be set when creating or updating the Domain. shared
spaces only support the use of JupyterLab 3 image ARNs. For more information, see JupyterLab
Versioning (p. 135).

To create a shared space from the AWS CLI, run the following command from the terminal of your local
machine.

aws --region region \


sagemaker create-space \
--domain-id domain-id \
--space-name space-name \
--space-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {

125
Amazon SageMaker Developer Guide
Collaborate with shared spaces

"SageMakerImageArn": "sagemaker-image-arn",
"InstanceType": "system"
}
}
}'

List and Describe shared spaces


This guide shows how to access a list of shared spaces in an Amazon SageMaker Domain with the
Amazon SageMaker console or the AWS CLI. It also shows how to view details of a shared space from the
AWS CLI.

Topics
• List shared spaces (p. 126)
• View shared space details (p. 126)

List shared spaces


The following topic describes how to view a list of shared spaces within a Domain from the SageMaker
console or the AWS CLI.

List shared spaces from the console


Complete the following procedure to view a list of the shared spaces in a Domain from the SageMaker
console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to view the list of shared spaces for.
4. On the Domain details page, choose the Space management tab.

List shared spaces from the AWS CLI


To list the shared spaces in a Domain from the AWS CLI, run the following command from the terminal
of your local machine.

aws --region region \


sagemaker list-spaces \
--domain-id domain-id

View shared space details


The following section describes how to view shared space details from the SageMaker console or the
AWS CLI.

View shared space details from the console


You can view the details of a shared space from the SageMaker console using the following procedure.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane choose Domains.
3. From the list of Domains, select the Domain that you want to view the list of shared spaces for.
4. On the Domain details page, choose the Space management tab.
5. Select the name of the space to open a new page that lists details about the shared space.

126
Amazon SageMaker Developer Guide
Collaborate with shared spaces

View shared space details from the AWS CLI

To view the details of a shared space from the AWS CLI, run the following command from the terminal of
your local machine.

aws --region region \


sagemaker describe-space \
--domain-id domain-id \
--space-name space-name

Edit a shared space


You can only edit the details for a shared space using the AWS CLI. This is not currently supported from
the Amazon SageMaker console. You can only update workspace attributes when there are no running
applications in the shared space.

To edit the details of a shared space from the AWS CLI, run the following command from the terminal
of your local machine. shared spaces only support the use of JupyterLab 3 image ARNs. For more
information, see JupyterLab Versioning (p. 135).

aws --region region \


sagemaker update-space \
--domain-id domain-id \
--space-name space-name \
--query SpaceArn --output text \
--space-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "sagemaker-image-arn",
"InstanceType": "system"
}
}
}'

Delete a shared space


The following topic shows how to delete a shared space from the Amazon SageMaker console or AWS
CLI. A shared space can only be deleted if it has no running applications.

Topics
• Console (p. 127)
• AWS CLI (p. 128)

Console
Complete the following procedure to delete a shared space in the Amazon SageMaker Domain from the
SageMaker console.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to create a shared space for.
4. On the Domain details page, choose the Space management tab.
5. Select the shared space that you want to delete. The shared space must not contain any non-failed
apps.

127
Amazon SageMaker Developer Guide
SageMaker Studio

6. Choose Delete. This opens a new window.


7. Choose Yes, delete space.
8. Enter delete in the field.
9. Choose Delete space.

AWS CLI
To delete a shared space from the AWS CLI, run the following command from the terminal of your local
machine.

aws --region region \


sagemaker delete-space \
--domain-id domain-id \
--space-name space-name

Amazon SageMaker Studio


Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine
learning that lets you build, train, debug, deploy, and monitor your machine learning models. SageMaker
Studio provides all the tools you need to take your models from data preparation to experimentation to
production while boosting your productivity. In a single unified visual interface, customers can perform
the following tasks:

• Write and execute code in Jupyter notebooks


• Prepare data for machine learning
• Build and train machine learning models
• Deploy the models and monitor the performance of their predictions
• Track and debug the machine learning experiments

For information on the onboarding steps to sign in to SageMaker Studio, see Onboard to Amazon
SageMaker Domain (p. 37).

For the AWS Regions supported by SageMaker Studio, see Supported Regions and Quotas (p. 33).

Topics
• Studio Features (p. 128)
• Amazon SageMaker Studio UI Overview (p. 129)
• Launch Amazon SageMaker Studio (p. 133)
• JupyterLab Versioning (p. 135)
• Use the Amazon SageMaker Studio Launcher (p. 141)
• Use Amazon SageMaker Studio Notebooks (p. 144)
• Customize Amazon SageMaker Studio (p. 168)
• Perform Common Tasks in Amazon SageMaker Studio (p. 194)
• Amazon SageMaker Studio Pricing (p. 200)
• Troubleshooting Amazon SageMaker Studio (p. 201)

Studio Features
Studio includes the following features:

128
Amazon SageMaker Developer Guide
UI Overview

• SageMaker Autopilot
• SageMaker Clarify
• SageMaker Data Wrangler
• SageMaker Debugger
• SageMaker Experiments
• SageMaker Feature Store
• SageMaker JumpStart
• Amazon SageMaker Model Building Pipelines
• SageMaker Model Registry
• SageMaker Projects
• SageMaker Studio Notebooks
• SageMaker Studio Universal Notebook

Amazon SageMaker Studio UI Overview


Amazon SageMaker Studio extends the capabilities of JupyterLab with custom resources that can speed
up your Machine Learning (ML) process by harnessing the power of AWS compute. Previous users of
JupyterLab will notice the similarity of the user interface. The most prominent additions are detailed
in the following sections. For an overview of the original JupyterLab interface, see The JupyterLab
Interface.

The following image shows the default view upon launching Amazon SageMaker Studio. The left
navigation panel displays all top-level categories of features, and a Studio Home page (p. 130) is open in

the main working area. Come back to this central point of orientation by choosing the Home ( ) icon
at any time, then selecting the Home node in the navigation menu.

Try the Getting started notebook for an in-product hands-on guide on how to set up and get familiar
with Amazon SageMaker Studio features. On the Quick actions section of the Studio Home page, choose
Open the Getting started notebook.

129
Amazon SageMaker Developer Guide
UI Overview

Note
This chapter is based on Studio's updated user interface (UI) available on version v5.38.x and
above on JupyterLab3.

• To retrieve your version of Studio UI, from the Studio Launcher, open a System Terminal, then

1. Run conda activate studio


2. Run jupyter labextension list
3. Search for the version displayed after @amzn/sagemaker-ui version in the output.
• For information about updating Amazon SageMaker Studio, see Shut down and Update
SageMaker Studio (p. 199).

Topics
• Studio Home page (p. 130)
• Studio layout (p. 130)

Studio Home page


The Home page provides access to common tasks and workflows. In particular, it includes a list of Quick
actions for common tasks such as Open Launcher to create notebooks and other resources and Import
& prepare data visually to create a new flow in Data Wrangler.The Home page also offers tooltips on key
controls in the UI.

The Prebuilt and automated solutions help you get started quickly with SageMaker's low-code solutions
such as Amazon SageMaker JumpStart and Autopilot.

In Workflows and tasks, you can find a list of relevant tasks for each step of your ML workflow that
takes you to the right tool for the job. For example, Transform, analyse, and export data takes you
to Amazon SageMaker Data Wrangler and opens the workflow to create a new data flow, or View all
experiments takes you to SageMaker Experiments and opens the experiments list view.

Upon Studio launch, the Home page is open in the main working area. You can customize your

SageMaker Home page by choosing Customize Layout at the top right of the Home tab.

Studio layout
The Amazon SageMaker Studio interface consists of a menu bar at the top, a collapsible left sidebar
displaying a variety of icons such as the Home icon and the File Browser, a status bar at the bottom
of the screen, and a central area divided horizontally into two panes. The left pane is a collapsible
navigation panel. The right pane, or main working area, contains one or more tabs for resources such as
launchers, notebooks, terminals, metrics, and graphs, and can be further divided.

Report a bug in Studio or choose the notification icon ( ) to view notifications from Studio, such as
new Studio versions and new SageMaker features, on the right corner of the menu bar. To update to a
new version of Studio, see Shut Down and Update SageMaker Studio and Studio Apps (p. 198).

The following sections describe the Studio main user interface areas.

Left sidebar
The left sidebar includes the following icons. When hovering over an icon, a tooltip displays the icon
name. A single click on an icon opens up the left navigation panel with the described functionality. A
double click minimizes the left navigation panel.

130
Amazon SageMaker Developer Guide
UI Overview

Icon Description

Home

Choose the Home icon to open a top-level navigation menu in the left
navigation panel.

Using the Home navigation menu, you can discover and navigate to the right
tools for each step of your ML workflow. The menu also provides shortcuts
to quick-start solutions and learning resources such as documentation and
guided tutorials.

The menu categories group relevant features together. Choosing Data,


for example, expands the relevant SageMaker capabilities for your data
preparations tasks. From here, you can prepare your data with Data
Wrangler, create and store ML features with Amazon SageMaker Feature
Store, and manage Amazon EMR clusters for large-scale data processing.
The categories are ordered following a typical ML workflow from preparing
data, to building, training, and deploying ML models (data, pipelines,
models, and deployments).

When you choose a specific node (such as Data Wrangler), a corresponding


page opens in the main working area.

Choose Home in the navigation menu to open the Studio Home


page (p. 130)

File Browser

The File Browser displays lists of your notebooks, experiments, trials, trial
components, endpoints, and low-code solutions.

Whether you are in a personal or shared space determines who has access to
your files. You can identify which type of space you are in by looking at the
top right corner. If you are in a personal app, you see a user icon followed by
[user_name] / Personal Studio and if you are in a collaborative space, you
see a globe icon followed by “[user_name] / [space_name].”

• Personal Studio app: A private Amazon EFS directory that only you can
access.

• Collaborative space: A shared Amazon EFS directory with other members


of your team for group access to notebooks and resources. Working in a
shared space allows for real-time team collaboration on notebooks.

• Studio launcher: Choose the plus (+) sign on the menu at the top of the
file browser to open the Amazon SageMaker Studio Launcher.


Upload files: Choose the Upload Files icon ( ) to add files to Studio or
drag and drop them from your desktop.

• Open files: Double-click a file to open the file in a new tab or right-click
and select Open.

131
Amazon SageMaker Developer Guide
UI Overview

Icon Description
• Panel management: To work in adjacent files, choose a tab that contains
a notebook, Python, or text file, then choose New View for File.

For hierarchical entries, a selectable breadcrumb at the top of the browser


shows your location in the hierarchy.

Property Inspector

The Property Inspector is a notebook cell tools inspector which displays


contextual property settings when open.

Running Terminals and Kernels

You can check the list of all the kernels and terminals currently running
across all notebooks, code consoles, and directories. You can shut down
individual resources, including notebooks, terminals, kernels, apps, and
instances. You can also shut down all resources in one of these categories at
the same time.

For more information, see Shut Down Resources (p. 159).

Git

You can connect to a Git repository and then access a full range of Git tools
and operations.

For more information, see Clone a Git Repository in SageMaker


Studio (p. 194).

Table of Contents

You can navigate the structure of a document when a notebook or Python


files are open.
A table of contents is auto-generated in the left navigation panel when you
have a notebook, Markdown files, or Python files opened. The entries are
clickable and scroll the document to the heading in question.

Extensions

You can turn on and manage third-party JupyterLab extensions. You can
check the already installed extensions and search for extensions by typing
the name in the search bar. When you have found the extension you want to
install, choose Install. After installing your new extensions, be sure to restart
JupyterLab by refreshing your browser.

For more information, see JupyterLab Extensions documentation.

Left navigation panel


The left navigation panel content varies with the Icon selected in the left sidebar.

For example, choosing the Home icon displays the navigation menu. Choosing File browser lists all
the files and directories available in your workspace (notebooks, experiments, data flows, trials, trial
components, endpoints, or low-code solutions).

132
Amazon SageMaker Developer Guide
Launch Amazon SageMaker Studio

In the navigation menu, choosing a node brings up the corresponding feature page in the main working
area. For example, choosing Data Wrangler in the Data menu opens up the Data Wrangler tab listing all
existing flows.

Main working area


The main working area consists of multiple tabs that contain your open notebooks, terminals, and
detailed information about your experiments and endpoints. In the main working area, you can arrange
documents (such as notebooks and text files) and other activities (such as terminals and code consoles)
into panels of tabs that you can resize or subdivide. Drag a tab to the center of a tab panel to move the
tab to the panel. Subdivide a tab panel by dragging a tab to the left, right, top, or bottom of the panel.
The tab for the current activity is marked with a colored top border (blue by default).
Note
All feature pages provide in-product contextual help. To access help, choose Show information.
The help interface provides a brief introduction to the tool and links to additional resources,
such as videos, tutorials, or blogs.

Launch Amazon SageMaker Studio


After you have onboarded to an Amazon SageMaker Domain, you can launch an Amazon SageMaker
Studio application from either the SageMaker console or the AWS CLI. For more information about
onboarding to a Domain, see Onboard to Amazon SageMaker Domain (p. 37).

Topics
• Launch Studio Using the Amazon SageMaker Console (p. 133)
• Launch Studio Using the AWS CLI (p. 134)

Launch Studio Using the Amazon SageMaker Console


When launching an Amazon SageMaker Studio application from the SageMaker console, you can use the
Studio landing page or the Amazon SageMaker Domain details page. The following sections demonstrate
how to launch the Studio application from the SageMaker console.

Topics
• Prerequisite (p. 133)
• Launch Studio from the Domain details page (p. 133)
• Launch Studio from the Studio landing page (p. 133)

Prerequisite
To complete this procedure, you must onboard to a Domain by following the steps in Onboard to
Amazon SageMaker Domain.

Launch Studio from the Domain details page


The following sections describe how to launch a Studio application from the Domain details page.
The steps to launch the Studio application after you have navigated to the Domain details page differ
depending on if you’re launching a personal application or a shared space.

Navigate to the Domain details page

The following procedure shows how to navigate to the Domain details page.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.

133
Amazon SageMaker Developer Guide
Launch Amazon SageMaker Studio

3. From the list of Domain, select the Domain that you want to launch the Studio application in.

Launch a user profile app

The following procedure shows how to launch a Studio application that is scoped to a user profile.

1. On the Domain details page, choose the User profiles tab.


2. Identify the user profile that you want to launch the Studio application for.
3. Choose Launch for your selected user profile, then choose Studio.

Launch a shared space app

The following procedure shows how to launch a Studio application that is scoped to a shared space.

1. On the Domain details page, choose the Space management tab.


2. Identify the shared space that you want to launch the Studio application for.
3. Choose Launch Studio for your selected shared space.

Launch Studio from the Studio landing page


The following procedure describes how to launch a Studio application from the Studio landing page.

Launch Studio

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Studio.
3. Under Get started, select the Domain that you want to launch the Studio application in. If your user
profile only belongs to one Domain, you do not see the option for selecting a Domain.
4. Select the user profile that you want to launch the Studio application for. If there is no user profile
in the Domain, choose Create user profile. For more information, see Add and Remove User
Profiles (p. 119).
5. Choose Launch Studio. If the user profile belongs to a shared space, choose Open Spaces.
6. To launch a Studio application scoped to a user profile, choose Launch personal Studio.
7. To launch a shared Studio application, choose the Launch shared Studio button next to the shared
space that you want to launch into.

Launch Studio Using the AWS CLI


You can use the AWS Command Line Interface (AWS CLI) to launch Amazon SageMaker Studio by
creating a presigned Domain URL.

Prerequisites

Before you begin, complete the following prerequisites:

• Onboard to Amazon SageMaker Domain. For more information, see Onboard to Amazon SageMaker
Domain.
• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.

The following code snippet demonstrates how to launch Amazon SageMaker Studio from the AWS CLI
using a presigned Domain URL. For more information, see create-presigned-domain-url.

134
Amazon SageMaker Developer Guide
JupyterLab Versioning

aws sagemaker create-presigned-domain-url \


--region region \
--domain-id domain-id \
--space-name space-name \
--user-profile-name user-profile-name \
--session-expiration-duration-in-seconds 43200

JupyterLab Versioning
The Amazon SageMaker Studio interface is based on JupyterLab, which is a web-based interactive
development environment for notebooks, code, and data. Studio now supports using both JupyterLab
1 and JupyterLab 3. The default version of JupyterLab in Studio is JupyterLab 3. If you created your
Amazon SageMaker Domain and user profile using the AWS Management Console before 08/31/2022
or using the AWS Command Line Interface before 02/22/23, then your Studio instance defaults to
JupyterLab 1. After 08/31/2022, JupyterLab version 1 on Amazon SageMaker Studio only receives
security fixes. You can choose the version that you want to run. However, you can run only a single
instance of JupyterLab at one time per user profile. You can’t run multiple versions of JupyterLab
simultaneously.

After 03/31/23, Studio only supports the creation of JupyterLab 3 applications. After that date, Studio
stops supporting JupyterLab 1 application creation. On 04/30/2023, Studio removes all existing
applications that run JupyterLab 1. Update your existing JupyterLab1 applications to JupyterLab 3
before 04/30/2023 following the steps in View and update the JupyterLab version of an application
from the console (p. 140).

Topics
• JupyterLab 3 (p. 135)
• Restricting default JupyterLab version using an IAM policy condition key (p. 136)
• Setting a default JupyterLab version (p. 137)
• View and update the JupyterLab version of an application from the console (p. 140)
• Installing JupyterLab and Jupyter Server extensions (p. 140)

JupyterLab 3
JupyterLab 3 includes the following features that are not available in previous versions. For more
information about these features, see JupyterLab 3.0 is released!.

• Visual debugger when using the Base Python 2.0 and Data Science 2.0 kernels.
• File browser filter
• Table of Contents (TOC)
• Multi-language support
• Simple mode
• Single interface mode

Important changes to JupyterLab 3


Consider the following when using JupyterLab 3:

• When setting the JupyterLab version using the AWS CLI, select the corresponding image for your
Region and JupyterLab version from the image list in From the AWS CLI (p. 137).
• In JupyterLab 3, you must activate the studio conda environment before installing extensions. For
more information, see Installing JupyterLab and Jupyter Server extensions (p. 140).

135
Amazon SageMaker Developer Guide
JupyterLab Versioning

• Debugger is only supported when using the following images:


• Base Python 2.0
• Data Science 2.0
• Base Python 3.0
• Data Science 3.0

Restricting default JupyterLab version using an IAM policy


condition key
You can use IAM policy condition keys to restrict the version of JupyterLab that your users can launch.

The following policy shows how to limit the JupyterLab version at the Domain level.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the domain level",
"Effect": "Deny",
"Action": [
"sagemaker:CreateDomain",
"sagemaker:UpdateDomain"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}

The following policy shows how to limit the JupyterLab version at the user profile level.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the user profile level",
"Effect": "Deny",
"Action": [
"sagemaker:CreateUserProfile",
"sagemaker:UpdateUserProfile"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}

The following policy shows how to limit the JupyterLab version at the application level. The CreateApp
request must include the image ARN for this policy to apply.

136
Amazon SageMaker Developer Guide
JupyterLab Versioning

"Version": "2012-10-17",
"Statement": [
{
"Sid": "Block users from creating JupyterLab 3 apps at the application level",
"Effect": "Deny",
"Action": "sagemaker:CreateApp",
"Resource": "*",
"Condition": {
"ForAnyValue:StringLike": {
"sagemaker:ImageArns": "*image/jupyter-server-3"
}
}
}
]
}

Setting a default JupyterLab version


The following sections show how to set a default JupyterLab version for Studio using either the console
or the AWS CLI.

From the console


You can select the default JupyterLab version to use on either the Domain or user profile level during
resource creation. To set the default JupyterLab version using the console, see Onboard to Amazon
SageMaker Domain (p. 37).

From the AWS CLI


You can select the default JupyterLab version to use on either the Domain or user profile level using the
AWS CLI.

To set the default JupyterLab version using the AWS CLI, you must include the ARN of the desired default
JupyterLab version as part of an AWS CLI command. This ARN differs based on the version and the
Region of the SageMaker Domain.

The following table lists the ARNs of the available JupyterLab versions for each Region:

Region JL1 JL3

us-east-1 arn:aws:sagemaker:us- arn:aws:sagemaker:us-


east-1:081325390199:image/ east-1:081325390199:image/
jupyter-server jupyter-server-3

us-east-2 arn:aws:sagemaker:us- arn:aws:sagemaker:us-


east-2:429704687514:image/ east-2:429704687514:image/
jupyter-server jupyter-server-3

us-west-1 arn:aws:sagemaker:us- arn:aws:sagemaker:us-


west-1:742091327244:image/ west-1:742091327244:image/
jupyter-server jupyter-server-3

us-west-2 arn:aws:sagemaker:us- arn:aws:sagemaker:us-


west-2:236514542706:image/ west-2:236514542706:image/
jupyter-server jupyter-server-3

af-south-1 arn:aws:sagemaker:af- arn:aws:sagemaker:af-


south-1:559312083959:image/ south-1:559312083959:image/
jupyter-server jupyter-server-3

137
Amazon SageMaker Developer Guide
JupyterLab Versioning

ap-east-1 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


east-1:493642496378:image/ east-1:493642496378:image/
jupyter-server jupyter-server-3

ap-south-1 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


south-1:394103062818:image/ south-1:394103062818:image/
jupyter-server jupyter-server-3

ap-northeast-2 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


northeast-2:806072073708:image/northeast-2:806072073708:image/
jupyter-server jupyter-server-3

ap-southeast-1 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


southeast-1:492261229750:image/southeast-1:492261229750:image/
jupyter-server jupyter-server-3

ap-southeast-2 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


southeast-2:452832661640:image/southeast-2:452832661640:image/
jupyter-server jupyter-server-3

ap-northeast-1 arn:aws:sagemaker:ap- arn:aws:sagemaker:ap-


northeast-1:102112518831:image/northeast-1:102112518831:image/
jupyter-server jupyter-server-3

ca-central-1 arn:aws:sagemaker:ca- arn:aws:sagemaker:ca-


central-1:310906938811:image/ central-1:310906938811:image/
jupyter-server jupyter-server-3

eu-central-1 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


central-1:936697816551:image/ central-1:936697816551:image/
jupyter-server jupyter-server-3

eu-west-1 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


west-1:470317259841:image/ west-1:470317259841:image/
jupyter-server jupyter-server-3

eu-west-2 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


west-2:712779665605:image/ west-2:712779665605:image/
jupyter-server jupyter-server-3

eu-west-3 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


west-3:615547856133:image/ west-3:615547856133:image/
jupyter-server jupyter-server-3

eu-north-1 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


north-1:243637512696:image/ north-1:243637512696:image/
jupyter-server jupyter-server-3

eu-south-1 arn:aws:sagemaker:eu- arn:aws:sagemaker:eu-


south-1:592751261982:image/ south-1:592751261982:image/
jupyter-server jupyter-server-3

sa-east-1 arn:aws:sagemaker:sa- arn:aws:sagemaker:sa-


east-1:782484402741:image/ east-1:782484402741:image/
jupyter-server jupyter-server-3

138
Amazon SageMaker Developer Guide
JupyterLab Versioning

Create or update Domain

You can set a default JupyterServer version at the Domain level by


invoking CreateDomain or UpdateDomain and passing the
UserSettings.JupyterServerAppSettings.DefaultResourceSpec.SageMakerImageArn field.

The following shows how to create a Domain with JupyterLab 3 as the default, using the AWS CLI:

aws --region <REGION> \


sagemaker create-domain \
--domain-name <NEW_DOMAIN_NAME> \
--auth-mode <AUTHENTICATION_MODE> \
--subnet-ids <SUBNET-IDS> \
--vpc-id <VPC-ID> \
--default-user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-
server-3",
"InstanceType": "system"
}
}
}'

The following shows how to update a Domain to use JupyterLab 3 as the default, using the AWS CLI:

aws --region <REGION> \


sagemaker update-domain \
--domain-id <YOUR_DOMAIN_ID> \
--default-user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-
server-3",
"InstanceType": "system"
}
}
}'

Create or update user profile

You can set a default JupyterServer version at the user profile level
by invoking CreateUserProfile or UpdateUserProfile and passing
the UserSettings.JupyterServerAppSettings.DefaultResourceSpec.SageMakerImageArn
field.

The following shows how to create a user profile with JupyterLab 3 as the default on an existing Domain,
using the AWS CLI:

aws --region <REGION> \


sagemaker create-user-profile \
--domain-id <YOUR_DOMAIN_ID> \
--user-profile-name <NEW_USERPROFILE_NAME> \
--query UserProfileArn --output text \
--user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-
server-3",
"InstanceType": "system"

139
Amazon SageMaker Developer Guide
JupyterLab Versioning

}
}
}'

The following shows how to update a user profile to use JupyterLab 3 as the default, using the AWS CLI:

aws --region <REGION> \


sagemaker update-user-profile \
--domain-id <YOUR_DOMAIN_ID> \
--user-profile-name <EXISTING_USERPROFILE_NAME> \
--user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:image/jupyter-
server-3",
"InstanceType": "system"
}
}
}'

View and update the JupyterLab version of an application from


the console
The following shows how to view and update the JupyterLab version of an application.

1. Navigate to the SageMaker Domains page.


2. Select a domain to view its user profiles.
3. Select a user to view their applications.
4. To view the JupyterLab version of an application, select the application's name.
5. To update the JupyterLab version, select Action.
6. From the dropdown menu, select Change JupyterLab version.
7. From the Studio settings page, select the JupyterLab version from the dropdown menu.
8. After the JupyterLab version for the user profile has been successfully updated, restart the
JupyterServer application to make the version changes effective. For more information about
restarting a JupyterServer application, see Shut down and Update SageMaker Studio (p. 199).

Installing JupyterLab and Jupyter Server extensions


The process for installing JupyterLab and Jupyter Server extensions differs depending on the JupyterLab
version of your Studio instance. In JupyterLab 1, you can open the terminal and install extensions
without activating any conda environment. In JupyterLab 3, you must activate the studio conda
environment before installing extensions. The method for this differs if you're installing the extensions
from within Studio or using a lifecycle configuration script.

Installing Extension from within Studio


To install extensions from within Studio, you must activate the studio environment before you install
extensions.

# Before installing extensions


conda activate studio

# Install your extensions


pip install <JUPYTER_EXTENSION>

140
Amazon SageMaker Developer Guide
Use the Studio Launcher

# After installing extensions


conda deactivate

Installing Extensions using a lifecycle configuration script


If you're installing JupyterLab and Jupyter Server extensions in your lifecycle configuration script, you
must modify your script so that it works with JupyterLab 3. The following sections show the code needed
for existing and new lifecycle configuration scripts.

Existing lifecycle configuration script

If you're reusing an existing lifecycle configuration script that must work with both versions of
JupyterLab, use the following code in your script:

# Before installing extension


export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-
server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
eval "$(conda shell.bash hook)"
conda activate studio
fi;

# Install your extensions


pip install <JUPYTER_EXTENSION>

# After installing extension


if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ]; then
conda deactivate
fi;

New lifecycle configuration script

If you're writing a new lifecycle configuration script that only uses JupyterLab 3, you can use the
following code in your script:

# Before installing extension


eval "$(conda shell.bash hook)"
conda activate studio

# Install your extensions


pip install <JUPYTER_EXTENSION>

conda deactivate

Use the Amazon SageMaker Studio Launcher


You can use the Amazon SageMaker Studio Launcher to create notebooks and text files, and to launch
terminals and interactive Python shells.

You can open Studio Launcher in any of the following ways:

• Choose Amazon SageMaker Studio at the top left of the Studio interface.
• Use the keyboard shortcut Ctrl + Shift + L.
• From the Studio menu, choose File and then choose New Launcher.
• If the SageMaker file browser is open, choose the plus (+) sign in the Studio file browser menu.

141
Amazon SageMaker Developer Guide
Use the Studio Launcher

• In the Quick actions section of the Home tab, choose Open Launcher. The Launcher opens in a new
tab. The Quick actions section is visible by default but can be toggled off. Choose Customize Layout
to turn this section back on.

The Launcher consists of the following two sections:

Topics
• Notebooks and compute resources (p. 142)
• Utilities and files (p. 143)

Notebooks and compute resources


In this section, you can create a notebook, open an image terminal, or open a Python console.

To create or launch one of those items:

1. Choose Change environment to select a SageMaker image, a kernel, an instance type, and, optionally,
add a lifecycle configuration script that runs on image start-up. For more information on lifecycle
configuration scripts, see Use Lifecycle Configurations with Amazon SageMaker Studio (p. 182). For
more information about kernel updates, see Change an Image or a Kernel (p. 159).
2. Select an item.

Note
When you choose an item from this section, you might incur additional usage charges. For more
information, see Usage Metering (p. 161).

The following items are available:

• Notebook

Launches the notebook in a kernel session on the chosen SageMaker image.

142
Amazon SageMaker Developer Guide
Use the Studio Launcher

Creates the notebook in the folder that you have currently selected in the file browser. To view the file
browser, in the left sidebar of Studio, choose the File Browser icon.
• Console

Launches the shell in a kernel session on the chosen SageMaker image.

Opens the shell in the folder that you have currently selected in the file browser.
• Image terminal

Launches the terminal in a terminal session on the chosen SageMaker image.

Opens the terminal in the root folder for the user (as shown by the Home folder in the file browser).

Note
By default, CPU instances launch on a ml.t3.medium instance, while GPU instances launch on a
ml.g4dn.xlarge instance.

Utilities and files


In this section, you can add contextual help in a notebook; create Python, Markdown and text files; and
open a system terminal.
Note
Items in this section run in the context of Amazon SageMaker Studio and don't incur usage
charges.

The following items are available:

• Show Contextual Help

Opens a new tab that displays contextual help for functions in a Studio notebook. To display the help,
choose a function in an active notebook. To make it easier to see the help in context, drag the help tab
so that it's adjacent to the notebook tab. To open the help tab from within a notebook, press Ctrl +
I.

The following screenshot shows the contextual help for the Experiment.create method.

143
Amazon SageMaker Developer Guide
Use Studio Notebooks

• System terminal

Opens a bash shell in the root folder for the user (as shown by the Home folder in the file browser).
• Text File and Markdown File

Creates a file of the associated type in the folder that you have currently selected in the file browser.

To view the file browser, in the left sidebar, choose the File Browser icon ( ).

Use Amazon SageMaker Studio Notebooks


Amazon SageMaker Studio notebooks are collaborative notebooks that you can launch quickly because
you don't need to set up compute instances and file storage beforehand. A set of instance types, known
as Fast launch types are designed to launch in under two minutes. Studio notebooks provide persistent
storage, which enables you to view and share notebooks even if the instances that the notebooks run on
are shut down.

You can share your notebooks with others, so that they can easily reproduce your results and collaborate
while building models and exploring your data. You provide access to a read-only copy of the notebook
through a secure URL. Dependencies for your notebook are included in the notebook's metadata. When
your colleagues copy the notebook, it opens in the same environment as the original notebook.

A Studio notebook runs in an environment defined by the following:

• Amazon EC2 instance type – The hardware configuration the notebook runs on. The configuration
includes the number and type of processors (vCPU and GPU), and the amount and type of memory.
The instance type determines the pricing rate.

144
Amazon SageMaker Developer Guide
Use Studio Notebooks

• SageMaker image – A container image that is compatible with SageMaker Studio. The image consists
of the kernels, language packages, and other files required to run a notebook in Studio. There can be
multiple images in an instance. For more information, see Bring your own SageMaker image (p. 169).
• KernelGateway app – A SageMaker image runs as a KernelGateway app. The app provides access to
the kernels in the image. There is a one-to-one correspondence between a SageMaker image and a
SageMaker app.
• Kernel – The process that inspects and runs the code contained in the notebook. A kernel is defined by
a kernel spec in the image. There can be multiple kernels in an image.

You can change any of these resources from within the notebook.

The following diagram outlines how a notebook kernel runs in relation to the KernelGateway App, User,
and Domain.

Sample SageMaker Studio notebooks are available in the aws_sagemaker_studio folder of the Amazon
SageMaker example GitHub repository. Each notebook comes with the necessary SageMaker image that
opens the notebook with the appropriate kernel.

We recommend that you familiarize yourself with the SageMaker Studio interface and the Studio
notebook toolbar before creating or using a Studio notebook. For more information, see Amazon
SageMaker Studio UI Overview (p. 129) and Use the Studio Notebook Toolbar (p. 150).

Topics
• How Are Amazon SageMaker Studio Notebooks Different from Notebook Instances? (p. 146)
• Get Started (p. 146)
• Amazon SageMaker Studio Tour (p. 147)
• Create or Open an Amazon SageMaker Studio Notebook (p. 148)
• Use the Studio Notebook Toolbar (p. 150)
• Install External Libraries and Kernels in Amazon SageMaker Studio (p. 152)
• Share and Use an Amazon SageMaker Studio Notebook (p. 154)
• Get Studio Notebook and App Metadata (p. 155)
• Get Notebook Differences (p. 157)

145
Amazon SageMaker Developer Guide
Use Studio Notebooks

• Manage Resources (p. 158)


• Usage Metering (p. 161)
• Available Resources (p. 162)

How Are Amazon SageMaker Studio Notebooks Different from


Notebook Instances?
When you're starting a new notebook, we recommend that you create the notebook in Amazon
SageMaker Studio instead of launching a notebook instance from the Amazon SageMaker console. There
are many benefits to using a Studio notebook, including the following:

• Faster: Starting a Studio notebook is faster than launching an instance-based notebook. Typically, it is
5-10 times faster than instance-based notebooks.
• Easy notebook sharing: Notebook sharing is an integrated feature in Studio. Users can generate a
shareable link that reproduces the notebook code and also the SageMaker image required to execute
it, in just a few clicks.
• Latest Python SDK: Studio notebooks come pre-installed with the latest Amazon SageMaker Python
SDK.
• Access all Studio features: Studio notebooks are accessed from within Studio. This enables you to
build, train, debug, track, and monitor your models without leaving Studio.
• Persistent user directories: Each member of a Studio team gets their own home directory to store
their notebooks and other files. The directory is automatically mounted onto all instances and kernels
as they're started, so their notebooks and other files are always available. The home directories are
stored in Amazon Elastic File System (Amazon EFS) so that you can access them from other services.
• Direct access: When using IAM Identity Center, you use your IAM Identity Center credentials through a
unique URL to directly access Studio. You don't have to interact with the AWS Management Console to
run your notebooks.
• Optimized images: Studio notebooks are equipped with a set of predefined SageMaker image settings
to get you started faster.

Note
Studio notebooks don't support local mode. However, you can use a notebook instance to train a
sample of your dataset locally, and then use the same code in a Studio notebook to train on the
full dataset.

When you open a notebook in SageMaker Studio, the view is an extension of the JupyterLab interface.
The primary features are the same, so you'll find the typical features of a Jupyter notebook and
JupyterLab. For more information about the Studio interface, see Amazon SageMaker Studio UI
Overview (p. 129).

Get Started
To get started, you or your organization's administrator need to complete the Amazon SageMaker Studio
onboarding process. For more information, see Onboard to Amazon SageMaker Domain (p. 37).

You can access a Studio notebook in any of the following ways:

• You receive an email invitation to access Studio through your organization's IAM Identity Center, which
includes a direct link to login to Studio without having to use the Amazon SageMaker console. You can
proceed to the the section called “Next Steps” (p. 147).
• You receive a link to a shared Studio notebook, which includes a direct link to log in to Studio
without having to use the SageMaker console. You can proceed to the the section called “Next
Steps” (p. 147).
• You onboard to Studio and then log in to the SageMaker console. For more information, see Onboard
to Amazon SageMaker Domain (p. 37).

146
Amazon SageMaker Developer Guide
Use Studio Notebooks

Launch Amazon SageMaker


Complete the steps in Launch Amazon SageMaker Studio (p. 133) to launch Studio.

Next Steps
Now that you're in Studio, you can try any of the following options:

• To create a Studio notebook or explore Studio end-to-end tutorial notebooks – See Amazon
SageMaker Studio Tour (p. 147) in the next section.
• To familiarize yourself with the Studio interface – See Amazon SageMaker Studio UI
Overview (p. 129) or try the Getting started notebook by selecting Open the Getting started
notebook in the Quick actions section of the Studio Home page.

Amazon SageMaker Studio Tour


For a walkthrough that takes you on a tour of the main features of Amazon SageMaker Studio, see
the xgboost_customer_churn_studio.ipynb sample notebook from the aws/amazon-sagemaker-
examples GitHub repository. The code in the notebook trains multiple models and sets up the SageMaker
Debugger and SageMaker Model Monitor. The walkthrough shows you how to view the trials, compare
the resulting models, show the debugger results, and deploy the best model using the Studio UI. You
don't need to understand the code to follow this walkthrough.

Prerequisites

To run the notebook for this tour, you need:

• An IAM account to sign in to Studio. For information, see Onboard to Amazon SageMaker
Domain (p. 37).
• Basic familiarity with the Studio user interface and Jupyter notebooks. For information, see Amazon
SageMaker Studio UI Overview (p. 129).
• A copy of the aws/amazon-sagemaker-examples repository in your Studio environment.

To clone the repository

1. Sign in to Studio. For users in IAM Identity Center, sign in using the URL from your invitation email.
For IAM users, follow these steps.

a. Sign in to the SageMaker console.


b. Choose Studio in the left navigation pane.
c. Under Get Started, select your domain and user profile.
d. Choose Open Studio.
2. On the top menu, choose File, then New, then Terminal.
3. At the command prompt, run the following command to clone the aws/amazon-sagemaker-
examples GitHub repository.

$ git clone https://github.com/aws/amazon-sagemaker-examples.git

To navigate to the sample notebook

1. From the File Browser on the left menu, select amazon-sagemaker-examples.


2. Navigate to the example notebook with the following path.

147
Amazon SageMaker Developer Guide
Use Studio Notebooks

~/amazon-sagemaker-examples/aws_sagemaker_studio/getting_started/
xgboost_customer_churn_studio.ipynb
3. Follow the notebook to learn about Studio's main features.

Note
If you encounter an error when you run the sample notebook, and some time has passed from
when you cloned the repository, review the notebook on the remote repository for updates.

Create or Open an Amazon SageMaker Studio Notebook


When you Create a Notebook from the File Menu (p. 149) in Amazon SageMaker Studio or Open
a notebook in Studio (p. 148) for the first time, you are prompted to set up your environment by
choosing a SageMaker image, a kernel, an instance type, and, optionally, a lifecycle configuration script
that runs on image start-up. SageMaker launches the notebook on an instance of the chosen type. By
default, the instance type is set to ml.t3.medium (available as part of the AWS Free Tier) for CPU-based
images. For GPU-based images, the default instance type is ml.g4dn.xlarge.

If you create or open additional notebooks that use the same instance type, whether or not the
notebooks use the same kernel, the notebooks run on the same instance of that instance type.

After you launch a notebook, you can change its instance type, SageMaker image, and kernel from within
the notebook. For more information, see Change an Instance Type (p. 158) and Change an Image or a
Kernel (p. 159).
Note
You can have only one instance of each instance type. Each instance can have multiple
SageMaker images running on it. Each SageMaker image can run multiple kernels or terminal
instances.

Billing occurs per instance and starts when the first instance of a given instance type is launched. If
you want to create or open a notebook without the risk of incurring charges, open the notebook from
the File menu and choose No Kernel from the Select Kernel dialog. You can read and edit a notebook
without a running kernel but you can't run cells.

Billing ends when the SageMaker image for the instance is shut down. For more information, see Usage
Metering (p. 161).

For information about shutting down the notebook, see Shut Down Resources (p. 160).

Topics
• Open a notebook in Studio (p. 148)
• Create a Notebook from the File Menu (p. 149)
• Create a Notebook from the Launcher (p. 149)
• List of the available instance types, images, and kernels (p. 150)

Open a notebook in Studio


Amazon SageMaker Studio can only open notebooks listed in the Studio file browser. For instructions on
uploading a notebook to the file browser, see Upload Files to SageMaker Studio (p. 194) or Clone a Git
Repository in SageMaker Studio (p. 194).

To open a notebook

1.
In the left sidebar, choose the File Browser icon ( ) to display the file browser.
2. Browse to a notebook file and double-click it to open the notebook in a new tab.

148
Amazon SageMaker Developer Guide
Use Studio Notebooks

Create a Notebook from the File Menu


To create a notebook from the File menu

1. From the Studio menu, choose File, choose New, and then choose Notebook.
2. In the Change environment dialog, use the dropdown menus to select your Image, Kernel, Instance
type, and Start-up script, then choose Select. Your notebook launches and opens in a new Studio
tab.

Create a Notebook from the Launcher


To create a notebook from the Launcher

1. To open the Launcher, choose Amazon SageMaker Studio at the top left of the Studio interface or
use the keyboard shortcut Ctrl + Shift + L.

To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker Studio
Launcher (p. 141)
2. In the Launcher, in the Notebooks and compute resources section, choose Change environment.

3. In the Change environment dialog, use the dropdown menus to select your Image, Kernel, Instance
type, and Start-up script, then choose Select.
4. In the Launcher, choose Create notebook. Your notebook launches and opens in a new Studio tab.

To view the notebook's kernel session, in the left sidebar, choose the Running Terminals and Kernels

icon ( ). You can stop the notebook's kernel session from this view.

149
Amazon SageMaker Developer Guide
Use Studio Notebooks

List of the available instance types, images, and kernels


For a list of all available resources, see:

• Available Studio Instance Types (p. 162)


• Available Amazon SageMaker Images (p. 164)
• Available Amazon SageMaker Kernels (p. 167)

Use the Studio Notebook Toolbar


Amazon SageMaker Studio notebooks extend the JupyterLab interface. For an overview of the original
JupyterLab interface, see The JupyterLab Interface.

The following image shows the toolbar and an empty cell from a Studio notebook.

When you pause on a toolbar icon, a tooltip displays the icon function. Additional notebook commands
are found in the Studio main menu. The toolbar includes the following icons:

Icon Description

Save and checkpoint

Saves the notebook and updates the checkpoint file. For more information,
see Get the Difference Between the Last Checkpoint (p. 157).

Insert cell

Inserts a code cell below the current cell. The current cell is noted by the
blue vertical marker in the left margin.

Cut, copy, and paste cells

Cuts, copies, and pastes the selected cells.

Run cells

Runs the selected cells and then makes the cell that follows the last selected
cell the new selected cell.

Interrupt kernel

Interrupts the kernel, which cancels the currently running operation. The
kernel remains active.

Restart kernel

Restarts the kernel. Variables are reset. Unsaved information is not affected.

Restart kernel and run all cells

Restarts the kernel, then run all the cells of the notebook.

150
Amazon SageMaker Developer Guide
Use Studio Notebooks

Icon Description

Cell type

Displays or changes the current cell type. The cell types are:

• Code – Code that the kernel runs.


• Markdown – Text rendered as markdown.
• Raw – Content, including Markdown markup, that's displayed as text.

Launch terminal

Launches a terminal in the SageMaker image hosting the notebook. For an


example, see Get App Metadata (p. 156).

Checkpoint diff

Opens a new tab that displays the difference between the notebook and the
checkpoint file. For more information, see Get the Difference Between the
Last Checkpoint (p. 157).

Git diff

Only enabled if the notebook is opened from a Git repository. Opens a


new tab that displays the difference between the notebook and the last
Git commit. For more information, see Get the Difference Between the Last
Commit (p. 158).

2 vCPU + 4 GiB Instance type

Displays or changes the instance type the notebook runs in. The format is as
follows:

number of vCPUs + amount of memory + number of GPUs

Unknown indicates the notebook was opened without specifying a kernel.


The notebook runs on the SageMaker Studio instance and doesn't accrue
runtime charges. You can't assign the notebook to an instance type. You
must specify a kernel and then Studio assigns the notebook to a default
type.

For more information, see Create or Open an Amazon SageMaker Studio


Notebook (p. 148) and Change an Instance Type (p. 158).

Cluster

Connect your notebook to an Amazon EMR cluster and scale your ETL jobs
or run large-scale model training using Apache Spark, Hive, or Presto.

For more information, see Prepare data using Amazon EMR (p. 1164).

151
Amazon SageMaker Developer Guide
Use Studio Notebooks

Icon Description

Python 3 (Data Kernel and SageMaker Image


Science)
Displays or changes the kernel that processes the cells in the notebook. The
format is as follows:

Kernel (SageMaker Image)

No Kernel indicates the notebook was opened without specifying a kernel.


You can edit the notebook but you can't run any cells.

For more information, see Change an Image or a Kernel (p. 159).

Kernel busy status

Displays the busy status of the kernel. When the edge of the circle and
its interior are the same color, the kernel is busy. The kernel is busy when
it is starting and when it is processing cells. Additional kernel states are
displayed in the status bar at the bottom-left corner of SageMaker Studio.

Share notebook

Shares the notebook. For more information, see Share and Use an Amazon
SageMaker Studio Notebook (p. 154).

To select multiple cells, click in the left margin outside of a cell. Hold down the Shift key and use K or
the Up key to select previous cells, or use J or the Down key to select following cells.

Install External Libraries and Kernels in Amazon SageMaker


Studio
Amazon SageMaker Studio notebooks come with multiple images already installed. These images
contain kernels and Python packages including scikit-learn, Pandas, NumPy, TensorFlow, PyTorch, and
MXNet. You can also install your own images that contain your choice of packages and kernels. For more
information on installing your own image, see Bring your own SageMaker image (p. 169).

The different Jupyter kernels in Amazon SageMaker Studio notebooks are separate conda environments.
For information about conda environments, see Managing environments.

Package installation tools


The method that you use to install Python packages from the terminal differs depending on the image.
Studio supports the following package installation tools:

• Notebooks – The following commands are supported. If one of the following does not work on your
image, try the other one.
• %conda install
• %pip install
• The Jupyter terminal – You can install packages using pip and conda directly. You can also use apt-
get install to install system packages from the terminal.

Note
We do not recommend using pip install -u or pip install --user, because those
commands install packages on the user's Amazon EFS volume and can potentially block

152
Amazon SageMaker Developer Guide
Use Studio Notebooks

JupyterServer app restarts. Instead, use a lifecycle configuration to reinstall the required
packages on app restarts as shown in Install packages using lifecycle configurations (p. 154).

We recommend using %pip and %conda to install packages from within a notebook because they
correctly take into account the active environment or interpreter being used. For more information, see
Add %pip and %conda magic functions. You can also use the system command syntax (lines starting
with !) to install packages. For example, !pip install and !conda install.

Conda

Conda is an open source package management system and environment management system that can
install packages and their dependencies. SageMaker supports using conda with either of these two main
channels: the default channel or the conda-forge channel. For more information, see Conda channels.
The conda-forge channel is a community channel where contributors can upload packages.
Note
Installing packages from conda-forge can take up to 10 minutes. Timing relates to how conda
resolves the dependency graph.

All of the SageMaker provided environments are functional. User installed packages may not function
correctly.

Conda has two methods for activating environments: conda activate, and source activate. For
more information, see Managing environment.

Supported conda operations

• conda install of a package in a single environment


• conda install of a package in all environments
• Installing a package from the main conda repository
• Installing a package from conda-forge
• Changing the conda install location to use Amazon EBS
• Supporting both conda activate and source activate

Pip

Pip is the tool for installing and managing Python packages. Pip searches for packages on the Python
Package Index (PyPI) by default. Unlike conda, pip doesn't have built in environment support. Therfore,
pip isn't as thorough as conda when it comes to packages with native or system library dependencies. Pip
can be used to install packages in conda environments. You can use alternative package repositories with
pip instead of the PyPI.

Supported pip operations

• Using pip to install a package without an active conda environment


• Using pip to install a package in a conda environment
• Using pip to install a package in all conda environments
• Changing the pip install location to use Amazon EBS
• Using an alternative repository to install packages with pip

Unsupported

SageMaker aims to support as many package installation operations as possible. However, if the
packages were installed by SageMaker and you use the following operations on these packages, it might
make your environment unstable:

153
Amazon SageMaker Developer Guide
Use Studio Notebooks

• Uninstalling
• Downgrading
• Upgrading

Due to potential issues with network conditions or configurations, or the availability of conda or PyPi,
packages may not install in a fixed or deterministic amount of time.
Note
Attempting to install a package in an environment with incompatible dependencies can result
in a failure. If issues occur, you can contact the library maintainer about updating the package
dependencies. When you modify the environment, such as removing or updating existing
packages, this may result in instability of that environment.

Install packages using lifecycle configurations


Install custom images and kernels on the Studio instance's Amazon EBS volume so that they persist
when you stop and restart the notebook, and that any external libraries you install are not updated by
SageMaker. To do that, use a lifecycle configuration that includes both a script that runs when you create
the notebook (on-create) and a script that runs each time you restart the notebook (on-start). For
more information about using lifecycle configurations with Studio, see Use Lifecycle Configurations with
Amazon SageMaker Studio (p. 182). For sample lifecycle configuration scripts, see SageMaker Studio
Lifecycle Configuration Samples.

Share and Use an Amazon SageMaker Studio Notebook


You can share your Amazon SageMaker Studio notebooks with your colleagues. The shared notebook is
a copy. After you share your notebook, any changes you make to your original notebook aren't reflected
in the shared notebook and any changes your colleague's make in their shared copies of the notebook
aren't reflected in your original notebook. If you want to share your latest version, you must create a new
snapshot and then share it.

Topics
• Share a Notebook (p. 154)
• Use a Shared Notebook (p. 155)

Share a Notebook
The following screenshot shows the menu from a Studio notebook.

To share a notebook

1. In the upper-right corner of the notebook, choose Share.


2. (Optional) In Create shareable snapshot, choose any of the following items:

• Include Git repo information – Includes a link to the Git repository that contains the notebook.
This enables you and your colleague to collaborate and contribute to the same Git repository.

154
Amazon SageMaker Developer Guide
Use Studio Notebooks

• Include output – Includes all notebook output that has been saved.

Note
If you're an user in IAM Identity Center and you don't see these options, your IAM Identity
Center administrator probably disabled the feature. Contact your administrator.
3. Choose Create.
4. After the snapshot is created, choose Copy link and then choose Close.
5. Share the link with your colleague.

After selecting your sharing options, you are provided with a URL. You can share this link with users that
have access to Amazon SageMaker Studio. When the user opens the URL, they're prompted to log in
using IAM Identity Center or IAM authentication. This shared notebook becomes a copy, so changes made
by the recipient will not be reproduced in your original notebook.

Use a Shared Notebook


You use a shared notebook in the same way you would with a notebook that you created yourself. You
must first login to your account, then open the shared link. If you don't have an active session, you
receive an error.

When you choose a link to a shared notebook for the first time, a read-only version of the notebook
opens. To edit the shared notebook, choose Create a Copy. This copies the shared notebook to your
personal storage.

The copied notebook launches on an instance of the instance type and SageMaker image that the
notebook was using when the sender shared it. If you aren't currently running an instance of the instance
type, a new instance is started. Customization to the SageMaker image isn't shared. You can also inspect
the notebook snapshot by choosing Snapshot Details.

The following are some important considerations about sharing and authentication:

• If you have an active session, you see a read-only view of the notebook until you choose Create a
Copy.
• If you don't have an active session, you need to log in.
• If you use IAM to login, after you login, select your user profile then choose Open Studio. Then you
need to choose the link you were sent.
• If you use IAM Identity Center to login, after you login the shared notebook is opened automatically in
Studio.

Get Studio Notebook and App Metadata


You can access notebook metadata and App metadata using the Amazon SageMaker Studio UI.

Topics
• Get Studio Notebook Metadata (p. 155)
• Get App Metadata (p. 156)

Get Studio Notebook Metadata


Jupyter notebooks contain optional metadata that you can access through the Amazon SageMaker
Studio UI.

155
Amazon SageMaker Developer Guide
Use Studio Notebooks

To view the notebook metadata:

1.
In the right sidebar, choose the Property Inspector icon ( ).
2. Open the Advanced Tools section.

The metadata should look similar to the following.

{
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:<acct-id>:image/
datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
}

Get App Metadata


When you create a notebook in Amazon SageMaker Studio, the App metadata is written to a file named
resource-metadata.json in the folder /opt/ml/metadata/. You can get the App metadata
by opening an Image terminal from within the notebook. The metadata gives you the following
information, which includes the SageMaker image and instance type the notebook runs in:

• AppType – KernelGateway
• DomainId – Same as the StudioID
• UserProfileName – The profile name of the current user
• ResourceArn – The Amazon Resource Name (ARN) of the App, which includes the instance type
• ResourceName – The name of the SageMaker image

Additional metadata might be included for internal use by Studio and is subject to change.

To get the App metadata

1.

In the center of the notebook menu, choose the Launch Terminal icon ( ). This opens a
terminal in the SageMaker image that the notebook runs in.
2. Run the following commands to display the contents of the resource-metadata.json file.

$ cd /opt/ml/metadata/
cat resource-metadata.json

156
Amazon SageMaker Developer Guide
Use Studio Notebooks

The file should look similar to the following.

{
"AppType": "KernelGateway",
"DomainId": "d-xxxxxxxxxxxx",
"UserProfileName": "profile-name",
"ResourceArn": "arn:aws:sagemaker:us-east-2:account-id:app/d-xxxxxxxxxxxx/profile-
name/KernelGateway/datascience--1-0-ml-t3-medium",
"ResourceName": "datascience--1-0-ml",
"AppImageVersion":""
}

Get Notebook Differences


You can display the difference between the current notebook and the last checkpoint or the last Git
commit using the Amazon SageMaker UI.

The following screenshot shows the menu from a Studio notebook.

Topics
• Get the Difference Between the Last Checkpoint (p. 157)
• Get the Difference Between the Last Commit (p. 158)

Get the Difference Between the Last Checkpoint


When you create a notebook, a hidden checkpoint file that matches the notebook is created. You can
view changes between the notebook and the checkpoint file or revert the notebook to match the
checkpoint file.

By default, a notebook is auto-saved every 120 seconds and also when you close the notebook.
However, the checkpoint file isn't updated to match the notebook. To save the notebook and update the

checkpoint file to match, you must choose the Save notebook and create checkpoint icon ( ) on
the left of the notebook menu or use the Ctrl + S keyboard shortcut.

To view the changes between the notebook and the checkpoint file, choose the Checkpoint diff icon (

) in the center of the notebook menu.

To revert the notebook to the checkpoint file, from the main Studio menu, choose File then Revert
Notebook to Checkpoint.

157
Amazon SageMaker Developer Guide
Use Studio Notebooks

Get the Difference Between the Last Commit


If a notebook is opened from a Git repository, you can view the difference between the notebook and the
last Git commit.

To view the changes in the notebook from the last Git commit, choose the Git diff icon ( ) in the
center of the notebook menu.

Manage Resources
You can change the instance type, and SageMaker image and kernel from within an Amazon SageMaker
Studio notebook. To create a custom kernel to use with your notebooks, see Bring your own SageMaker
image (p. 169).

Topics
• Change an Instance Type (p. 158)
• Change an Image or a Kernel (p. 159)
• Shut Down Resources (p. 159)

Change an Instance Type


When you open a new Studio notebook for the first time, you are assigned a default Amazon Elastic
Compute Cloud (Amazon EC2) instance type to run the notebook. When you open additional notebooks
on the same instance type, the notebooks run on the same instance as the first notebook, even if the
notebooks use different kernels.

You can change the instance type that your Studio notebook runs on from within the notebook.

The following information only applies to Studio notebooks. For information about how to change the
instance type of a Amazon SageMaker notebook instance, see Update a Notebook Instance (p. 212).
Important
If you change the instance type, unsaved information and existing settings for the notebook are
lost, and installed packages must be re-installed.
The previous instance type continues to run even if no kernel sessions or apps are active. You
must explicitly stop the instance to stop accruing charges. To stop the instance, see Shut Down
Resources (p. 160).

The following screenshot shows the menu from a Studio notebook. The processor and memory of the
instance type powering the notebook are displayed as 2 vCPU + 4 GiB.

To change the instance type

1. Choose the instance type.


2. In Select instance, choose one of the fast launch instance types that are listed. Or to see all instance
types, switch off Fast launch only. The list can be sorted by any column.

158
Amazon SageMaker Developer Guide
Use Studio Notebooks

3. After choosing a type, choose Save and continue.


4. Wait for the new instance to become enabled, and then the new instance type information is
displayed.

For a list of the available instance types, see Available Studio Instance Types (p. 162).

Change an Image or a Kernel


With Amazon SageMaker Studio notebooks, you can change the notebook's image or kernel from within
the notebook.

The following screenshot shows the menu from a Studio notebook. The current SageMaker kernel
and image are displayed as Python 3 (Data Science), where Python 3 denotes the kernel and Data
Science denotes the SageMaker image that contains the kernel. The color of the circle to the right
indicates the kernel is idle or busy. The kernel is busy when the center and the edge of the circle are the
same color.

To change a notebook's image or kernel

1. Choose the image/kernel name in the notebook menu.


2. From the drop-down list, choose an image and/or a kernel.
3. Choose Select.
4. Wait for the kernel's status to show as idle, which indicates the kernel has started.

For a list of available SageMaker images, see Available Amazon SageMaker Images (p. 164).

For a list of available SageMaker kernels, see Available Amazon SageMaker Kernels (p. 167).

Shut Down Resources


You can shut down individual resources, including notebooks, terminals, kernels, apps, and instances. You
can also shut down all resources in one of these categories at the same time.

159
Amazon SageMaker Developer Guide
Use Studio Notebooks

Note
Amazon SageMaker Studio does not support shutting down resources from within a notebook.

Topics
• Shut Down an Open Notebook (p. 160)
• Shut Down Resources (p. 160)

Shut Down an Open Notebook

You can shut down an open notebook from the Amazon SageMaker Studio File menu or from the
Running Terminal and Kernels pane.
Note
When you shut down a notebook, any unsaved information in the notebook is lost. The
notebook is not deleted.

To shut down an open notebook from the File menu

1. Optionally, save the notebook contents by choosing the Disk icon on the left of the notebook menu.
2. Choose File then Close and Shutdown Notebook.
3. Choose OK.

Shut Down Resources

You can reach the Running Terminals and Kernels pane on the left side of Amazon SageMaker Studio

with the icon. The Running Terminals and Kernels pane consists of four sections. Each section
lists all the resources of that type. You can shut down each resource individually or shut down all the
resources in a section at the same time.

When you choose to shut down all resources in a section, the following occurs:

• RUNNING INSTANCES/RUNNING APPS – All instances, apps, notebooks, kernel sessions, consoles/
shells, and image terminals are shut down. System terminals aren't shut down.
Note
When you shutdown the Studio notebook instances, any additional resources, such as
SageMaker endpoints, Amazon EMR clusters, and Amazon S3 buckets created from Studio are
not deleted. Delete those resources to stop accrual of charges.

160
Amazon SageMaker Developer Guide
Use Studio Notebooks

• KERNEL SESSIONS – All kernels, notebooks and consoles/shells are shut down.
• TERMINAL SESSIONS – All image terminals and system terminals are shut down.

To shut down resources

1.
In the left sidebar, choose the Running Terminals and Kernels icon ( ).
2. Do either of the following:


To shut down a specific resource, choose the Shut Down icon ( ) on the same row as the
resource.

For running instances, a confirmation dialog lists all the resources that will be shut down. For
running apps, a confirmation dialog is displayed. Choose Shut down all to proceed.
Note
No confirmation dialog is displayed for kernel sessions or terminal sessions.
• To shut down all resources in a section, choose the X to the right of the section label. A
confirmation dialog is displayed. Choose Shut down all to proceed.

Usage Metering
There is no additional charge for using Amazon SageMaker Studio. The costs incurred for running
Amazon SageMaker Studio notebooks, interactive shells, consoles, and terminals are based on Amazon
Elastic Compute Cloud (Amazon EC2) instance usage.

When you run the following resources, you must choose a SageMaker image and kernel:

From the Studio Launcher

• Notebook
• Interactive Shell
• Image Terminal

From the File menu

• Notebook
• Console

When launched, the resource is run on an Amazon EC2 instance of the chosen instance type. If an
instance of that type was previously launched and is available, the resource is run on that instance.

For CPU based images, the default suggested instance type is ml.t3.medium. For GPU based images,
the default suggested instance type is ml.g4dn.xlarge.

The costs incurred are based on the instance type. You are billed separately for each instance.

Metering starts when an instance is created. Metering ends when all the apps on the instance are shut
down, or the instance is shut down. For information about how to shut down an instance, see Shut Down
Resources (p. 159).
Important
You must shut down the instance to stop incurring charges. If you shut down the notebook
running on the instance but don't shut down the instance, you will still incur charges. When

161
Amazon SageMaker Developer Guide
Use Studio Notebooks

you shutdown the Studio notebook instances, any additional resources, such as SageMaker
endpoints, Amazon EMR clusters, and Amazon S3 buckets created from Studio are not deleted.
Delete those resources to stop accrual of charges.

When you open multiple notebooks on the same instance type, the notebooks run on the same instance
even if they are using different kernels. You are billed only for the time that one instance is running.

You can change the instance type from within the notebook after you open it. For more information, see
Change an Instance Type (p. 158).

For information about billing along with pricing examples, see Amazon SageMaker Pricing.

Available Resources
The following sections list the available resources for Amazon SageMaker Studio notebooks.

Topics
• Available Studio Instance Types (p. 162)
• Available Amazon SageMaker Images (p. 164)
• Available Amazon SageMaker Kernels (p. 167)

Available Studio Instance Types


The following Amazon Elastic Compute Cloud (Amazon EC2) instance types are available for use with
Studio notebooks.

For detailed information on which instance types fit your use case, and their performance capabilities,
see Amazon Elastic Compute Cloud Instance types.

For information about available Amazon SageMaker Notebook Instance types, see
CreateNotebookInstance.
Note
For most use cases, you should use a ml.t3.medium. This is the default instance type for CPU-
based SageMaker images, and is available as part of the AWS Free Tier.

>> Fast launch instances types are optimized to start in under two minutes.

Default instance types

• CPU-based images: ml.t3.medium >> Fast launch


• GPU-based images: ml.g4dn.xlarge >> Fast launch

General purpose (no GPUs)

• ml.t3.medium >> Fast launch


• ml.t3.large
• ml.t3.xlarge
• ml.t3.2xlarge
• ml.m5.large >> Fast launch
• ml.m5.xlarge
• ml.m5.2xlarge
• ml.m5.4xlarge
• ml.m5.8xlarge

162
Amazon SageMaker Developer Guide
Use Studio Notebooks

• ml.m5.12xlarge
• ml.m5.16xlarge
• ml.m5.24xlarge
• ml.m5d.large
• ml.m5d.xlarge
• ml.m5d.2xlarge
• ml.m5d.4xlarge
• ml.m5d.8xlarge
• ml.m5d.12xlarge
• ml.m5d.16xlarge
• ml.m5d.24xlarge

Compute optimized (no GPUs)

• ml.c5.large >> Fast launch


• ml.c5.xlarge
• ml.c5.2xlarge
• ml.c5.4xlarge
• ml.c5.9xlarge
• ml.c5.12xlarge
• ml.c5.18xlarge
• ml.c5.24xlarge

Memory optimized (no GPUs)

• ml.r5.large
• ml.r5.xlarge
• ml.r5.2xlarge
• ml.r5.4xlarge
• ml.r5.8xlarge
• ml.r5.12xlarge
• ml.r5.16xlarge
• ml.r5.24xlarge

Accelerated computing (1+ GPUs)

• ml.p3.2xlarge
• ml.p3.8xlarge
• ml.p3.16xlarge
• ml.p3dn.24xlarge
• ml.g4dn.xlarge >> Fast launch
• ml.g4dn.2xlarge
• ml.g4dn.4xlarge
• ml.g4dn.8xlarge
• ml.g4dn.12xlarge
• ml.g4dn.16xlarge
• ml.g5.xlarge

163
Amazon SageMaker Developer Guide
Use Studio Notebooks

• ml.g5.2xlarge
• ml.g5.4xlarge
• ml.g5.8xlarge
• ml.g5.12xlarge
• ml.g5.24xlarge
• ml.g5.48xlarge

Available Amazon SageMaker Images


The following SageMaker images are available in Amazon SageMaker Studio. SageMaker images contain
the latest Amazon SageMaker Python SDK and the latest version of the kernel. The name in brackets ([ ])
is the resource identifier of the SageMaker image as specified in the Amazon Resource Name (ARN) for
the SageMaker image. For more information, see Deep Learning Containers Images.

• Base Python [python-3.6]

Official Python 3.6 image from DockerHub with boto3 and AWS CLI included.
• Base Python 2.0 [sagemaker-base-python-38]

Official Python 3.8 image from DockerHub with boto3 and AWS CLI included.
• Base Python 3.0 [sagemaker-base-python-310-v1]

Official Python 3.10 image from DockerHub with boto3 and AWS CLI included.
• Data Science [datascience-1.0]

Data Science is a Python 3.7 conda image with the most commonly used Python packages and
libraries, such as NumPy and SciKit Learn.
• Data Science 2.0 [sagemaker-data-science-38]

Data Science 2.0 is a Python 3.8 conda image based on anaconda version 2021.11 with the most
commonly used Python packages and libraries, such as NumPy and SciKit Learn.
• Data Science 3.0 [sagemaker-data-science-310-v1]

Data Science 3.0 is a Python 3.10 conda image based on anaconda version 2022.10 with the most
commonly used Python packages and libraries, such as NumPy and SciKit Learn.
• Amazon SageMaker geospatial [sagemaker-geospatial-1.0]

Amazon SageMaker geospatial is a Python image consisting of commonly used geospatial libraries
such as GDAL, Fiona, GeoPandas, Shapely, and Rasterio, and allows you to visualize geospatial data
within SageMaker. For more information, see Amazon SageMaker geospatial Notebook SDK
• SparkMagic [sagemaker-sparkmagic]

Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• SparkAnalytics 1.0 [sagemaker-sparkanalytics-v1]

Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• SparkAnalytics 2.0 [sagemaker-sparkanalytics-310-v1]

Anaconda Individual Edition with PySpark and Spark kernels. For more information, see sparkmagic.
• MXNet 1.6 Python 3.6 (optimized for CPU) [mxnet-1.6-cpu-py36]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.6 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for MXNet 1.6.0 .

164
Amazon SageMaker Developer Guide
Use Studio Notebooks

• MXNet 1.6 Python 3.6 (optimized for GPU) [mxnet-1.6-gpu-py36]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.6 with CUDA 10.1
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for MXNet 1.6.0 .
• MXNet 1.8 Python 3.7 (optimized for CPU) [mxnet-1.8-cpu-py37-ubuntu16.04-v1]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.8 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for AWS MX 1.8.0 .
• MXNet 1.8 Python 3.7 (optimized for GPU) [mxnet-1.8-gpu-py37-cu110-ubuntu16.04-v1]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.8 with CUDA 11.0
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for AWS MX 1.8.0 .
• MXNet 1.9 Python 3.8 (optimized for CPU) [mxnet-1.9-cpu-py38-ubuntu20.04-sagemaker-v1.0]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.9 include containers for
training on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep
Learning Containers for MX 1.9.0 on SageMaker .
• MXNet 1.9 Python 3.8 (optimized for GPU) [mxnet-1.9-gpu-py38-cu112-ubuntu20.04-sagemaker-v1.0]

The AWS Deep Learning Containers for AWS MX powered by Apache MXNet 1.9 with CUDA 11.2
include containers for training on GPU, optimized for performance and scale on AWS. For more
information, see AWS Deep Learning Containers for MX 1.9.0 on SageMaker .
• PyTorch 1.10 Python 3.8 (optimized for CPU) [pytorch-1.10-cpu-py38]

The AWS Deep Learning Containers for PyTorch 1.10 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.10.2 on SageMaker .
• PyTorch 1.10 Python 3.8 (optimized for GPU) [pytorch-1.10-gpu-py38]

The AWS Deep Learning Containers for PyTorch 1.10 with CUDA 11.3 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.10.2 on SageMaker .
• PyTorch 1.4 Python 3.6 (optimized for CPU) [pytorch-1.4-cpu-py36]

The AWS Deep Learning Containers for PyTorch 1.4 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers v3.2 for
PyTorch .
• PyTorch 1.4 Python 3.6 (optimized for GPU) [pytorch-1.4-gpu-py36]

The AWS Deep Learning Containers for PyTorch 1.4 with CUDA 10.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v3.2 for PyTorch .
• PyTorch 1.6 Python 3.6 (optimized for CPU) [pytorch-1.6-cpu-py36-ubuntu16.04-v1]

The AWS Deep Learning Containers for PyTorch 1.6 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.6.0 .
• PyTorch 1.6 Python 3.6 (optimized for GPU) [pytorch-1.6-gpu-py36-cu110-ubuntu18.04-v3]

The AWS Deep Learning Containers for PyTorch 1.6 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.6.0 with CUDA 11.0 .
• PyTorch 1.8 Python 3.6 (optimized for CPU) [1.8.1-cpu-py36]

165
Amazon SageMaker Developer Guide
Use Studio Notebooks

The AWS Deep Learning Containers for PyTorch 1.8 include containers for training on CPU, optimized
for performance and scale on AWS. For more information, see AWS Deep Learning Containers for
PyTorch 1.8.0 .
• PyTorch 1.8 Python 3.6 (optimized for GPU) [pytorch-1.8-gpu-py36]

The AWS Deep Learning Containers for PyTorch 1.8 with CUDA 11.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.8.0 .
• PyTorch 1.12 Python 3.8 (optimized for CPU) [pytorch-1.12-cpu-py38]

The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training
on CPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.12.0 .
• PyTorch 1.12 Python 3.8 (optimized for GPU) [pytorch-1.12-gpu-py38]

The AWS Deep Learning Containers for PyTorch 1.12 with CUDA 11.3 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for PyTorch 1.12.0.
• TensorFlow 1.15 Python 3.6 (optimized for CPU) [tensorflow-1.15-cpu-py36]

The AWS Deep Learning Containers for TensorFlow 1.15 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 1.15.3 .
• TensorFlow 1.15 Python 3.6 (optimized for GPU) [tensorflow-1.15-gpu-py36]

The AWS Deep Learning Containers for TensorFlow 1.15 with CUDA 10.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 1.15.3 .
• TensorFlow 1.15 Python 3.7 (optimized for CPU) [tensorflow-1.15-cpu-py37-ubuntu18.04-v7]

The AWS Deep Learning Containers for TensorFlow 1.15 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v7.0 for TensorFlow .
• TensorFlow 1.15 Python 3.7 (optimized for GPU) [tensorflow-1.15-gpu-py37-cu110-ubuntu18.04-v8]

The AWS Deep Learning Containers for TensorFlow 1.15 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v7.0 for TensorFlow .
• TensorFlow 2.1 Python 3.6 (optimized for CPU) [tensorflow-2.1-cpu-py36]

The AWS Deep Learning Containers for TensorFlow 2.1 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v6.2 for Tensorflow .
• TensorFlow 2.1 Python 3.6 (optimized for GPU) [tensorflow-2.1-gpu-py36]

The AWS Deep Learning Containers for TensorFlow 2.1 with CUDA 10.1 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers v6.2 for Tensorflow .
• TensorFlow 2.3 Python 3.7 (optimized for CPU) [tensorflow-2.3-cpu-py37-ubuntu18.04-v1]

The AWS Deep Learning Containers for TensorFlow 2.3 include containers for training on CPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers with TensorFlow 2.3.0 .
• TensorFlow 2.3 Python 3.7 (optimized for GPU) [tensorflow-2.3-gpu-py37-cu110-ubuntu18.04-v3]

166
Amazon SageMaker Developer Guide
Use Studio Notebooks

The AWS Deep Learning Containers for TensorFlow 2.3 with CUDA 11.0 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.3.1 with CUDA 11.0 .
• TensorFlow 2.6 Python 3.8 (optimized for CPU) [tensorflow-2.6-cpu-py38-ubuntu20.04-v1]

The AWS Deep Learning Containers for TensorFlow 2.6 include containers for training on GPU,
optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.6 .
• TensorFlow 2.6 Python 3.8 (optimized for GPU) [tensorflow-2.6-gpu-py38-cu112-ubuntu20.04-v1]

The AWS Deep Learning Containers for TensorFlow 2.6 with CUDA 11.2 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see AWS Deep Learning
Containers for TensorFlow 2.6 .
• TensorFlow 2.10 Python 3.9 (optimized for CPU) [2.10.0-cpu-py39-ubuntu20.04-sagemaker-v1.0]

The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training
on CPU, optimized for performance and scale on AWS. For more information, see Release Notes for
Deep Learning Containers.
• TensorFlow 2.10 Python 3.9 (optimized for GPU) [tensorflow-2.10-gpu-py39-cu112-ubuntu20.04-
sagemaker-v1]

The AWS Deep Learning Containers for TensorFlow 2.10 with CUDA 11.2 include containers for training
on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for
Deep Learning Containers.

Available Amazon SageMaker Kernels


The following Amazon SageMaker kernels are available in Studio. The name in parentheses is the
SageMaker image hosting the kernel.

Data Science is a conda image with the most commonly used Python packages and libraries, such as
NumPy and scikit-learn.

• Python 3 (Base Python) with Python 3.6


• Python 3 (Base Python 2.0) with Python 3.8
• Python 3 (Data Science) with Python 3.7
• Python 3 (Data Science 2.0) with Python 3.8
• PySpark (SparkMagic) with Python 3.7
• Spark (SparkMagic) with Python 3.7
• Python 3 (MXNet 1.6 Python 3.6 CPU Optimized)
• Python 3 (MXNet 1.6 Python 3.6 GPU Optimized)
• Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)
• Python 3 (MXNet 1.8 Python 3.7 GPU Optimized)
• Python 3 (MXNet 1.8 Python 3.8 CPU Optimized)
• Python 3 (MXNet 1.8 Python 3.8 GPU Optimized)
• Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)
• Python 3 (PyTorch 1.10 Python 3.8 GPU Optimized)
• Python 3 (PyTorch 1.4 Python 3.6 CPU Optimized)
• Python 3 (PyTorch 1.4 Python 3.6 GPU Optimized)
• Python 3 (PyTorch 1.6 Python 3.6 CPU Optimized)
• Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)

167
Amazon SageMaker Developer Guide
Customize Studio

• Python 3 (PyTorch 1.8 Python 3.6 CPU Optimized)


• Python 3 (PyTorch 1.8 Python 3.6 GPU Optimized)
• Python 3 (SageMaker JumpStart Data Science 1.0) with Python 3.7
• Python 3 (SageMaker JumpStart MXNet 1.0) with Python 3.7
• Python 3 (SageMaker JumpStart PyTorch 1.0) with Python 3.7
• Python 3 (SageMaker JumpStart TensorFlow 1.0) with Python 3.7
• Python 3 (TensorFlow 1.15 Python 3.6 CPU Optimized)
• Python 3 (TensorFlow 1.15 Python 3.6 GPU Optimized)
• Python 3 (TensorFlow 1.15 Python 3.7 CPU Optimized)
• Python 3 (TensorFlow 1.15 Python 3.7 GPU Optimized)
• Python 3 (TensorFlow 2.1 Python 3.6 CPU Optimized)
• Python 3 (TensorFlow 2.1 Python 3.6 GPU Optimized)
• Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)
• Python 3 (TensorFlow 2.3 Python 3.7 GPU Optimized)
• Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized)
• Python 3 (TensorFlow 2.6 Python 3.8 GPU Optimized)

Customize Amazon SageMaker Studio


There are four options for customizing your Amazon SageMaker Studio environment. You bring your own
SageMaker image, use a Lifecycle Configuration script, attach suggested Git repos to Studio, or create
kernels using persistent Conda environments in Amazon EFS. These four options can be used individually
or together.

• Bring your own SageMaker image: A SageMaker image is a file that identifies the kernels, language
packages, and other dependencies required to run a Jupyter notebook in Amazon SageMaker Studio.
Amazon SageMaker provides many built-in images for you to use. If you need different functionality,
you can bring your own custom images to Studio.
• Use Lifecycle Configurations with Amazon SageMaker Studio: Lifecycle Configurations are
shell scripts triggered by Amazon SageMaker Studio lifecycle events, such as starting a new
Studio notebook. You can use Lifecycle Configurations to automate customization for your Studio
environment. For example, you can install custom packages, configure notebook extensions, preload
datasets, and set up source code repositories.
• Attach suggested Git repos to Studio:You can attach suggested Git repository URLs at the Amazon
SageMaker Domain or user profile level. Then, you can select the repo URL from the list of suggestions
and clone that into your environment using the Git extension in Studio.
• Persist Conda environments to the Studio Amazon EFS volume: Studio uses an Amazon EFS volume
as a persistent storage layer. You can save your Conda environment on this Amazon EFS volume, then
use the saved environment to create kernels. Studio automatically picks up all valid environments
saved in Amazon EFS as KernelGateway kernels. These kernels persist through restart of the kernel,
app, and Studio. For more information, see the Persist Conda environments to the Studio EFS
volume section in Four approaches to manage Python packages in Amazon SageMaker Studio
notebooks.

The following topics show how to use these three options to customize your Amazon SageMaker Studio
environment.

Topics
• Bring your own SageMaker image (p. 169)
• Use Lifecycle Configurations with Amazon SageMaker Studio (p. 182)

168
Amazon SageMaker Developer Guide
Customize Studio

• Attach Suggested Git Repos to Studio (p. 190)

Bring your own SageMaker image


A SageMaker image is a file that identifies the kernels, language packages, and other dependencies
required to run a Jupyter notebook in Amazon SageMaker Studio. These images are used to create an
environment that you then run Jupyter notebooks from. Amazon SageMaker provides many built-in
images for you to use. For the list of built-in images, see Available Amazon SageMaker Images (p. 164).

If you need different functionality, you can bring your own custom images to Studio. You can create
images and image versions, and attach image versions to your domain or shared space, using the
SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS
CLI). You can also create images and image versions using the SageMaker console, even if you haven't
onboarded to a SageMaker domain. SageMaker provides sample Dockerfiles to use as a starting point for
your custom SageMaker images in the SageMaker Studio Custom Image Samples repository.

The following topics explain how to bring your own image using the SageMaker console or AWS CLI,
then launch the image in Studio. For a similar blog article, see Bringing your own R environment to
Amazon SageMaker Studio. For notebooks that show how to bring your own image for use in training
and inference, see Amazon SageMaker Studio Container Build CLI.

Key terminology
The following section defines key terms for bringing your own image to use with Studio.

• Dockerfile: A Dockerfile is a file that identifies the language packages and other dependencies for your
Docker image.
• Docker image: The Docker image is a built Dockerfile. This image is checked into Amazon ECR and
serves as the basis of the SageMaker image.
• SageMaker image: A SageMaker image is a holder for a set of SageMaker image versions based on
Docker images. Each image version is immutable.
• Image version: An image version of a SageMaker image represents a Docker image and is stored in an
Amazon ECR repository. Each image version is immutable. These image versions can be attached to a
domain or shared space and used with Studio.

Topics
• Custom SageMaker image specifications (p. 169)
• Prerequisites (p. 171)
• Add a Docker image compatible with Studio to Amazon ECR (p. 171)
• Create a custom SageMaker image (p. 172)
• Attach a custom SageMaker image (p. 175)
• Launch a custom SageMaker image in Amazon SageMaker Studio (p. 180)
• Clean up resources (p. 181)

Custom SageMaker image specifications


The following specifications apply to the container image that is represented by a SageMaker image
version.

Running the image

ENTRYPOINT and CMD instructions are overridden to enable the image to run as a KernelGateway
app.

169
Amazon SageMaker Developer Guide
Customize Studio

Port 8888 in the image is reserved for running the KernelGateway web server.
Stopping the image

The DeleteApp API issues the equivalent of a docker stop command. Other processes in the
container won’t get the SIGKILL/SIGTERM signals.
Kernel discovery

SageMaker recognizes kernels as defined by Jupyter kernel specs.

You can specify a list of kernels to display before running the image. If not specified, python3 is
displayed. Use the DescribeAppImageConfig API to view the list of kernels.

Conda environments are recognized as kernel specs by default.


File system

The /opt/.sagemakerinternal and /opt/ml directories are reserved. Any data in these
directories might not be visible at runtime.
User data

Each user in a domain gets a user directory on a shared Amazon Elastic File System volume in the
image. The location of the current user's directory on the Amazon EFS volume is configurable. By
default, the location of the directory is /home/sagemaker-user.

SageMaker configures POSIX UID/GID mappings between the image and the host. This defaults to
mapping the root user's UID/GID (0/0) to the UID/GID on the host.

You can specify these values using the CreateAppImageConfig API.


GID/UID limits

Amazon SageMaker Studio only supports the following DefaultUID and DefaultGID
combinations:
• DefaultUID: 1000 and DefaultGID: 100, which corresponds to a non-priveleged user.
• DefaultUID: 0 and DefaultGID: 0, which corresponds to root access.
Metadata

A metadata file is located at /opt/ml/metadata/resource-metadata.json. No additional


environment variables are added to the variables defined in the image. For more information, see
Get App Metadata (p. 156).
GPU

On a GPU instance, the image is run with the --gpus option. Only the CUDA toolkit should be
included in the image not the NVIDIA drivers. For more information, see NVIDIA User Guide.
Metrics and logging

Logs from the KernelGateway process are sent to Amazon CloudWatch in the customer’s account.
The name of the log group is /aws/sagemaker/studio. The name of the log stream is
$domainID/$userProfileName/KernelGateway/$appName.
Image size

Limited to 25 GB. To view the size of your image, run docker image ls.

Sample Dockerfile

The following sample Dockerfile creates an image based Amazon Linux 2, installs third party packages
and the python3 kernel, and sets the scope to the non-privileged user.

170
Amazon SageMaker Developer Guide
Customize Studio

FROM public.ecr.aws/amazonlinux/amazonlinux:2

ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"

RUN \
yum install --assumeyes python3 shadow-utils && \
useradd --create-home --shell /bin/bash --gid "${NB_GID}" --uid ${NB_UID} ${NB_USER} &&
\
yum clean all && \
python3 -m pip install ipykernel && \
python3 -m ipykernel install

USER ${NB_UID}

Prerequisites
You must satisfy the following prerequisites to bring your own container for use with Amazon SageMaker
Studio.

• The Docker application. For information about setting up Docker, see Orientation and setup.
• Install the AWS CLI by following the steps in Getting started with the AWS CLI.
• A local copy of any Dockerfile for creating a Studio compatible image. For sample custom images, see
the SageMaker Studio custom image samples repository.
• Permissions to access the Amazon Elastic Container Registry (Amazon ECR) service. For more
information, see Amazon ECR Managed Policies.
• An AWS Identity and Access Management execution role that has the AmazonSageMakerFullAccess
policy attached. If you have onboarded to Amazon SageMaker domain, you can get the role from the
Domain Summary section of the SageMaker control panel.
• Install the Studio image build CLI by following the steps in SageMaker Docker Build. This CLI enables
you to build a Dockerfile using AWS CodeBuild.

Add a Docker image compatible with Studio to Amazon ECR


You perform the following steps to add a container image to Amazon ECR:

• Create an Amazon ECR repository.


• Authenticate to Amazon ECR.
• Build a Docker image compatible with Studio.
• Push the image to the Amazon ECR repository.

Note
The Amazon ECR repository must be in the same AWS Region as Studio.

To build and add a container image to Amazon ECR

1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR
console, see Creating a repository.

aws ecr create-repository \


--repository-name smstudio-custom \
--image-scanning-configuration scanOnPush=true

The response should look similar to the following.

171
Amazon SageMaker Developer Guide
Customize Studio

{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/smstudio-custom",
"registryId": "acct-id",
"repositoryName": "smstudio-custom",
"repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/smstudio-custom",
...
}
}

2. Build the Dockerfile using the Studio image build CLI. The period (.) specifies that the Dockerfile
should be in the context of the build command. This command builds the image and uploads the
built image to the ECR repo. It then outputs the image URI.

sm-docker build . --repository smstudio-custom:custom

The response should look similar to the following.

Image URI: <acct-id>.dkr.ecr.<region>.amazonaws.com/<image_name>

Create a custom SageMaker image


This topic describes how you can create a custom SageMaker image using the SageMaker console or AWS
CLI.

When you create an image from the console, SageMaker also creates an initial image version. The image
version represents a container image in Amazon Elastic Container Registry (ECR). The container image
must satisfy the requirements to be used in Amazon SageMaker Studio. For more information, see
Custom SageMaker image specifications (p. 169). For information on testing your image locally and
resolving common issues, see the SageMaker Studio Custom Image Samples repo.

After you have created your custom SageMaker image, you must attach it to your domain or shared
space to use it with Studio. For more information, see Attach a custom SageMaker image (p. 175).

Create a SageMaker image from the console


The following section demonstrates how to create a custom SageMaker image from the SageMaker
console.

To create an image

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Images.
3. On the Custom images page, choose Create image.
4. For Image source, enter the registry path to the container image in Amazon ECR. The path is in the
following format:

acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest]
5. Choose Next.
6. Under Image properties, enter the following:

• Image name – The name must be unique to your account in the current AWS Region.
• (Optional) Display name – The name displayed in the Studio user interface. When not provided,
Image name is displayed.
• (Optional) Description – A description of the image.

172
Amazon SageMaker Developer Guide
Customize Studio

• IAM role – The role must have the AmazonSageMakerFullAccess policy attached. Use the
dropdown menu to choose one of the following options:
• Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets
that you want users of your notebooks to have access to. If you don't want to allow access to
additional buckets, choose None.

SageMaker attaches the AmazonSageMakerFullAccess policy to the role. The role allows
users of your notebooks access to the S3 buckets listed next to the checkmarks.
• Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
• Use existing role – Choose one of your existing roles from the list.
• (Optional) Image tags – Choose Add new tag. You can add up to 50 tags. Tags are searchable
using the Studio user interface, the SageMaker console, or the SageMaker Search API.
7. Choose Submit.

The new image is displayed in the Custom images list and briefly highlighted. After the image has been
successfully created, you can choose the image name to view its properties or choose Create version to
create another version.

To create another image version

1. Choose Create version on the same row as the image.


2. For Image source, enter the registry path to the Amazon ECR container image. The container image
shouldn't be the same image as used in a previous version of the SageMaker image.

Create a SageMaker image from the AWS CLI


You perform the following steps to create a SageMaker image from the container image using the AWS
CLI.

• Create an Image.
• Create an ImageVersion.
• Create a configuration file.
• Create an AppImageConfig.

To create the SageMaker image entities

1. Create a SageMaker image.

aws sagemaker create-image \


--image-name custom-image \
--role-arn arn:aws:iam::<acct-id>:role/service-role/<execution-role>

The response should look similar to the following.

{
"ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/custom-image"
}

2. Create a SageMaker image version from the container image.

aws sagemaker create-image-version \


--image-name custom-image \
--base-image <acct-id>.dkr.ecr.<region>.amazonaws.com/smstudio-custom:custom-image

173
Amazon SageMaker Developer Guide
Customize Studio

The response should look similar to the following.

{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-
image/1"
}

3. Check that the image version was successfully created.

aws sagemaker describe-image-version \


--image-name custom-image \
--version-number 1

The response should look similar to the following.

{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/custom-
image/1",
"ImageVersionStatus": "CREATED"
}

Note
If the response is "ImageVersionStatus": "CREATED_FAILED", the response also
includes the failure reason. A permissions issue is a common cause of failure. You also
can check your Amazon CloudWatch logs if you experience a failure when starting or
running the KernelGateway app for a custom image. The name of the log group is /aws/
sagemaker/studio. The name of the log stream is $domainID/$userProfileName/
KernelGateway/$appName.
4. Create a configuration file, named app-image-config-input.json. The Name value of
KernelSpecs must match the name of the kernelSpec available in the Image associated with this
AppImageConfig. This value is case sensitive. You can find the available kernelSpecs in an image
by running jupyter-kernelspec list from a shell inside the container. MountPath is the path
within the image to mount your Amazon Elastic File System (Amazon EFS) home directory. It needs
to be different from the path you use inside the container because that path will be overridden when
your Amazon EFS home directory is mounted.
Note
The following DefaultUID and DefaultGID combinations are the only accepted values:

• DefaultUID: 1000 and DefaultGID: 100


• DefaultUID: 0 and DefaultGID: 0

{
"AppImageConfigName": "custom-image-config",
"KernelGatewayImageConfig": {
"KernelSpecs": [
{
"Name": "python3",
"DisplayName": "Python 3 (ipykernel)"
}
],
"FileSystemConfig": {
"MountPath": "/home/sagemaker-user",
"DefaultUid": 1000,
"DefaultGid": 100
}
}

174
Amazon SageMaker Developer Guide
Customize Studio

5. Create the AppImageConfig using the file created in the previous step.

aws sagemaker create-app-image-config \


--cli-input-json file://app-image-config-input.json

The response should look similar to the following.

{
"AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/custom-
image-config"
}

Attach a custom SageMaker image


To use a custom SageMaker image, you must attach a version of the image to your domain or shared
space. When you attach an image version, it appears in the SageMaker Studio Launcher and is available
in the Select image dropdown list, which users use to launch an activity or change the image used by a
notebook.

To make a custom SageMaker image available to all users within a domain, you attach the image to the
domain. To make an image available to all users within a shared space, you can attach the image to the
shared space. To make an image available to a single user, you attach the image to the user's profile.
When you attach an image, SageMaker uses the latest image version by default. You can also attach a
specific image version. After you attach the version, you can choose the version from the SageMaker
Launcher or the image selector when you launch a notebook.

There is a limit to the number of image versions that can be attached at any given time. After you reach
the limit, you must detach a version in order to attach another version of the image.

The following sections demonstrate how to attach a custom SageMaker image to your domain using
either the SageMaker console or the AWS CLI. You can only attach a custom image to a share space using
the AWS CLI.

Attach the SageMaker image to a Domain

Attach the SageMaker image using the Console

This topic describes how you can attach an existing custom SageMaker image version to your domain
using the SageMaker control panel. You can also create a custom SageMaker image and image version,
and then attach that version to your domain. For the procedure to create an image and image version,
see Create a custom SageMaker image (p. 172).

To attach an existing image

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Domains.
3. From the Domains page, select the Domain to attach the image to.
4. From the Domain details page, select the Environment tab.
5. On the Environment tab, under Custom SageMaker Studio images attached to domain, choose
Attach image.
6. For Image source, choose Existing image.
7. Choose an existing image from the list.
8. Choose a version of the image from the list.

175
Amazon SageMaker Developer Guide
Customize Studio

9. Choose Next.
10. Verify the values for Image name, Image display name, and Description.
11. Choose the IAM role. For more information, see Create a custom SageMaker image (p. 172).
12. (Optional) Add tags for the image.
13. Specify the EFS mount path. This is the path within the image to mount the user's Amazon Elastic
File System (EFS) home directory.
14. For Image type, select SageMaker Studio image
15. For Kernel name, enter the name of an existing kernel in the image. For information on how to
get the kernel information from the image, see DEVELOPMENT in the SageMaker Studio Custom
Image Samples repository. For more information, see the Kernel discovery and User data sections
of Custom SageMaker image specifications (p. 169).
16. (Optional) For Kernel display name, enter the display name for the kernel.
17. Choose Add kernel.
18. Choose Submit.

• Wait for the image version to be attached to the domain. When attached, the version is
displayed in the Custom images list and briefly highlighted.

Attach the SageMaker image using the AWS CLI

The following sections demonstrate how to attach a custom SageMaker image when creating a new
domain or updating your existing domain using the AWS CLI.

Attach the SageMaker image to a new domain

The following section demonstrates how to create a new domain with the version attached. These steps
require that you specify the Amazon Virtual Private Cloud (VPC) information and execution role required
to create the domain. You perform the following steps to create the domain and attach the custom
SageMaker image:

• Get your default VPC ID and subnet IDs.


• Create the configuration file for the domain, which specifies the image.
• Create the domain with the configuration file.

To add the custom SageMaker image to your domain

1. Get your default VPC ID.

aws ec2 describe-vpcs \


--filters Name=isDefault,Values=true \
--query "Vpcs[0].VpcId" --output text

The response should look similar to the following.

vpc-xxxxxxxx

2. Get your default subnet IDs using the VPC ID from the previous step.

aws ec2 describe-subnets \


--filters Name=vpc-id,Values=<vpc-id> \
--query "Subnets[*].SubnetId" --output json

The response should look similar to the following.

176
Amazon SageMaker Developer Guide
Customize Studio

[
"subnet-b55171dd",
"subnet-8a5f99c6",
"subnet-e88d1392"
]

3. Create a configuration file named create-domain-input.json. Insert the VPC ID, subnet IDs,
ImageName, and AppImageConfigName from the previous steps. Because ImageVersionNumber
isn't specified, the latest version of the image is used, which is the only version in this case.

{
"DomainName": "domain-with-custom-image",
"VpcId": "<vpc-id>",
"SubnetIds": [
"<subnet-ids>"
],
"DefaultUserSettings": {
"ExecutionRole": "<execution-role>",
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "custom-image",
"AppImageConfigName": "custom-image-config"
}
]
}
},
"AuthMode": "IAM"
}

4. Create the domain with the attached custom SageMaker image.

aws sagemaker create-domain \


--cli-input-json file://create-domain-input.json

The response should look similar to the following.

{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx",
"Url": "https://d-xxxxxxxxxxxx.studio.us-east-2.sagemaker.aws/..."
}

Attach the SageMaker image to your current domain

If you have onboarded to a SageMaker domain, you can attach the custom image to your current
domain. For more information about onboarding to a SageMaker domain, see Onboard to Amazon
SageMaker Domain (p. 37). You don't need to specify the VPC information and execution role when
attaching a custom image to your current domain. After you attach the version, you must delete all the
apps in your domain and reopen Studio. For information about deleting the apps, see Delete an Amazon
SageMaker Domain (p. 116).

You perform the following steps to add the SageMaker image to your current domain.

• Get your DomainID from SageMaker control panel.


• Use the DomainID to get the DefaultUserSettings for the domain.
• Add the ImageName and AppImageConfig as a CustomImage to the DefaultUserSettings.
• Update your domain to include the custom image.

177
Amazon SageMaker Developer Guide
Customize Studio

To add the custom SageMaker image to your domain

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Domains.
3. From the Domains page, select the Domain to attach the image to.
4. From the Domain details page, select the Domain settings tab.
5. From the Domain settings tab, under General settings, find the DomainId. The ID is in the
following format: d-xxxxxxxxxxxx.
6. Use the domain ID to get the description of the domain.

aws sagemaker describe-domain \


--domain-id <d-xxxxxxxxxxxx>

The response should look similar to the following.

{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}

7. Save the default user settings section of the response to a file named default-user-
settings.json.
8. Insert the ImageName and AppImageConfigName from the previous steps as a custom image.
Because ImageVersionNumber isn't specified, the latest version of the image is used, which is the
only version in this case.

{
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "string",
"AppImageConfigName": "string"
}
],
...
}
}
}

9. Use the domain ID and default user settings file to update your domain.

aws sagemaker update-domain \


--domain-id <d-xxxxxxxxxxxx> \
--cli-input-json file://default-user-settings.json

The response should look similar to the following.

{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}

178
Amazon SageMaker Developer Guide
Customize Studio

Attach the SageMaker image to a shared space

You can only attach the SageMaker image to a shared space using the AWS CLI. After you attach the
version, you must delete all of the applications in your shared space and reopen Studio. For information
about deleting the apps, see Delete an Amazon SageMaker Domain (p. 116).

You perform the following steps to add the SageMaker image to a shared space.

• Get your DomainID from SageMaker control panel.


• Use the DomainID to get the DefaultSpaceSettings for the domain.
• Add the ImageName and AppImageConfig as a CustomImage to the DefaultSpaceSettings.
• Update your domain to include the custom image for the shared space.

To add the custom SageMaker image to your shared space

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Domains.
3. From the Domains page, select the Domain to attach the image to.
4. From the Domain details page, select the Domain settings tab.
5. From the Domain settings tab, under General settings, find the DomainId. The ID is in the
following format: d-xxxxxxxxxxxx.
6. Use the domain ID to get the description of the domain.

aws sagemaker describe-domain \


--domain-id <d-xxxxxxxxxxxx>

The response should look similar to the following.

{
"DomainId": "d-xxxxxxxxxxxx",
...
"DefaultSpaceSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}

7. Save the default space settings section of the response to a file named default-space-
settings.json.
8. Insert the ImageName and AppImageConfigName from the previous steps as a custom image.
Because ImageVersionNumber isn't specified, the latest version of the image is used, which is the
only version in this case.

{
"DefaultSpaceSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "string",
"AppImageConfigName": "string"
}
],

179
Amazon SageMaker Developer Guide
Customize Studio

...
}
}
}

9. Use the domain ID and default space settings file to update your domain.

aws sagemaker update-domain \


--domain-id <d-xxxxxxxxxxxx> \
--cli-input-json file://default-space-settings.json

The response should look similar to the following.

{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}

View the attached image in SageMaker

After you create the custom SageMaker image and attach it to your domain, the image appears in the
Environment tab of the Domain. You can only view the attached images for shared spaces using the
AWS CLI by using the following command.

aws sagemaker describe-domain \


--domain-id <d-xxxxxxxxxxxx>

Launch a custom SageMaker image in Amazon SageMaker Studio


After you create your custom SageMaker image and attach it to your domain or shared space, the custom
image and kernel appear in selectors in the Change environment dialog box of the Studio Launcher.

To launch and select your custom image and kernel

1. In Amazon SageMaker Studio, open the Launcher. To open the Launcher, choose Amazon
SageMaker Studio at the top left of the Studio interface or use the keyboard shortcut Ctrl +
Shift + L.

To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker Studio
Launcher (p. 141)

180
Amazon SageMaker Developer Guide
Customize Studio

2. In the Launcher, in the Notebooks and compute resources section, choose Change environment.
3. In the Change environment dialog, use the dropdown menus to select your Image from the Custom
Image section, and your Kernel, then choose Select.
4. In the Launcher, choose Create notebook or Open image terminal. Your notebook or terminal
launches in the selected custom image and kernel.

To change your image or kernel in an open notebook, see Change an Image or a Kernel (p. 159).
Note
If you encounter an error when launching the image, check your Amazon CloudWatch logs.
The name of the log group is /aws/sagemaker/studio. The name of the log stream is
$domainID/$userProfileName/KernelGateway/$appName.

Clean up resources
The following sections show how to clean up the resources you created in the previous sections from the
SageMaker console or AWS CLI. You perform the following steps to clean up the resources:

• Detach the image and image versions from your domain.


• Delete the image, image version, and app image config.
• Delete the container image and repository from Amazon ECR. For more information, see Deleting a
repository.

Clean up resources from the SageMaker console

The following section shows how to clean up resources from the SageMaker console.

When you detach an image from a domain, all versions of the image are detached. When an image
is detached, all users of the domain lose access to the image versions. A running notebook that has a
kernel session on an image version when the version is detached, continues to run. When the notebook is
stopped or the kernel is shut down, the image version becomes unavailable.

To detach an image

1. In the Control Panel, under Custom SageMaker Studio images attached to domain, choose the
image and then choose Detach.
2. (Optional) To delete the image and all versions from SageMaker, select Also delete the selected
images .... This does not delete the associated container images from Amazon ECR.
3. Choose Detach.

Clean up resources from the AWS CLI

The following section shows how to clean up resources from the AWS CLI.

To clean up resources

1. Detach the image and image versions from your domain by passing an empty custom image list to
the domain. Open the default-user-settings.json file you created in Attach the SageMaker
image to your current domain (p. 177). To detach the image and image version from a shared
space, open the default-space-settings.json file.
2. Delete the custom images and then save the file.

"DefaultUserSettings": {

181
Amazon SageMaker Developer Guide
Customize Studio

"KernelGatewayAppSettings": {
"CustomImages": [
],
...
},
...
}

3. Use the domain ID and default user settings file to update your domain. To update your shared
space, use the default space settings file.

aws sagemaker update-domain \


--domain-id <d-xxxxxxxxxxxx> \
--cli-input-json file://default-user-settings.json

The response should look similar to the following.

{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}

4. Delete the app image config.

aws sagemaker delete-app-image-config \


--app-image-config-name custom-image-config

5. Delete the SageMaker image, which also deletes all image versions. The container images in ECR
that are represented by the image versions are not deleted.

aws sagemaker delete-image \


--image-name custom-image

Use Lifecycle Configurations with Amazon SageMaker Studio


Lifecycle Configurations are shell scripts triggered by Amazon SageMaker Studio lifecycle events, such
as starting a new Studio notebook. You can use Lifecycle Configurations to automate customization for
your Studio environment. This customization includes installing custom packages, configuring notebook
extensions, preloading datasets, and setting up source code repositories.

Using Lifecycle Configurations gives you flexibility and control to configure Studio to meet your specific
needs. For example, you can create a minimal set of base container images with the most commonly
used packages and libraries, then use Lifecycle Configurations to install additional packages for specific
use cases across your data science and machine learning teams.

For example Lifecycle Configuration scripts, see the Studio Lifecycle Configuration examples GitHub
repository. For a blog on implementing Lifecycle Configurations, see Customize Amazon SageMaker
Studio using Lifecycle Configurations.
Note
Each script has a limit of 16384 characters.

Topics
• Creating and Associating a Lifecycle Configuration (p. 183)
• Setting Default Lifecycle Configurations (p. 187)
• Debugging Lifecycle Configurations (p. 189)
• Updating and deleting Lifecycle Configurations (p. 190)

182
Amazon SageMaker Developer Guide
Customize Studio

Creating and Associating a Lifecycle Configuration


Amazon SageMaker Studio provides interactive applications that enable Studio's visual interface, code
authoring, and run experience. This series shows how to create a lifecycle configuration and associate it
with Amazon SageMaker Studio.

Application types can be either JupyterServer or KernelGateway.

• JupyterServerapplications: This application type enables access to the visual interface for Studio.
Every user in Studio gets their own Jupyter Server application.
• KernelGateway applications: This application type enables access to the code run environment and
kernels for your Studio notebooks and terminals. For more information, see Jupyter Kernel Gateway.

For more information about Studio's architecture and Studio applications, see Use Amazon SageMaker
Studio Notebooks.

Topics
• Create a Lifecycle Configuration from the AWS CLI (p. 183)
• Create a Lifecycle Configuration from the SageMaker Console (p. 185)

Create a Lifecycle Configuration from the AWS CLI

The following topic shows how to create a lifecycle configuration using the AWS CLI to automate
customization for your Studio environment.

Prerequisites

Before you begin, complete the following prerequisites:

• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
• Onboard to Amazon SageMaker Studio. For more information, see Onboard to Amazon SageMaker
Studio.

Step 1: Create a Lifecycle Configuration

The following procedure shows how to create a lifecycle configuration script that prints Hello World.

1. From your local machine, create a file named my-script.sh with the following content.

#!/bin/bash
set -eux
echo 'Hello World!'

2. Convert your my-script.sh file into base64 format. This requirement prevents errors that occur
from spacing and line break encoding.

LCC_CONTENT=`openssl base64 -A -in my-script.sh`

3. Create a Studio lifecycle configuration. The following command creates a lifecycle configuration that
runs when you launch an associated KernelGateway application.

aws sagemaker create-studio-lifecycle-config \

183
Amazon SageMaker Developer Guide
Customize Studio

--region region \
--studio-lifecycle-config-name my-studio-lcc \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type KernelGateway

Note the ARN of the newly created lifecycle configuration that is returned. This ARN is required to
attach the lifecycle configuration to your application.

Step 2: Attach the Lifecycle Configuration to your Studio domain, user profile, or shared space

To attach the lifecycle configuration, you must update the UserSettings for your Studio domain or an
individual user profile, or the SpaceSettings for a shared space. Lifecycle configuration scripts that are
associated at the domain level are inherited by all users. However, scripts that are associated at the user
profile level are scoped to a specific user, while scripts that are associated at the shared space level are
scoped to the shared space.

The following example shows how to create a new user profile with the lifecycle configuration attached.
To update an existing user profile, use the update-user-profile command.

Add the lifecycle configuration ARN from the previous step to the settings for the appropriate AppType.
For example, place it in the JupyterServerAppSettings of the user. You can add multiple lifecycle
configuration at the same time by using a list of lifecycle configuration.

# Create a new UserProfile


aws sagemaker create-user-profile --domain-id domain-id \
--user-profile-name user-profile-name \
--region region \
--user-settings '{
"JupyterServerAppSettings": {
"LifecycleConfigArns":
["lifecycle-configuration-arn-list"]
}
}'

The following example shows how to update an existing shared space to attach the lifecycle
configuration. The lifecycle configuration specified as part of DefaultResourceSpec indicates which
lifecycle configuration is automatically attached to new applications created in the shared space.

aws sagemaker update-space --domain-id domain-id \


--space-name space-name \
--region region \
--space-settings '{
"JupyterServerAppSettings": {
"LifecycleConfigArns":
["lifecycle-configuration-arn-list"],
"DefaultResourceSpec":
"default-lifecycle-configuration"
}
}'

Step 3: Launch application with Lifecycle Configuration

After you attach a lifecycle configuration to a user profile or space, the user can select it when launching
an application using the AWS CLI. This section describes how to launch an application with an attached
lifecycle configuration.

Launch the application and specify the lifecycle configuration ARN in the ResourceSpec argument of
the CreateApp API.

184
Amazon SageMaker Developer Guide
Customize Studio

• The following example shows how to create a JupyterServer application. When creating the app-
type JupyterServer, the app-name must be default.

# Create a UserProfile application


aws sagemaker create-app --domain-id domain-id \
--region region \
--user-profile-name user-profile-name \
--app-type JupyterServer \
--resource-spec LifecycleConfigArn=lifecycle-configuration-arn \
--app-name default

# Create a shared space application


aws sagemaker create-app --domain-id domain-id \
--region region \
--space-name space-name \
--app-type JupyterServer \
--resource-spec LifecycleConfigArn=lifecycle-configuration-arn \
--app-name default

• The following example shows how to create a KernelGateway application.

aws sagemaker create-app --domain-id domain-id \


--region region \
--user-profile-name user-profile-name \
--app-type KernelGateway \
--resource-spec LifecycleConfigArn=lifecycle-configuration-
arn,SageMakerImageArn=sagemaker-image-arn,InstanceType=instance-type \
--app-name app-name

Create a Lifecycle Configuration from the SageMaker Console

The following topic shows how to create a lifecycle configuration from the Amazon SageMaker console
to automate customization for your Studio environment.

Prerequisites

Before you can begin this tutorial, complete the following prerequisite:

• Onboard to Amazon SageMaker Studio. For more information, see Onboard to Amazon SageMaker
Studio.

Step 1: Create a new Lifecycle Configuration

You can create a lifecycle configuration by entering a script from the Amazon SageMaker console.

The following procedure shows how to create a lifecycle configuration script that prints Hello World.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the navigation panel, under SageMaker dashboard, choose Lifecycle configurations.
3. Choose the Studio tab.
4. Choose Create configuration.
5. Under Select configuration type, select the type of application that the lifecycle configuration
should be attached to.
6. Choose Next.
7. In the section called Configuration settings, enter a name for your lifecycle configuration.
8. In the Scripts section, enter the following content.

185
Amazon SageMaker Developer Guide
Customize Studio

#!/bin/bash
set -eux
echo 'Hello World!'

9. (Optional) Create a tag for your lifecycle configuration.


10. Choose Create Configuration.

Step 2: Attach the Lifecycle Configuration to Studio domain or user profile

Lifecycle configuration scripts associated at the domain level are inherited by all users. However, scripts
that are associated at the user profile level are scoped to a specific user.

The following sections show how to attach a lifecycle configuration to your domain and user profile.

Attach to Studio domain

The following shows how to attach a lifecycle configuration to your existing domain in Studio.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation panel, choose Domains.
3. From the list of Domains, select the Domain to attach the lifecycle configuration to.
4. From the Domain details, choose the Environment tab.
5. Under Lifecycle configurations for personal Studio apps, choose Attach.
6. Under Source, choose Existing configuration.
7. Under Studio lifecycle configurations, select the lifecycle configuration that you created in the
previous step.
8. Select Attach to domain.

Attach to your user profile

The following shows how to attach a lifecycle configuration to your existing user profile.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation panel, choose Domains.
3. From the list of Domains, select the Domain that contains the user profile to attach the lifecycle
configuration to.
4. Under User profiles, select the user profile.
5. From the User Details page, choose Edit.
6. On the left navigation, choose Studio settings.
7. Under Lifecycle configurations attached to user, choose Attach.
8. Under Source, choose Existing configuration.
9. Under Studio lifecycle configurations, select the lifecycle configuration that you created in the
previous step.
10. Choose Attach to user profile.

Step 3: Launch an application with the Lifecycle Configuration

After you attach a lifecycle configuration to a user profile, the user can select it when launching an
application using the Studio Launcher. The following procedure describes how to launch an application
with an attached lifecycle configuration.

186
Amazon SageMaker Developer Guide
Customize Studio

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Launch the Studio Domain. For more information, see Use the Amazon SageMaker Studio
Launcher (p. 141).
3. In the launcher, navigate to the Notebooks and compute resources section.
4. Click the Change environment button.
5. On the Change environment dialog, use the dropdown menus to select your Image, Kernel,
Instance type, and a Start-up script. If there is no default lifecycle configuration, the Start-up
script value defaults to No script. Otherwise, the Start-up script value is your default lifecycle
configuration. After you select a lifecycle configuration, you can view the entire script.
6. Click Select.
7. Back to the Launcher, click the Create notebook to launch a new notebook kernel with your selected
image and lifecycle configuration.

Step 4: View logs for a Lifecycle Configuration

You can view the logs for your lifecycle configuration after it has been attached to a Studio domain or
user profile.

1. First, provide access to CloudWatch for your AWS Identity and Access Management (IAM) role. Add
read permissions for the following log group /aws/sagemaker/studio and for the following log
stream <Domain>/<UserProfile>/<AppType>/<AppName>/LifecycleConfigOnStart. For
information about adding permissions, see Enabling logging from certain AWS services.
2.
From within Studio, navigate to the Running Terminals and Kernels icon to monitor your
lifecycle configuration.
3. Select an application from the list of running applications. Applications with attached lifecycle

configurations have an attached indicator icon .


4. Select the indicator icon for your application. This opens a new panel that lists the lifecycle
configuration.
5. From the new panel, select View logs. This opens a new tab that displays the logs.

Setting Default Lifecycle Configurations


To set a Lifecycle Configuration as the default for your Domain or UserProfile programatically, you can
create a new resource or update an existing resource. To associate a Lifecycle Configuration as a default,
you'll first need to create a Lifecycle Configuration following the steps in Creating and Associating
a Lifecycle Configuration (p. 183). Default Lifecycle Configurations set up at the domain level are
inherited by all users, while those set up at the user level are scoped to a specific user.
Note
User level defaults override defaults set up at the domain level.

To set up a default Lifecycle Configuration, it must be added to the DefaultResourceSpec of the


appropriate app type. The behavior of your Lifecycle Configuration depends on whether it is added to
the DefaultResourceSpec of a JupyterServer or KernelGateway app.

• JupyterServer apps: When added to the DefaultResourceSpec of a JupyterServer app, the default
Lifecycle Configuration script runs automatically when the user logs into Studio for the first time
or restarts Studio. This can be used to automate one-time set-up actions for the Studio developer
environment, such as installing notebook extensions or setting up a GitHub repo. For an example of
this, see Customize Amazon SageMaker Studio using Lifecycle Configurations.

187
Amazon SageMaker Developer Guide
Customize Studio

• KernelGateway apps: When added to the DefaultResourceSpec of a KernelGateway app, Studio


defaults to selecting the Lifecycle Configuration script from the Studio launcher. Users can launch a
notebook or terminal with the default script selected or they can select a different one from the list of
Lifecycle Configurations.

Note
A default KernelGateway Lifecycle Configuration specified in DefaultResourceSpec applies
to all KernelGateway images in the Studio Domain unless the user selects a different script from
the list presented in the Studio launcher. The default script also runs if No Script is selected
by the user. For more information on selecting a script, see Step 3: Launch an application with
the Lifecycle Configuration (p. 186).

Associate a default Lifecycle Configuration when creating a new Domain or UserProfile

To associate a Lifecycle Configuration when creating a new Studio Domain or UserProfile, you need the
ARN of the Lifecycle Configuration that you created. This ARN is passed to one of the following API calls:

• create-user-profile
• create-domain

For example, the following API call creates a new UserProfile with an associated Lifecycle Configuration.

aws sagemaker create-user-profile --domain-id <DOMAIN-ID> \


--user-profile-name <USER-PROFILE-NAME> \
--region <REGION> \
--user-settings '{
"KernelGatewayAppSettings": {
"DefaultResourceSpec": {
"InstanceType": "ml.t3.medium",
"LifecycleConfigArn": "<LIFECYCLE-CONFIGURATION-ARN>"
}
}
}'

Associate a default Lifecycle Configuration when updating a Domain or UserProfile

To associate a Lifecycle Configuration when updating an existing Studio Domain or UserProfile, you need
the ARN of the Lifecycle Configuration that you created. This ARN is passed to one of the following API
calls:

• update-user-profile
• update-domain

The Lifecycle Configuration ARN should be placed in 2 places, the DefaultResourceSpec and the
LifecycleConfigArns list in KernelGatewayAppSettings. For example, the following API call
updates a UserProfile with an associated Lifecycle Configuration.

aws sagemaker update-user-profile --domain-id <DOMAIN-ID> \


--user-profile-name <USER-PROFILE-NAME> \
--region <REGION> \
--user-settings '{
"KernelGatewayAppSettings": {
"DefaultResourceSpec": {
"InstanceType": "ml.t3.medium",
"LifecycleConfigArn": "<LIFECYCLE-CONFIGURATION-ARN>"
}
}

188
Amazon SageMaker Developer Guide
Customize Studio

}'

Debugging Lifecycle Configurations


The following topics show how to get information about and debug your Lifecycle Configurations.

Topics
• Verify Lifecycle Configuration Process from Amazon CloudWatch Logs (p. 189)
• JupyterServer App failure (p. 189)
• KernelGateway App failure (p. 190)
• Lifecycle Config timeout (p. 190)

Verify Lifecycle Configuration Process from Amazon CloudWatch Logs

Lifecycle Configurations only log STDOUT and STDERR. STDOUT is the default output for bash scripts,
while STDERR can be written to by appending >&2 to the end of a bash command. For example, echo
'hello'>&2. Logs for your Lifecycle Configurations are published to your AWS Account via CloudWatch.
These logs can be found in the /aws/sagemaker/studio Log Stream from the AWS CloudWatch
console.

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. Select Logs from the left side. From the dropdown menu, select Log Groups.
3. On the Log Groups screen, search for aws/sagemaker/studio. Select the log group.
4. On the aws/sagemaker/studio Log Group screen, navigate to the Log Streams tab.
5. To find the logs for a specific app, search Log Streams using the following format:

<DomainId>/<UserProfileName>/<AppType>/<AppName>

For example, to find the Lifecycle Configuration logs for Domain d-m85lcu8vbqmz, UserProfile i-
sonic-js, Apptype JupyterServer and AppName test-lcc-echo, use the following search
string:

d-m85lcu8vbqmz/i-sonic-js/JupyterServer/test-lcc-echo

6. Select the log stream appended with LifecycleConfigOnStart to view the script execution logs.

JupyterServer App failure

If your JupyterServer App crashes because of an issue with the attached Lifecycle Configuration, Studio
displays the following error message on the Studio startup screen.

Failed to create SageMaker Studio due to start-up script failure

Click the View script logs link to view the CloudWatch logs for your JupyterServer app.

In the case where the faulty Lifecycle Configuration is specified in the DefaultResourceSpec of your
Studio Domain or UserProfile, Studio continues to use the Lifecycle Configuration even after restarting
Studio.

To resolve this error, follow the steps in Setting Default Lifecycle Configurations (p. 187) to remove the
Lifecycle Configuration script from the DefaultResourceSpec or select another script using the AWS
CLI. Then launch a new JupyterServer app.

189
Amazon SageMaker Developer Guide
Customize Studio

KernelGateway App failure

If your KernelGateway App crashes because of an issue with the attached Lifecycle Configuration, Studio
displays the error message in your Studio Notebook.

Click the View script logs link to view the CloudWatch logs for your KernelGateway app.

In this case, your Lifecycle Configuration is specified in the Studio Launcher when launching a new Studio
Notebook.

To resolve this error, use the Studio launcher to select a different Lifecycle Configuration or select No
script.
Note
A default KernelGateway Lifecycle Configuration specified in DefaultResourceSpec applies
to all KernelGateway images in the Studio Domain unless the user selects a different script from
the list presented in the Studio launcher. The default script also runs if No Script is selected
by the user. For more information on selecting a script, see Step 3: Launch an application with
the Lifecycle Configuration (p. 186).

Lifecycle Config timeout

There is a Lifecycle Configuration timeout limitation of 5 minutes. If a Lifecycle Configuration script takes
longer than 5 minutes to run, Studio throws an error.

To resolve this error, ensure that your Lifecycle Configuration script completes in less than 5 minutes.

To help decrease the run time of scripts, try the following:

• Cut down on necessary steps. For example, limit which conda environments to install large packages
in.
• Run tasks in parallel processes.
• Use the nohup command in your script.

Updating and deleting Lifecycle Configurations


A Lifecycle Configuration script cannot be changed after it has been created. To update your script,
you must create a new Lifecycle Configuration script and use the update-domain and update-user-
profile APIs to attach the Lifecycle Configuration script to the respective Domain or UserProfile. For more
information, see Creating and Associating a Lifecycle Configuration (p. 183).

To delete an existing Lifecycle Configuration, use the DeleteStudioLifecycleConfig API. To successfully


delete a Lifecycle Configuration, no running Apps can be using it.

Attach Suggested Git Repos to Studio


Amazon SageMaker Studio offers a Git extension for you to enter the URL of a Git repository (repo),
clone it into your environment, push changes, and view commit history. In addition to this Git
extension, you can also attach suggested Git repository URLs at the Amazon SageMaker Domain or user
profile level. Then, you can select the repo URL from the list of suggestions and clone that into your
environment using the Git extension in Studio.

The following topics show how to attach Git repo URLs to a Domain or user profile from the AWS CLI and
SageMaker console. You'll also learn how to detach these repository URLs.

Topics

190
Amazon SageMaker Developer Guide
Customize Studio

• Attach a Git Repository from the AWS CLI (p. 191)


• Attach a Git Repository from the SageMaker Console (p. 191)
• Detach Git Repos (p. 192)

Attach a Git Repository from the AWS CLI


The following topic shows how to attach a Git repository URL using the AWS CLI, so that Amazon
SageMaker Studio automatically suggests it for cloning. After you attach the Git repository URL, you can
clone it by following the steps in Clone a Git Repository in SageMaker Studio (p. 194).

Prerequisites

Before you begin, complete the following prerequisites:

• Update the AWS CLI by following the steps in Installing the current AWS CLI Version.
• From your local machine, run aws configure and provide your AWS credentials. For information
about AWS credentials, see Understanding and getting your AWS credentials.
• Onboard to Amazon SageMaker Domain. For more information, see Onboard to Amazon SageMaker
Domain (p. 37).

Attach the Git repo to a Domain or user profile

Git repo URLs associated at the Domain level are inherited by all users. However, Git repo URLs that are
associated at the user profile level are scoped to a specific user. You can attach multiple Git repo URLs to
a Domain or user profile by passing a list of repository URLs.

The following sections show how to attach a Git repo URL to your Domain and user profile.

Attach to a Domain

The following command attaches a Git repo URL to an existing Domain.

aws sagemaker update-domain --region region --domain-id domain-id \


--default-user-settings
JupyterServerAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}

Attach to a user profile

The following shows how to attach a Git repo URL to an existing user profile.

aws sagemaker update-user-profile --domain-id domain-id --user-profile-name user-name\


--user-settings
JupyterServerAppSettings={CodeRepositories=[{RepositoryUrl="repository"}]}

Attach a Git Repository from the SageMaker Console


The following topic shows how to associate a Git repository URL from the Amazon SageMaker console
to clone it in your Studio environment. After you associate the Git repository URL, you can clone it by
following the steps in Clone a Git Repository in SageMaker Studio (p. 194).

Prerequisites

Before you can begin this tutorial, you must onboard to Amazon SageMaker Domain. For more
information, see Onboard to Amazon SageMaker Domain (p. 37).

191
Amazon SageMaker Developer Guide
Customize Studio

Attach the Git repo to a Domain or user profile

Git repo URLs associated at the Domain level are inherited by all users. However, Git repo URL that are
associated at the user profile level are scoped to a specific user.

The following sections show how to attach a Git repo URL to a Domain and user profile.

Attach to a Domain

To attach a Git repo URL to an existing Domain

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation panel, choose Domains.
3. Select the Domain to attach the Git repo to.
4. On the Domain details page, choose the Environment tab.
5. On the Suggested code repositories for the domain tab, choose Attach.
6. Under Source, enter the Git repository URL.
7. Select Attach to domain.

Attach to a user profile

The following shows how to attach a Git repository URL to an existing user profile.

To attach a Git repository URL to a user profile

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation panel, choose Domains.
3. Select the Domain that includes the user profile to attach the Git repo to.
4. On the Domain details page, choose the User profiles tab.
5. Select the user profile to attach the Git repo URL to.
6. On the User details page, choose Edit.
7. On the Studio settings page, choose Attach from the Suggested code repositories for the user
section.
8. Under Source, enter the Git repository URL.
9. Choose Attach to user.

Detach Git Repos


This guide shows how to detach Git repository URLs from an Amazon SageMaker Domain or user profile
using the AWS CLI or Amazon SageMaker console.

Topics
• Detach a Git repo using the AWS CLI (p. 192)
• Detach the Git repo using the SageMaker console (p. 193)

Detach a Git repo using the AWS CLI

To detach all Git repo URLs from a Domain or user profile, you must pass an empty list of code
repositories. This list is passed as part of the JupyterServerAppSettings parameter in an update-
domain or update-user-profile command. To detach only one Git repo URL, pass the code

192
Amazon SageMaker Developer Guide
Customize Studio

repositories list without the desired Git repo URL. This section shows how to detach all Git repo URLs
from your Domain or user profile using the AWS Command Line Interface (AWS CLI).

Detach from a Domain

The following command detaches all Git repo URLs from a Domain.

aws sagemaker update-domain --region region --domain-name domain-name \


--domain-settings JupyterServerAppSettings={CodeRepositories=[]}

Detach from a user profile

The following command detaches all Git repo URLs from a user profile.

aws sagemaker update-user-profile --domain-name domain-name --user-profile-name user-name\


--user-settings JupyterServerAppSettings={CodeRepositories=[]}

Detach the Git repo using the SageMaker console

The following sections show how to detach a Git repo URL from a Domain or user profile using the
SageMaker console.

Detach from a Domain

Use the following steps to detach a Git repo URL from an existing Domain.

To detach a Git repo URL from an existing Domain

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation panel, choose Domains.
3. Select the Domain with the Git repo URL that you want to detach.
4. On the Domain details page, choose the Environment tab.
5. On the Suggested code repositories for the domain tab, select the Git repository URL to detach.
6. Choose Detach.
7. From the new window, choose Detach.

Detach from a user profile

Use the following steps to detach a Git repo URL from a user profile.

To detach a Git repo URL from a user profile

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation panel, choose Domains.
3. Select the Domain that includes the user profile with the Git repo URL that you want to detach.
4. On the Domain details page, choose the User profiles tab.
5. Select the user profile with the Git repo URL that you want to detach.
6. On the User details page, choose Edit.
7. On the Studio settings page, select the Git repo URL to detach from the Suggested code
repositories for the user tab.
8. Choose Detach.

193
Amazon SageMaker Developer Guide
Perform Common Tasks

9. From the new window, choose Detach.

Perform Common Tasks in Amazon SageMaker


Studio
The following sections describe how to perform common tasks in Amazon SageMaker Studio. For an
overview of the Studio interface, see Amazon SageMaker Studio UI Overview (p. 129).

Topics
• Upload Files to SageMaker Studio (p. 194)
• Clone a Git Repository in SageMaker Studio (p. 194)
• Stop a Training Job in SageMaker Studio (p. 195)
• Use TensorBoard in Amazon SageMaker Studio (p. 195)
• Using CodeWhisperer and CodeGuru extensions with SageMaker (p. 197)
• Manage Your Amazon EFS Storage Volume in SageMaker Studio (p. 198)
• Provide Feedback on SageMaker Studio (p. 198)
• Shut Down and Update SageMaker Studio and Studio Apps (p. 198)

Upload Files to SageMaker Studio


When you onboard to Amazon SageMaker Studio, a home directory is created for you in the Amazon
Elastic File System (Amazon EFS) volume that was created for your team. Studio can only open files that
have been uploaded to your directory. The Studio file browser maps to your home directory.
Note
Studio does not support uploading folders. Only individual files can be uploaded.

To upload files to your home directory

1.
In the left sidebar, choose the File Browser icon ( ).
2.
In the file browser, choose the Upload Files icon ( ).
3. Select the files you want to upload and then choose Open.
4. Double-click a file to open the file in a new tab in Studio.

Clone a Git Repository in SageMaker Studio


Amazon SageMaker Studio can connect only to a local repository. In this example, you clone the aws/
amazon-sagemaker-examples GitHub repository (repo).

To clone the repo

1.
In the left sidebar, choose the File Browser icon ( ).
2. Choose the root folder or the folder you want to clone the repo into.
3.
In the left sidebar, choose the Git icon ( ).

194
Amazon SageMaker Developer Guide
Perform Common Tasks

4. Choose Clone a Repo.


5. In the Clone a Repository window, enter the URI for the SageMaker examples repo https://
github.com/aws/amazon-sagemaker-examples.git or select a repository from the list of
Suggested repositories.
6. Choose Clone.
7. If the repo requires credentials, you are prompted to enter your username and personal access
token.
8. Wait for the download to finish. After the repo has been cloned, the File Browser opens to display
the cloned repo.
9. Double click the repo to open it.
10. Choose the Git icon to view the Git user interface which now tracks the examples repo.
11. To track a different repo, open the repo in the file browser and then choose the Git icon.

Stop a Training Job in SageMaker Studio


You can stop a training job with the Amazon SageMaker Studio UI. When you stop a training job, its
status changes to Stopping at which time billing ceases. An algorithm can delay termination in order
to save model artifacts after which the job status changes to Stopped. For more information, see the
stop_training_job method in the AWS SDK for Python (Boto3).

To stop a training job

1. Follow the View, search, and compare experiment runs (p. 1592) procedure on this page until you
open the Describe Trial Component tab.
2. At the upper-right side of the tab, choose Stop training job. The Status at the top left of the tab
changes to Stopped.
3. To view the training time and billing time, choose AWS Settings.

Use TensorBoard in Amazon SageMaker Studio


The following doc outlines how to install and run TensorBoard in Amazon SageMaker Studio.
Note
This guide shows how to open the TensorBoard application through a SageMaker Studio
notebook server of an individual SageMaker Domain user profile. For a more comprehensive
TensorBoard experience integrated with SageMaker Training and the access control
functionalities of SageMaker Domain, see Use TensorBoard to Debug and Analyze Training Jobs
in Amazon SageMaker (p. 2146).

Prerequisites
This tutorial requires an Amazon SageMaker Studio Domain. For more information, see Onboard to
Amazon SageMaker Domain (p. 37)

Set Up TensorBoardCallback
1. Launch Studio, and open the Launcher. For more information, see Use the Amazon SageMaker
Studio Launcher (p. 141)
2. In the Amazon SageMaker Studio Launcher, under Notebooks and compute resources, choose
the Change environment button.
3. On the Change environment dialog, use the dropdown menus to select the TensorFlow 2.3
Python 3.7(optimized for CPU) Studio Image.

195
Amazon SageMaker Developer Guide
Perform Common Tasks

4. Back to the Launcher, click the Create notebook tile. Your notebook launches and opens in a new
Studio tab.
5. Run this code from within your notebook cells.
6. Import the required packages.

import os
import datetime
import tensorflow as tf

7. Create a Keras model.

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()


x_train, x_test = x_train / 255.0, x_test / 255.0

def create_model():
return tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

8. Create a directory for your TensorBoard logs

LOG_DIR = os.path.join(os.getcwd(), "logs/fit/" + datetime.datetime.now().strftime("%Y


%m%d-%H%M%S"))

9. Run training with TensorBoard.

model = create_model()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR,
histogram_freq=1)

model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=(x_test, y_test),
callbacks=[tensorboard_callback])

10. Generate the EFS path for the TensorBoard logs. You use this path to set up your logs from the
terminal.

EFS_PATH_LOG_DIR = "/".join(LOG_DIR.strip("/").split('/')[1:-1])
print (EFS_PATH_LOG_DIR)

Retrieve the EFS_PATH_LOG_DIR. You will need it in the TensorBoard installation section.

Install TensorBoard
1. Click on the Amazon SageMaker Studio button on the top left corner of Studio to open
the Amazon SageMaker Studio Launcher. This launcher must be opened from your root directory.
For more information, see Use the Amazon SageMaker Studio Launcher (p. 141)

196
Amazon SageMaker Developer Guide
Perform Common Tasks

2. In the Launcher, under Utilities and files, click System terminal.


3. From the terminal, run the following commands. Copy EFS_PATH_LOG_DIR from the Jupyter
notebook. You must run this from the /home/sagemaker-user root directory.

pip install tensorboard


tensorboard --logdir <EFS_PATH_LOG_DIR>

Launch TensorBoard
1. To launch TensorBoard, copy your Studio URL and replace lab? with proxy/6006/ as follows. You
must include the trailing / character.

https://<YOUR_URL>.studio.region.sagemaker.aws/jupyter/default/proxy/6006/

2. Navigate to the URL to examine your results.

Using CodeWhisperer and CodeGuru extensions with SageMaker


Amazon SageMaker Studio is an integrated machine learning environment where you can build, train,
deploy, and analyze your models all in the same application. This topic shows how to generate code
recommendations and suggest improvements related to code issues by using Amazon CodeWhisperer
and Amazon CodeGuru with Amazon SageMaker.

The following extensions support writing code by generating code recommendations and suggesting
improvements related to code issues:

• Amazon CodeWhisperer
• Amazon CodeGuru

What is Amazon CodeWhisperer?


Amazon CodeWhisperer is a service powered by machine learning that helps improve developer
productivity. CodeWhisperer achieves this by generating code recommendations based on developers’
comments in natural language and their code in the IDE. During preview, Amazon CodeWhisperer is
available for the Java, JavaScript, Python, C# and TypeScript programming languages. The service
integrates with JupyterLab, Amazon SageMaker Studio, Amazon SageMaker notebook instances, and
other integrated development environments (IDEs).

For more information, see the Setting up CodeWhisperer with Amazon SageMaker Studio.

What is Amazon CodeGuru?


Amazon CodeGuru Security uses automated reasoning and machine learning informed by AWS security
best practices. CodeGuru Security automatically creates comprehensive security policies, detects security
vulnerabilities in your code, and suggests quality improvements. Together, these recommendations can
help you create and deploy secure applications.

CodeGuru Security improves the security of your code in the following ways:

• Proactively detects security policy violations and vulnerabilities.


• Provides recommendations for addressing security risks.
• Suggests improvements to inefficient methods.

197
Amazon SageMaker Developer Guide
Perform Common Tasks

From SageMaker, you can call CodeGuru Security by using the open-source Jupyter plugin. You can use
CodeGuru Security to scan notebooks for a variety of issues that can affect the security, correctness,
reproducibility, maintainability, and performance of your code. For more information, see Get started
with the Amazon CodeGuru Extension for JupyterLab and SageMaker Studio.

Manage Your Amazon EFS Storage Volume in SageMaker Studio


The first time a user on your team onboards to Amazon SageMaker Studio, Amazon SageMaker creates
an Amazon Elastic File System (Amazon EFS) volume for the team. A home directory is created in the
volume for each user who onboards to Studio as part of your team. Notebook files and data files are
stored in these directories. Users don't have access to other team member's home directories. Amazon
SageMaker Domain does not support mounting custom or additional Amazon EFS volumes.
Important
Don't delete the Amazon EFS volume. If you delete it, the domain will no longer function and all
of your users will lose their work.

To find your Amazon EFS volume

1. Open the SageMaker console.


2. Choose Control Panel at the top left of the page.
3. From the Control Panel, under Domain, find the Domain ID. The ID will be in the following format:
d-xxxxxxxxxxxx.
4. Pass the Domain ID, as DomainId, to the describe_domain method.
5. In the response from describe_domain, note the value for the HomeEfsFileSystemId key. This
is the Amazon EFS file system ID.
6. Open the Amazon EFS console. Make sure the AWS Region is the same Region that's used by Studio.
7. Under File systems, choose the file system ID from the previous step.
8. To verify that you've chosen the correct file system, select the Tags heading. The value
corresponding to the ManagedByAmazonSageMakerResource key should match the Studio ID.

For information on how to access the Amazon EFS volume, see Using file systems in Amazon EFS.

To delete the Amazon EFS volume, see Deleting an Amazon EFS file system.

Provide Feedback on SageMaker Studio


Amazon SageMaker takes your feedback seriously. We encourage you to provide feedback.

To provide feedback

1.

At the right of SageMaker Studio, find the Feedback icon ( ).


2. Choose a smiley emoji to let us know how satisfied you are with SageMaker Studio and add any
feedback you'd care to share with us.
3. Decide whether to share your identity with us, then choose Submit.

Shut Down and Update SageMaker Studio and Studio Apps


The following topics show how to shut down and update SageMaker Studio and Studio Apps.

198
Amazon SageMaker Developer Guide
Perform Common Tasks

Amazon SageMaker does not update Amazon SageMaker Studio apps when it is in service.

Studio provides a notification icon ( ) in the upper-right corner of the Studio UI. This notification icon
displays the number of unread notices. To read the notices, select the icon.

Studio provides two types of notifications:

• Upgrade – Displayed when Studio or one of the Studio apps have released a new version. To update
Studio, see Shut down and Update SageMaker Studio (p. 199). To update Studio apps, see Shut down
and Update Studio Apps (p. 200).
• Information – Displayed for new features and other information.

To reset the notification icon, you must select the link in each notice. Read notifications may still display
in the icon. This does not indicate that updates are still needed after you have updated Studio and Studio
Apps.

To learn how to update Amazon SageMaker Data Wrangler, see Shut down and Update Studio
Apps (p. 200).

To ensure that you have the most recent software updates, update Amazon SageMaker Studio and your
Studio apps using the methods outlined in the following topics.

Topics
• Shut down and Update SageMaker Studio (p. 199)
• Shut down and Update Studio Apps (p. 200)

Shut down and Update SageMaker Studio


To update Amazon SageMaker Studio to the latest release, you must shut down the JupyterServer app.
You can shut down the JupyterServer app from the SageMaker console or from within Studio. After the
JupyterServer app is shut down, you must reopen Studio through the SageMaker console which creates a
new version of the JupyterServer app.

Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't
impacted.

Some of the services within Studio, like Data Wrangler, run on their own app. To update these
services you must delete the app for that service. To learn more, see Shut down and Update Studio
Apps (p. 200).
Note
A JupyterServer app is associated with a single Studio user. When you update the app for one
user it doesn't affect other users.

The following topic shows how to update the JupyterServer App from the SageMaker console or from
inside Studio.

To update the JupyterServer app from the SageMaker console

1. Navigate to https://console.aws.amazon.com/sagemaker/.
2. Choose Domains.
3. Select the Domain that includes the Studio application that you want to update.
4. Under User profiles, select your user name.
5. Under Apps, in the row displaying JupyterServer, choose Action, then choose Delete.

199
Amazon SageMaker Developer Guide
Studio Pricing

6. Choose Yes, delete app.


7. Type delete in the confirmation box.
8. Choose Delete.
9. After the app has been deleted, launch a new Studio app to get the latest version.

To update the JupyterServer app from inside Studio

1. Launch Studio.
2. On the top menu, choose File then Shut Down.
3. Choose one of the following options:

• Shutdown Server – Shuts down the JupyterServer app. Terminal sessions, kernel sessions,
SageMaker images, and instances aren't shut down. These resources continue to accrue charges.
• Shutdown All – Shuts down all apps, terminal sessions, kernel sessions, SageMaker images, and
instances. These resources no longer accrue charges.
4. Close the window.
5. After the app has been deleted, launch a new Studio app to use the latest version.

Shut down and Update Studio Apps


To update an Amazon SageMaker Studio app to the latest release, you must first shut down the
corresponding KernelGateway app from the SageMaker console. After the KernelGateway app is shut
down, you must reopen it through SageMaker Studio by running a new kernel. The kernel automatically
updates. Any unsaved notebook information is lost in the process. The user data in the Amazon EFS
volume isn't impacted.
Note
A KernelGateway app is associated with a single Studio user. When you update the app for one
user it doesn't effect other users.

To update the KernelGateway app

1. Navigate to https://console.aws.amazon.com/sagemaker/.
2. Choose Domains.
3. Select the Domain that includes the application that you want to update.
4. Under User profiles, select your user name.
5. Under Apps, in the row displaying the App name, choose Action, then choose Delete

To update Data Wrangler, delete the app that starts with sagemaker-data-wrang.
6. Choose Yes, delete app.
7. Type delete in the confirmation box.
8. Choose Delete.
9. After the app has been deleted, launch a new kernel from within Studio to use the latest version.

Amazon SageMaker Studio Pricing


When the first member of your team onboards to Amazon SageMaker Studio, Amazon SageMaker
creates an Amazon Elastic File System (Amazon EFS) volume for the team. In the SageMaker Control
Panel, when the Studio Status displays as Ready, the Amazon EFS volume has been created.

200
Amazon SageMaker Developer Guide
Troubleshooting

When this member, or any member of the team, opens Studio, a home directory is created in the
volume for the member. A storage charge is incurred for this directory. Subsequently, additional storage
charges are incurred for the notebooks and data files stored in the member's home directory. For pricing
information on Amazon EFS, see Amazon EFS Pricing.

Additional costs are incurred when other operations are run inside Studio, for example, running a
notebook, running training jobs, and hosting a model.

For information on the costs associated with using Studio notebooks, see Usage Metering (p. 161).

For information about billing along with pricing examples, see Amazon SageMaker Pricing.

Troubleshooting Amazon SageMaker Studio


This topic describes how to troubleshoot common Amazon SageMaker Studio issues during setup and
use. The following are common errors that might occur while using Amazon SageMaker Studio. Each
error is followed by its solution.

Studio application issues


The following issues occur when launching and using the Studio application.

• Screen not loading: Clearing workspace and waiting doesn't help

When launching the Studio application, a pop-up displays the following message. No matter which
option is selected, Studio does not load.

Loading...
The loading screen is taking a long time. Would you like to clear the workspace or keep
waiting?

The Studio application can have a launch delay if multiple tabs are open in the Studio workspace
or several files are on Amazon EFS. This pop-up should disappear in a few seconds after the Studio
workspace is ready.

If you continue to see a loading screen with a spinner after selecting either of the options, there could
be connectivity issues with the Amazon Virtual Private Cloud used by Studio.

To resolve connectivity issues with the Amazon Virtual Private Cloud (Amazon VPC) used by Studio,
verify the following networking configurations:
• If your domain is set up in VpcOnly mode: Verify that there is an Amazon VPC endpoint for AWS
STS, or a NAT Gateway for outbound traffic, including traffic over the internet. To do this, follow the
steps in Connect SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).
• If your Amazon VPC is set up with a custom DNS instead of the DNS provided by Amazon: Verify that
the routes are configured using Dynamic Host Configuration Protocol (DHCP) for each Amazon VPC
endpoint added to the Amazon VPC used by Studio. For more information about setting default and
custom DHCP option sets, see DHCP option sets in Amazon VPC.
• Internal Failure when launching Studio

When launching Studio, you are unable to view the Studio UI. You also see an error similar to the
following, with Internal Failure as the error detail.

Amazon SageMaker Studio


The JupyterServer app default encountered a problem and was stopped.

This error can be caused by multiple factors. If completion of these steps does not resolve your issue,
create an issue with https://aws.amazon.com/premiumsupport/.

201
Amazon SageMaker Developer Guide
Troubleshooting

• Missing Amazon EFS mount target: Studio uses Amazon EFS for storage. The Amazon EFS volume
needs a mount target for each subnet that the Amazon SageMaker domain is created in. If this
Amazon EFS mount target is deleted accidentally, the Studio application cannot load because it
cannot mount the user’s file directory. To resolve this issue, complete the following steps.

To verify or create mount targets.

1. Find the Amazon EFS volume that is associated with the domain by using the DescribeDomain
API call.
2. Sign in to the AWS Management Console and open the Amazon EFS console at https://
console.aws.amazon.com/efs/.
3. From the list of Amazon EFS volumes, select the Amazon EFS volume that is associated with the
domain.
4. On the Amazon EFS details page, select the Network tab. Verify that there are mount targets
for all of the subnets that the domain is set up in.
5. If mount targets are missing, add the missing Amazon EFS mount targets. For instructions, see
Creating and managing mount targets and security groups.
6. After the missing mount targets are created, launch the Studio application.
• Conflicting files in the user’s .local folder: If you're using JupyterLab version 1 on Studio,
conflicting libraries in your .local folder can cause issues when launching the Studio application.
To resolve this, update your user profile's default JupyterLab version to JupyterLab 3.0.
For more information about viewing and updating the JupyterLab version, see JupyterLab
Versioning (p. 135).
• ConfigurationError: LifecycleConfig when launching Studio

You can't view the Studio UI when launching Studio. This is caused by issues with the default lifecycle
configuration script attached to the domain.

To resolve lifecycle configuration issues

1. View the Amazon CloudWatch Logs for the lifecycle configuration to trace the command that
caused the failure. To view the log, follow the steps in Verify Lifecycle Configuration Process from
Amazon CloudWatch Logs (p. 189).
2. Detach the default script from the user profile or domain. For more information, see Updating and
deleting Lifecycle Configurations (p. 190).
3. Launch the Studio application.
4. Debug your lifecycle configuration script. You can run the lifecycle configuration script from the
system terminal to troubleshoot. When the script runs successfully from the terminal, you can
attach the script to the user profile or the domain.
• SageMaker Studio core functionalities are not available.

If you get this error message when opening Studio, it may be due to Python package version conflicts.
This occurs if you used the following commands in a notebook or terminal to install Python packages
that have version conflicts with SageMaker package dependencies.

!pip install

pip install --user

To resolve this issue, complete the following steps:


1. Uninstall recently installed Python packages. If you’re not sure which package to uninstall, create an
issue with https://aws.amazon.com/premiumsupport/.
2. Restart Studio:

202
Amazon SageMaker Developer Guide
Troubleshooting

a. Shut down Studio from the File menu.


b. Wait for one minute.
c. Reopen Studio by refreshing the page or opening it from the AWS Management Console.

The problem should be resolved if you have uninstalled the package which caused the conflict. To
install packages without causing this issue again, use %pip install without the --user flag.

If the issue persists, create a new user profile and set up your environment with that user profile.

If these solutions don't fix the issue, create an issue with https://aws.amazon.com/premiumsupport/.


• Unable to open Studio from the AWS Management Console.

If you are unable to open Studio and cannot make a new running instance with all default settings,
create an issue with https://aws.amazon.com/premiumsupport/.

KernelGateway application issues


The following issues are specific to KernelGateway applications that are launched in Studio.

• Cannot access the Kernel session

When the user launches a new notebook, they are unable to connect to the notebook session. If the
KernelGateway application's status is In Service, you can verify the following to resolve the issue.
• Check Security Group configurations

If the domain is set up in VPCOnly mode, the security group associated with the domain must allow
traffic between the ports in the range 8192-65535 for connectivity between the JupyterServer and
KernelGateway apps.

To verify the security group rules

1. Get the security groups associated with the domain using the DescribeDomain API call.
2. Sign in to the AWS Management Console and open the Amazon VPC console at https://
console.aws.amazon.com/vpc/.
3. From the left navigation, under Security, choose Security Groups.
4. Filter by the IDs of the security groups that are associated with the domain.
5. For each security group:

a. Select the security group.


b. From the security group details page, view the Inbound rules. Verify that traffic is allowed
between ports in the range 8192-65535.

For more information about security group rules, see Control traffic to resources using security
groups. For more information about requirements to use Studio in VPCOnly mode, see Connect
SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).
• Verify firewall and WebSocket connections

If the KernelGateway apps have an InService status and the user is unable to connect to the
Studio notebook session, verify the firewall and WebSocket settings.

1. Launch the Studio application. For more information, see Launch Amazon SageMaker
Studio (p. 133).
2. Open your web browser’s developer tools.
3. Choose the Network tab.

203
Amazon SageMaker Developer Guide
SageMaker Notebook Instances

4. Search for an entry that matches the following format.

wss://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/api/kernels/
<unique-code>/channels?session_id=<unique-code>

If the status or response code for the entry is anything other than 101, then your network
settings are preventing the connection between the Studio application and the KernelGateway
apps.

To resolve this issue, contact the team that manages your networking settings to allow list the
Studio URL and enable WebSocket connections.
• Unable to launch an app caused by exceeded resource quotas

When a user tries to launch a new notebook, the notebook creation fails with either of the following
errors. This is caused by exceeding resource quotas.

Unable to start more Apps of AppType [KernelGateway] and ResourceSpec(instanceType=[])
for UserProfile []. Please delete an App with a matching AppType and ResourceSpec,
then try again

Studio supports up to four running KernelGateway apps on the same instance. To resolve this issue,
you can do either of the following:
• Delete an existing KernelGateway application running on the instance, then restart the new
notebook.
• Start the new notebook on a different instance type

For more information, see Change an Instance Type (p. 158).



An error occurred (ResourceLimitExceeded) when calling the CreateApp operation

In this case, the account does not have sufficient limits to create a Studio application on the
specified instance type. To resolve this, navigate to the Service Quotas console at https://
console.aws.amazon.com/servicequotas/. In that console, request to increase the Studio
KernelGateway Apps running on instance-type instance limit. For more information,
see AWS service quotas.

Amazon SageMaker Notebook Instances


An Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the
Jupyter Notebook App. SageMaker manages creating the instance and related resources. Use Jupyter
notebooks in your notebook instance to prepare and process data, write code to train models, deploy
models to SageMaker hosting, and test or validate your models.

SageMaker also provides sample notebooks that contain complete code walkthroughs. These
walkthroughs show how to use SageMaker to perform common machine learning tasks. For more
information, see Example Notebooks (p. 220).

Topics
• Amazon Linux 2 vs Amazon Linux notebook instances (p. 205)
• JupyterLab versioning (p. 208)
• Create a Notebook Instance (p. 209)
• Access Notebook Instances (p. 212)
• Update a Notebook Instance (p. 212)

204
Amazon SageMaker Developer Guide
AL2 vs AL1 instances

• Customize a Notebook Instance Using a Lifecycle Configuration Script (p. 213)


• Example Notebooks (p. 220)
• Set the Notebook Kernel (p. 221)
• Associate Git Repositories with SageMaker Notebook Instances (p. 222)
• Notebook Instance Metadata (p. 229)
• Monitor Jupyter Logs in Amazon CloudWatch Logs (p. 229)

Amazon Linux 2 vs Amazon Linux notebook instances


Amazon SageMaker notebook instances currently support Amazon Linux 2 (AL2) and Amazon Linux (AL1)
operating systems. You can select the operating system that your notebook instances is based on when
you create the notebook instance. Notebook instances created before 08/18/2021 automatically run on
AL1.

Notebook instances based on AL1 will enter a maintenance phase as of 12/01/2022. To replace AL1, you
now have the option to create Amazon SageMaker notebook instances with AL2. The AL1 maintenance
phase also coincides with the deprecation of Python 2 and Chainer. Notebooks based on AL2 do not have
managed Python 2 and Chainer kernels.

AL1 Maintenance Phase Plan


The following table is a timeline for when AL1 enters its extended maintenance phase.

Date Description

08/18/2021 Notebook instances based on AL2 are launched.


Newly launched notebook instances still default
to AL1. AL1 is supported with security patches
and updates, but no new features. You can
choose between the two operating systems when
launching a new notebook instance.

10/31/2022 The default platform identifier for SageMaker


notebook instances changes from Amazon Linux
(al1-v1) to Amazon Linux 2 (al2-v2). You can
choose between the two operating systems when
launching a new notebook instance.

12/01/2022 AL1 is no longer supported with non-critical


security patches and updates. AL1 still receives
fixes for critical security-related issues. You can
still launch instances on AL1, but assume the risks
associated with using an unsupported operating
system.

02/01/2023 AL1 is no longer an available option for new


notebook instance creation. After this date,
customers can create notebook instances with the
AL2 platform identifiers. Existing al1-v1 notebook
instances are not be affected.

205
Amazon SageMaker Developer Guide
AL2 vs AL1 instances

Supported instances
Amazon Linux 2 supports instances listed under Notebook Instances in Amazon SageMaker Pricing with
the exception that Amazon Linux 2 does not support ml.p2 instances.

Available Kernels
notebook-al1-v1: The following kernels are available in notebook instances based on the Amazon
Linux platform. These notebook instances support JupyterLab version 1. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).

Kernel Name

Sparkmagic (PySpark)

Sparkmagic (Spark)

Sparkmagic (SparkR)

conda_amazonei_mxnet_p27

conda_amazonei_mxnet_p36

conda_amazonei_pytorch_latest_p36

conda_amazonei_tensorflow2_p27

conda_amazonei_tensorflow2_p36

conda_amazonei_tensorflow_p27

conda_amazonei_tensorflow_p36

conda_chainer_p27

conda_chainer_p36

conda_mxnet_latest_p37

conda_mxnet_p27

conda_mxnet_p36

conda_python2

conda_python3

conda_pytorch_latest_p36

conda_pytorch_p27

conda_pytorch_p36

conda_tensorflow2_p36

conda_tensorflow_p27

conda_tensorflow_p36

206
Amazon SageMaker Developer Guide
AL2 vs AL1 instances

notebook-al2-v1: The following kernels are available in notebook instances based on the Amazon
Linux 2 platform. These notebook instances support JupyterLab version 1. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).

Kernel Name

Sparkmagic (PySpark)

Sparkmagic (Spark)

Sparkmagic (SparkR)

conda_amazonei_mxnet_p36

conda_amazonei_pytorch_latest_p37

conda_amazonei_tensorflow2_p36

conda_mxnet_p38

conda_python3

conda_pytorch_p39

conda_tensorflow2_p310

notebook-al2-v2: The following kernels are available in notebook instances based on the Amazon
Linux 2 platform. These notebook instances support JupyterLab version 3. For information about
JupyterLab versions, see JupyterLab versioning (p. 208).

Kernel Name

Sparkmagic (PySpark)

Sparkmagic (Spark)

Sparkmagic (SparkR)

conda_amazonei_pytorch_latest_p37

conda_mxnet_p38

conda_python3

conda_pytorch_p39

conda_tensorflow2_p310

Migrating to Amazon Linux 2


Your existing notebook instance is not automatically migrated to Amazon Linux 2. To upgrade your
notebook instance to Amazon Linux 2, you must create a new notebook instance, replicate your code
and environment, and delete your old notebook instance. For more information, see the Amazon Linux 2
migration blog.

207
Amazon SageMaker Developer Guide
JupyterLab versioning

JupyterLab versioning
The Amazon SageMaker notebook instance interface is based on JupyterLab, which is a web-based
interactive development environment for notebooks, code, and data. Notebooks now support using
either JupyterLab 1 or JupyterLab 3. A single notebook instance can run a single instance of JupyterLab
(at most). You can have multiple notebook instances with different JupyterLab versions.

You can configure your notebook to run your preferred JupyterLab version by selecting the appropriate
platform identifier. Use either the AWS CLI or the SageMaker console when creating your notebook
instance. For more information about platform identifiers, see Amazon Linux 2 vs Amazon Linux
notebook instances. If you don’t explicitly configure a platform identifier, your notebook instance
defaults to running JupyterLab 1.

Topics
• JupyterLab 3 (p. 208)
• Creating a notebook with your JupyterLab version (p. 209)
• View the JupyterLab version of a notebook from the console (p. 209)

JupyterLab 3
JupyterLab 3 support is available only on the Amazon Linux 2 operating system platform. JupyterLab 3
includes the following features that are not available in JupyterLab 1. For more information about these
features, see JupyterLab 3.0 is released!.

• Visual debugger when using the following kernels:


• conda_pytorch_p38
• conda_tensorflow2_p38
• conda_amazonei_pytorch_latest_p37
• File browser filter
• Table of Contents (TOC)
• Multi-language support
• Simple mode
• Single interface mode
• Live editing SVG files with updated rendering
• User interface for notebook cell tags

Important changes to JupyterLab 3


For information about important changes when using JupyterLab 3, see the following JupyterLab
change logs:

• v2.0.0
• v3.0.0

Package version changes

JupyterLab 3 has the following package version changes from JupyterLab 1:

• JupyterLab has been upgraded from 1.x to 3.x.


• Jupyter notebook has been upgraded from 5.x to 6.x.
• jupyterlab-git has been updated to version 0.37.1.

208
Amazon SageMaker Developer Guide
Create a Notebook Instance

• nbserverproxy 0.x (0.3.2) has been replaced with jupyter-server-proxy 3.x (3.2.1).

Creating a notebook with your JupyterLab version


You can select the JupyterLab version when creating your notebook instance from the console following
the steps in Create a Notebook Instance (p. 209).

You can also select the JupyterLab version by passing the platform-identifier parameter when
creating your notebook instance using the AWS CLI as follows:

create-notebook-instance --notebook-instance-name <NEW_NOTEBOOK_NAME> \


--instance-type <INSTANCE_TYPE> \
--role-arn <YOUR_ROLE_ARN> \
--platform-identifier <PLATFORM_TO_USE>

View the JupyterLab version of a notebook from the console


You can view the JupyterLab version of a notebook using the following procedure:

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation, select Notebook.
3. From the dropdown menu, select Notebook instances to navigate to the Notebook instances page.
4. From the list of notebook instances, select your notebook instance name.
5. On the Notebook instance settings page, view the Platform Identifier to see the JupyterLab
version of the notebook.

Create a Notebook Instance


An Amazon SageMaker notebook instance is a ML compute instance running the Jupyter Notebook
App. SageMaker manages creating the instance and related resources. Use Jupyter notebooks in your
notebook instance to prepare and process data, write code to train models, deploy models to SageMaker
hosting, and test or validate your models.

To create a notebook instance, use either the SageMaker console or the


CreateNotebookInstance API.

The notebook instance type you choose depends on how you use your notebook instance. You want to
ensure that your notebook instance is not bound by memory, CPU, or IO. If you plan to load a dataset
into memory on the notebook instance for exploration or preprocessing, we recommend that you
choose an instance type with enough RAM memory for your dataset. This would require an instance
with at least 16 GB of memory (.xlarge or larger). If you plan to use the notebook for compute intensive
preprocessing, we recommend you choose a compute-optimized instance such as a c4 or c5.

A best practice when using a SageMaker notebook is to use the notebook instance to orchestrate other
AWS services. For example, you can use the notebook instance to manage large dataset processing by
making calls to AWS Glue for ETL (extract, transform, and load) services or Amazon EMR for mapping
and data reduction using Hadoop. You can use AWS services as temporary forms of computation or
storage for your data.

You can store and retrieve your training and test data using an Amazon S3 bucket. You can then use
SageMaker to train and build your model, so the instance type of your notebook would have no bearing
on the speed of your model training and testing.

After receiving the request, SageMaker does the following:

209
Amazon SageMaker Developer Guide
Create a Notebook Instance

• Creates a network interface—If you choose the optional VPC configuration, SageMaker creates the
network interface in your VPC. It uses the subnet ID that you provide in the request to determine
which Availability Zone to create the subnet in. SageMaker associates the security group that you
provide in the request with the subnet. For more information, see Connect a Notebook Instance in a
VPC to External Resources (p. 3211).
• Launches an ML compute instance—SageMaker launches an ML compute instance in a SageMaker
VPC. SageMaker performs the configuration tasks that allow it to manage your notebook instance, and
if you specified your VPC, it enables traffic between your VPC and the notebook instance.
• Installs Anaconda packages and libraries for common deep learning platforms—SageMaker installs
all of the Anaconda packages that are included in the installer. For more information, see Anaconda
package list. In addition, SageMaker installs the TensorFlow and Apache MXNet deep learning libraries.
• Attaches an ML storage volume—SageMaker attaches an ML storage volume to the ML compute
instance. You can use the volume as a working area to clean up the training dataset or to temporarily
store validation, test, or other data. Choose any size between 5 GB and 16384 GB, in 1 GB increments,
for the volume. The default is 5 GB. ML storage volumes are encrypted, so SageMaker can't determine
the amount of available free space on the volume. Because of this, you can increase the volume size
when you update a notebook instance, but you can't decrease the volume size. If you want to decrease
the size of the ML storage volume in use, create a new notebook instance with the desired size.

Only files and data saved within the /home/ec2-user/SageMaker folder persist between notebook
instance sessions. Files and data that are saved outside this directory are overwritten when the
notebook instance stops and restarts. Each notebook instance's /tmp directory provides a minimum
of 10 GB of storage in an instance store. An instance store is temporary, block-level storage that isn't
persistent. When the instance is stopped or restarted, SageMaker deletes the directory's contents. This
temporary storage is part of the root volume of the notebook instance.
• Copies example Jupyter notebooks— These Python code examples illustrate model training and
hosting exercises using various algorithms and training datasets.

To create a SageMaker notebook instance:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Notebook instances, then choose Create notebook instance.
3. On the Create notebook instance page, provide the following information:

a. For Notebook instance name, type a name for your notebook instance.
b. For Notebook instance type, choose an instance size suitable for your use case. For a list of
supported instance types and quotas, see Amazon SageMaker Service Quotas.
c. For Elastic Inference, choose an inference accelerator type to associate with the notebook
instance if you plan to conduct inferences from the notebook instance, or choose none. For
information about elastic inference, see Use Amazon SageMaker Elastic Inference (EI) (p. 2628).
d. For Platform Identifier, choose a platform type to create the notebook instance on. This
platform type dictates the Operating System and the JupyterLab version that your notebook
instance is created with. For information about platform identifier type, see Amazon Linux 2 vs
Amazon Linux notebook instances (p. 205). For information about JupyterLab versions, see
JupyterLab versioning (p. 208).
e. (Optional) Additional configuration lets advanced users create a shell script that can run when
you create or start the instance. This script, called a lifecycle configuration script, can be used
to set the environment for the notebook or to perform other functions. For information, see
Customize a Notebook Instance Using a Lifecycle Configuration Script (p. 213).
f. (Optional) Additional configuration also lets you specify the size, in GB, of the ML storage
volume that is attached to the notebook instance. You can choose a size between 5 GB and
16,384 GB, in 1 GB increments. You can use the volume to clean up the training dataset or to
temporarily store validation or other data.

210
Amazon SageMaker Developer Guide
Create a Notebook Instance

g. (Optional) For Minimum IMDS Version, select a version from the dropdown list. If this value
is set to v1, both versions can be used with the notebook instance. If v2 is selected, then only
IMDSv2 can be used with the notebook instance. For information about IMDSv2, see Use
IMDSv2.
Note
Starting October 31, 2022, the default minimum IMDS Version for SageMaker
notebook instances changes from IMDSv1 to IMDSv2.
Starting February 1, 2023, IMDSv1 is no longer be available for new notebook instance
creation. After this date, you can create notebook instances with a minimum IMDS
version of 2.
h. For IAM role, choose either an existing IAM role in your account that has the
necessary permissions to access SageMaker resources or choose Create a new
role. If you choose Create a new role, SageMaker creates an IAM role named
AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS. The AWS managed policy
AmazonSageMakerFullAccess is attached to the role. The role provides permissions that
allow the notebook instance to call SageMaker and Amazon S3.
i. For Root access, to enable root access for all notebook instance users, choose Enable. To disable
root access for users, choose Disable.If you enable root access, all notebook instance users have
administrator privileges and can access and edit all files on it.
j. (Optional) Encryption key lets you encrypt data on the ML storage volume attached to the
notebook instance using an AWS Key Management Service (AWS KMS) key. If you plan to store
sensitive information on the ML storage volume, consider encrypting the information.
k. (Optional) Network lets you put your notebook instance inside a Virtual Private Cloud (VPC).
A VPC provides additional security and restricts access to resources in the VPC from sources
outside the VPC. For more information on VPCs, see Amazon VPC User Guide.

To add your notebook instance to a VPC:

i. Choose the VPC and a SubnetId.


ii. For Security Group, choose your VPC's default security group.
iii. If you need your notebook instance to have internet access, enable direct internet access.
For Direct internet access, choose Enable. Internet access can make your notebook
instance less secure. For more information, see Connect a Notebook Instance in a VPC to
External Resources (p. 3211).
l. (Optional) To associate Git repositories with the notebook instance, choose a default repository
and up to three additional repositories. For more information, see Associate Git Repositories
with SageMaker Notebook Instances (p. 222).
m. Choose Create notebook instance.

In a few minutes, Amazon SageMaker launches an ML compute instance—in this case, a


notebook instance—and attaches an ML storage volume to it. The notebook instance has a
preconfigured Jupyter notebook server and a set of Anaconda libraries. For more information,
see the
CreateNotebookInstance API.
4. When the status of the notebook instance is InService, in the console, the notebook instance
is ready to use. Choose Open Jupyter next to the notebook name to open the classic Jupyter
dashboard.

You can choose Open JupyterLab to open the JupyterLab dashboard. The dashboard provides
access to your notebook instance and sample SageMaker notebooks that contain complete code
walkthroughs. These walkthroughs show how to use SageMaker to perform common machine
learning tasks. For more information, see Example Notebooks (p. 220). For more information, see
Control root access to a SageMaker notebook instance (p. 3042).

For more information about Jupyter notebooks, see The Jupyter notebook.
211
Amazon SageMaker Developer Guide
Access Notebook Instances

Access Notebook Instances


To access your Amazon SageMaker notebook instances, choose one of the following options:

• Use the console.

Choose Notebook instances. The console displays a list of notebook instances in your account. To
open a notebook instance with a standard Jupyter interface, choose Open Jupyter for that instance.
To open a notebook instance with a JupyterLab interface, choose Open JupyterLab for that instance.

The console uses your sign-in credentials to send a


CreatePresignedNotebookInstanceUrl API request to SageMaker. SageMaker returns the URL
for your notebook instance, and the console opens the URL in another browser tab and displays the
Jupyter notebook dashboard.
Note
The URL that you get from a call to
CreatePresignedNotebookInstanceUrl is valid only for 5 minutes. If you try to use the
URL after the 5-minute limit expires, you are directed to the AWS Management Console sign-
in page.
• Use the API.

To get the URL for the notebook instance, call the


CreatePresignedNotebookInstanceUrl API and use the URL that the API returns to open the
notebook instance.

Use the Jupyter notebook dashboard to create and manage notebooks and to write code. For more
information about Jupyter notebooks, see http://jupyter.org/documentation.html.

Update a Notebook Instance


After you create a notebook instance, you can update it using the SageMaker console and
UpdateNotebookInstance API operation.

You can update the tags of a notebook instance that is InService. To update any other attribute of a
notebook instance, its status must be Stopped.

To update a notebook instance in the SageMaker console:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Notebook instances.
3. Choose the notebook instance that you want to update by selecting the notebook instance Name
from the list.
4. If your notebook Status is not Stopped, select the Stop button to stop the notebook instance.

When you do this, the notebook instance status changes to Stopping. Wait until the status changes
to Stopped to complete the following steps.

212
Amazon SageMaker Developer Guide
Customize a Notebook Instance

5. Select the Edit button to open the Edit notebook instance page. For information about the
notebook properties you can update, see Create a Notebook Instance (p. 209).
6. Update your notebook instance and select the Update notebook instance button at the bottom
of the page when you are done to return to the notebook instances page. Your notebook instance
status changes to Updating.

When the notebook instance update is complete, the status changes to Stopped.

Customize a Notebook Instance Using a Lifecycle


Configuration Script
To install packages or sample notebooks on your notebook instance, configure networking and security
for it, or otherwise use a shell script to customize it, use a lifecycle configuration. A lifecycle configuration
provides shell scripts that run only when you create the notebook instance or whenever you start one.
When you create a notebook instance, you can create a new lifecycle configuration and the scripts it uses
or apply one that you already have.

You can also use a lifecycle configuration script to access AWS services from your notebook. For example,
you can create a script that lets you use your notebook to control other AWS resources, such as an
Amazon EMR instance.

We maintain a public repository of notebook lifecycle configuration scripts that address common use
cases for customizing notebook instances at https://github.com/aws-samples/amazon-sagemaker-
notebook-instance-lifecycle-configuration-samples.
Note
Each script has a limit of 16384 characters.
The value of the $PATH environment variable that is available to both scripts is /usr/local/
sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin. The working directory, which
is the value of the $PWD environment variable, is /.
View CloudWatch Logs for notebook instance lifecycle configurations in log group /
aws/sagemaker/NotebookInstances in log stream [notebook-instance-name]/
[LifecycleConfigHook].
Scripts cannot run for longer than 5 minutes. If a script runs for longer than 5 minutes, it fails
and the notebook instance is not created or started. To help decrease the run time of scripts, try
the following:

• Cut down on necessary steps. For example, limit which conda environments in which to install
large packages.
• Run tasks in parallel processes.
• Use the nohup command in your script.

You can see a list of notebook instance lifecycle configurations you previously created by choosing
Lifecycle configuration in the SageMaker console. You can attach a notebook instance lifecycle
configuration when you create a new notebook instance. For more information about creating a
notebook instance, see Create a Notebook Instance (p. 209).

To create a lifecycle configuration

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left side under SageMaker dashboard, choose Lifecycle configurations.
3. From the Lifecycle configurations page, choose the Notebook Instance tab.
4. Choose Create configuration.

213
Amazon SageMaker Developer Guide
Customize a Notebook Instance

5. For Name, type a name using alphanumeric characters and "-", but no spaces. The name can have a
maximum of 63 characters.
6. (Optional) To create a script that runs when you create the notebook and every time you start it,
choose Start notebook.
7. In the Start notebook editor, type the script.
8. (Optional) To create a script that runs only once, when you create the notebook, choose Create
notebook.
9. In the Create notebook editor, type the script configure networking.
10. Choose Create configuration.

Lifecycle Configuration Best Practices


The following are best practices for using lifecycle configurations:
Important
We do not recommend storing sensitive information in your lifecycle configuration script.

• Lifecycle configurations run as the root user. If your script makes any changes within the /home/ec2-
user/SageMaker directory, (for example, installing a package with pip), use the command sudo -u
ec2-user to run as the ec2-user user. This is the same user that Amazon SageMaker runs as.
• SageMaker notebook instances use conda environments to implement different kernels for Jupyter
notebooks. If you want to install packages that are available to one or more notebook kernels, enclose
the commands to install the packages with conda environment commands that activate the conda
environment that contains the kernel where you want to install the packages.

For example, if you want to install a package only for the python3 environment, use the following
code:

#!/bin/bash
sudo -u ec2-user -i <<EOF

# This will affect only the Jupyter kernel called "conda_python3".


source activate python3

# Replace myPackage with the name of the package you want to install.
pip install myPackage
# You can also perform "conda install" here as well.

source deactivate

EOF

If you want to install a package in all conda environments in the notebook instance, use the following
code:

#!/bin/bash
sudo -u ec2-user -i <<EOF

# Note that "base" is special environment name, include it there as well.


for env in base /home/ec2-user/anaconda3/envs/*; do
source /home/ec2-user/anaconda3/bin/activate $(basename "$env")

# Installing packages in the Jupyter system environment can affect stability of your
SageMaker
# Notebook Instance. You can remove this check if you'd like to install Jupyter
extensions, etc.
if [ $env = 'JupyterSystemEnv' ]; then
continue

214
Amazon SageMaker Developer Guide
Customize a Notebook Instance

fi

# Replace myPackage with the name of the package you want to install.
pip install --upgrade --quiet myPackage
# You can also perform "conda install" here as well.

source /home/ec2-user/anaconda3/bin/deactivate
done

EOF

• You must store all conda environments in the default environments folder (/home/user/anaconda3/
envs).

Important
When you create or change a script, we recommend that you use a text editor that provides
Unix-style line breaks, such as the text editor available in the console when you create a
notebook. Copying text from a non-Linux operating system might introduce incompatible line
breaks and result in an unexpected error.

Install External Libraries and Kernels in Notebook Instances


Amazon SageMaker notebook instances come with multiple environments already installed. These
environments contain Jupyter kernels and Python packages including: scikit, Pandas, NumPy,
TensorFlow, and MXNet. These environments, along with all files in the sample-notebooks folder, are
refreshed when you stop and start a notebook instance. You can also install your own environments that
contain your choice of packages and kernels.

The different Jupyter kernels in Amazon SageMaker notebook instances are separate conda
environments. For information about conda environments, see Managing environments in the Conda
documentation.

Install custom environments and kernels on the notebook instance's Amazon EBS volume. This ensures
that they persist when you stop and restart the notebook instance, and that any external libraries you
install are not updated by SageMaker. To do that, use a lifecycle configuration that includes both a script
that runs when you create the notebook instance (on-create) and a script that runs each time you
restart the notebook instance (on-start). For more information about using notebook instance lifecycle
configurations, see Customize a Notebook Instance Using a Lifecycle Configuration Script (p. 213).
There is a GitHub repository that contains sample lifecycle configuration scripts at SageMaker Notebook
Instance Lifecycle Config Samples.

The examples at https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-


config-samples/blob/master/scripts/persistent-conda-ebs/on-create.sh and https://github.com/
aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/
persistent-conda-ebs/on-start.sh show the best practice for installing environments and kernels on a
notebook instance. The on-create script installs the ipykernel library to create custom environments
as Jupyter kernels, then uses pip install and conda install to install libraries. You can adapt the
script to create custom environments and install libraries that you want. SageMaker does not update
these libraries when you stop and restart the notebook instance, so you can ensure that your custom
environment has specific versions of libraries that you want. The on-start script installs any custom
environments that you create as Jupyter kernels, so that they appear in the dropdown list in the Jupyter
New menu.

Package installation tools


SageMaker notebooks support the following package installation tools:

• conda install
• pip install

215
Amazon SageMaker Developer Guide
Customize a Notebook Instance

You can install packages using the following methods:

• Lifecycle configuration scripts.

For example scripts, see SageMaker Notebook Instance Lifecycle Config Samples. For more information
on lifecycle configuration, see Customize a Notebook Instance Using a Lifecycle Configuration Script.
• Notebooks – The following commands are supported.
• %conda install
• %pip install
• The Jupyter terminal – You can install packages using pip and conda directly.

From within a notebook you can use the system command syntax (lines starting with !) to install
packages, for example, !pip install and !conda install. More recently, new commands have been
added to IPython: %pip and %conda. These commands are the recommended way to install packages
from a notebook as they correctly take into account the active environment or interpreter being used.
For more information, see Add %pip and %conda magic functions.

Conda

Conda is an open source package management system and environment management system, which can
install packages and their dependencies. SageMaker supports using Conda with either of the two main
channels, the default channel, and the conda-forge channel. For more information, see Conda channels.
The conda-forge channel is a community channel where contributors can upload packages.
Note
Due to how Conda resolves the dependency graph, installing packages from conda-forge can
take significantly longer (in the worst cases, upwards of 10 minutes).

The Deep Learning AMI comes with many conda environments and many packages preinstalled. Due to
the number of packages preinstalled, finding a set of packages that are guaranteed to be compatible
is difficult. You may see a warning "The environment is inconsistent, please check the package plan
carefully". Despite this warning, SageMaker ensures that all the SageMaker provided environments are
correct. SageMaker cannot guarantee that any user installed packages will function correctly.
Note
Users of SageMaker, AWS Deep Learning AMI and Amazon EMR can access the commercial
Anaconda repository without taking a commercial license through February 1, 2024 when
using Anaconda in those services. For any usage outside of these three services, customers are
responsible for determining their own Anaconda license requirements.

Conda has two methods for activating environments: conda activate/deactivate, and source activate/
deactivate. For more information, see Should I use 'conda activate' or 'source activate' in Linux.

SageMaker supports moving Conda environments onto the Amazon EBS volume, which is persisted when
the instance is stopped. The environments aren't persisted when the environments are installed to the
root volume, which is the default behavior. For an example lifecycle script, see persistent-conda-ebs.

Supported conda operations (see note at the bottom of this topic)

• conda install of a package in a single environment


• conda install of a package in all environments
• conda install of a R package in the R environment
• Installing a package from the main conda repository
• Installing a package from conda-forge
• Changing the Conda install location to use EBS
• Supporting both conda activate and source activate

216
Amazon SageMaker Developer Guide
Customize a Notebook Instance

Pip
Pip is the de facto tool for installing and managing Python packages. Pip searches for packages on the
Python Package Index (PyPI) by default. Unlike Conda, pip doesn't have built in environment support,
and is not as thorough as Conda when it comes to packages with native/system library dependencies. Pip
can be used to install packages in Conda environments.

You can use alternative package repositories with pip instead of the PyPI. For an example lifecycle script,
see on-start.sh.

Supported pip operations (see note at the bottom of this topic)

• Using pip to install a package without an active conda environment (install packages system wide)
• Using pip to install a package in a conda environment
• Using pip to install a package in all conda environments
• Changing the pip install location to use EBS
• Using an alternative repository to install packages with pip

Unsupported
SageMaker aims to support as many package installation operations as possible. However, if the
packages were installed by SageMaker or DLAMI, and you use the following operations on these
packages, it might make your notebook instance unstable:

• Uninstalling
• Downgrading
• Upgrading

We do not provide support for installing packages via yum install or installing R packages from CRAN.

Due to potential issues with network conditions or configurations, or the availability of Conda or PyPi, we
cannot guarantee that packages will install in a fixed or deterministic amount of time.
Note
We cannot guarantee that a package installation will be successful. Attempting to install a
package in an environment with incompatible dependencies can result in a failure. In such a
case you should contact the library maintainer to see if it is possible to update the package
dependencies. Alternatively you can attempt to modify the environment in such a way as to
allow the installation. This modification however will likely mean removing or updating existing
packages, which means we can no longer guarantee stability of this environment.

Notebook Instance Software Updates


Amazon SageMaker periodically tests and releases software that is installed on notebook instances. This
includes:

• Kernel updates
• Security patches
• AWS SDK updates
• Amazon SageMaker Python SDK updates
• Open source software updates

To ensure that you have the most recent software updates, stop and restart your notebook instance,
either in the SageMaker console or by calling
StopNotebookInstance.

217
Amazon SageMaker Developer Guide
Customize a Notebook Instance

You can also manually update software installed on your notebook instance while it is running by using
update commands in a terminal or in a notebook.
Note
Updating kernels and some packages might depend on whether root access is enabled for the
notebook instance. For more information, see Control root access to a SageMaker notebook
instance (p. 3042).

You can check the Personal Health Dashboard or the security bulletin at Security Bulletins for updates.

Control an Amazon EMR Spark Instance Using a Notebook


You can use a notebook instance created with a custom lifecycle configuration script to access AWS
services from your notebook. For example, you can create a script that lets you use your notebook with
Sparkmagic to control other AWS resources, such as an Amazon EMR instance. You can then use the
Amazon EMR instance to process your data instead of running the data analysis on your notebook. This
allows you to create a smaller notebook instance because you won't use the instance to process data.
This is helpful when you have large datasets that would require a large notebook instance to process the
data.

The process requires three procedures using the Amazon SageMaker console:

• Create the Amazon EMR Spark instance


• Create the Jupyter Notebook
• Test the notebook-to-Amazon EMR connection

To create an Amazon EMR Spark instance that can be controlled from a notebook using
Sparkmagic

1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.


2. In the navigation pane, choose Create cluster.
3. On the Create Cluster - Quick Options page, under Software configuration, choose Spark: Spark
2.4.4 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.2.
4. Set additional parameters on the page and then choose Create cluster.
5. On the Cluster page, choose the cluster name that you created. Note the Master Public DNS, the
EMR master's security group, and the VPC name and subnet ID where the EMR cluster was created.
You will use these values when you create a notebook.

To create a notebook that uses Sparkmagic to control an Amazon EMR Spark instance

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, under Notebook instances, choose Create notebook.
3. Enter the notebook instance name and choose the instance type.
4. Choose Additional configuration, then, under Lifecycle configuration, choose Create a new
lifecycle configuration.
5. Add the following code to the lifecycle configuration script:

# OVERVIEW
# This script connects an Amazon EMR cluster to an Amazon SageMaker notebook instance
that uses Sparkmagic.
#
# Note that this script will fail if the Amazon EMR cluster's master node IP address is
not reachable.

218
Amazon SageMaker Developer Guide
Customize a Notebook Instance

# 1. Ensure that the EMR master node IP is resolvable from the notebook instance.
# One way to accomplish this is to have the notebook instance and the Amazon EMR
cluster in the same subnet.
# 2. Ensure the EMR master node security group provides inbound access from the
notebook instance security group.
# Type - Protocol - Port - Source
# Custom TCP - TCP - 8998 - $NOTEBOOK_SECURITY_GROUP
# 3. Ensure the notebook instance has internet connectivity to fetch the SparkMagic
example config.
#
# https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-
backed-by-spark-in-amazon-emr/

# PARAMETERS
EMR_MASTER_IP=your.emr.master.ip

cd /home/ec2-user/.sparkmagic

echo "Fetching Sparkmagic example config from GitHub..."


wget https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/
example_config.json

echo "Replacing EMR master node IP in Sparkmagic config..."


sed -i -- "s/localhost/$EMR_MASTER_IP/g" example_config.json
mv example_config.json config.json

echo "Sending a sample request to Livy.."


curl "$EMR_MASTER_IP:8998/sessions"

6. In the PARAMETERS section of the script, replace your.emr.master.ip with the Master Public DNS
name for the Amazon EMR instance.
7. Choose Create configuration.
8. On the Create notebook page, choose Network - optional.
9. Choose the VPC and subnet where the Amazon EMR instance is located.
10. Choose the security group used by the Amazon EMR master node.
11. Choose Create notebook instance.

While the notebook instance is being created, the status is Pending. After the instance has been created
and the lifecycle configuration script has successfully run, the status is InService.
Note
If the notebook instance can't connect to the Amazon EMR instance, SageMaker can't create
the notebook instance. The connection can fail if the Amazon EMR instance and notebook are
not in the same VPC and subnet, if the Amazon EMR master security group is not used by the
notebook, or if the Master Public DNS name in the script is incorrect.

To test the connection between the Amazon EMR instance and the notebook

1. When the status of the notebook is InService, choose Open Jupyter to open the notebook.
2. Choose New, then choose Sparkmagic (PySpark).
3. In the code cell, enter %%info and then run the cell.

The output should be similar to the following

Current session configs: {'driverMemory': '1000M', 'executorCores': 2, 'kind':


'pyspark'}
No active sessions.

219
Amazon SageMaker Developer Guide
Example Notebooks

Example Notebooks
Your notebook instance contains example notebooks provided by Amazon SageMaker. The example
notebooks contain code that shows how to apply machine learning solutions by using SageMaker.
Notebook instances use the nbexamples Jupyter extension, which enables you to view a read-
only version of an example notebook or create a copy of it that you can modify and run. For more
information about the nbexamples extension, see https://github.com/danielballan/nbexamples.
For information about example notebooks for SageMaker Studio, see Use Amazon SageMaker Studio
Notebooks (p. 144).
Note
Example notebooks typically download datasets from the internet. If you disable SageMaker-
provided internet access when you create your notebook instance, example notebooks might
not work. For more information, see Connect a Notebook Instance in a VPC to External
Resources (p. 3211).

Use or View Example Notebooks in Jupyter Classic


To view or use the example notebooks in the classic Jupyter view, choose the SageMaker Examples tab.

To view a read-only version of an example notebook in the Jupyter classic view, on the SageMaker
Examples tab, choose Preview for that notebook. To create a copy of an example notebook in the home
directory of your notebook instance, choose Use. In the dialog box, you can change the notebook's name
before saving it.

Use or View Example Notebooks in Jupyterlab


To view or use the example notebooks in the Jupyterlab view, choose the examples icon in the left
navigation panel.

220
Amazon SageMaker Developer Guide
Set the Notebook Kernel

To view a read-only version of an example notebook, choose the name of the notebook. This opens the
notebook as a tab in the main area. To create a copy of an example notebook in the home directory of
your notebook instance, choose Create a Copy in the top banner. In the dialog box, type a name for the
notebook and then choose CREATE COPY.

For more information about the example notebooks, see the SageMaker examples GitHub repository.

Set the Notebook Kernel


Amazon SageMaker provides several kernels for Jupyter that provide support for Python 2 and 3, Apache
MXNet, TensorFlow, and PySpark. To set a kernel for a new notebook in the Jupyter notebook dashboard,
choose New, and then choose the kernel from the list. For more information about the available kernels,
see Available Kernels (p. 206).

221
Amazon SageMaker Developer Guide
Git Repos

You can also create a custom kernel that you can use in your notebook instance. For information, see
Install External Libraries and Kernels in Notebook Instances (p. 215).

Associate Git Repositories with SageMaker Notebook


Instances
Associate Git repositories with your notebook instance to save your notebooks in a source control
environment that persists even if you stop or delete your notebook instance. You can associate one
default repository and up to three additional repositories with a notebook instance. The repositories can
be hosted in AWS CodeCommit, GitHub, or on any other Git server. Associating Git repositories with your
notebook instance can be useful for:

• Persistence - Notebooks in a notebook instance are stored on durable Amazon EBS volumes, but they
do not persist beyond the life of your notebook instance. Storing notebooks in a Git repository enables
you to store and use notebooks even if you stop or delete your notebook instance.
• Collaboration - Peers on a team often work on machine learning projects together. Storing your
notebooks in Git repositories allows peers working in different notebook instances to share notebooks
and collaborate on them in a source-control environment.
• Learning - Many Jupyter notebooks that demonstrate machine learning techniques are available in
publicly hosted Git repositories, such as on GitHub. You can associate your notebook instance with a
repository to easily load Jupyter notebooks contained in that repository.

There are two ways to associate a Git repository with a notebook instance:

• Add a Git repository as a resource in your Amazon SageMaker account. Then, to access the repository,
you can specify an AWS Secrets Manager secret that contains credentials. That way, you can access
repositories that require authentication.
• Associate a public Git repository that is not a resource in your account. If you do this, you cannot
specify credentials to access the repository.

Topics
• Add a Git Repository to Your Amazon SageMaker Account (p. 222)
• Create a Notebook Instance with an Associated Git Repository (p. 225)
• Associate a CodeCommit Repository in a Different AWS Account with a Notebook Instance (p. 226)
• Use Git Repositories in a Notebook Instance (p. 227)

Add a Git Repository to Your Amazon SageMaker Account


To manage your GitHub repositories, easily associate them with your notebook instances, and associate
credentials for repositories that require authentication, add the repositories as resources in your Amazon
SageMaker account. You can view a list of repositories that are stored in your account and details about
each repository in the SageMaker console and by using the API.

222
Amazon SageMaker Developer Guide
Git Repos

You can add Git repositories to your SageMaker account in the SageMaker console or by using the AWS
CLI.
Note
You can use the SageMaker API
CreateCodeRepository to add Git repositories to your SageMaker account, but step-by-step
instructions are not provided here.

Add a Git Repository to Your SageMaker Account (Console)


To add a Git repository as a resource in your SageMaker account

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Under Notebook, choose Git repositories, then choose Add repository.
3. To add an CodeCommit repository, choose AWS CodeCommit. To add a GitHub or other Git-based
repository, choose GitHub/Other Git-based repo.

To add an existing CodeCommit repository

1. Choose Use existing repository.


2. For Repository, choose a repository from the list.
3. Enter a name to use for the repository in SageMaker. The name must be 1 to 63 characters. Valid
characters are a-z, A-Z, 0-9, and - (hyphen).
4. Choose Add repository.

To create a new CodeCommit repository

1. Choose Create new repository.


2. Enter a name for the repository that you can use in both CodeCommit and SageMaker. The name
must be 1 to 63 characters. Valid characters are a-z, A-Z, 0-9, and - (hyphen).
3. Choose Create repository.

To add a Git repository hosted somewhere other than CodeCommit

1. Choose GitHub/Other Git-based repo.


2. Enter a name of up to 63 characters. Valid characters include alpha-numeric characters, a hyphen (-),
and 0-9.
3. Enter the URL for the repository. Do not provide a username in the URL. Add the sign-in credentials
in AWS Secrets Manager as described in the next step.
4. For Git credentials, choose the credentials to use to authenticate to the repository. This is necessary
only if the Git repository is private.
Note
If you have two-factor authentication enabled for your Git repository, enter a personal
access token generated by your Git service provider in the password field.

a. To use an existing AWS Secrets Manager secret, choose Use existing secret, and then choose
a secret from the list. For information about creating and storing a secret, see Creating a Basic
Secret in the AWS Secrets Manager User Guide. The name of the secret you use must contain the
string sagemaker.
Note
The secret must have a staging label of AWSCURRENT and must be in the following
format:

223
Amazon SageMaker Developer Guide
Git Repos

{"username": UserName, "password": Password}


For GitHub repositories, we recommend using a personal access token in the password
field. For information, see https://help.github.com/articles/creating-a-personal-access-
token-for-the-command-line/.
b. To create a new AWS Secrets Manager secret, choose Create secret, enter a name for the secret,
and then enter the sign-in credentials to use to authenticate to the repository. The name for the
secret must contain the string sagemaker.
Note
The IAM role you use to create the secret must have the
secretsmanager:GetSecretValue permission in its IAM policy.
The secret must have a staging label of AWSCURRENT and must be in the following
format:
{"username": UserName, "password": Password}
For GitHub repositories, we recommend using a personal access token.
c. To not use any credentials, choose No secret.
5. Choose Create secret.

Add a Git Repository to Your Amazon SageMaker Account (CLI)


Use the create-code-repository AWS CLI command. Specify a name for the repository as the value
of the code-repository-name argument. The name must be 1 to 63 characters. Valid characters are a-
z, A-Z, 0-9, and - (hyphen). Also specify the following:

• The default branch


• The URL of the Git repository
Note
Do not provide a username in the URL. Add the sign-in credentials in AWS Secrets Manager as
described in the next step.
• The Amazon Resource Name (ARN) of an AWS Secrets Manager secret that contains the credentials to
use to authenticate the repository as the value of the git-config argument

For information about creating and storing a secret, see Creating a Basic Secret in the AWS Secrets
Manager User Guide. The following command creates a new repository named MyRespository in
your Amazon SageMaker account that points to a Git repository hosted at https://github.com/
myprofile/my-repo".

For Linux, OS X, or Unix:

aws sagemaker create-code-repository \


--code-repository-name "MyRepository" \
--git-config Branch=branch,RepositoryUrl=https://github.com/myprofile/
my-repo,SecretArn=arn:aws:secretsmanager:us-east-2:012345678901:secret:my-secret-ABc0DE

For Windows:

aws sagemaker create-code-repository ^


--code-repository-name "MyRepository" ^
--git-config "{\"Branch\":\"master\", \"RepositoryUrl\" :
\"https://github.com/myprofile/my-repo\", \"SecretArn\" :
\"arn:aws:secretsmanager:us-east-2:012345678901:secret:my-secret-ABc0DE\"}"

Note
The secret must have a staging label of AWSCURRENT and must be in the following format:

224
Amazon SageMaker Developer Guide
Git Repos

{"username": UserName, "password": Password}


For GitHub repositories, we recommend using a personal access token.

Create a Notebook Instance with an Associated Git Repository


You can associate Git repositories with a notebook instance when you create the notebook instance by
using the AWS Management Console, or the AWS CLI. If you want to use a CodeCommit repository that
is in a different AWS account than the notebook instance, set up cross-account access for the repository.
For information, see Associate a CodeCommit Repository in a Different AWS Account with a Notebook
Instance (p. 226).

Topics
• Create a Notebook Instance with an Associated Git Repository (Console) (p. 225)
• Create a Notebook Instance with an Associated Git Repository (CLI) (p. 225)

Create a Notebook Instance with an Associated Git Repository (Console)


To create a notebook instance and associate Git repositories in the Amazon SageMaker
console

1. Follow the instructions at Step 1: Create an Amazon SageMaker Notebook Instance (p. 88).
2. For Git repositories, choose Git repositories to associate with the notebook instance.

a. For Default repository, choose a repository that you want to use as your default repository.
SageMaker clones this repository as a subdirectory in the Jupyter startup directory at /home/
ec2-user/SageMaker. When you open your notebook instance, it opens in this repository. To
choose a repository that is stored as a resource in your account, choose its name from the list.
To add a new repository as a resource in your account, choose Add a repository to SageMaker
(opens the Add repository flow in a new window) and then follow the instructions at Create
a Notebook Instance with an Associated Git Repository (Console) (p. 225). To clone a public
repository that is not stored in your account, choose Clone a public Git repository to this
notebook instance only, and then specify the URL for that repository.
b. For Additional repository 1, choose a repository that you want to add as an additional
directory. SageMaker clones this repository as a subdirectory in the Jupyter startup directory
at /home/ec2-user/SageMaker. To choose a repository that is stored as a resource in your
account, choose its name from the list. To add a new repository as a resource in your account,
choose Add a repository to SageMaker (opens the Add repository flow in a new window) and
then follow the instructions at Create a Notebook Instance with an Associated Git Repository
(Console) (p. 225). To clone a repository that is not stored in your account, choose Clone
a public Git repository to this notebook instance only, and then specify the URL for that
repository.

Repeat this step up to three times to add up to three additional repositories to your notebook
instance.

Create a Notebook Instance with an Associated Git Repository (CLI)


To create a notebook instance and associate Git repositories by using the AWS CLI, use the create-
notebook-instance command as follows:

• Specify the repository that you want to use as your default repository as the value of the default-
code-repository argument. Amazon SageMaker clones this repository as a subdirectory in the
Jupyter startup directory at /home/ec2-user/SageMaker. When you open your notebook instance,
it opens in this repository. To use a repository that is stored as a resource in your SageMaker account,
specify the name of the repository as the value of the default-code-repository argument. To use

225
Amazon SageMaker Developer Guide
Git Repos

a repository that is not stored in your account, specify the URL of the repository as the value of the
default-code-repository argument.
• Specify up to three additional repositories as the value of the additional-code-repositories
argument. SageMaker clones this repository as a subdirectory in the Jupyter startup directory at /
home/ec2-user/SageMaker, and the repository is excluded from the default repository by adding
it to the .git/info/exclude directory of the default repository. To use repositories that are stored
as resources in your SageMaker account, specify the names of the repositories as the value of the
additional-code-repositories argument. To use repositories that are not stored in your
account, specify the URLs of the repositories as the value of the additional-code-repositories
argument.

For example, the following command creates a notebook instance that has a repository named
MyGitRepo, that is stored as a resource in your SageMaker account, as a default repository, and an
additional repository that is hosted on GitHub:

aws sagemaker create-notebook-instance \


--notebook-instance-name "MyNotebookInstance" \
--instance-type "ml.t2.medium" \
--role-arn "arn:aws:iam::012345678901:role/service-role/
AmazonSageMaker-ExecutionRole-20181129T121390" \
--default-code-repository "MyGitRepo" \
--additional-code-repositories "https://github.com/myprofile/my-other-
repo"

Note
If you use an AWS CodeCommit repository that does not contain "SageMaker" in its name, add
the codecommit:GitPull and codecommit:GitPush permissions to the role that you pass
as the role-arn argument to the create-notebook-instance command. For information
about how to add permissions to a role, see Adding and Removing IAM Policies in the AWS
Identity and Access Management User Guide.

Associate a CodeCommit Repository in a Different AWS Account


with a Notebook Instance
To associate a CodeCommit repository in a different AWS account with your notebook instance, set up
cross-account access for the CodeCommit repository.

To set up cross-account access for a CodeCommit repository and associate it with a notebook
instance:

1. In the AWS account that contains the CodeCommit repository, create an IAM policy that allows
access to the repository from users in the account that contains your notebook instance. For
information, see Step 1: Create a Policy for Repository Access in AccountA in the CodeCommit User
Guide.
2. In the AWS account that contains the CodeCommit repository, create an IAM role, and attach the
policy that you created in the previous step to that role. For information, see Step 2: Create a Role
for Repository Access in AccountA in the CodeCommit User Guide.
3. Create a profile in the notebook instance that uses the role that you created in the previous step:

a. Open the notebook instance.


b. Open a terminal in the notebook instance.
c. Edit a new profile by typing the following in the terminal:

vi /home/ec2-user/.aws/config

226
Amazon SageMaker Developer Guide
Git Repos

d. Edit the file with the following profile information:

[profile CrossAccountAccessProfile]
region = us-west-2
role_arn =
arn:aws:iam::CodeCommitAccount:role/CrossAccountRepositoryContributorRole
credential_source=Ec2InstanceMetadata
output = json

Where CodeCommitAccount is the account that contains the CodeCommit


repository, CrossAccountAccessProfile is the name of the new profile, and
CrossAccountRepositoryContributorRole is the name of the role you created in the
previous step.
4. On the notebook instance, configure git to use the profile you created in the previous step:

a. Open the notebook instance.


b. Open a terminal in the notebook instance.
c. Edit the Git configuration file typing the following in the terminal:

vi /home/ec2-user/.gitconfig

d. Edit the file with the following profile information:

[credential]
helper = !aws codecommit credential-helper --
profile CrossAccountAccessProfile $@
UseHttpPath = true

Where CrossAccountAccessProfile is the name of the profile that you created in the
previous step.

Use Git Repositories in a Notebook Instance


When you open a notebook instance that has Git repositories associated with it, it opens in the default
repository, which is installed in your notebook instance directly under /home/ec2-user/SageMaker.
You can open and create notebooks, and you can manually run Git commands in a notebook cell. For
example:

!git pull origin master

To open any of the additional repositories, navigate up one folder. The additional repositories are also
installed as directories under /home/ec2-user/SageMaker.

If you open the notebook instance with a JupyterLab interface, the jupyter-git extension is installed and
available to use. For information about the jupyter-git extension for JupyterLab, see https://github.com/
jupyterlab/jupyterlab-git.

When you open a notebook instance in JupyterLab, you see the git repositories associated with it on the
left menu:

227
Amazon SageMaker Developer Guide
Git Repos

You can use the jupyter-git extension to manage git visually, instead of using the command line:

228
Amazon SageMaker Developer Guide
Notebook Instance Metadata

Notebook Instance Metadata


When you create a notebook instance, Amazon SageMaker creates a JSON file on the instance at
the location /opt/ml/metadata/resource-metadata.json that contains the ResourceName
and ResourceArn of the notebook instance. You can access this metadata from anywhere within
the notebook instance, including in lifecycle configurations. For information about notebook
instance lifecycle configurations, see Customize a Notebook Instance Using a Lifecycle Configuration
Script (p. 213).

The resource-metadata.json file has the following structure:

{
"ResourceArn": "NotebookInstanceArn",
"ResourceName": "NotebookInstanceName"
}

You can use this metadata from within the notebook instance to get other information about the
notebook instance. For example, the following commands get the tags associated with the notebook
instance:

NOTEBOOK_ARN=$(jq '.ResourceArn'
/opt/ml/metadata/resource-metadata.json --raw-output)
aws sagemaker list-tags --resource-arn $NOTEBOOK_ARN

The out put looks like the following:

{
"Tags": [
{
"Key": "test",
"Value": "true"
}
]
}

Monitor Jupyter Logs in Amazon CloudWatch Logs


Jupyter logs include important information such as events, metrics, and health information that provide
actionable insights when running Amazon SageMaker notebooks. By importing Jupyter logs into
CloudWatch Logs, customers can use CloudWatch Logs to detect anomalous behaviors, set alarms,
and discover insights to keep the SageMaker notebooks running more smoothly. You can access the
logs even when the Amazon EC2 instance that hosts the notebook is unresponsive, and use the logs to
troubleshoot the unresponsive notebook. Sensitive information such as AWS account IDs, secret keys,
and authentication tokens in presigned URLs are removed so that customers can share logs without
leaking private information.

To view Jupyter logs for a notebook instance:

1. Sign in to the AWS Management Console and open the SageMaker console at https://
console.aws.amazon.com/sagemaker/.
2. Choose Notebook instances.
3. In the list of notebook instances, choose the notebook instance for which you want to view Jupyter
logs by selecting the Notebook instance Name.

This will bring you to the details page for that notebook instance.

229
Amazon SageMaker Developer Guide
SageMaker Studio Lab

4. Under Monitor on the notebook instance details page, choose View logs.
5. In the CloudWatch console, choose the log stream for your notebook instance. Its name is in the
form NotebookInstanceName/jupyter.log.

For more information about monitoring CloudWatch logs for SageMaker, see Log Amazon SageMaker
Events with Amazon CloudWatch (p. 3284).

Amazon SageMaker Studio Lab


Amazon SageMaker Studio Lab is a free service that gives customers access to AWS compute resources,
in an environment based on open-source JupyterLab. It is based on the same architecture and user
interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.

With Studio Lab, you can use AWS compute resources to create and run your Jupyter notebooks without
signing up for an AWS account. Because Studio Lab is based on open-source JupyterLab, you can take
advantage of open-source Jupyter extensions to run your Jupyter notebooks.

Studio Lab compared to Amazon SageMaker Studio

While Studio Lab provides free access to AWS compute resources, Amazon SageMaker Studio provides
the following advanced machine learning capabilities that Studio Lab does not support.

• Continuous integration and continuous delivery (SageMaker Pipelines)


• Real-time predictions
• Large-scale distributed training
• Data preparation (Amazon SageMaker Data Wrangler)
• Data labeling (Amazon SageMaker Ground Truth)
• Feature Store
• Bias analysis (Clarify)
• Model deployment
• Model monitoring

Studio also supports fine-grained access control and security by using AWS Identity and Access
Management (IAM), Amazon Virtual Private Cloud (Amazon VPC), and AWS Key Management Service
(AWS KMS). Studio Lab does not support these Studio features, nor does it support the use of estimators
and built-in SageMaker algorithms.

To export your Studio Lab projects for use with Studio, see Export an Amazon SageMaker Studio Lab
environment to Amazon SageMaker Studio (p. 251).

The following topics give information about Studio Lab and how to use it

Topics
• Amazon SageMaker Studio Lab components overview (p. 231)
• Onboard to Amazon SageMaker Studio Lab (p. 234)
• Manage your account (p. 235)
• Launch your Amazon SageMaker Studio Lab project runtime (p. 236)
• Use Amazon SageMaker Studio Lab starter assets (p. 237)
• Use the Amazon SageMaker Studio Lab project runtime (p. 239)
• Troubleshooting (p. 256)

230
Amazon SageMaker Developer Guide
Studio Lab components overview

Amazon SageMaker Studio Lab components overview


Amazon SageMaker Studio Lab consists of the following components. The following topics give more
details about these components.

Topics
• Landing page (p. 231)
• Studio Lab account (p. 231)
• Project overview page (p. 231)
• Preview page (p. 232)
• Project (p. 232)
• Compute instance type (p. 233)
• Project runtime (p. 234)
• Session (p. 234)

Landing page
You can request an account and sign in to an existing account on your landing page. To navigate to the
landing page, see the Amazon SageMaker Studio Lab website. For more information about creating a
Studio Lab account, see Onboard to Amazon SageMaker Studio Lab (p. 234).

The following screenshot shows the Studio Lab landing page interface for requesting a user account and
signing in.

Studio Lab account


Your Studio Lab account gives you access to Studio Lab. For more information about creating a user
account, see Onboard to Amazon SageMaker Studio Lab (p. 234).

Project overview page


You can launch a compute instance and view information about your project on this page. To navigate
to this page, you must sign in from the Amazon SageMaker Studio Lab website. The URL takes the
following format.

231
Amazon SageMaker Developer Guide
Studio Lab components overview

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

The following screenshot shows a project overview in the Studio Lab user interface.

Preview page
On this page, you can access a read-only preview of a Jupyter notebook. You can not execute the
notebook from preview, but you can copy that notebook into your project. For many customers, this
may be the first Studio Lab page that customers see, as they may be opening a notebook from GitHub
notebook. For more information on how to use GitHub resources, see Use GitHub resources (p. 248).

To copy the notebook preview to your Studio Lab project:

1. Sign in to your Studio Lab account. For more information about creating a Studio Lab account, see
Onboard to Amazon SageMaker Studio Lab (p. 234).
2. Under Notebook compute instance, choose a compute instance type. For more information about
compute instance types, see Compute instance type (p. 233).
3. Choose Start runtime. You might be asked to solve a CAPTCHA puzzle. For more information on
CAPTCHA, see What is a CAPTCHA puzzle?
4. One time setup, for first time starting runtime using your Studio Lab account:

a. Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account
and choose Continue.

For information on supported countries and regions, see Supported countries and regions (SMS
channel).
b. Enter the 6-digit code sent to the associated mobile phone number and choose Verify.
5. Choose Copy to project.

Project
Your project contains all of your files and folders, including your Jupyter notebooks. You have full control
over the files in your project. Your project also includes the JupyterLab-based user interface. From this
interface, you can interact with your Jupyter notebooks, edit your source code files, integrate with
GitHub, and connect to Amazon S3. For more information, see Use the Amazon SageMaker Studio Lab
project runtime (p. 239).

The following screenshot shows a Studio Lab project with the file browser open and the Studio Lab
Launcher displayed.

232
Amazon SageMaker Developer Guide
Studio Lab components overview

Compute instance type


Your Amazon SageMaker Studio Lab project runtime is based on an EC2 instance. You are allotted 15
GB of storage and 16 GB of RAM. Availability of compute instances is not guaranteed and is subject
to demand. If you require additional storage or compute resources, consider switching to Amazon
SageMaker Studio.

Amazon SageMaker Studio Lab offers the choice of a CPU (Central Processing Unit) and a GPU (Graphical
Processing Unit). The following sections give information about these two options, including selection
guidance.

CPU

A central processing unit (CPU) is designed to handle a wide range of tasks efficiently, but is limited
in how many tasks it can run concurrently. For machine learning, a CPU is recommended for compute
intensive algorithms, such as time series, forecasting, and tabular data.

The CPU compute type has 12 hours of compute time.

GPU

A graphics processing unit (GPU) is designed to render high-resolution images and video concurrently. A
GPU is recommended for deep learning tasks, especially for transformers and computer vision.

The GPU compute type has 4 hours of compute time.

Compute time

When compute time for Studio Lab reaches its time limit, the instance stops all running computations.
Studio Lab does not support time limit increases.

Studio Lab automatically saves your environment when you update your environment and every time
you create a new file. Custom-installed extensions and packages persist even after your runtime has
ended.

File edits are periodically saved, but are not saved when your runtime ends. To ensure that you do not
lose your progress, save your work manually. If you have content in your Studio Lab project that you
don’t want to lose, we recommend that you back up your content elsewhere. For more information about

233
Amazon SageMaker Developer Guide
Onboard to Studio Lab

exporting your environment and files, see Export an Amazon SageMaker Studio Lab environment to
Amazon SageMaker Studio (p. 251).

During long computation, you do not need to keep your project open. For example, you can start training
a model, then close your browser. The instance keeps running for up to 12 hours on CPU instances and 4
hours on GPU instances. You can then sign in later to continue your work.

We recommend that you use checkpointing in your deep learning jobs. You can use saved checkpoints to
restart a job from the previously saved checkpoint. For more information, see File I/O.

Project runtime
The project runtime is the period of time when your compute instance is running.

Session
A user session begins every time you launch your project.

Onboard to Amazon SageMaker Studio Lab


To onboard to Amazon SageMaker Studio Lab, follow the steps in this guide. In the following sections,
you learn how to request a Studio Lab account, create your account, and sign in.

Topics
• Request a Studio Lab account (p. 234)
• Create a Studio Lab account (p. 235)
• Sign in to Studio Lab (p. 235)

Request a Studio Lab account


To use Studio Lab, you must first request approval to create a Studio Lab account. An AWS account
cannot be used for onboarding to Studio Lab.

The following steps show how to request a Studio Lab account.

1. Navigate to the Studio Lab landing page.


2. Select Request account.
3. Enter the required information into the form.
4. Select Submit request.
5. If you receive an email to verify your email address, follow the instructions in the email to complete
this step.

Your account request must be approved before you can register for a Studio Lab account. Your request
will be reviewed within five business days. When your account request is approved, you receive an email
with a link to the Studio Lab account registration page. This link expires seven days after your request is
approved. If the link expires, you must submit a new account request.

Note: Your account request is denied if your email has been associated with activity that violates our
Terms of Service or other agreements.

Referral codes
Studio Lab referral codes enable new account requests to be automatically approved to support machine
learning events like workshops, hackathons, and classes. With a referral code, a trusted host can get their

234
Amazon SageMaker Developer Guide
Manage your account

participants immediate access to Studio Lab. After an account has been created using a referral code, the
account continues to exist after the expiration of the code.

To get a referral code, contact Sales Support. To use a referral code, enter the code as part of the account
request form.

Create a Studio Lab account


After your request is approved, complete the following steps to create your Studio Lab account.

1. Select Create account in the account request approval email to open a new page.
2. From the new page, enter your Email, a Password, and a Username.
3. Select Create account.

You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?

Sign in to Studio Lab


After you register for your account, you can sign in to Studio Lab.

1. Navigate to the Studio Lab landing page.


2. Select Sign in to open a new page.
3. Enter your Email or Username and Password.
4. Select Sign in to open a new page to your project.

You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?

Manage your account


The following topic gives information about managing your account, including changing your password,
deleting your account, and getting information that we have collected. These topics require that you
sign in to your Amazon SageMaker Studio Lab account. For more information, see Sign in to Studio
Lab (p. 235).

Change your password


Follow these steps to change your Amazon SageMaker Studio Lab password.

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

2. From the top-right corner, select your user name to open a dropdown menu.
3. From the dropdown menu, select Change password to open a new page.
4. Enter your current password into the Enter your current password field.
5. Enter your new password into the Create a new password and Confirm your new password fields.
6. Select Submit.

Delete your account


Follow these steps to delete your Studio Lab account.

235
Amazon SageMaker Developer Guide
Launch Studio Lab

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

2. From the top-right corner, select your user name to open a dropdown menu.
3. From the dropdown menu, select Delete account to open a new page.
4. Enter your password to confirm the deletion of your Studio Lab account.
5. Select Delete.

Customer information
Studio Lab collects your email address, user name, encrypted password, project files, and metadata.
When requesting an account, you can optionally choose to provide your first and last name, country,
organization name, occupation, and the reason for your interest in this product. We protect all customer
personal data with encryption. For more information about how your personal information is handled,
see the Privacy Notice.

When you delete your account, all of your information is deleted immediately. If you have an inquiry
about this, submit the Amazon SageMaker Studio Lab Form. For information and support related to AWS
compliance, see Compliance support.

Launch your Amazon SageMaker Studio Lab project


runtime
The Amazon SageMaker Studio Lab project runtime lets you write and run code directly from your
browser. It is based on JupyterLab and has an integrated terminal and console. For more information
about JupyterLab, see the JupyterLab Documentation.

The following topic gives information about how to manage your project runtime. These topics require
that you sign in to your Amazon SageMaker Studio Lab account. For more information about signing in,
see Sign in to Studio Lab (p. 235). For more information about your project, see Amazon SageMaker
Studio Lab components overview (p. 231).

Topics
• Start your project runtime (p. 236)
• Stop your project runtime (p. 237)
• View remaining compute time (p. 237)
• Change your compute type (p. 237)

Start your project runtime


To use Studio Lab, you must start your project runtime. This runtime gives you access to the JupyterLab
environment.

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

2. Under My Project, select a compute type. For more information about compute types, see Compute
instance type (p. 233).

236
Amazon SageMaker Developer Guide
Use Studio Lab starter assets

3. Select Start runtime.

You might be asked to solve a CAPTCHA puzzle. For more information on CAPTCHA, see What is a
CAPTCHA puzzle?
4. One time setup, for first time starting runtime using your Studio Lab account:

a. Enter a mobile phone number to associate with your Amazon SageMaker Studio Lab account
and choose Continue.

For information on supported countries and regions, see Supported countries and regions (SMS
channel).
b. Enter the 6-digit code sent to the associated mobile phone number and choose Verify.
5. After the runtime is running, select Open project to open the project runtime environment in a new
browser tab.

Stop your project runtime


When you stop your project runtime, your files are not automatically saved. To ensure that you don't lose
your work, save all of your changes before stopping your project runtime.

• Under My Project, select Stop runtime.

View remaining compute time


Your project runtime has limited compute time based on the compute type that you select. For more
information about compute time in Studio Lab, see Compute instance type (p. 233).

• Under My Project, view Time remaining.

Change your compute type


You can switch your compute type based on your workflow. For more information about compute types,
see Compute instance type (p. 233).

1. Save any project files before changing the compute type.


2. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

3. Under My Project, select the desired compute type (CPU or GPU).


4. Confirm your choice by selecting Restart in the Restart project runtime? dialog box. Studio Lab
stops your current project runtime, then starts a new project runtime with your updated compute
type.
5. After your project runtime has started, select Open project. This opens your project runtime
environment in a new browser tab. For information about using your project runtime environment,
see Use the Amazon SageMaker Studio Lab project runtime (p. 239).

Use Amazon SageMaker Studio Lab starter assets


Amazon SageMaker Studio Lab supports the following assets to help machine learning (ML) practitioners
get started. This guide shows you how to clone notebooks for your project.

237
Amazon SageMaker Developer Guide
Use Studio Lab starter assets

Getting started notebook

Studio Lab comes with a starter notebook that gives general information and guides you through key
workflows. When you launch your project runtime for the first time, this notebook automatically opens.

Dive into Deep Learning

Dive into Deep Learning (D2L) is an interactive, open-source book that teaches the ideas, mathematical
theory, and code that power machine learning. With over 150 Jupyter notebooks, D2L provides a
comprehensive overview of deep learning principles. For more information about D2L, see the D2L
website.

The following procedure shows how to clone the D2L Jupyter notebooks to your instance.

1. Start and open the Studio Lab project runtime environment by following Start your project
runtime (p. 236).
2.
Once Studio Lab is open, choose the Git tab ( ) on the left sidebar.
3. Choose Clone a Repository. Under Git repository URL (.git) paste the MLU git repository D2L
by following the steps below. If you do not see the Clone a Repository option because you are
currently in a Git repository, return to the user directory to clone a new repository. You return to the

user directory by choosing the Folder tab ( ) on the left sidebar. In the Folder tab beneath the
file search bar choose the folder icon to the left of the currently open repository. Once you are in the
user directory, choose the Git tab on the left sidebar and choose Clone a Repository.
4. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

5. Under New to machine learning?, choose Dive into Deep Learning.


6. From the new Dive into Deep Learning browser tab, choose GitHub to open a new page with the
example notebooks.
7. Choose Code and copy the GitHub repository's URL in the HTTPS tab.
8. Return to the Studio Lab open project browser tab, paste the D2L repository URL, and clone the
repository.

AWS Machine Learning University

The AWS Machine Learning University (MLU) provides access to the machine learning courses used to
train Amazon’s own developers. With AWS MLU, any developer can learn how to use machine learning
with the learn-at-your-own-pace MLU Accelerator learning series. The MLU Accelerator series is designed
to help developers begin their ML journey. It offers three-day foundational courses on these three
subjects: Natural Language Processing, Tabular Data, and Computer Vision. For more information,
see Machine Learning University.

The following procedure shows how to clone the AWS MLU Jupyter notebooks to your instance.

1. Start and open the Studio Lab project runtime environment by following Start your project
runtime (p. 236).
2.
Once Studio Lab is open, choose the Git tab ( ) on the left sidebar.
3. Choose Clone a Repository. Under Git repository URL (.git) paste the MLU git repository URL
by following the steps below. If you do not see the Clone a Repository option because you are
currently in a Git repository, return to the user directory to clone a new repository. You return to the

238
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

user directory by choosing the Folder tab ( ) on the left sidebar. In the Folder tab beneath the
file search bar choose the folder icon to the left of the currently open repository. Once you are in the
user directory, choose the Git tab on the left sidebar and choose Clone a Repository.
4. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

5. Under New to machine learning?, choose AWS Machine Learning University.


6. From the new AWS Machine Learning University browser tab, find a course that interests you by
reading the Course Summary for each course.
7. Choose the corresponding GitHub repository of interest under Course Content, to open a new page
with the example notebooks.
8. Choose Code and copy the GitHub repository's URL in the HTTPS tab.
9. Return to the Studio Lab open project browser tab, paste the D2L repository URL, and choose Clone
to clone the repository.

Roboflow

Roboflow gives you the tools to train, fine-tune, and label objects for computer vision applications. For
more information, see https://roboflow.com/.

The following procedure shows how to clone the Roboflow Jupyter notebooks to your instance.

1. Navigate to the Studio Lab project overview page. The URL takes the following format.

https://studiolab.sagemaker.aws/users/<YOUR_USER_NAME>

2. Under Resources and community, find Try Computer Vision.


3. Under Try Computer Vision choose a Roboflow model. For more information, see https://
roboflow.com/.
4. Follow the tutorial under the Notebook preview.

Use the Amazon SageMaker Studio Lab project


runtime
The following topics give information about using the Amazon SageMaker Studio Lab project
runtime. Before you can use the Studio Lab project runtime, you must onboard to Studio Lab by
following the steps in Onboard to Amazon SageMaker Studio Lab (p. 234).

Topics
• Amazon SageMaker Studio Lab UI overview (p. 240)
• Create or open an Amazon SageMaker Studio Lab notebook (p. 241)
• Use the Amazon SageMaker Studio Lab notebook toolbar (p. 242)
• Manage your environment (p. 244)
• Use external resources in Amazon SageMaker Studio Lab (p. 248)
• Get notebook differences (p. 251)
• Export an Amazon SageMaker Studio Lab environment to Amazon SageMaker Studio (p. 251)
• Shut down resources (p. 255)

239
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

Amazon SageMaker Studio Lab UI overview


Amazon SageMaker Studio Lab extends the JupyterLab interface. Previous users of JupyterLab will
notice similarities between the JupyterLab and Studio Lab UI, including the workspace. For an overview
of the basic JupyterLab interface, see The JupyterLab Interface.

The following image shows Studio Lab with the file browser open and the Studio Lab Launcher
displayed.

You will find the menu bar at the top of the screen. The left sidebar contains icons to open file browsers,
resource browsers, and tools. The status bar is located at the bottom-left corner of Studio Lab.

The main work area is divided horizontally into two panes. The left pane is the file and resource browser.
The right pane contains one or more tabs for resources, such as notebooks and terminals.

Topics
• Left sidebar (p. 240)
• File and resource browser (p. 241)
• Main work area (p. 241)

Left sidebar
The left sidebar includes the following icons. When you hover over an icon, a tooltip displays the icon
name. When you choose an icon, the file and resource browser displays the described functionality.
For hierarchical entries, a selectable breadcrumb at the top of the browser shows your location in the
hierarchy.

Icon Description

File Browser

Choose the Upload Files icon ( ) to add files to Studio Lab.

Double-click a file to open the file in a new tab.

240
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

Icon Description
To have adjacent files open, choose a tab that contains a notebook, Python,
or text file, and then choose New View for File.

Choose the plus (+) sign on the menu at the top of the file browser to open
the Studio Lab Launcher.

Running Terminals and Kernels

You can see a list of all of the running terminals and kernels in your project.
For more information, see Shut down resources (p. 255).

Git

You can connect to a Git repository and then access a full range of Git tools
and operations. For more information, see Use external resources in Amazon
SageMaker Studio Lab (p. 248).

Table of Contents

You can access the Table of Contents for your current Jupyter notebook.

Extension Manager

You can enable and manage third-party JupyterLab extensions.

File and resource browser


The file and resource browser shows lists of your notebooks and files. On the menu at the top of the file
browser, choose the plus (+) sign to open the Studio Lab Launcher. The Launcher allows you to create a
notebook or open a terminal.

Main work area


The main work area has multiple tabs that contain your open notebooks and terminals.

Create or open an Amazon SageMaker Studio Lab notebook


When you create a notebook in Amazon SageMaker Studio Lab or open a notebook in Studio Lab, you
must select a kernel for the notebook. The following topics describe how to create and open notebooks
in Studio Lab.

For information about shutting down the notebook, see Shut down resources (p. 255).

Topics
• Open a Studio Lab notebook (p. 241)
• Create a notebook from the file menu (p. 242)
• Create a notebook from the Launcher (p. 242)

Open a Studio Lab notebook


Studio Lab can only open notebooks listed in the Studio Lab file browser. To clone a notebook into
your file browser from an external repository, see Use external resources in Amazon SageMaker Studio
Lab (p. 248).

241
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

To open a notebook

1.
In the left sidebar, choose the File Browser icon ( ) to display the file browser.
2. Browse to a notebook file and double-click it to open the notebook in a new tab.

Create a notebook from the file menu


To create a notebook from the File menu

1. From the Studio Lab menu, choose File, choose New, and then choose Notebook.
2. To use the default kernel, in the Select Kernel dialog box, choose Select. Otherwise, to select a
different kernel, use the dropdown menu.

Create a notebook from the Launcher


To create a notebook from the Launcher

1. Open the Launcher by using the keyboard shortcut Ctrl + Shift + L.

Alternatively, you can open Launcher from the left sidebar: Choose the File Browser icon, and then
choose the plus (+) icon.
2. To use the default kernel from the Launcher, under Notebook, choose default:Python. Otherwise,
select a different kernel.

After you choose the kernel, your notebook launches and opens in a new Studio Lab tab.

To view the notebook's kernel session, in the left sidebar, choose the Running Terminals and Kernels

icon ( ). You can stop the notebook's kernel session from this view.

Use the Amazon SageMaker Studio Lab notebook toolbar


Amazon SageMaker Studio Lab notebooks extend the JupyterLab interface. For an overview of the basic
JupyterLab interface, see The JupyterLab Interface.

The following image shows the toolbar and an empty cell from a Studio Lab notebook.

When you hover over a toolbar icon, a tooltip displays the icon function. You can find additional
notebook commands in the Studio Lab main menu. The toolbar includes the following icons:

Icon Description

Save and checkpoint

Saves the notebook and updates the checkpoint file.

Insert cell

242
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

Icon Description
Inserts a code cell below the current cell. The current cell is noted by the
blue vertical marker in the left margin.

Cut, copy, and paste cells

Cuts, copies, and pastes the selected cells.

Run cells

Runs the selected cells. The cell that follows the last-selected cell becomes
the new-selected cell.

Interrupt kernel

Interrupts the kernel, which cancels the currently-running operation. The


kernel remains active.

Restart kernel

Restarts the kernel. Variables are reset. Unsaved information is not affected.

Restart kernel and re-run notebook

Restarts the kernel. Variables are reset. Unsaved information is not affected.
Then re-runs the entire notebook.

Cell type

Displays or changes the current cell type. The cell types are:

• Code – Code that the kernel runs.


• Markdown – Text rendered as markdown.
• Raw – Content, including Markdown markup, that's displayed as text.

Checkpoint diff

Opens a new tab that displays the difference between the notebook
and the checkpoint file. For more information, see Get notebook
differences (p. 251).

Git diff

Only enabled if the notebook is opened from a Git repository. Opens a


new tab that displays the difference between the notebook and the last Git
commit. For more information, see Get notebook differences (p. 251).

default Kernel

Displays or changes the kernel that processes the cells in the notebook.

No Kernel indicates that the notebook was opened without specifying a


kernel. You can edit the notebook, but you can't run any cells.

243
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

Icon Description

Kernel busy status

Displays a kernel's busy status by showing the circle's edge and its interior
as the same color. The kernel is busy when it is starting and when it is
processing cells. Additional kernel states are displayed in the status bar at
the bottom-left corner of Studio Lab.

Manage your environment


Amazon SageMaker Studio Lab provides environments for your Studio Lab notebook instances.
Environments allow you to start up a Studio Lab notebook instance with the packages you want to use.
This is done by activating an environment, installing packages in the environment, and then selecting the
environment as a Kernel. For more information on activating the environment, see Create, activate, and
use new conda environments (p. 246).

Your Studio Lab environment comes with a base image installed that includes key packages and
resources. You can customize your environment by adding new packages and libraries to it. You can also
create new environments from Studio Lab, import compatible environments, reset your environment to
create space, and more.

The commands on this page will be for running in a Studio Lab terminal. If you wish to run these
commands in a Studio Lab Jupyter notebook, prefix the command with a % before running the cell. For
example, the code snippet pip list in a terminal is the same as %pip list in a Jupyter notebook.

Topics
• Base image (p. 244)
• Managing conda environments (p. 245)

Base image
The default Amazon SageMaker Studio Lab base image includes the following packages.

• Python 3.9
• bzip2
• build-essential
• curl
• git
• libgl1-mesa-glx
• nano
• rsync
• unzip
• wget
• ca-certificates
• pip
• ipykernel-6.4

Supported ML frameworks and libraries

Machine learning frameworks simplify machine learning by abstracting complex algorithms and
processes. This abstraction helps you get started with machine learning. Libraries are collections of

244
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

files, programs, and other resources that you can use in your code. Studio Lab supports the following
frameworks and libraries, which you must install manually.

• PyTorch 1.9
• TensorFlow 1.15 and 2.6
• MxNet 1.8
• Hugging Face
• AutoGluon 0.3.1
• Scikit-learn 0.24
• PyTorch ecosystem
• OpenCV
• scipy
• numpy

For a list of all of the packages currently installed in your environment, run the following command from
your Jupyter notebook.

pip list

Managing conda environments


The following sections give information about your default conda environment, how to customize it,
and how to add and remove conda environments. For more information about conda environments, see
conda environments. For a list of sample environments that you can install into Studio Lab, see Creating
Custom conda Environments. To use these sample environment YAML files with Studio Lab, see Step 4:
Install your Studio Lab conda environments in Studio (p. 255).

Your default environment


Studio Lab uses conda environments to encapsulate the software packages that are needed to run
notebooks. Your project contains a default conda environment, named default, with the IPython
kernel. This environment serves as the default kernel for your Jupyter notebooks.

View environments
To view the environments in Studio Lab you can use a terminal or Jupyter notebook. The following
command will be for a Studio Lab terminal. If you wish to run the corresponding commands in a Jupyter
notebook, see Manage your environment (p. 244).

Open the Studio Lab terminal by opening the File Browser ( ) panel, choose the plus (+) sign on the
menu at the top of the file browser to open the Launcher, then choose Terminal. From the Studio Lab
terminal, list the conda environments by running the following.

conda env list

This command outputs a list of the conda environments and their locations in the file system. When you
onboard to Studio Lab, you automatically activate the studiolab conda environment. The following is
an example of listed environments after you onboard.

# conda environments: #
default /home/studio-lab-user/.conda/envs/default
studiolab * /home/studio-lab-user/.conda/envs/studiolab
studiolab-safemode /opt/amazon/sagemaker/safemode-home/.conda/envs/
studiolab-safemode

245
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

base /opt/conda

The * marks the activated environment.

Create, activate, and use new conda environments

If you would like to maintain multiple environments for different use cases, you can create new conda
environments in your project. The following sections show how to create and activate new conda
environments. For a Jupyter notebook that shows how to create a custom environment, see Setting up a
Custom Environment in SageMaker Studio Lab.
Note
Maintaining multiple environments counts against your available Studio Lab memory.

Create conda environment

To create a conda environment, run the following conda command from your terminal. This example
creates a new environment with Python 3.9.

conda create --name <ENVIRONMENT_NAME> python=3.9

Once the conda environment is created, you can view the environment in your environment list. For more
information on how to view your environment list, see View environments (p. 245).

Activate a conda environment

To activate any conda environment, run the following command in the terminal.

conda activate <ENVIRONMENT_NAME>

When you run this command, any packages installed using conda or pip are installed in the environment.
For more information on installing packages, see Customize your environment (p. 247).

Use a conda environment

To use your new conda environments with notebooks, make sure the ipykernel package is installed in
the environment.

conda install ipykernel

Once the ipykernel package is installed in the environment, you can select the environment as the
kernel for your notebook.

You may need to restart JupyterLab to see the environment available as a kernel. This can be done
by choosing Amazon SageMaker Studio Lab in the top menu of Studio Lab and choosing Restart
JupyterLab....

When you create a new notebook from the Studio Lab Launcher, you will have the option to choose the
kernel under Notebook. For an overview of the Studio Lab UI, see Amazon SageMaker Studio Lab UI
overview (p. 240).

When a Jupyter notebook is open, you can choose the kernel by choosing Kernel from the top menu and
choose Change Kernel....

Using sample Studio Lab environments

Studio Lab provides sample custom environments through the SageMaker Studio Lab Examples
repository. The following shows how to clone and build these environments.

246
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

1. Clone the SageMaker Studio Lab Examples GitHub repository by following the instructions in Use
GitHub resources (p. 248).
2.
In Studio Lab choose the File Browser icon ( ) on the left menu, so that the File Browser panel
shows on the left.
3. Navigate to the studio-lab-examples/custom-environments directory in the File Browser.
4. Open the directory for the environment that you want to build.
5. Right click the .yml file in the folder, then select Build conda Environment.
6. You can now use the environment as a kernel after your conda environment has finished building.
For instructions on how to use an existing environment as a kernel, see Create, activate, and use new
conda environments (p. 246)

Customize your environment

You can customize your environment by installing and removing extensions and packages, as needed.
Any installed extensions and packages persist in your project, so you do not need to install your packages
every time you work on your project.
Note
Installed packages counts against your available Studio Lab memory

To view your environments, see View environments (p. 245).

To activate your environment, see Create, activate, and use new conda environments (p. 246).

To view the packages in an environment, run conda list.

Install packages

To install additional packages to your environment from a Jupyter notebook, run one of the following
commands in a Studio Lab terminal. These commands install packages in the currently activated
environment. Any packages that you install are saved in your persistent project directory.

• conda install <PACKAGE>


• pip install <PACKAGE>

We don't recommend using the !pip or !conda commands because they can behave in unexpected
ways when you have multiple environments.

After you install new packages to your environment, restart the kernel to ensure that the packages work
in your notebook. This can be done by choosing Amazon SageMaker Studio Lab in the top menu of
Studio Lab and choosing Restart JupyterLab....

Remove packages

To remove a package, run the command

conda remove
<PACKAGE_NAME>

This command will also remove any package that depends on <PACKAGE_NAME>, unless a replacement
can be found without that dependency.

To remove all of the packages in an environment, run the command

247
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

conda deactivate
&& conda env remove --name
<ENVIRONMENT_NAME>

Refresh Studio Lab

To refresh Studio Lab, remove all of your environments and files.

1. List all conda environments.

conda env list

2. Activate the base environment.

conda activate base

3. Remove each environment in the list of conda environments, besides base.

conda remove --name <ENVIRONMENT_NAME> --all

4. Delete all of the files on your Studio Lab.

rm -rf *.*

Use external resources in Amazon SageMaker Studio Lab


With Amazon SageMaker Studio Lab, you can integrate external resources, such as Jupyter notebooks
and data, from Git repositories and Amazon S3. You can also add an Open in Studio Lab button to your
GitHub repo and notebooks. This button lets you clone your notebooks directly from Studio Lab.

The following topics show how to integrate external resources.

Topics
• Use GitHub resources (p. 248)
• Add an Open in Studio Lab button to your notebook (p. 250)
• Import files from your computer (p. 250)
• Connect to Amazon S3 (p. 250)

Use GitHub resources


Studio Lab offers integration with GitHub. With this integration, you can clone notebooks and
repositories directly to your Studio Lab project.

The following topics give information about how to use GitHub resources with Studio Lab.

Studio Lab sample notebooks

To get started with a repository of sample notebooks tailored for Studio Lab, see Studio Lab Sample
Notebooks.

This repository provides notebooks for the following use cases and others.

• Computer vision

248
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

• Connecting to AWS
• Creating custom environments
• Geospatial data analysis
• Natural language processing
• Using R

Clone a GitHub repo

To clone a GitHub repo to your Studio Lab project, follow these steps.

1. Start your Studio Lab project runtime. For more information on launching Studio Lab project
runtime, see Start your project runtime (p. 236).
2.
In Studio Lab, choose the File Browser icon ( ) on the left menu, so that the File Browser panel
shows on the left.
3. Navigate to your user directory by choosing the file icon beneath the file search bar.
4.
Select the Git icon ( ) from the left menu to open a new dropdown menu.
5. Choose Clone a Repository.
6. Paste the repository's URL under Git repository URL (.git).
7. Select Clone.

Clone individual notebooks from GitHub

To open a notebook in Studio Lab, you must have access to the repo that the notebook is in. The
following examples describe Studio Lab permission-related behavior in various situations.

• If a repo is public, you can automatically clone the notebook into your project from the Studio Lab
preview page.
• If a repo is private, you are prompted to sign in to GitHub from the Studio Lab preview page. If you
have access to a private repo, you can clone the notebook into your project.
• If you don't have access to a private repo, you cannot clone the notebook from the Studio Lab preview
page.

The following sections show two options for you to copy a GitHub notebook in your Studio Lab project.
These options depend on whether the notebook has an Open in Studio Lab button.

Option 1: Copy notebook with an Open in Studio Lab button

The following procedure shows how to copy a notebook that has an Open in Studio Lab button.
If you want to add this button to your notebook, see Add an Open in Studio Lab button to your
notebook (p. 250).

1. Sign in to Studio Lab following the steps in Sign in to Studio Lab (p. 235).
2. In a new browser tab, navigate to the GitHub notebook that you want to clone.
3. In the notebook, select the Open in Studio Lab button to open a new page in Studio Lab with a
preview of the notebook.
4. If your project runtime is not already running, start it by choosing the Start runtime button at the
top of the preview page. Wait for the runtime to start before proceeding to the next step.
5. After your project runtime has started, select Copy to project to open your project runtime in a new
browser tab.

249
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

6. In the Copy from GitHub? dialog box, select Copy notebook only. This copies the notebook file to
your project.

Option 2: Clone any GitHub notebook

The following procedure shows how to copy any notebook from GitHub.

1. Navigate to the notebook in GitHub.


2. In the browser’s address bar, modify the notebook URL, as follows.

# Original URL
https://github.com/<PATH_TO_NOTEBOOK>

# Modified URL
https://studiolab.sagemaker.aws/import/github/<PATH_TO_NOTEBOOK>

3. Navigate to the modified URL. This opens a preview of the notebook in Studio Lab.
4. If your project runtime is not already running, start it by choosing the Start runtime button at the
top of the preview page. Wait for the runtime to start before proceeding to the next step.
5. After your project runtime has started, select Copy to project to open your project runtime in a new
browser tab.
6. In the Copy from GitHub? dialog box, select Copy notebook only to copy the notebook file to your
project.

Add an Open in Studio Lab button to your notebook


When you add the Open in Studio Lab button to your notebooks, others can clone your notebooks or
repositories directly to their Studio Lab projects. If you are sharing your notebook within a public GitHub
repository, your content will be publicly readable. Do not share private content, such as AWS access keys
or AWS Identity and Access Management credentials, in your notebook.

To add the functional Open in Studio Lab button to your Jupyter notebook or repository, add the
following markdown to the top of your notebook or repository.

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://


studiolab.sagemaker.aws/import/github/<PATH_TO_YOUR_NOTEBOOK_ON_GITHUB>)

Import files from your computer


The following steps show how to import files from your computer to your Studio Lab project.

1. Open the Studio Lab project runtime.


2. Open the File Browser panel.
3. In the actions bar of the File Browser panel, select the Upload Files button.
4. Select the files that you want to upload from your local machine.
5. Select Open.

Alternatively, you can drag and drop files from your computer into the File Browser panel.

Connect to Amazon S3
The AWS CLI enables AWS integration in your Studio Lab project. With this integration, you can pull
resources from Amazon S3 to use with your Jupyter notebooks.

250
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

To use AWS CLI with Studio Lab, complete the following steps. For a notebook that outlines this
integration, see Using Studio Lab with AWS Resources.

1. Install the AWS CLI following the steps in Installing or updating the latest version of the AWS CLI.
2. Configure your AWS credentials by following the steps in Quick setup. The role for your AWS
account must have permissions to access the Amazon S3 bucket that you are copying data from.
3. From your Jupyter notebook, clone resources from the Amazon S3 bucket, as needed. The following
command shows how to clone all resources from an Amazon S3 path to your project. For more
information, see the AWS CLI Command Reference.

!aws s3 cp s3://<BUCKET_NAME>/<PATH_TO_RESOURCES>/ <PROJECT_DESTINATION_PATH>/ --


recursive

Get notebook differences


You can display the difference between the current notebook and the last checkpoint, or the last Git
commit, using the Amazon SageMaker Studio Lab project UI.

Topics
• Get the difference between the last checkpoint (p. 251)
• Get the difference between the last commit (p. 251)

Get the difference between the last checkpoint


When you create a notebook, a hidden checkpoint file that matches the notebook is created. You can
view changes between the notebook and the checkpoint file, or revert the notebook to match the
checkpoint file.

To save the Studio Lab notebook and update the checkpoint file to match: Choose the Save notebook
and create checkpoint icon ( ). This is located on the Studio Lab menu's left side. The keyboard
shortcut for Save notebook and create checkpoint is Ctrl + s.

To view changes between the Studio Lab notebook and the checkpoint file: Choose the Checkpoint diff
icon ( ), located in the center of the Studio Lab menu.

To revert the Studio Lab notebook to the checkpoint file: On the main Studio Lab menu, choose File, and
then Revert Notebook to Checkpoint.

Get the difference between the last commit


If a notebook is opened from a Git repository, you can view the difference between the notebook and the
last Git commit.

To view the changes in the notebook from the last Git commit: Choose the Git diff icon ( ) in the
center of the notebook menu.

Export an Amazon SageMaker Studio Lab environment to


Amazon SageMaker Studio
Amazon SageMaker Studio offers many features for machine learning and deep learning work flows
that are unavailable in Amazon SageMaker Studio Lab. This page shows how to migrate a Studio Lab

251
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

environment to Studio to take advantage of more compute capacity, storage, and features. However, you
may want to familiarize yourself with Studio's prebuilt containers, which are optimized for the full MLOP
pipeline. For more information, see Amazon SageMaker Studio Lab (p. 230)

To migrate your Studio Lab environment to Studio, you must first onboard to Studio following the steps
in Onboard to Amazon SageMaker Domain (p. 37).

Topics
• Step 1: Export your Studio Lab conda environment (p. 252)
• Step 2: Save your Studio Lab artifacts (p. 253)
• Step 3: Import your Studio Lab artifacts to Studio (p. 254)
• Step 4: Install your Studio Lab conda environments in Studio (p. 255)

Step 1: Export your Studio Lab conda environment


You can export a conda environment and add libraries or packages to the environment by following the
steps in Manage your environment (p. 244). The following example demonstrates using the default
environment to be exported to Studio.

1.
Open the Studio Lab terminal by opening the File Browser ( ) panel, choose the plus (+) sign
on the menu at the top of the file browser to open the Launcher, then choose Terminal. From the
Studio Lab terminal, list the conda environments by running the following.

conda env list

This command outputs a list of the conda environments and their locations in the file system. When
you onboard to Studio Lab, you automatically activate the studiolab conda environment.

# conda environments: #
default /home/studio-lab-user/.conda/envs/default
studiolab * /home/studio-lab-user/.conda/envs/studiolab
studiolab-safemode /opt/amazon/sagemaker/safemode-home/.conda/envs/
studiolab-safemode
base /opt/conda

We recommend that you do not export the studiolab, studiolab-safemode, and base
environments. These environments are not usable in Studio for the following reasons:

• studiolab: This sets up the JupyterLab environment for Studio Lab. Studio Lab runs a different
major version of JupyterLab than Studio, so it is not usable in Studio.
• studiolab-safemode: This also sets up the JupyterLab environment for Studio Lab. Studio Lab
runs a different major version of JupyterLab than Studio, so it is not usable in Studio.
• base: This environment comes with conda by default. The base environment in Studio Lab and
the base environment in Studio have incompatible versions of many packages.
2. For the conda environment that you want to migrate to Studio, first activate the conda
environment.The default environment is then changed when new libraries are installed or
removed from it. To get the exact state of the environment, export it into a YAML file using the
command line. The following command lines export the default environment into a YAML file,
creating a file called myenv.yml.

conda activate default


conda env export > ~/myenv.yml

252
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

Step 2: Save your Studio Lab artifacts


Now that you have saved your environment to a YAML file, you can move the environment file to any
platform.

Save to a local machine using Studio Lab GUI


Note
Downloading a directory from the Studio Lab GUI by right-clicking on the directory is
currently unavailable. If you wish to export a directory, please follow the steps using the
Save to Git repository tab.

One option is to save the environment onto your local machine. To do this, use the following
procedure.

1.
In Studio Lab, choose the File Browser ( ) icon on the left menu, so that the File Browser
panel shows on the left.
2. Navigate to your user directory by choosing the file icon beneath the file search bar.
3. Choose (right-click) the myenv.yml file and then choose Download. You can repeat this process
for other files you want to import to Studio.

Save to a Git repository

Another option is to save your environment to a Git repository. This option uses GitHub as an
example. These steps require a GitHub account and repository. For more information, visit GitHub.
The following procedure shows how to synchronize your content with GitHub using the Studio Lab
terminal.

1. From the Studio Lab terminal, navigate to your user directory and make a new directory to
contain the files you want to export.

cd ~
mkdir <NEW_DIRECTORY_NAME>

2. After you create a new directory, copy any file or directory you want to export to
<NEW_DIRECTORY_NAME>.

Copy a file using the following code format:

cp <FILE_NAME> <NEW_DIRECTORY_NAME>

For example, replace <FILE_NAME> with myenv.yml.

Copy any directory using the following code format:

cp -r <DIRECTORY_NAME> <NEW_DIRECTORY_NAME>

For example, replace <DIRECTORY_NAME> with any directory name in your user directory.
3. Navigate to the new directory and initialize the directory as a Git repository using the following
command. For more information, see the git-init documentation.

cd <NEW_DIRECTORY_NAME>
git init

4. Using Git, add all relevant files and then commit your changes.

253
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

git add .
git commit -m "<COMMIT_MESSAGE>"

For example, replace <COMMIT_MESSAGE> with Add Amazon SageMaker Studio Lab
artifacts to GitHub repository to migrate to Amazon SageMaker Studio .
5. Push the commit to your remote repository. This repository has the format https://
github.com/<GITHUB_USERNAME>/ <REPOSITORY_NAME>.git where
<GITHUB_USERNAME> is your GitHub user name and the <REPOSITORY_NAME> is your
remote repository name. Create a branch <BRANCH_NAME> to push the content to the GitHub
repository.

git branch -M <BRANCH_NAME>


git remote add origin https://github.com/<GITHUB_USERNAME>/<REPOSITORY_NAME>.git
git push -u origin <BRANCH_NAME>

Step 3: Import your Studio Lab artifacts to Studio


The following procedure shows how to import artifacts to Studio. You first have to open Amazon
SageMaker Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).

From Studio, you can import files from your local machine or from a Git repository. You can do this using
the Studio GUI or terminal. The following procedure uses the examples from Step 2: Save your Studio
Lab artifacts (p. 253).

Import using the Studio GUI

If you saved the files to your local machine, you can import the files to Studio using the following
steps.

1.
Open the File Browser ( ) panel at the top left of Studio.
2.
Choose the Upload Files icon ( ) on the menu at the top of the File Browser panel.
3. Navigate to the file you want to import, then choose Open.

Note
If you wish to import a directory into Studio, first compress the directory on your
local machine to a file. On a Mac, right-click the directory and choose Compress
"<DIRECTORY_NAME>". In Windows, right-click the directory and choose Send to, and
then choose Compressed (zipped) folder. After the directory is compressed, import the
compressed file using the preceding steps. Unzip the compressed file by navigating to the
Studio terminal and running the command <DIRECTORY_NAME>.zip.
Import using a Git repository

This example provides two options for how to clone a GitHub repository into Studio. You can use

the Studio GUI by choosing the Git ( ) tab on the left side of Studio. Choose Clone a Repository,
then paste your GitHub repository URL from Step 2: Save your Studio Lab artifacts (p. 253).
Another option is to use the Studio terminal by using the following procedure.

1. Open the Studio Launcher. For more information on opening the Launcher, see Amazon
SageMaker Studio Launcher.

254
Amazon SageMaker Developer Guide
Use the Studio Lab project runtime

2. In the Launcher, in the Notebooks and compute resources section, choose Change
environment.
3. In Studio, open the Launcher. To open the Launcher, choose Amazon SageMaker Studio at the
top-left corner of Studio.

To learn about all the available ways to open the Launcher, see Use the Amazon SageMaker
Studio Launcher (p. 141).
4. In the Change environment dialog, use the Image dropdown list to select the Data Science
image and choose Select. This image comes with conda pre-installed.
5. In the Studio Launcher, choose Open image terminal.
6. From the image terminal, run the following command to clone your repository. This command
creates a directory named after <REPOSITORY_NAME> in your Studio instance and clones your
artifacts in that repository.

git clone https://github.com/<GITHUB_USERNAME>/<REPOSITORY_NAME>.git

Step 4: Install your Studio Lab conda environments in Studio


You can now recreate your conda environment by using your YAML file in your Studio instance. Open
the Studio Launcher. For more information on opening the Launcher, see Amazon SageMaker Studio
Launcher. From the Launcher, choose Open image terminal. In the terminal navigate to the directory
that contains the YAML file, then run the following commands.

conda env create --file <ENVIRONMENT_NAME>.yml


conda activate <ENVIRONMENT_NAME>

After these commands are complete, you can select your environment as the kernel for your Studio
notebook instances. To view the available environment, run conda env list. To activate your
environment, run conda activate <ENVIRONMENT_NAME>.

Shut down resources


In this guide, you will learn how to shut down individual resources, including notebooks, terminals, and
kernels. You can also shut down all resources in one of these categories at the same time.

Topics
• Shut down an open notebook (p. 255)
• Shut down resources (p. 256)

Shut down an open notebook


You can shut down an open notebook from the Amazon SageMaker Studio Lab File menu or from the
Running Terminals and Kernels pane.
Note
When you shut down a notebook, any unsaved information in the notebook is lost. The
notebook is not deleted.

To shut down an open notebook from the File menu

1.

Save the notebook contents by choosing the icon, located in the notebook menu.

255
Amazon SageMaker Developer Guide
Troubleshooting

2. Choose File then Close and Shutdown Notebook.


3. Choose OK.

Shut down resources

On the left sidebar of Studio Lab, you will find the Running Terminals and Kernels pane and icon.
The Running Terminals and Kernels pane has three sections. Each section lists all of the resources
of that type. You can shut down each resource individually, or shut down all resources in a section
simultaneously.

When you shut down all resources in a section, the following occurs:

• KERNELS – All kernels, notebooks, and consoles are shut down.


• TERMINALS – All terminals are shut down.

To shut down resources

1.
In the left sidebar, choose the Running Terminals and Kernels icon ( ).
2. Do either of the following:

• To shut down a specific resource: Choose the SHUT DOWN icon on the same row as the resource.
• To shut down all resources in a section: Choose Shut Down All, which is located to the right of the
section label. After a confirmation dialog box appears, choose Shut down all to proceed.

Troubleshooting
The guide shows common errors that might occur when using Amazon SageMaker Studio Lab. Each error
contains a description, as well as a solution to the error.
Note
You cannot share your password with multiple users or use Studio Lab to mine cryptocurrency.
We don’t recommend using Studio Lab for production tasks because of runtime limits.

Can’t access account

If you can’t access your account, verify that you are using the correct email and password. If you have
forgotten your password, use the following steps to reset your password. If you still cannot access your
account, you must request and register for a new account using the instructions in Onboard to Amazon
SageMaker Studio Lab (p. 234).

Forgot password

If you forget your password, you must reset it using the following steps.

1. Navigate to the Studio Lab landing page.


2. Select Sign in.
3. Select Forgot password? to open a new page.
4. Enter the email address that you used to sign up for an account.
5. Select Send reset link to send an email with a password reset link.
6. From the password reset email, select Reset your password.
7. Enter your new password.
8. Select Submit.

256
Amazon SageMaker Developer Guide
Troubleshooting

Can't launch project runtime

If the Studio Lab project runtime does not launch, try launching it again. If this doesn't work, switch
the instance type from CPU to GPU (or in reverse). For more information, see Change your compute
type (p. 237).

Runtime stopped running unexpectedly

If there is an issue with the environment used to run JupyterLab, then Studio Lab will automatically
recreate the environment. Studio Lab does not support manual activation of this process.

Conflicting versions

Because you can add packages and modify your environment as needed, you may run into conflicts
between packages in your environment. If there are conflicts between packages in your environment, you
must remove the conflicting package.

Environment build fails

When you build an environment from a YAML file, a package-version conflict or file issue might cause a
build to fail. To resolve this, remove the environment by running the following command. Do this before
attempting to build it again.

conda remove --name <YOUR_ENVIRONMENT> --all

Error message about allowing to download script from domain *.awswaf.com

Studio uses the web application firewall service AWS WAF to protect your resources, which uses
JavaScript. If you are using a browser security plugin that prevents JavaScript from downloading, this
error may pop up. To use Studio, allow the JavaScript download from *.awswaf.com as a trusted domain.
For more information on AWS WAF, see AWS WAF from the AWS WAF, AWS Firewall Manager, and AWS
Shield Advanced. Developer Guide.

Disk space is full

If you run into a notification saying mentioning that your disk space is full or File Load Error
for <FILE_NAME> while attempting to open a file, you can remove files, directories, libraries, or
environments to increase space. For more information on managing your libraries and environments, see
Manage your environment (p. 244).

Project runtime is in safe mode notification

If you run into a notification that Project runtime is in safe mode, you must free up some disk space to
resume using the Studio Lab project runtime. Follow the instructions in the preceding troubleshoot item,
Disk space is full. Once up to at least 500 MB of space has been cleared, you may restart the project
runtime to use Studio Lab. This can be done by choosing Amazon SageMaker Studio Lab in the top
menu of Studio Lab and choosing Restart JupyterLab....

git Cannot import cv2

If you run into an error when importing cv2 after installing opencv-python, you must uninstall
opencv-python and install opencv-python-headless as follows.

%pip uninstall opencv-python --yes


%pip install opencv-python-headless

You can then import cv2 as expected.

Studio Lab becomes unresponsive when opening large files

257
Amazon SageMaker Developer Guide
SageMaker Canvas

The Studio Lab IDE may fail to render when large files are opened, resulting in blocked access to Studio
Lab resources. To resolve this, reset the Studio Lab workspace using the following procedure.

1. After you open the IDE, copy the URL in your browser's address bar. This URL should be in the
https://xxxxxx.studio.us-east-2.sagemaker.aws/studiolab/default/jupyter/lab
format. Close the tab.
2. In a new tab, paste the URL and remove anything after https://xxxxxx.studio.us-
east-2.sagemaker.aws/studiolab/default/jupyter/lab.
3. Add ?reset to the end of the URL, so it is in the https://xxxxxx.studio.us-
east-2.sagemaker.aws/studiolab/default/jupyter/lab?reset format.
4. Navigate to the updated URL. This resets the saved UI state and makes the Studio Lab IDE
responsive.

Amazon SageMaker Canvas


Amazon SageMaker Canvas gives you the ability to use machine learning to generate predictions without
needing to write any code. The following are some use cases where you can use SageMaker Canvas:

• Predict customer churn


• Plan inventory efficiently
• Optimize price and revenue
• Improve on-time deliveries
• Classify text or images based on custom categories
• Identify objects and text in images
• Extract information from documents

With Canvas, you can access Ready-to-use models or build a custom model trained on your data.

The Ready-to-use models (p. 289) in Canvas can extract insights from your data for a variety of use
cases. You don’t have to build a model to use Ready-to-use models because they are powered by Amazon
AI services, including Amazon Rekognition, Amazon Textract, and Amazon Comprehend. You only have to
import your data and start using a solution to generate predictions.

If you want a model that is customized to your use case and trained with your data, you can build a
model (p. 297). You can get predictions customized to your data by doing the following:

1. Import your data from one or more data sources.


2. Build a predictive model.
3. Evaluate the model's performance.
4. Generate predictions with the model.

Canvas supports the following types of custom models:

• Numeric prediction (also known as regression)


• Categorical prediction for 2 and 3+ categories (also known as binary and multi-class classification)
• Time series forecasting
• Single-label image prediction (also known as image classification)
• Multi-category text prediction (also known as multi-class text classification)

You can also bring your own models into Canvas from Amazon SageMaker Studio.

258
Amazon SageMaker Developer Guide
Are you a first-time SageMaker Canvas user?

To learn more about pricing, see the SageMaker Canvas pricing page. You can also see Manage billing
and cost in SageMaker Canvas (p. 400) for more information.

SageMaker Canvas is currently available in the following Regions:

• US East (Ohio)
• US East (N. Virginia)
• US West (Oregon)
• Asia Pacific (Mumbai)
• Asia Pacific (Seoul)
• Asia Pacific (Singapore)
• Asia Pacific (Sydney)
• Asia Pacific (Tokyo)
• Europe (Frankfurt)
• Europe (Ireland)

Topics
• Are you a first-time SageMaker Canvas user? (p. 259)
• Getting started with using Amazon SageMaker Canvas (p. 259)
• Setting Up and Managing Amazon SageMaker Canvas (for IT Administrators) (p. 264)
• Use Ready-to-use models (p. 289)
• Use custom models (p. 297)
• Logging out of Amazon SageMaker Canvas (p. 392)
• Limitations and troubleshooting (p. 393)
• Manage billing and cost in SageMaker Canvas (p. 400)

Are you a first-time SageMaker Canvas user?


If you are a first-time user of SageMaker Canvas, we recommend that you begin by reading the following
sections:

• For IT administrators – Setting Up and Managing Amazon SageMaker Canvas (for IT


Administrators) (p. 264)
• For analysts and individual users – Getting started with using Amazon SageMaker Canvas (p. 259)

Getting started with using Amazon SageMaker


Canvas
This guide tells you how to get started with using SageMaker Canvas. If you're an IT administrator,
see Setting Up and Managing Amazon SageMaker Canvas (for IT Administrators) (p. 264) to set up
SageMaker Canvas for your users.

If you're a business user or analyst, read the following sections.

Topics
• Prerequisites for setting up Amazon SageMaker Canvas (p. 260)
• Step 1: Log in to Amazon SageMaker Canvas as a business user (p. 262)
• Step 2: Use SageMaker Canvas to get predictions (p. 264)

259
Amazon SageMaker Developer Guide
Getting started

Prerequisites for setting up Amazon SageMaker Canvas


To set up Amazon SageMaker Canvas, you can either contact your administrator or do the following:

• Set up an Amazon SageMaker Domain


• Give yourself permissions to use specific features in Canvas
• Optional: Give yourself permissions to to upload local files
• Optional: Give yourself permissions to build custom image and text prediction models
• Optional: Give yourself permissions to use Ready-to-use models
• Optional: Give yourself permissions to do time series forecasts
• Optional: Give yourself permissions to send batch predictions to Amazon QuickSight
• Optional: Give yourself permissions to register models to the model registry
• Optional: Give yourself permissions to collaborate with Amazon SageMaker Studio
• Optional: Give yourself permissions to import Amazon Redshift data

The following sections describe how to set up an Amazon SageMaker Domain and give yourself
SageMaker Canvas permissions.
Important
For you to set up Amazon SageMaker Canvas, your version of Amazon SageMaker Studio must
be 3.19.0 or later. For information about updating Amazon SageMaker Studio, see Shut down
and Update SageMaker Studio (p. 199).

Onboard to Domain using IAM Identity Center


To onboard to Domain use the following procedure:

1. Open the SageMaker console.


2. Choose Domains in the navigation pane.
3. On the Domains page, choose Create domain
4. Choose Standard setup.
5. Choose Configure.

Use the following procedure to configure the general settings for the Domain:

1. Under Permission, for IAM role, choose an option from the role selector.

If you choose Enter a custom IAM role ARN, the role must have at a minimum, an attached trust
policy that grants SageMaker permission to assume the role. For more information, see SageMaker
Roles (p. 3086).

If you choose Create a new role, the Create an IAM role dialog opens:

• Choose Create role. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role


with the AmazonSageMakerFullAccess policy attached.
2. Under Network and storage, specify the following:

• Your VPC information – For more information, see Choose an Amazon VPC (p. 46) and Configure
Amazon SageMaker Canvas in a VPC without internet access (p. 285).
• (Optional) Encryption key – SageMaker uses an AWS KMS key to encrypt your Amazon Elastic File
System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems. By default, it
uses an AWS managed key. To use a customer managed key, enter its key ID or Amazon Resource
Name (ARN). For more information, see Protect Data at Rest Using Encryption (p. 3043).

260
Amazon SageMaker Developer Guide
Getting started

Note
Encryption in transit is only available for Amazon SageMaker Studio.
3. Select Next.

Use the following procedure to configure the SageMaker Canvas settings for the Domain:

1. For the Canvas base permissions configuration, leave the Enable Canvas base permissions
option turned on (it is turned on by default). This attaches the AmazonSageMakerCanvasFullAccess
policy to your user's execution role and establishes the minimum required permissions to use the
SageMaker Canvas app.
2. (Optional) For the Canvas Ready-to-use models configuration, leave the Enable Canvas Ready-to-
use models option turned on to give your users permissions to generate predictions with Ready-to-
use models in Canvas (it is turned on by default).
3. (Optional) For the Time series forecasting configuration, leave the Enable time series forecasting
option turned on to give your users permissions to do time series forecasting in SageMaker Canvas
(it is turned on by default).

• If you left Enable time series forecasting turned on, select Create and use a new execution
role, or select Use an existing execution role if you already have an IAM role with the required
Amazon Forecast permissions attached (for more information, see the IAM role setup method.
4. (Optional) If you left Enable time series forecasting turned on, select Create and use a new
execution role, or select Use an existing execution role if you already have an IAM role with the
required Amazon Forecast permissions attached (for more information, see the IAM role setup
method (p. 278)).
5. (Optional) For the ML Ops permissions configuration section, leave the Enable Model Registry
registration permissions for all users option turned on to give your users permissions to register
their model version to the SageMaker model registry (it is turned on by default). For more
information, see Register a model version in the SageMaker model registry (p. 373).
6. (Optional) Add Tags to track your cost and usage trends in AWS Billing and Cost Management.
SageMaker adds the tags you specify in the Domain to all of the SageMaker Canvas apps you
create in the Domain. For more information about billing and tags, see Manage billing and cost in
SageMaker Canvas (p. 400).
7. Finish making any other changes to your Domain setup, and then choose Submit.

Note
If you encounter any issues with granting permissions through the console, such as permissions
for Ready-to-use models, see the topic Troubleshooting issues with granting permissions
through the SageMaker console (p. 393).

When you set up the Domain, SageMaker Canvas creates an Amazon S3 bucket with a name that uses the
following pattern: sagemaker-<Region>-<your-account-id>. Your Canvas application data, such as
imported datasets and batch predictions, are stored in the Canvas/ folder in the bucket.

Give yourself permissions to use specific features in Canvas


The following information outlines the permissions that you can grant to a Canvas user to allow the use
of various features and functionalities within Canvas:

• Local file upload. The permissions for local file upload are turned on by default in the Canvas base
permissions when setting up your Domain. If you don’t have the ability to upload local files from your
machine to SageMaker Canvas, you can attach a CORS policy to the default bucket that SageMaker
created for your Domain (sagemaker-<Region>-<your-account-id>). For more information, see
Grant Your Users Permissions to Upload Local Files.

261
Amazon SageMaker Developer Guide
Getting started

• Custom image and text prediction models. The permissions for building custom image and
text prediction models are turned on by default in the Canvas base permissions when setting
up your Domain. However, if you have a custom IAM configuration and don't want to attach the
AmazonSageMakerCanvasFullAccess policy to your user's IAM execution role, then you must explicitly
grant your user the necessary permissions. For more information, see Grant Your Users Permissions to
Build Custom Image and Text Prediction Models (p. 275).
• Ready-to-use models. You might want to have the ability to use the Canvas Ready-to-use models
to make predictions for your data. The permissions are turned on by default when setting up your
Domain, or you can edit the permissions for a Domain that you’ve already created. The Canvas Ready-
to-use models permissions option adds the AmazonSageMakerCanvasAIServicesAccess policy to
your execution role. For more information, see the Get started (p. 290) section of the Ready-to-use
models documentation.
• Time series forecasting. If you’d like to have the ability to perform forecasts on time series data,
you can add time series forecasting permissions when setting up your Domain, or you can edit the
permissions for a Domain or user profile after creating your Domain. The required permissions are the
AmazonSageMakerCanvasForecastAccess managed policy and a trust relationship with Amazon
Forecast to the AWS IAM role you chose when setting up the user profile. For instructions on how
to add these permissions to your IAM role, see Grant Your Users Permissions to Perform Time Series
Forecasting.
• Send batch predictions to Amazon QuickSight. You might want to have the ability to send batch
predictions, or datasets of predictions you generate from a custom model, to Amazon QuickSight for
analysis. In QuickSight, you can build and publish predictive dashboards with your prediction results.
For instructions on how to add these permissions to your Canvas user's IAM role, see Grant Your Users
Permissions to Send Predictions to Amazon QuickSight.
• Register model versions to the model registry. You might want to register versions of your model
to the SageMaker model registry, which is a repository for tracking the status of updated versions of
your model. A data scientist or MLOps team working in the SageMaker model registry can view the
versions of your model that you’ve built and approve or reject them. Then, they can deploy your model
version to production or kick off an automated workflow. Model registration permissions are turned on
by default for your Domain. You can manage permissions at the user profile level and grant or remove
permissions to specific users. For more information, see Register a model version in the SageMaker
model registry (p. 373).
• Collaboration with data scientists. If you want to collaborate with Studio users and share models,
you must add additional permissions to the AWS IAM role you chose when setting up the user profile.
For instructions on how to add the policy to the role, see Grant Users Permissions to Collaborate with
Studio.
• Import data from Amazon Redshift. If you want to import data from Amazon Redshift, you must give
yourself additional permissions. You must add the AmazonRedshiftFullAccess managed policy to
the AWS IAM role you chose when setting up the user profile. For instructions on how to add the policy
to the role, see Grant Users Permissions to Import Amazon Redshift Data.

Note
The necessary permissions to import through other data sources, such as Amazon
Athena and SaaS platforms, are included in the AmazonSageMakerFullAccess and
AmazonSageMakerCanvasFullAccess policies. If you followed the standard setup instructions,
these policies should already be attached to your execution role. For more information about
these data sources and their permissions, see Connect to data sources (p. 310).

Step 1: Log in to Amazon SageMaker Canvas as a business user


Contact your administrator to guide you through the process of setting up Amazon SageMaker Canvas.

When the initial setup is complete, you can access SageMaker Canvas by doing the following.

1. Navigate to the SageMaker console.

262
Amazon SageMaker Developer Guide
Getting started

2. In the navigation pane, choose Canvas.


3. In the Get Started box, select your user profile from the dropdown.
4. Choose Open Canvas to open the application.

When you log into SageMaker Canvas for the first time, there is a welcome message with quick getting
started tutorials that you can follow for a walkthrough of the SageMaker Canvas application.

You can follow the Get started with Canvas tutorial for a high-level overview of the SageMaker Canvas
application. There are also shorter tutorials that guide you through the individual steps of using
SageMaker Canvas. These tutorials show you how to import a dataset, build a model, analyze the results
of a built model, and generate predictions with your model. You can revisit the tutorials at any time by
choosing the Help button and then choosing one of the tutorials.

263
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Step 2: Use SageMaker Canvas to get predictions


After you’ve logged in to Canvas, you can start building models and generating predictions for your data.

You can either use Canvas Ready-to-use models to make predictions without building a model, or you
can build a custom model for your specific business problem. Review the following information to decide
whether Ready-to-use models or custom models are best for your use case.

• Ready-to-use models. With Ready-to-use models, you can use pre-built models to extract insights
from your data. The Ready-to-use models cover a variety of use cases, such as language detection and
document analysis. To get started making predictions with Ready-to-use models, see Use Ready-to-use
models (p. 289).
• Custom models. With custom models, you can build a variety of model types that are customized to
make predictions for your data. Use custom models if you’d like to build a model that is trained on
your business-specific data and if you’d like to use features such as collaborating with data scientists
and evaluating your model’s performance. To get started with building a custom model, see Use
custom models (p. 297).

You can also bring your own model (BYOM) from other features in SageMaker. An Amazon SageMaker
Studio user can share their model with a Canvas user, and the Canvas user can generate predictions with
the model. To learn more, see Bring your own model to SageMaker Canvas.

Setting Up and Managing Amazon SageMaker Canvas


(for IT Administrators)
You can use the information in this section to help your users do the following:

• Optional: Grant your users permissions to upload their files locally.


• Set up Okta SSO for your users.
• Update SageMaker Canvas.

264
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

• Clean up or delete the installation of SageMaker Canvas.


• Optional: Set up Amazon Forecast so users can do time series forecasting.
• Optional: Set up an Amazon Virtual Private Cloud.
• Optional: Encrypt data using AWS Key Management Service.
• Optional: Grant your users permissions to import Amazon Redshift data.

You can also set up SageMaker Canvas for your users with AWS CloudFormation. For more information,
see AWS::SageMaker::App in the AWS CloudFormation User Guide.

Topics
• Grant Your Users Permissions to Upload Local Files (p. 265)
• Set Up SageMaker Canvas for Your Users (p. 267)
• Encrypt Your SageMaker Canvas Data with AWS KMS (p. 271)
• Grant Your Users Permissions to Build Custom Image and Text Prediction Models (p. 275)
• Grant Your Users Permissions to Perform Time Series Forecasting (p. 275)
• Update SageMaker Canvas for Your Users (p. 279)
• Request a Quota Increase (p. 280)
• Grant Users Permissions to Import Amazon Redshift Data (p. 281)
• Grant Users Permissions to Collaborate with Studio (p. 282)
• Grant Your Users Permissions to Send Predictions to Amazon QuickSight (p. 283)
• Manage apps (p. 284)
• Configure Amazon SageMaker Canvas in a VPC without internet access (p. 285)

Grant Your Users Permissions to Upload Local Files


If your users are uploading files from their local machines to SageMaker Canvas, you must attach a
CORS configuration to the Amazon S3 bucket that they're using. When one of your users first accesses
SageMaker Canvas, SageMaker creates an Amazon S3 bucket with a name that uses the following
pattern: sagemaker-{Region}-{account-ID}. SageMaker Canvas adds your users' data to the bucket
whenever they upload a file.

To grant users permissions to upload local files to the bucket, you can attach a CORS configuration to
it using either of the following procedures. You can use the first method when setting up your Domain
or editing the existing Domain settings, where you opt in to allow SageMaker to attach the CORS
configuration to the default bucket for you. The second method is the manual method, where you can
attach the CORS configuration to the bucket yourself.

Domain setup method


To grant your users permissions to upload local files, you can choose Enable Canvas permissions when
setting up your Domain. This attaches a Cross-Origin Resource Sharing (CORS) configuration to the
SageMaker Amazon S3 bucket created for your account and grants all users in the Domain permission to
upload local files into SageMaker Canvas. By default, the permissions option is turned on when you set
up a Domain, but you can turn off this option if you don’t want to grant your users permission to upload
files.
Note
If you have an existing CORS configuration on the SageMaker Amazon S3 bucket, turning on
Enable Canvas permissions overwrites the existing configuration with the new configuration.

265
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

The following procedure shows how you can turn on this option when doing a Quick setup for your
Domain in the console.

1. In the User profile section, enter a Name for the user.


2. Select an Execution role for the user.
3. Turn on Enable SageMaker Canvas permissions. (By default, this option is turned on.)
4. Finish setting up the Domain.

If you are doing a Standard setup for your Domain, then use the following procedure for the Canvas
settings section to turn on local file upload.

1. For Enable and configure Canvas permissions, select Local file upload. (It's already checked by
default.)
2. Choose Next.
3. Finish setting up the Domain.

Your users can now upload local files into their SageMaker Canvas application.

You can also turn on or turn off local upload permissions for an existing Domain by using the following
procedure.

1. Go to the Amazon SageMaker console.


2. Choose Domainsin the navigation pane.
3. From the list of Domains, choose your Domain.
4. On the Domain settings page, choose the Domain settings, tab.
5. Choose Edit.
6. In the navigation pane, choose Canvas settings.
7. Select or deselect Enable local file upload.
8. Finish any other modifications you want to make to the Domain, and then choose Submit to submit
your changes.

Amazon S3 bucket method


If you want to manually attach the CORS configuration to the SageMaker Amazon S3 bucket, use the
following procedure.

1. Sign in to https://console.aws.amazon.com/s3/.
2. Choose the bucket with the name that uses the following pattern:
sagemaker-{region}-{account-ID}.
3. Choose Permissions.
4. Navigate to Cross-origins resource sharing (CORS).
5. Choose Edit.
6. Add the following CORS policy:

[
{
"AllowedHeaders": [
"*"
],

266
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

"AllowedMethods": [
"POST"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": []
}
]

7. Choose Save changes.

In the preceding procedure, the CORS policy must have "POST" listed under AllowedMethods.

After you've gone through the procedure, you should have:

• An IAM role assigned to each of your users.


• Amazon SageMaker Studio runtime permissions for each of your users. SageMaker Canvas uses Studio
to run the commands from your users.
• If the users are uploading files from their local machines, a CORS policy attached to their Amazon S3
bucket.

If your users still can't upload the local files after you update the CORS policy, the browser might be
caching the CORS settings from a previous upload attempt. If they're running into issues, instruct them
to clear their browser cache and try again.

Set Up SageMaker Canvas for Your Users


To set up Amazon SageMaker Canvas, do the following:

• Create an Amazon SageMaker Domain.


• Create user profiles for the Domain
• Set up Okta Single Sign On (Okta SSO) for your users.
• Activate link sharing for models.

Use Okta Single-Sign On (Okta SSO) to grant your users access to Amazon SageMaker Canvas.
SageMaker Canvas supports SAML 2.0 SSO methods. The following sections guide you through
procedures to set up Okta SSO.

To set up a Domain, see Onboard to Amazon SageMaker Runtime Studio Using IAM. You can use the
following information to help you complete the procedure in the section:

• You can ignore the step about creating projects.


• You don't need to provide access to additional Amazon S3 buckets. Your users can use the default
bucket that we provide when we create a role.
• To grant your users access to share their notebooks with data scientists, turn on Notebook Sharing
Configuration.
• Use Amazon SageMaker Studio version 3.19.0 or later. For information about updating Amazon
SageMaker Studio, see Shut down and Update SageMaker Studio (p. 199).

Use the following procedure to set up Okta. For all of the following procedures, you specify the same
IAM role for IAM-role .

267
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Add the SageMaker Canvas application to Okta


Set up the sign-on method for Okta.

1. Sign in to the Okta Admin dashboard.


2. Choose Add application. Search for AWS Account Federation.
3. Choose Add.
4. Optional: Change the name to Amazon SageMaker Canvas.
5. Choose Next.
6. Choose SAML 2.0 as the Sign-On method.
7. Choose Identity Provider Metadata to open the metadata XML file. Save the file locally.
8. Choose Done.

Set up ID federation in IAM


AWS Identity and Access Management (IAM) is the AWS service that you use to gain access to your AWS
account. You gain access to AWS through an IAM account.

1. Sign in to the AWS console.


2. Choose AWS Identity and Access Management (IAM).
3. Choose Identity Providers.
4. Choose Create Provider.
5. For Configure Provider, specify the following:

• Provider Type – From the dropdown list, choose SAML.


• Provider Name – Specify Okta.
• Metadata Document – Upload the XML document that you've saved locally from step 7 of Add the
SageMaker Canvas application to Okta (p. 268).
6. Find your identity provider under Identity Providers. Copy its Provider ARN value.
7. For Roles, choose the IAM role that you're using for Okta SSO access.
8. Under Trust Relationship for the IAM role, choose Edit Trust Relationship.
9. Modify the IAM trust relationship policy by specifying the Provider ARN value that you've copied
and add the following policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:saml-provider/Okta"
},
"Action": [
"sts:AssumeRoleWithSAML",
"sts:SetSourceIdentity",
"sts:TagSession"
],
"Condition": {
"StringEquals": {
"SAML:aud": "https://signin.aws.amazon.com/saml"
}
}
}
]

268
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

10. For Permissions, add the following policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AmazonSageMakerPresignedUrlPolicy",
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:CreatePresignedDomainUrlWithPrincipalTag"
],
"Resource": "*"
}
]
}

Configure SageMaker Canvas in Okta


Configure Amazon SageMaker Canvas in Okta using the following procedure.

To configure Amazon SageMaker Canvas to use Okta, follow the steps in this section. You must specify
unique user names for each SageMakerStudioProfileName field. For example, you can use user.login
as a value. If the username is different from the SageMaker Canvas profile name, choose a different
uniquely identifying attribute. For example, you can use an employee's ID number for the profile name.

For an example of values that you can set for Attributes, see the code following the procedure.

1. Under Directory, choose Groups.


2. Add a group with the following pattern: sagemaker#canvas#IAM-role#AWS-account-id.
3. In Okta, open the AWS Account Federation application integration configuration.
4. Select Sign On for the AWS Account Federation application.
5. Choose Edit and specify the following:

• SAML 2.0
• Default Relay State – https://Region.console.aws.amazon.com/sagemaker/home?
region=Region#/studio/canvas/open/StudioId. You can find the Studio ID in the console:
https://console.aws.amazon.com/sagemaker/
6. Choose Attributes.
7. In the SageMakerStudioProfileName fields, specify unique values for each username. The
usernames must match the usernames that you've created in the AWS console.

Attribute 1:
Name: https://aws.amazon.com/SAML/Attributes/
PrincipalTag:SageMakerStudioUserProfileName
Value: ${user.login}

Attribute 2:
Name: https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys
Value: {"SageMakerStudioUserProfileName"}

269
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

8. Select Environment Type. Choose Regular AWS.


• If your environment type isn't listed, you can set your ACS URL in the ACS URL field. If your
environment type is listed, you don't need to enter your ACS URL
9. For Identity Provider ARN, specify the ARN you used in step 6 of the preceding procedure.
10. Specify a Session Duration.
11. Choose Join all roles.
12. Turn on Use Group Mapping by specifying the following fields:

• App Filter – okta


• Group Filter – ^aws\#\S+\#(?IAM-role[\w\-]+)\#(?accountid\d+)$
• Role Value Pattern – arn:aws:iam::$accountid:saml-provider/Okta,arn:aws:iam::
$accountid:role/IAM-role
13. Choose Save/Next.
14. Under Assignments, assign the application to the group that you've created.

Add optional policies on access control in IAM


In IAM, you can apply the following policy to the administrator user who creates the user profiles.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateSageMakerStudioUserProfilePolicy",
"Effect": "Allow",
"Action": "sagemaker:CreateUserProfile",
"Resource": "*",
"Condition": {
"ForAnyValue:StringEquals": {
"aws:TagKeys": [
"studiouserid"
]
}
}
}
]
}

If you choose to add the preceding policy to the admin user, you must use the following permissions
from Set up ID federation in IAM (p. 268).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AmazonSageMakerPresignedUrlPolicy",
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:CreatePresignedDomainUrlWithPrincipalTag"
],
"Resource": "*",
"Condition": {
"StringEquals": {

270
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

"sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/
SageMakerStudioUserProfileName}"
}
}
}
]
}

Encrypt Your SageMaker Canvas Data with AWS KMS


You might have data that you want to encrypt while using Amazon SageMaker Canvas, such as your
private company information or customer data. SageMaker Canvas uses AWS Key Management Service
to protect your data. AWS KMS is a service that you can use to create and manage cryptographic keys for
encrypting your data. For more information about AWS KMS, see AWS Key Management Service in the
AWS KMS Developer Guide.

Amazon SageMaker Canvas provides you with several options for encrypting your data. SageMaker
Canvas provides default encryption within the application for tasks such as building your model and
generating insights. You can also choose to encrypt data stored in Amazon S3 to protect your data
at rest. SageMaker Canvas supports importing encrypted datasets, so you can establish an encrypted
workflow. The following sections describe how you can use AWS KMS encryption to protect your data
while building models with SageMaker Canvas.

Encrypt your data in SageMaker Canvas


With SageMaker Canvas, you can use two different AWS KMS encryption keys to encrypt your data in
SageMaker Canvas, which you can specify when setting up your Domain. These two keys can be the
same or different. SageMaker Canvas uses one key for temporary application storage, visualizations, or
compute purposes (such as building models). You can use either the default AWS managed key or specify
your own. You can also specify an optional key that SageMaker Canvas uses for long-term storage of
model objects and datasets, which are stored in the Region’s default SageMaker S3 bucket for your
account.

Prerequisites
To use your own KMS key for either of the previously described purposes, you must first grant your user's
IAM role permission to use the key. Then, you can specify the KMS key when setting up your Domain.

The simplest way to grant your role permission to use the key is to modify the key policy. Use the
following procedure to grant your role the necessary permissions.

1. Open the AWS KMS console.


2. In the Key Policy section, choose Switch to policy view.
3. Modify the key's policy to grant permissions for the kms:GenerateDataKey and kms:Decrypt
actions to the IAM role. You can add a statement that's similar to the following:

{
"Sid": "ExampleStmt",
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/Jane>"
},
"Resource": "*"
}

271
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

4. Choose Save changes.

The less preferred method is to modify the user’s IAM role to grant the user permissions to use or
manage the KMS key. If you use this method, the KMS key policy must also allow access management
through IAM. To learn how to grant permission to a KMS key through the user’s IAM role, see Specifying
KMS keys in IAM policy statements in the AWS KMS Developer Guide.

Prerequisites for time series forecasting

To use your AWS KMS key to encrypt time series forecasting models in SageMaker Canvas, you must
modify the key policy for the KMS key used to store objects to Amazon S3. Your key policy must
grant permissions to the AmazonSageMakerCanvasForecastRole, which SageMaker creates
when you grant time series forecasting permissions for your users. Amazon Forecast uses the
AmazonSageMakerCanvasForecastRole to perform time series forecasting operations in SageMaker
Canvas. Your KMS key must grant permissions to this role in order to ensure data is encrypted for time
series forecasting.

To modify the permissions of your KMS key policy to allow encrypted time series forecasting, do the
following.

1. Open the AWS KMS console.


2. In the Key Policy section, choose Switch to policy view.
3. Modify the key's policy to have the permissions specified in the following example:

{
"Sid": "Enable IAM Permissions for Amazon Forecast KMS access",
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/service-role/
AmazonSagemakerCanvasForecastRole-444455556666>"
},
"Action": [
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlainText",
"kms:Decrypt"
],
"Resource": "*"
}

4. Choose Save changes.

You can now use your KMS key to encrypt time series forecasting operations in SageMaker Canvas.
Note
The following permissions are only required if you are using the IAM role setup method to
configure time series forecasting. Add the following permissions policy to your user's IAM role.
You must also update the key policy with updated policies required for Amazon Forecast. For
more information about the permissions required for time series forecasting, see Grant Your
Users Permissions to Perform Time Series Forecasting (p. 275).

{
"Sid": "Enable IAM Permissions for Amazon Forecast KMS access",
"Effect": "Allow",
"Principal": {
"AWS": "<arn:aws:iam::111122223333:role/AmazonSageMaker-
ExecutionRole-111122223333444>"

272
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

},
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:GenerateDataKey"
"kms:GenerateDataKeyWithoutPlainText",
],
"Resource": "*"
}

Encrypt your data in the SageMaker Canvas application

The first KMS key you can use in SageMaker Canvas is used for encrypting application data stored on
Amazon Elastic Block Store (EBS) volumes and in the Amazon Elastic File System that SageMaker creates
in your Domain. SageMaker Canvas encrypts your data with this key in the underlying application and
temporary storage systems created when using compute instances for building models and generating
insights. SageMaker Canvas passes the key to other AWS services, such as Autopilot, whenever
SageMaker Canvas initiates jobs with them to process your data.

You can specify this key by setting the KmsKeyID in the CreateDomain API call or while doing the
Standard Domain setup in the console. If you don’t specify your own KMS key, SageMaker uses a default
AWS managed KMS key to encrypt your data in the SageMaker Canvas application.

To specify your own KMS key for use in the SageMaker Canvas application through the console, first set
up your Amazon SageMaker Domain using the Standard setup. Use the following procedure to complete
the Network and Storage Section for the Domain.

1. Fill out your desired Amazon VPC settings.


2. For Encryption key, choose Enter a KMS key ARN.
3. For KMS ARN, enter the ARN for your KMS key, which should have a format similar to the following:
arn:aws:kms:example-region-1:123456789098:key/111aa2bb-333c-4d44-5555-
a111bb2c33dd

Encrypt your SageMaker Canvas data saved in Amazon S3

The second KMS key you can specify is used for data that SageMaker Canvas stores to Amazon S3.
SageMaker Canvas saves duplicates of your input datasets, application and model data, and output data
to the Region’s default SageMaker S3 bucket for your account. The naming pattern for this bucket is
sagemaker-{region}-{account-ID}, and SageMaker Canvas stores data in the Canvas/ folder.

1. Turn on Enable notebook resource sharing.


2. For S3 location for shareable notebook resources, leave the default Amazon S3 path. Note that
SageMaker Canvas does not use this S3 path; this S3 path is used for Studio notebooks.
3. For Encryption key, choose Enter a KMS key ARN.
4. For KMS ARN, enter the ARN for your KMS key, which should have a format similar to the following:
arn:aws:kms:example-region-1:123456789098:key/111aa2bb-333c-4d44-5555-
a111bb2c33dd

Import encrypted datasets from Amazon S3


Your users might have datasets that have been encrypted with a KMS key. While the preceding section
shows you how to encrypt data in SageMaker Canvas and data stored to Amazon S3, you must grant

273
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

your user's IAM role additional permissions if you want to import data from Amazon S3 that is already
encrypted with AWS KMS.

To grant your user permissions to import encrypted datasets from Amazon S3 into SageMaker Canvas,
add the following permissions to the IAM execution role that you've used for the user profile.

"kms:Decrypt",
"kms:GenerateDataKey"

To learn how to edit the IAM permissions for a role, see Adding and removing IAM identity permissions
in the IAM User Guide. For more information about KMS keys, see Key policies in AWS Key Management
Service in the AWS KMS Developer Guide.

FAQs
Refer to the following FAQ items for answers to commonly asked questions about SageMaker Canvas
AWS KMS support.

Q: Does SageMaker Canvas retain my KMS key?

A: No. SageMaker Canvas may temporarily cache your key or pass it on to other AWS services (such as
Autopilot), but SageMaker Canvas does not retain your KMS key.

Q: I specified a KMS key when setting up my Domain. Why did my dataset fail to import in
SageMaker Canvas?

A: Your user’s IAM role may not have permissions to use that KMS key. To grant your user permissions,
see the Prerequisites (p. 271). Another possible error is that you have a bucket policy on your Amazon
S3 bucket that requires the use of a specific KMS key that doesn’t match the KMS key you specified
in your Domain. Make sure that you specify the same KMS key for your Amazon S3 bucket and your
Domain.

Q: How do I find the Region’s default SageMaker S3 bucket for my account?

A: The default S3 bucket follows the naming pattern sagemaker-{region}-{account-ID}. The


Canvas/ folder in this bucket stores your SageMaker Canvas application data.

Q: Can I change the default SageMaker S3 bucket used to store SageMaker Canvas data?

A: No, SageMaker creates this bucket for you.

Q: What does SageMaker Canvas store in the default SageMaker S3 bucket?

A: SageMaker Canvas uses the default SageMaker S3 bucket to store duplicates of your input datasets,
model artifacts, and model outputs.

Q: What use cases are supported for using KMS keys with SageMaker Canvas?

A: With SageMaker Canvas, you can use your own encryption keys with AWS KMS for building regression,
binary and multi-class classification, and time series forecasting models, as well as for batch inference
with your model.

Q: Can I encrypt time series forecasting models in SageMaker Canvas?

A: Yes. You must give your KMS key additional permissions in order to perform encrypted time series
forecasting. For more information about how to modify your key’s policy in order to grant time series
forecasting permissions, see Prerequisites for time series forecasting (p. 272).

274
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Grant Your Users Permissions to Build Custom Image and Text


Prediction Models
In Amazon SageMaker Canvas, you can build custom models to meet your specific business need. Two
of these custom model types are single-label image predicion and multi-category text prediction. The
permissions to build these model types are included in the AWS Identity and Access Management (IAM)
policy called AmazonSageMakerCanvasFullAccess, which SageMaker attaches by default to your user's
IAM execution role if you leave the Canvas base permissions turned on.

However, if you are using a custom IAM configuration, then you must explicitly add permissions to your
user's IAM execution role so that they can build custom image and text prediction model types. To grant
the necessary permissions to build image and text prediction models, read the following section to learn
how to attach a least-permissions policy to your role.

To add the permissions to the user's IAM role, do the following:

1. Go to the IAM console.


2. Choose Roles.
3. In the search box, search for the user's IAM role by name and select it.
4. On the page for the user's role, under Permissions, choose Add permissions.
5. Choose Create inline policy.
6. Select the JSON tab, and then paste the following least-permissions policy into the editor.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateAutoMLJobV2",
"sagemaker:DescribeAutoMLJobV2"
],
"Resource": "*"
}
]
}

7. Choose Review policy.


8. Enter a Name for the policy.
9. Choose Create policy.

For more information about AWS managed policies, see Managed policies and inline policies in the IAM
User Guide.

Grant Your Users Permissions to Perform Time Series


Forecasting
In order to perform time series forecasts in Amazon SageMaker Canvas, your users must have the
necessary permissions. The preferred method to give your users these permissions is to turn on the
time series forecasting option when setting up the Amazon SageMaker Domain, or when editing the
settings for a specific user profile. You can also use the manual method of attaching a policy and trust
relationship for Amazon Forecast to the AWS Identity and Access Management (IAM) role.

If you want to encrypt your time series forecasts with your own key, you must use an AWS KMS key
and modify your KMS key's policy to grant permissions to the role used by Amazon Forecast. For more

275
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

information about setting up your KMS key and modifying the policy for time series forecasting, see
Prerequisites for time series forecasting (p. 272).

Domain setup method


SageMaker provides you with the option to grant time series forecasting permissions to users through
the Domain settings. You can toggle the permissions for all of the users in your Domain, and SageMaker
manages attaching the required IAM policy and trust relationship for you.

If you are setting up your Amazon SageMaker Domain for the first time and want to turn on time series
forecasting permissions for all users in the Domain, then use the following procedures.

Quick setup

Use the following procedure to turn on SageMaker Canvas time series forecasting permissions when
doing a Quick setup for your Domain.

1. In the Amazon SageMaker Domain Quick setup, fill out the Name and Default execution role
fields in the User profile section.
2. Leave the Enable SageMaker Canvas permissions option turned on. It is turned on by default.
3. Choose Submit to finish setting up your Domain.

Standard setup

Use the following procedure to turn on SageMaker Canvas time series forecasting permissions when
doing a Standard setup for your Domain.

1. In the Amazon SageMaker Domain Standard setup, fill out the General settings, Studio
settings, and RStudio settings pages.
2. Choose the Canvas settings page.
3. For the Canvas base permissions configuration, leave the Enable Canvas base permissions
option turned on. It is turned on by default. These permissions are required in order to turn on
time series forecasting permissions.
4. For the Time series forecasting configuration, leave the Enable time series forecasting option
turned on. It is turned on by default.
5. Select Create and use a new execution role, or select Use an existing execution role if you
already have an IAM role with the required Amazon Forecast permissions attached. For more
information, see the IAM role setup method (p. 278).
6. Finish making any other changes to your Domain setup, and then choose Submit.

Your users should now have the necessary permissions to perform time series forecasting in SageMaker
Canvas.

User setup method


You can configure time series forecasting permissions for individual users in an existing Domain. The
user profile settings override the general Domain settings, so you can grant permissions to specific users
without giving permissions to all of your users. To grant time series forecasting permissions to a specific
user that doesn't already have permissions, use the following procedure.

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Domains.
3. On the Domains page, choose your Domain.

276
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

4. In the User profiles tab, select the name of the user whose permissions you want to edit.
5. On the User Details page, choose Edit.
6. Choose the Canvas settings page.
7. Turn on Enable Canvas base permissions. These permissions are required in order to turn on time
series forecasting permissions.
8. Turn on the Enable time series forecasting option.
9. If you want to use a different execution role for the user than the role specified in the Domain, select
Create and use a new execution role, or Use an existing execution role if you already have an IAM
role ready to use.
Note
If you want to use an existing IAM role, make sure that it has the IAM policy
AmazonSageMakerCanvasForecastAccess attached and has a trust relationship that
establishes Amazon Forecast as a service principal. For more information, see the section
IAM role setup method (p. 278).
10. The Canvas settings page should look like the following screenshot. Finish making any other
changes to your user profile, and then choose Submit to save your changes.

Your user should now have permission to do time series forecasting in SageMaker Canvas.

277
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

You can also remove your user's permissions by using the preceding procedure and turning off the
Enable time series forecasting option.

IAM role setup method


You can manually grant your users permissions to perform time series forecasting in Amazon SageMaker
Canvas by adding additional permissions to the AWS Identity and Access Management (IAM) role
specified for the user’s profile. The IAM role must have a trust relationship with Amazon Forecast and an
attached policy that gives permissions to Forecast.

The following section shows you how to create the trust relationship and attach the
AmazonSageMakerCanvasForecastAccess managed policy to your IAM role, which grants the
minimum permissions necessary for time series forecasting to work in SageMaker Canvas.

To configure an IAM role with the manual method, use the following procedure.

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Domains.
3. On the Domains page, choose your Domain.
4. From the list of User profiles, select the profile of the user you to whom want to grant time series
forecasting permissions.
5. Under Details, copy or make a note of the name of the user's Execution role. The name of the IAM
role should be similar to the following: AmazonSageMaker-ExecutionRole-111122223333444.

6. Once you have the name of the user's IAM role, go to the IAM console.
7. Choose Roles.
8. Search for the user's IAM role by name from the list of roles and select it.
9. Under Permissions, choose Add permissions.
10. Choose Attach policies.
11. Search for the AmazonSageMakerCanvasForecastAccess managed policy and select it. Choose
Attach policies to attach the policy to the role.

After attaching the policy, the role's Permissions section should now include
AmazonSageMakerCanvasForecastAccess.

278
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

12. Return to the IAM role's page, and under Trust relationships, choose Edit trust policy.
13. In the Edit trust policy editor, update the trust policy to add Forecast as a service principal. The
policy should look like the following example.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com",
"forecast.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}

14. After editing the trust policy, choose Update policy.

You should now have an IAM role that has the policy AmazonSageMakerCanvasForecastAccess
attached to it and a trust relationship established with Amazon Forecast, giving users permission to
perform time series forecasting in SageMaker Canvas. For information about AWS managed policies, see
Managed policies and inline policies.
Note
If you use this method to set up time series forecasting and want to use AWS KMS encryption
for your forecasts, then you must configure your KMS key’s policy to grant additional
permissions. For more information, see Prerequisites for time series forecasting (p. 272).

Update SageMaker Canvas for Your Users


You can update to the latest version of Amazon SageMaker Canvas as either a user or an IT administrator.
You can update Amazon SageMaker Canvas for a single user at a time.

To update the Amazon SageMaker Canvas application, you must delete the previous version.
Important
Deleting the previous version of Amazon SageMaker Canvas doesn't delete the data or models
that the users have created.

Use the following procedure to log in to AWS, open Amazon SageMaker Domain, and update Amazon
SageMaker Canvas. The users can start using the SageMaker Canvas application when they log back in.

1. Sign in to the Amazon SageMaker console at Amazon SageMaker Runtime.


2. In the navigation pane, choose Domains.
3. On the Domains page, choose your Domain.
4. From the list of User profiles, choose a user profile.
5. For the list of Apps, find the Canvas application (the App type says Canvas) and choose Delete app.
6. Complete the dialog box and choose Confirm action.

The following image shows the user profile page and highlights the Delete app action from the
preceding procedure.

279
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Request a Quota Increase


Your users might use AWS resources in amounts that exceed those specified by their quotas. If your users
are resource constrained, you can request a quota increase for them.

Amazon SageMaker Canvas uses the following services to process the requests of your users:

• Amazon SageMaker Autopilot


• Amazon SageMaker Studio Domain
• Amazon Forecast

For information about increasing quotas for SageMaker Canvas operations that aren't used to forecast
time series data, see Amazon SageMaker endpoints and quotas.

For information about increasing quotas for SageMaker Canvas operations that are used to forecast time
series data, see Amazon Forecast endpoints and quotas.

Request an increase for instances to build custom models


When building a custom model, if you encounter an error during post-building analysis that tells you to
increase your quota for ml.m5.2xlarge instances, use the following information to resolve the issue.

To allow SageMaker Canvas to complete post-building analysis of models, you must increase the
SageMaker Hosting endpoint limit for the ml.m5.2xlarge instance type to a non-zero value in your
AWS account. After building a model, SageMaker Canvas hosts the model on a SageMaker Hosting
endpoint and uses the endpoint to generate the post-building analysis. If you don't increase the default
account limit of 0 for ml.m5.2xlarge instances, SageMaker Canvas cannot complete this step and
generates an error during post-building analysis.

Use the following procedure to request a limit increase for your account.

1. Open the AWS Support Center console.


2. On the AWS Support Center page, choose Create Case and then choose Service limit increase.
3. In the Case classification panel under Limit type, search for SageMaker.
4. In the Request panel, choose the Region that you are working in. For Resource Type, choose
SageMaker Hosting.

280
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

5. For Limit, choose ml.m5.2xlarge instances.


6. For New Limit Value, verify that the value is at least 1.
7. In Case description, provide a brief explanation of why you need the Service limit increase. For
example, "SageMaker Canvas uses this instance type for model analysis."
8. In Contact options, provide some details about how you would like to be contacted by the AWS
service support team on the status of your Service limit increase request.
9. Choose Submit.

Grant Users Permissions to Import Amazon Redshift Data


Your users might have datasets stored in Amazon Redshift. Before users can import data from Amazon
Redshift into SageMaker Canvas, you must add the AmazonRedshiftFullAccess managed policy
to the IAM execution role that you've used for the user profile and add Amazon Redshift as a service
principal to the role's trust policy. You must also associate the IAM execution role with your Amazon
Redshift cluster. Complete the procedures in the following sections to give your users the required
permissions to import Amazon Redshift data.

Add Amazon Redshift permissions to your IAM role


You must grant Amazon Redshift permissions to the IAM role specified in your user profile.

To add the AmazonRedshiftFullAccess policy to the user's IAM role, do the following.

1. Sign in to the IAM console at https://console.aws.amazon.com/iam/.


2. Choose Roles.
3. In the search box, search for the user's IAM role by name and select it.
4. On the page for the user's role, under Permissions, choose Add permissions.
5. Choose Attach policies.
6. Search for the AmazonRedshiftFullAccess managed policy and select it.
7. Choose Attach policies to attach the policy to the role.

After attaching the policy, the role’s Permissions section should now include
AmazonRedshiftFullAccess.

To add Amazon Redshift as a service principal to the IAM role, do the following.

1. On the same page for the IAM role, under Trust relationships, choose Edit trust policy.
2. In the Edit trust policy editor, update the trust policy to add Amazon Redshift as a service principal.
An IAM role that allows Amazon Redshift to access other AWS services on your behalf has a trust
relationship as follows:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

281
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

3. After editing the trust policy, choose Update policy.

You should now have an IAM role that has the policy AmazonRedshiftFullAccess attached to it and a
trust relationship established with Amazon Redshift, giving users permission to import Amazon Redshift
data into SageMaker Canvas. For more information about AWS managed policies, see Managed policies
and inline policies in the IAM User Guide.

Associate the IAM role with your Amazon Redshift cluster


In the settings for your Amazon Redshift cluster, you must associate the IAM role that you granted
permissions to in the preceding section.

To associate an IAM role with your cluster, do the following.

1. Sign in to the Amazon Redshift console at https://console.aws.amazon.com/redshift/.


2. On the navigation menu, choose Clusters, and then choose the name of the cluster that you want to
update.
3. In the Actions dropdown menu, choose Manage IAM roles. The Cluster permissions page appears.
4. For Available IAM roles, enter either the ARN or the name of the IAM role, or choose the IAM role
from the list.
5. Choose Associate IAM role to add it to the list of Associated IAM roles.
6. Choose Save changes to associate the IAM role with the cluster.

Amazon Redshift modifies the cluster to complete the change, and the IAM role to which you previously
granted Amazon Redshift permissions is now associated with your Amazon Redshift cluster. Your users
now have the required permissions to import Amazon Redshift data into SageMaker Canvas.

Grant Users Permissions to Collaborate with Studio


Your Amazon SageMaker Canvas users might want to share their models with users in Amazon
SageMaker Studio to receive feedback and model updates, and Studio users might want to share models
with Canvas users so that they can generate predictions in Canvas. The following permissions grant
Canvas users and Studio users access to share models with each other.

For more information about how Canvas users can share models with Studio users, see Collaborate with
data scientists (p. 377). For more information about how Canvas users can bring a model shared from
Studio, see Bring your own model to SageMaker Canvas (p. 384).

Before Canvas and Studio users can collaborate, the users must be in the same Amazon SageMaker
Domain. Add the following IAM permissions added to the same IAM execution role that you've used for
their profiles.

To add the permissions to the users’ IAM role, do the following:

1. Go to the IAM console.


2. Choose Roles.
3. In the search box, search for the user's IAM role by name and select it.
4. On the page for the user's role, under Permissions, choose Add permissions.
5. Choose Attach policies.
6. Enter the following IAM policy:

282
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateSharedModel",
"sagemaker:DescribeSharedModel",
"sagemaker:ListSharedModelEvents",
"sagemaker:ListSharedModels",
"sagemaker:ListSharedModelVersions",
"sagemaker:SendSharedModelEvent",
"sagemaker:UpdateSharedModel",
],
"Resource": "*"
}
]
}

7. Choose Attach policies to attach the policy to the role.

For more information about AWS managed policies, see Managed policies and inline policies in the IAM
User Guide.

Grant Your Users Permissions to Send Predictions to Amazon


QuickSight
You must grant your SageMaker Canvas users permissions to send batch predictions to Amazon
QuickSight. In Amazon QuickSight, users can create analyses and reports with a dataset and prepare
dashboards to share their results. For more information about sending prediction to QuickSight for
analysis, see Send predictions to Amazon QuickSight (p. 364).

To grant the necessary permissions to share batch predictions with users in QuickSight, you must add a
permissions policy to the AWS Identity and Access Management (IAM) execution role that you’ve used for
the user profile. The following section shows you how to attach a least-permissions policy to your role.

Add the permissions policy to your IAM role

To add the permissions policy, use the following procedure:

1. Sign in to the IAM console at https://console.aws.amazon.com/iam/.


2. Choose Roles.
3. In the search box, search for the user's IAM role by name and select it.
4. On the page for the user's role, under Permissions, choose Add permissions.
5. Choose Create inline policy.
6. Select the JSON tab, and then paste the following least-permissions policy into the editor. Replace
the placeholders <your-account-number> with your own AWS account number.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"quicksight:CreateDataSet",
"quicksight:ListUsers",
"quicksight:ListNamespaces",
"quicksight:CreateDataSource",
"quicksight:PassDataSet",

283
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

"quicksight:PassDataSource"
],
"Resource":[
"arn:aws:quicksight:*:<your-account-number>:datasource/*",
"arn:aws:quicksight:*:<your-account-number>:user/*",
"arn:aws:quicksight:*:<your-account-number>:namespace/*",
"arn:aws:quicksight:*:<your-account-number>:dataset/*"
]
}
]
}

7. Choose Review policy.


8. Enter a Name for the policy.
9. Choose Create policy.

You should now have a customer-managed IAM policy attached to your execution role that grants your
Canvas users the necessary permissions to send batch predictions to users in QuickSight.

Manage apps
The following sections describe how you can manage your SageMaker Canvas applications. You can view,
delete, or relaunch your apps from the Domains section of the SageMaker console.

Check for active apps


To check if you have any actively running SageMaker Canvas apps, use the following procedure.

1. Open the SageMaker console.


2. In the navigation pane, select Domains.
3. On the Domains page, choose your Domain.
4. On the Domain details page, under User profiles, select the user profile name for the Canvas
application that you want to view.
5. Under Apps, find the app that says Canvas in the App type column.

The Status column displays the status of the app, such as Ready, Pending, or Deleted. If the app is
Ready, then your SageMaker Canvas workspace instance is active. You can delete the app from the
console or log out from the SageMaker Canvas interface.

Delete app
If you want to end your SageMaker Canvas workspace instance, you can either log out from the
SageMaker Canvas application or delete your application from the SageMaker console. A workspace
instance is dedicated for your use from when you start using SageMaker Canvas to the point when you
stop using it. Deleting the application only ends the workspace instance. Models and datasets aren’t
affected, but Quick build tasks automatically restart when you log in again. The billing for the workspace
instance also stops.

To delete your Canvas app through the AWS console, first close the browser tab in which your Canvas
app was open. Then, use the following procedure to delete your SageMaker Canvas application.

1. Open the SageMaker console.


2. In the navigation pane, select Domains.
3. On the Domains page, choose your Domain.

284
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

4. On the Domain details page, under User profiles, select the user profile name for the Canvas
application you want to view.
5. Under Apps, find the application that says Canvas in the App type column.
6. In the Action column, choose Delete app.
7. In the Delete app dialog box, select the Yes, delete app prompt, confirm the deletion by typing
delete in the text field, and then choose Delete.

After you've successfully deleted the application, the Status column says Deleted. Otherwise, your
application is still active.

You can also end the workspace instance by logging out (p. 392) from within the SageMaker Canvas
application.

Relaunch app
If you delete or log out of your SageMaker Canvas application and want to relaunch the application, use
the following procedure.

1. Navigate to the SageMaker console.


2. In the navigation pane, choose Canvas.
3. On the SageMaker Canvas landing page, in the Get Started box, select your user profile from the
dropdown.
4. Choose Open Canvas to open the application.

SageMaker Canvas begins launching the app.

You can also use the following secondary procedure if you encounter any issues with the previous
procedure.

1. Open the SageMaker console.


2. In the navigation pane, select Domains.
3. On the Domains page, choose your Domain.
4. On the Domain details page, under User profiles, select the user profile name for the SageMaker
Canvas application you want to view.
5. Choose Launch and select Canvas from the dropdown list.

SageMaker Canvas begins launching the app.

Configure Amazon SageMaker Canvas in a VPC without internet


access
The Amazon SageMaker Canvas application runs in a container in an AWS managed Amazon Virtual
Private Cloud (VPC). If you want to further control access to your resources or run SageMaker Canvas
without public internet access, you can configure your Amazon SageMaker Domain and VPC settings.
Within your own VPC, you can configure settings such as security groups (virtual firewalls that control
inbound and outbound traffic from Amazon EC2 instances) and subnets (ranges of IP addresses in your
VPC). To learn more about VPCs, see How Amazon VPC works.

When the SageMaker Canvas application is running in the AWS managed VPC, it can interact with other
AWS services using either an internet connection or through VPC endpoints created in a customer-
managed VPC (without public internet access). SageMaker Canvas applications can access these VPC
endpoints through a Studio-created network interface that provides connectivity to the customer-

285
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

managed VPC. The default behavior of the SageMaker Canvas application is to have internet access.
When using an internet connection, the containers for the preceding jobs access AWS resources over the
internet, such as the Amazon S3 buckets where you store training data and model artifacts.

However, if you have security requirements to control access to your data and job containers, we
recommend that you configure SageMaker Canvas and your VPC so that your data and containers aren’t
accessible over the internet. SageMaker uses the VPC configuration settings you specify when setting up
your Domain for SageMaker Canvas.

If you want to configure your SageMaker Canvas application without internet access, you must configure
your VPC settings when you onboard to Amazon SageMaker Domain (p. 37), set up VPC endpoints,
and grant the necessary AWS Identity and Access Management permissions. For information about
configuring a VPC in Amazon SageMaker, see Choose an Amazon VPC (p. 46). The following sections
describe how to run SageMaker Canvas in a VPC without public internet access.

Configure Amazon SageMaker Canvas in a VPC without internet access


You can send traffic from SageMaker Canvas to other AWS services through your own VPC. If your
own VPC doesn't have public internet access and you've set up your Domain in VPC only mode, then
SageMaker Canvas won't have public internet access as well. This includes all requests, such as accessing
datasets in Amazon S3 or training jobs for standard builds, and the requests go through VPC endpoints
in your VPC instead of the public internet. When you onboard to Domain and Choose an Amazon
VPC (p. 46), you can specify your own VPC as the default VPC for the Domain, along with your desired
security group and subnet settings. Then, SageMaker creates a network interface in your VPC that
SageMaker Canvas uses to access VPC endpoints in your VPC. Note that the security group and subnet
settings are set after you finish onboarding to Domain.

When onboarding to Domain, if you choose Public internet only as the network access type, the VPC is
SageMaker managed and allows internet access.

You can change this behavior by choosing VPC only so that SageMaker sends all traffic to a network
interface that SageMaker creates in your specified VPC. When you choose this option, you must provide
the subnets, security groups, and VPC endpoints that are necessary to communicate with the SageMaker
API and SageMaker Runtime, and various AWS services, such as Amazon S3 and Amazon CloudWatch,
that are used by SageMaker Canvas. Note that you can only import data from Amazon S3 buckets
located in the same Region as your VPC.

The following procedures show how you can configure these settings to use SageMaker Canvas without
the internet.

Step 1: Onboard to Amazon SageMaker Domain

To send SageMaker Canvas traffic to a network interface in your own VPC instead of over the internet,
specify the VPC you want to use when onboarding to Amazon SageMaker Domain (p. 37). You must also
specify at least two subnets in your VPC that SageMaker can use. Choose Standard setup and do the
following procedure when configuring the Network and Storage Section for the Domain.

1. Select your desired VPC.


2. Choose two or more Subnets. If you don’t specify the subnets, SageMaker uses all of the subnets in
the VPC.
3. Choose one or more Security group(s).
4. Choose VPC Only to turn off direct internet access in the AWS managed VPC where SageMaker
Canvas is hosted.

After disabling internet access, finish the onboarding process to set up your Domain. For more
information about the VPC settings for Amazon SageMaker Domain, see Choose an Amazon VPC (p. 46).

286
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Step 2: Configure VPC endpoints and access


Note
In order to configure Canvas in your own VPC, you must enable private DNS hostnames for
your VPC endpoints. For more information, see Connect to SageMaker Through a VPC Interface
Endpoint.

SageMaker Canvas only accesses other AWS services to manage and store data for its functionality.
For example, it connects to Amazon Redshift if your users access an Amazon Redshift database. It can
connect to an AWS service such as Amazon Redshift using an internet connection or a VPC endpoint. Use
VPC endpoints if you want to set up connections from your VPC to AWS services that don't use the public
internet.

A VPC endpoint creates a private connection to an AWS service that uses a networking path that is
isolated from the public internet. For example, if you set up access to Amazon S3 using a VPC endpoint
from your own VPC, then the SageMaker Canvas application can access Amazon S3 by going through
the network interface in your VPC and then through the VPC endpoint that connects to Amazon S3. The
communication between SageMaker Canvas and Amazon S3 is private.

For more information about configuring VPC endpoints for your VPC, see AWS PrivateLink.

The following are the VPC endpoints for each service you can use with SageMaker Canvas:

Service Endpoint Endpoint type

Amazon Athena com.amazonaws.Region.athena Interface

Amazon SageMaker Interface


com.amazonaws.Region.sagemaker.api

com.amazonaws.Region.sagemaker.runtime

com.amazonaws.Region.notebook

AWS Security Token Service com.amazonaws.Region.sts Interface

Amazon Elastic Container com.amazonaws.Region.ecr.api Interface


Registry (Amazon ECR)
com.amazonaws.Region.ecr.dkr

Amazon Elastic Compute Cloud com.amazonaws.Region.ec2 Interface


(Amazon EC2)

Amazon Simple Storage Service com.amazonaws.Region.s3 Gateway


(Amazon S3)

Amazon Redshift com.amazonaws.Region.redshift- Interface


data

AWS Secrets Manager Interface


com.amazonaws.Region.secretsmanager

AWS Systems Manager com.amazonaws.Region.ssm Interface

Amazon CloudWatch Interface


com.amazonaws.Region.monitoring

Amazon CloudWatch Logs com.amazonaws.Region.logs Interface

Amazon Forecast com.amazonaws.Region.forecast Interface

com.amazonaws.Region.forecastquery

Amazon Textract com.amazonaws.Region.textract Interface

287
Amazon SageMaker Developer Guide
Setting Up and Managing Amazon
SageMaker Canvas (for IT Administrators)

Service Endpoint Endpoint type

Amazon Comprehend Interface


com.amazonaws.Region.comprehend

Amazon Rekognition Interface


com.amazonaws.Region.rekognition

AWS Glue com.amazonaws.Region.glue Interface

You must also add the following endpoint policy for Amazon S3 to control AWS principal access to your
VPC endpoint. For information about how to update your VPC endpoint policy, see Control access to VPC
endpoints using endpoint policies.

{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:CreateBucket",
"s3:GetBucketCors",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::*SageMaker*",
"arn:aws:s3:::*Sagemaker*",
"arn:aws:s3:::*sagemaker*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": "*"
}

Step 3: Grant IAM permissions

The SageMaker Canvas user must have the necessary AWS Identity and Access Management permissions
to allow connection to the VPC endpoints. The IAM role to which you give permissions must be the
same one you used when onboarding to Amazon SageMaker Domain. You can attach the SageMaker
managed AmazonSageMakerFullAccess policy to the IAM role for the user to give the user the
required permissions. If you require more restrictive IAM permissions and use custom policies instead,
then give the user’s role the ec2:DescribeVpcEndpointServices permission. SageMaker Canvas
requires these permissions to verify the existence of the required VPC endpoints for standard build jobs.
If it detects these VPC endpoints, then standard build jobs run by default in your VPC. Otherwise, they
will run in the default AWS managed VPC.

For instructions on how to attach the AmazonSageMakerFullAccess IAM policy to your user’s IAM
role, see Adding and removing IAM identity permissions.

To grant your user’s IAM role the granular ec2:DescribeVpcEndpointServices permission, use the
following procedure.

1. Sign in to the AWS Management Console and open the IAM console.
2. In the navigation pane, choose Roles.
3. In the list, choose the name of the role to which you want to grant permissions.

288
Amazon SageMaker Developer Guide
Use Ready-to-use models

4. Choose the Permissions tab.


5. Choose Add permissions and then choose Create inline policy.
6. Choose the JSON tab and enter the following policy, which grants the
ec2:DescribeVpcEndpointServices permission:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ec2:DescribeVpcEndpointServices",
"Resource": "*"
}
]
}

7. Choose Review policy, and then enter a Name for the policy (for example,
VPCEndpointPermissions).
8. Choose Create policy.

The user’s IAM role should now have permissions to access the VPC endpoints configured in your VPC.

(Optional) Step 4: Override security group settings for specific users

If you are an administrator, you might want different users to have different VPC settings, or user-specific
VPC settings. When you override the default VPC’s security group settings for a specific user, these
settings are passed on to the SageMaker Canvas application for that user.

You can override the security groups that a specific user has access to in your VPC when you set up a new
user profile in Studio. You can use the CreateUserProfile SageMaker API call (or create_user_profile with
the AWS CLI), and then in the UserSettings, you can specify the SecurityGroups for the user.

Use Ready-to-use models


With Amazon SageMaker Canvas Ready-to-use models, you can make predictions on your data without
writing a single line of code or having to build a model—all you have to bring is your data. The Ready-
to-use models use pre-built models to generate predictions without requiring you to spend the time,
expertise, or cost required to build a model, and you can choose from a variety of use cases ranging from
language detection to expense analysis.

Canvas integrates with existing AWS services, such as Amazon Textract, Amazon Rekognition, and
Amazon Comprehend, to analyze your data and make predictions or extract insights. You can use the
predictive power of these services from within the Canvas application to get high quality predictions for
your data.

Canvas supports the following Ready-to-use models types:

Ready-to-use model Description Supported data type

Sentiment analysis Detect sentiment in lines of text, Text (CSV or plain text)
which can be positive, negative,
neutral, or mixed. Currently, you
can only do sentiment analysis
for English language text.

289
Amazon SageMaker Developer Guide
Use Ready-to-use models

Ready-to-use model Description Supported data type

Entities extraction Extract entities, which are real- Text (CSV or plain text)
world objects such as people,
places, and commercial items,
or units such as dates and
quantities, from text.

Language detection Determine the dominant Text (CSV or plain text)


language in text such as English,
French, or German.

Personal information detection Detect personal information Text (CSV or plain text)
that could be used to identify
an individual, such as addresses,
bank account numbers, and
phone numbers, from text.

Object detection in images Detect objects, concepts, scenes, Image (JPG, PNG)
and actions in your images.

Text detection in images Detect text in your images. Image (JPG, PNG)

Expense analysis Extract information from Document (PDF, JPG, PNG, TIFF)
invoices and receipts, such as
date, number, item prices, total
amount, and payment terms.

Identity document analysis Extract information from Document (PDF, JPG, PNG, TIFF)
passports, driver licenses, and
other identity documentation
issued by the US Government.

Document analysis Analyze documents and forms Document (PDF, JPG, PNG, TIFF)
for relationships among
detected text.

Get started
To get started with Ready-to-use models, review the following information.

Prerequisites

To use Ready-to-use models in Canvas, you must turn on the Canvas Ready-to-use models
configuration permissions when setting up your Amazon SageMaker Domain. The Canvas Ready-
to-use models configuration attaches the AmazonSageMakerCanvasAIServicesAccess policy to your
Canvas user's AWS Identity and Access Management (IAM) execution role. If you encounter any issues
with granting permissions, see the topic Troubleshooting issues with granting permissions through the
SageMaker console (p. 393).

If you’ve already set up your Domain, you can edit your Domain settings and turn on the permissions. For
instructions on how to edit your Domain settings, see View and Edit Domains. When editing the settings
for your Domain, go to the Canvas settings and turn on the Enable Canvas Ready-to-use models
option.

How to use Ready-to-use models

To get started with Ready-to-use models, do the following:

290
Amazon SageMaker Developer Guide
Use Ready-to-use models

1. (Optional) Import your data. You can import a tabular, image, or document dataset to generate
batch predictions, or a dataset of predictions, with Ready-to-use models. To get started with
importing a dataset, see Import data for Ready-to-use models (p. 291).
2. Generate predictions. You can generate single or batch predictions with your chosen Ready-
to-use model. To get started with making predictions, see Make predictions with Ready-to-use
models (p. 292).

Import data for Ready-to-use models


Note
The following topic assumes that want to use Ready-to-use models to get predictions for an
entire dataset. If you only want to get a single prediction with a Ready-to-use model, can you
skip this section and go directly to the Make predictions with Ready-to-use models (p. 292)
topic.

You can use Canvas Ready-to-use models to get predictions for an entire dataset. You only have to
import your data into Canvas.

Ready-to-use models support predictions for the following dataset types:

• Text data. Text data consists of text in a standard CSV format. The data should consist of at least one
column of plain text data.
• Image data. Image datasets consist of image files in JPG or PNG format.
• Document data. Document data consists of files in PDF, JPG, PNG, or TIFF format.

When you import your data into Canvas, you must make sure that it meets the input requirements. For
a table of requirements by data type, you can refer to the limits table on the Create a dataset (p. 301)
page for custom models.

You can import data into Canvas from the following data sources:

• Local files on your computer


• Amazon S3 buckets
• Amazon Redshift
• AWS Glue Data Catalog through Amazon Athena
• Snowflake
• Over 40 external SaaS platforms, such as SAP OData

For a table of all of the supported data sources and what data types you can import from them, see the
data sources table on the Import data into Canvas page in the custom models documentation.

Use the following procedures to import datasets into Canvas that you can use with Ready-to-use models.

Import text or image data


With tabular datasets of text data, you can generate predictions for the sentiment analysis, entities
extraction, language detection, and personal information detection Ready-to-use models. With image
datasets, you can generate predictions for the object detection in images and text detection in images
Ready-to-use models.

The procedures for importing text and image data are the same for Ready-to-use models and custom
models. You can refer to the custom model procedures for instructions on how to import these types of
datasets:

291
Amazon SageMaker Developer Guide
Use Ready-to-use models

• To learn how to import text data, use the Import tabular data (p. 303) procedure within the custom
model documentation.
• To learn how to import image data, use the Import image data (p. 305) procedure within the custom
model documentation.

Note
You can only import image datasets from local file upload or an Amazon S3 bucket.

Import document data


The Ready-to-use models for expense analysis, identity document analysis, and document analysis
support document data. You can’t build a custom model with document data.

With document datasets, you can generate predictions for expense analysis, identity document
analysis, and document analysis Ready-to-use models. Review the limitations table in the Create a
dataset (p. 301) section to ensure that your document dataset meets the requirements for document
data.
Note
You can only import document datasets from local file upload or an Amazon S3 bucket.

Use the following procedure to import a document dataset into Canvas:

1. Open your SageMaker Canvas application.


2. In the left navigation pane, choose Datasets.
3. Choose Create.
4. From the dropdown menu, choose Document.
5. In the popup dialog box, in the Dataset name field, enter a name for the dataset and choose Create.
6. On the Import page, open the Data Source dropdown menu.
7. Choose your data source. To upload files from your computer, choose Local upload. To import files
from Amazon S3, choose Amazon S3.
8. From your computer or Amazon S3 bucket, select the document files that you want to upload.
9. When you’re ready to import your data, choose Import data.

While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).

When the Status of your dataset shows as Ready, Canvas has successfully imported your data.

On the Datasets page, you can choose your dataset to preview it, which shows you up to the first 100
documents of your dataset.

Make predictions with Ready-to-use models


Ready-to-use models are available for text, image, and document data. Each data type has Ready-to-use
models that are designed to work best for each use case. Use the following guide to determine which
Ready-to-use models you can use with your input data:

• Text data: Sentiment analysis, entities extraction, language detection, personal information detection
• Image data: Object detection in images, text detection in images
• Document data: Expense analysis, identity document analysis, document analysis

The following screenshot shows you the landing page for Ready-to-use models, which showcases all of
the different solutions.

292
Amazon SageMaker Developer Guide
Use Ready-to-use models

Each Ready-to-use model supports both Single predictions and Batch predictions for your dataset. A
Single prediction is when you only need to make one prediction. For example, you have one image from
which you want to extract text, or one paragraph of text for which you want to detect the dominant
language. A Batch prediction is when you’d like to make predictions for an entire dataset. For example,
you might have a CSV file of customer reviews for which you’d like to analyze the customer sentiment, or
you might have image files in which you’d like to detect objects.

When you have your data and have identified your use case, choose one of the following workflows to
make predictions for your data.

Make predictions for text data


The following procedures describe how to make both single and batch predictions for text datasets.
You can use the procedures for the following Ready-to-use model types: sentiment analysis, entities
extraction, language detection, and personal information detection.
Note
For sentiment analysis, you can only use English language text.

Single predictions

To make a single prediction for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For text data,
it should be one of the following: Sentiment analysis, Entities extraction, Language detection, or
Personal information detection.
3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. For Text field, enter the text for which you’d like to get a prediction.
5. Choose Generate prediction results to get your prediction.

In the right pane Prediction results, you receive an analysis of your text in addition to a Confidence
score for each result or label. For example, if you chose language detection and entered a passage of text
in French, you might get French with a 95% confidence score and traces of other languages, like English,
with a 5% confidence score.

The following screenshot shows the results for a single prediction using language detection where the
model is 100% confident that the passage is English.

293
Amazon SageMaker Developer Guide
Use Ready-to-use models

Batch predictions

To make batch predictions for Ready-to-use models that accept text data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For text data,
it should be one of the following: Sentiment analysis, Entities extraction, Language detection, or
Personal information detection.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can Preview the output data. Then, you can choose Download CSV to download the results.

Make predictions for image data


The following procedures describe how to make both single and batch predictions for image datasets.
You can use the procedures for the following Ready-to-use model types: object detection images and
text detection in images.

Single predictions

To make a single prediction for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Object detection images or Text detection in images.
3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. Choose Upload image.
5. You are prompted to select an image to upload from your local computer. Select the image from
your local files, and then the prediction results generate.

294
Amazon SageMaker Developer Guide
Use Ready-to-use models

In the right pane Prediction results, you receive an analysis of your image in addition to a Confidence
score for each object or text detected. For example, if you chose object detection in images, you receive
a list of objects in the image along with a confidence score of how certain the model is that each object
was accurately detected, such as 93%.

The following screenshot shows the results for a single prediction using the object detection in images
solution, where the model predicts objects such as a clock tower and bus with 100% confidence.

Batch predictions
To make batch predictions for Ready-to-use models that accept image data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Object detection images or Text detection in images.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ),
you can choose View prediction results to preview the output data. Then, you can choose Download
prediction and download the results as a CSV or a ZIP file.

Make predictions for document data


The following procedures describe how to make both single and batch predictions for document
datasets. You can use the procedures for the following Ready-to-use model types: expense analysis,
identity document analysis, and document analysis.

Single predictions
To make a single prediction for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.
2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For document
data, it should be one of the following: Expense analysis, Identity document analysis, or Document
analysis.

295
Amazon SageMaker Developer Guide
Use Ready-to-use models

3. On the Run predictions page for your chosen Ready-to-use model, choose Single prediction.
4. If your Ready-to-use model is identity document analysis or document analysis, do the following (if
you’re doing expense analysis, skip this step and go to Step 5):

a. Choose Upload document.


b. You are prompted to upload a PDF, JPG, or PNG file from your local computer. Select the
document from your local files, and then the prediction results will generate.
5. If your Ready-to-use model is expense analysis, do the following:

a. Choose Upload invoice or receipt.


b. You are prompted to upload a PDF, JPG, PNG, or TIFF file from your local computer. Select the
document from your local files, and then the prediction results will generate.

In the right pane Prediction results, you’ll receive an analysis of your document.

The following information describes the results for each type of solution:

• For expense analysis, the results are categorized into Summary fields, which include fields such as the
total on a receipt, and Line item fields, which include fields such as individual items on a receipt. The
identified fields are highlighted on the document image in the output.
• For identity document analysis, the output shows you the fields that the Ready-to-use model
identified, such as first and last name, address, or date of birth. The identified fields are highlighted on
the document image in the output.
• For document analysis, the results are categorized into Raw text, Forms, Tables, and Signatures. Raw
text includes all of the extracted text, while Forms, Tables, and Signatures only include information
on the form that falls into those categories. For example, Tables only includes information extracted
from tables in the document. The identified fields are highlighted on the document image in the
output.

The following screenshot shows the results for a single prediction using the document analysis solution.

Batch predictions

To make batch predictions for Ready-to-use models that accept document data, do the following:

1. In the left navigation pane of the Canvas application, choose Ready-to-use models.

296
Amazon SageMaker Developer Guide
Use custom models

2. On the Ready-to-use models page, choose the Ready-to-use model for your use case. For image
data, it should be one of the following: Expense analysis, Identity document analysis, or Document
analysis.
3. On the Run predictions page for your chosen Ready-to-use model, choose Batch prediction.
4. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you are directed through the import data workflow.
5. From the list of available datasets, select your dataset and choose Generate predictions. If your use
case is document analysis, continue to Step 6.
6. (Optional) If your use case is Document analysis, another dialog box called Select features to
include in batch prediction appears. You can select Forms, Tables, and Signatures to group the
results by those features. Then, choose Generate predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose View prediction results to preview the analysis of your document data.

The following information describes the results for each type of solution:

• For expense analysis, the results are categorized into Summary fields, which include fields such as the
total on a receipt, and Line item fields, which include fields such as individual items on a receipt. The
identified fields are highlighted on the document image in the output.
• For identity document analysis, the output shows you the fields that the Ready-to-use model
identified, such as first and last name, address, or date of birth. The identified fields are highlighted on
the document image in the output.
• For document analysis, the results are categorized into Raw text, Forms, Tables, and Signatures. Raw
text includes all of the extracted text, while Forms, Tables, and Signatures only include information
on the form that falls into those categories. For example, Tables only includes information extracted
from tables in the document. The identified fields are highlighted on the document image in the
output.

After previewing your results, you can choose Download prediction and download the results as a ZIP
file.

Use custom models


With Amazon SageMaker Canvas, you can build a custom model that is trained with your data. By
training a custom model on your data, you are able to capture characteristics and trends that are specific
and most representative of your data. For example, you might want to create a custom time series
forecasting model that you train on inventory data from your warehouse that allows you to manage your
logistics operations.

You can train a Canvas custom model on the following types of datasets:

• Tabular (including Numeric, Categorical, Timeseries, and Text data)


• Image

The following table shows the types of custom models that you can build in Canvas, along with their
supported data types and data sources

297
Amazon SageMaker Developer Guide
Use custom models

Model type Example use case Supported data types Supported data
sources

Numeric prediction Predicting house prices Numeric Local upload, Amazon


based on features like S3, SaaS connectors
square footage

2 category prediction Predicting whether or Binary or Categorical Local upload, Amazon


not a customer is likely S3, SaaS connectors
to churn

3+ cateogry prediction Predicting patient Categorical Local upload, Amazon


outcomes after being S3, SaaS connectors
discharged from the
hospital

Time series forecasting Predicting your Timeseries Local upload, Amazon


inventory for the next S3, SaaS connectors
quarter

Single-label image Predicting types of Image (JPG, PNG) Local upload, Amazon
prediction manufacturing defects S3
in images

Multi-category text Predicting categories Source column: Text Local upload, Amazon
prediction of products, such as S3
clothing, electronics, or Target column: Binary
household goods, based or Categorical
on product descriptions

Get started

To get started with building and generating predictions from a custom model, do the following:

• Determine your use case and type of model that you want to build. For more information about the
custom model types, see Build a custom model (p. 321). For more information about the data types
and sources supported for custom models, see Import data into Canvas (p. 299).
• Import your data into Canvas. You can build a custom model with any tabular or image dataset that
meets the input requirements. For more information about the input requirements, see Create a
dataset (p. 301).

To learn more about SageMaker-provided sample datasets you can experiment with, see Use sample
datasets.
• Build your custom model. You can do a Quick build to get your model and start making predictions
more quickly, or you can do a Standard build for greater accuracy.

For numeric, categorical, and time series forecasting model types, you can clean and prepare your data
with features such as advanced transforms and joins. For image prediction models, you can Edit an
image dataset (p. 328) to update your labels or add and delete images. Note that you can't use these
features for multi-category text prediction models.
• Evaluate your model's performance and determine how well it might perform on real-world data.
• (Optional) For certain model types, you can collaborate with data scientists in Amazon SageMaker
Studio who can help review and improve your model.
• Make single or batch predictions with your model.

298
Amazon SageMaker Developer Guide
Use custom models

Note
If you already have a trained model in Amazon SageMaker Studio that you’d like to share with
Canvas, you can bring your own model to SageMaker Canvas. Review the BYOM prerequisites to
determine whether your model is eligible for sharing.

Import data into Canvas


Amazon SageMaker Canvas supports importing tabular, image, and document data. You can import data
from both local and external data sources into Canvas. Use the datasets that you import to build models
and make predictions for other datasets.

Each use case for which you can build a custom model accepts different types of input. For example,
if you want to build a single-label image classification model, then you should import image data.
For more information about the different model types and the data they accept, see Build a custom
model (p. 321). You can import data and build custom models in SageMaker Canvas for the following
data types:

• Tabular (CSV or tables)


• Categorical – Use categorical data to build custom categorical prediction models for 2 and 3+
category prediction.
• Numeric – Use numeric data to build custom numeric prediction models.
• Text – Use text data to build custom multi-category text prediction models.
• Timeseries – Use timeseries data to build custom time series forecasting models.
• Image (JPG or PNG) – Use image data to build custom single-label image prediction models.
• Document (PDF, JPG, PNG, TIFF) – Document data is only supported for SageMaker Canvas Ready-to-
use models. To learn more about Ready-to-use models that can make predictions for document data,
see Use Ready-to-use models (p. 289).

You can import data into Canvas from the following data sources:

• Local files on your computer


• Amazon S3 buckets
• Amazon Redshift
• AWS Glue Data Catalog through Amazon Athena
• Snowflake
• Over 40 external SaaS platforms, such as SAP OData

For a full list of data sources you can import from, see the following table:

Source Type Supported data types

Local file upload Local Tabular, Image, Document

Amazon S3 bucket Amazon internal Tabular, Image, Document

Amazon Redshift Amazon internal Tabular

AWS Glue Data Catalog (through Amazon internal Tabular


Amazon Athena)

Snowflake External Tabular

Amplitude External SaaS platform Tabular

299
Amazon SageMaker Developer Guide
Use custom models

Source Type Supported data types

CircleCI External SaaS platform Tabular

DocuSign Monitor External SaaS platform Tabular

Domo External SaaS platform Tabular

Datadog External SaaS platform Tabular

Dynatrace External SaaS platform Tabular

Facebook Ads External SaaS platform Tabular

Facebook Page Insights External SaaS platform Tabular

Google Ads External SaaS platform Tabular

Google Analytics 4 External SaaS platform Tabular

Google Search Console External SaaS platform Tabular

GitHub External SaaS platform Tabular

GitLab External SaaS platform Tabular

Infor Nexus External SaaS platform Tabular

Instagram Ads External SaaS platform Tabular

Jira Cloud External SaaS platform Tabular

LinkedIn Ads External SaaS platform Tabular

LinkedIn Ads External SaaS platform Tabular

Mailchimp External SaaS platform Tabular

Marketo External SaaS platform Tabular

Microsoft Teams External SaaS platform Tabular

Mixpanel External SaaS platform Tabular

Okta External SaaS platform Tabular

Salesforce External SaaS platform Tabular

Salesforce Marketing Cloud External SaaS platform Tabular

Salesforce Pardot External SaaS platform Tabular

SAP OData External SaaS platform Tabular

SendGrid External SaaS platform Tabular

ServiceNow External SaaS platform Tabular

Singular External SaaS platform Tabular

Slack External SaaS platform Tabular

Stripe External SaaS platform Tabular

300
Amazon SageMaker Developer Guide
Use custom models

Source Type Supported data types

Trend Micro External SaaS platform Tabular

Typeform External SaaS platform Tabular

Veeva External SaaS platform Tabular

Zendesk External SaaS platform Tabular

Zendesk Chat External SaaS platform Tabular

Zendesk Sell External SaaS platform Tabular

Zendesk Sunshine External SaaS platform Tabular

Zoom Meetings External SaaS platform Tabular

For instructions on how to import data and information regarding input data requirements, such as the
maximum file size for images, see Create a dataset (p. 301).

Canvas also provides several sample datasets in your application to help you get started. To learn more
about the SageMaker-provided sample datasets you can experiment with, see Use sample datasets.

After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual
update or you can set up a schedule for automatic dataset updates. For more information, see Update a
dataset (p. 308).

For more information specific to each dataset type, see the following sections:

Tabular

To import data from an external data source (such as a Snowflake database or a SaaS platform), you
must authenticate and connect to the data source in the Canvas application. For more information, see
Connect to data sources (p. 310).

After creating datasets in Canvas, you can join multiple datasets into a single dataset. Joining datasets
is only supported for tabular datasets. As long as your data is arranged into tables, you can join datasets
from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake. For information about
joining datasets, see Join data that you've imported into SageMaker Canvas (p. 317).

Image

For information about how to edit an image dataset and perform tasks such as assigning or reassigning
labels, adding images, or deleting images, see Edit an image dataset (p. 328).

Create a dataset
The following sections describe how to create a dataset in Amazon SageMaker Canvas. For custom
models, you can create datasets for tabular and image data. Choose your workflow based on the
following information:

• For categorical, numeric, text, and timeseries data, see Import tabular data (p. 303).
• For image data, see Import image data (p. 305).

Note
For information about how to import a document dataset for Ready-to-use models that accept
document data, see the Import document data (p. 292) workflow in the Ready-to-use models
documentation.

301
Amazon SageMaker Developer Guide
Use custom models

A dataset can consist of multiple files. For example, you might have multiple files of inventory data in
CSV format. You can upload these files together as a dataset as long as the schema (or column names
and data types) of the files match.

Canvas also supports managing multiple versions of your dataset. When you create a dataset, the first
version is labeled as V1. You can create a new version of your dataset by updating your dataset. You can
do a manual update, or you can set up an automated schedule for updating your dataset with new data.
For more information, see Update a dataset (p. 308).

When you import your data into Canvas, make sure that it meets the requirements in the following table.
The limitations are specific to the type of model you’re building.

Limit 2 category, Text prediction Image prediction *Document data


3+ category, models models for Ready-to-use
numeric, and time models
series models

Supported file CSV (local upload, CSV (local upload, JPG, PNG PDF, JPG, PNG,
types Amazon S3, or Amazon S3, or TIFF
databases) databases)

JSON (databases) JSON (databases)

Maximum file size 5 GB (for all files 5 MB (for all files 30 MB per image 5 MB per
in the dataset) in the dataset) document

Maximum number 50 50 N/A N/A


of files in tabular
datasets

Maximum number 20 20 N/A N/A


of files in tabular
datasets for a
single manual
upload

Maximum number 1000 1000 N/A N/A


of columns

Maximum number 50,000 7500 5000 N/A


of entries (rows,
images, or
documents) for
Quick builds

Maximum number N/A 150,000 180,000 N/A


of entries (rows,
images, or
documents) for
Standard builds

Minimum number 2 category: 500 N/A N/A N/A


of entries (rows)
for Quick builds 3+ category,
numeric, time
series: N/A

Minimum number 250 50 50 N/A


of entries (rows,

302
Amazon SageMaker Developer Guide
Use custom models

Limit 2 category, Text prediction Image prediction *Document data


3+ category, models models for Ready-to-use
numeric, and time models
series models
images, or
documents) for
Standard builds

Minimum number N/A 25 25 N/A


of entries (rows or
images) per label

Minimum number 2 category: 2 2 2 N/A


of labels
3+ category: 3

Numeric, time
series: N/A

Minimum sample 500 N/A N/A N/A


size for random
sampling

Maximum sample 40,000 N/A N/A N/A


size for random
sampling

Maximum number 2 category: 2 1000 1000 N/A


of labels
3+ category,
numeric, time
series: N/A

*Document data is currently only supported for Ready-to-use models (p. 289) that accept document
data. You can't build a custom model with document data.

Also note the following restrictions:

• For tabular data, CSV files must be comma delimited and not have newline characters except when
denoting a new row.
• For image data, if you have any unlabeled images, you must label them before building your model.
For information about how to assign labels to images within the Canvas application, see Edit an image
dataset (p. 328).
• If you set up automatic dataset updates or automatic batch prediction configurations, you can only
create a total of 20 configurations in your Canvas application. For more information, see Manage
automations (p. 375).

After you import a dataset, you can view your datasets on the Datasets page at any time.

Import tabular data


With tabular datasets, you can build categorical, numeric, time series forecasting, and text prediction
models. Review the limitations table in the preceding Import a dataset section to ensure that your data
meets the requirements for tabular data (note that the sample size limits only apply when previewing
your data before building your model).

Use the following procedure to import a tabular dataset into Canvas:

303
Amazon SageMaker Developer Guide
Use custom models

1. Open your SageMaker Canvas application.


2. In the left navigation pane, choose Datasets.
3. Choose Import.
4. In the popup dialog box, in the Dataset name field, enter a name for the dataset and choose Create.
5. On the Import page, open the Data Source dropdown menu.
6. Choose your data source:

• To upload files from your computer, choose Local upload.


• To import data from another source, such as an Amazon S3 bucket or a Snowflake database,
search for your data source in the Search data source bar. Then, choose the tile for your desired
data source.
Note
You can only import data from the tiles that have an active connection. If you want to
connect to a data source that is unavailable to you, contact your administrator. If you’re
an administrator, see Connect to data sources (p. 310).

The following screenshot shows the Data Source dropdown menu.

7. (Optional) If you’re connecting to an Amazon Redshift or Snowflake database for the first time, a
dialog box appears to create a connection. Fill out the dialog box with your credentials and choose
Create connection. If you already have a connection, choose your connection.
8. From your data source, select your files to import. For local upload and importing from Amazon
S3, you can select files. For database sources, you can drag-and-drop data tables from the left
navigation pane.
9. (Optional) For tabular data sources that support SQL querying (such as Amazon Redshift, Amazon
Athena, or Snowflake), you can choose Edit in SQL to make SQL queries and join tables before
importing them. For more information, see Join data that you've imported into SageMaker
Canvas (p. 317).

The following screenshot shows the Edit SQL view for an Amazon Athena data source.

304
Amazon SageMaker Developer Guide
Use custom models

10. (Optional) You can choose Preview to preview your dataset before importing. For tabular datasets,
this shows you up to the first 100 rows of your dataset. The following screenshot shows you the
Import preview screen
11. When you’re ready to import your data, choose Import data.

While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).

When the Status of your dataset shows as Ready, Canvas successfully imported your data and you can
proceed with building a model.

If you have a connection to a data source, such as an Amazon Redshift database or a SaaS connector, you
can return to that connection. For Amazon Redshift and Snowflake, you can add another connection by
creating another dataset, returning to the Import data page, and choosing the Data Source tile for that
connection. From the dropdown menu, you can open the previous connection or choose Add connection.
Note
For SaaS platforms, you can only have one connection per data source.

Import image data

With image datasets, you can build single-label image prediction custom models, which predict a label
for an image. Review the limitations in the preceding Import a dataset section to ensure that your image
dataset meets the requirements for image data.
Note
You can only import image datasets from local file upload or an Amazon S3 bucket. Also, for
image datasets, you must have at least 25 images per label.

Use the following procedure to import an image dataset into Canvas:

1. Open your SageMaker Canvas application.


2. In the left navigation pane, choose Datasets.
3. Choose Create.
4. From the dropdown menu, choose Image.
5. In the popup dialog box, in the Dataset name field, enter a name for the dataset and choose Create.
6. On the Import page, open the Data Source dropdown menu.
7. Choose your data source. To upload files from your computer, choose Local upload. To import files
from Amazon S3, choose Amazon S3.

305
Amazon SageMaker Developer Guide
Use custom models

8. From your computer or Amazon S3 bucket, select the images or folders of images that you want to
upload.
9. When you’re ready to import your data, choose Import data.

While your dataset is importing into Canvas, you can see your datasets listed on the Datasets page. From
this page, you can View your dataset details (p. 306).

When the Status of your dataset shows as Ready, Canvas successfully imported your data and you can
proceed with building a model.

When you are building your model, you can edit your image dataset, and you can assign or re-assign
labels, add images, or delete images from your dataset. For more information about how to edit your
image dataset, see Edit an image dataset (p. 328).

View your dataset details

For each of your datasets, you can view all of the files in a dataset, the dataset’s version history, and any
auto update configurations for the dataset. From the Datasets page, you can also initiate actions such as
Update a dataset (p. 308) or Build a custom model (p. 321).

To view the details for a dataset, do the following:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose Datasets.
3. From the list of datasets, choose your dataset.

On the Data tab, you can see a preview of your data. If you choose Dataset details, you can see all of
the files that are part of your dataset. Choose a file to see only the data from that file in the preview. For
image datasets, the preview only shows you the first 100 images of your dataset.

On the Version history tab, you can see a list of all of the versions of your dataset. A new version
is made whenever you update a dataset. To learn more about updating a dataset, see Update a
dataset (p. 308). The following screenshot shows the Version history tab in the Canvas application.

306
Amazon SageMaker Developer Guide
Use custom models

On the Auto updates tab, you can enable auto updates for the dataset and set up a configuration to
update your dataset on a regular schedule. To learn more about setting up auto updates for a dataset,
see Configure automatic updates for a dataset (p. 309). The following screenshot shows the Auto
updates tab with auto updates turned on and a list of auto update jobs that have been performed on the
dataset.

307
Amazon SageMaker Developer Guide
Use custom models

Update a dataset
After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that
you want to add to your dataset. For example, you might get inventory data at the end of every week
that you want to add to your dataset. Instead of importing your data multiple times, you can update
your existing dataset and add or remove files from it.
Note
You can only update datasets that you have imported through local upload or Amazon S3.

You can update your dataset either manually or automatically. With automatic updates, you specify
a location where Canvas checks for files at a frequency you specify. If you import new files during the
update, the schema of the files must match the existing dataset exactly.

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use
the latest version of your dataset to build a model or generate predictions. For more information about
viewing the version history of your dataset, see View your dataset details (p. 306).

You can also use dataset updates with automated batch predictions, which starts a batch prediction job
whenever you update your dataset. For more information, see Make batch predictions (p. 360).

The following sections describe how to do manual and automatic updates to your dataset.

Manually update a dataset

To do a manual update, do the following:

1. Open the SageMaker Canvas application.

308
Amazon SageMaker Developer Guide
Use custom models

2. In the left navigation pane, choose Datasets.


3. From the list of datasets, choose the dataset you want to update.
4. Choose the Update dataset dropdown menu and choose Manual update. You are taken to the import
data workflow.
5. From the Data source dropdown menu, choose either Local upload or Amazon S3.
6. The page shows you a preview of your data. From here, you can add or remove files from the dataset.
If you’re importing tabular data, the schema of the new files (column names and data types) must
match the schema of the existing files. Additionally, your new files must not exceed the maximum
dataset size or file size. For more information about these limitations, see Import a dataset.
Note
If you add a file with the same name as an existing file in your dataset, the new file overwrites
the old version of the file.
7. When you’re ready to save your changes, choose Update dataset.

You should now have a new version of your dataset.

On the Datasets page, you can choose the Version history tab to see all of the versions of your dataset
and the history of both manual and automatic updates you’ve made.

Configure automatic updates for a dataset

An automatic update is when you set up a configuration for Canvas to update your dataset at a given
frequency. We recommend that you use this option if you regularly receive new files of data that you
want to add to your dataset.

When you set up the auto update configuration, you specify an Amazon S3 location where you upload
your files and a frequency at which Canvas checks the location and imports files. Each instance of Canvas
updating your dataset is referred to as a job. For each job, Canvas imports all of the files in the Amazon
S3 location. If you have new files with the same names as existing files in your dataset, Canvas overwrites
the old files with the new files.

For automatic dataset updates, Canvas doesn’t perform schema validation. If the schema of files
imported during an automatic update don’t match the schema of the existing files or exceed the size
limitations (see Import a dataset for a table of file size limitations), then you get errors when your jobs
run.
Note
You can only set up a maximum of 20 automatic configurations in your Canvas application.
Additionally, Canvas only does automatic updates while you’re logged in to your Canvas
application. If you log out of your Canvas application, automatic updates pause until you log
back in.

To configure automatic updates for your dataset, do the following:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose Datasets.
3. From the list of datasets, choose the dataset you want to update.
4. Choose the Update dataset dropdown menu and choose Automatic update. You are taken to the
Auto updatestab for the dataset.
5. Turn on the Auto update enabled toggle.
6. For Specify a data source, enter the Amazon S3 path to a folder where you plan to regularly upload
files.
7. For Choose a frequency, select Hourly, Weekly, or Daily.
8. For Specify a starting time, use the calendar and time picker to select when you want the first auto
update job to start.

309
Amazon SageMaker Developer Guide
Use custom models

9. When you’re ready to create the auto update configuration, choose Save.

Canvas begins the first job of your auto update cadence at the specified starting time.

For more information about viewing your auto update job history or making changes to your
auto update configuration through the Automations page in the Canvas application, see Manage
automations (p. 375).

The following sections describe how to view, update, and delete your automatic update configuration
through the Datasets page in the Canvas application.

View your automatic dataset update jobs

To view the job history for your automatic dataset updates, on your dataset details page, choose the
Auto updates tab.

Each automatic update to a dataset shows as a job in the Auto updates tab under the Job history
section. For each job, you can see the following:

• Job created – The timestamp for when Canvas started updating the dataset.
• Files – The number of files in the dataset.
• Cells (Columns x Rows) – The number of columns and rows in the dataset.
• Status – The status of the dataset after the update. If the job was successful, the status is Ready. If the
job failed for any reason, the status is Failed, and you can hover over the status for more details.

Edit your automatic dataset update configuration

You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.

To make changes to your auto update configuration for a dataset, go to the Auto updates tab of your
dataset and choose Edit to make changes to the configuration.

To pause your dataset updates, turn off your automatic configuration. You can turn off auto updates by
going to the Auto updates tab of your dataset and turning the Enable auto updates toggle off. You can
turn this toggle back on at any time to resume the update schedule.

Delete your automatic dataset update configuration

To learn how to delete your configuration, see Delete an automatic configuration (p. 377).

Connect to data sources


In Amazon SageMaker Canvas, you can import data from a location outside of your local file system
through an AWS service or a SaaS platform. For example, you might want to import tables from a data
warehouse in Amazon Redshift, or you might want to import Google Analytics data.

When you go through the Import workflow to import data in the Canvas application, you can choose
your data source and then select the data that you want to import. For certain data sources, like
Snowflake and Amazon Redshift, you must specify your credentials and add a connection to the data
source.

The following screenshot shows the data sources toolbar in the Import workflow, with all of the
available data sources highlighted. You can only import data from the data sources that are available to
you. Contact your administrator if your desired data source isn’t available.

310
Amazon SageMaker Developer Guide
Use custom models

The following sections provide information about importing data from AWS services (like Amazon
Redshift) and from SaaS platforms (such as Snowflake or Facebook Ads). Review the following section
first to determine what permissions you need to import data from your data source.

Permissions
Review the following information to ensure that you have the necessary permissions to import data from
your data source:

• Amazon S3: You can import data from any Amazon S3 bucket as long as your user has permissions to
access the bucket. For more information about using AWS IAM to control access to Amazon S3 buckets,
see Identity and access management in Amazon S3 in the Amazon S3 User Guide.
• Amazon Athena: If you have the AmazonSageMakerFullAccess policy and the
AmazonSageMakerCanvasFullAccess policy attached to your user’s execution role, then you’ll
be able to query your AWS Glue Data Catalog with Amazon Athena. If you’re part of an Athena
workgroup, make sure that the Canvas user has permissions to run Athena queries on the data. For
more information, see Using workgroups for running queries in the Amazon Athena User Guide.
• Amazon Redshift: To give yourself the necessary permissions to import data from Amazon Redshift,
see Grant Users Permissions to Import Amazon Redshift Data.
• SaaS platforms: If you have the AmazonSageMakerFullAccess policy and the
AmazonSageMakerCanvasFullAccess policy attached to your user’s execution role, then you’ll have
the necessary permissions to import data from SaaS platforms. See Use SaaS connectors with
Canvas (p. 316) for more information about connecting to a specific SaaS connector.

Connect to a database stored in AWS


You might want to import data that you’ve stored in AWS. You can import data from Amazon S3, use
Amazon Athena to query a database in the AWS Glue Data Catalog, or make a connection to an Amazon
Redshift database.

You can create multiple connections to Amazon Redshift. For Amazon Athena, you can access any
databases that you have in your AWS Glue Data Catalog. For Amazon S3, you can import data from a
bucket as long as you have the necessary permissions.

Review the following sections for more detailed information.

Connect to data in Amazon S3 or with Amazon Athena


For Amazon S3, you can import data from an Amazon S3 bucket as long as you have permissions to
access the bucket.

For Amazon Athena, you can access databases in your AWS Glue Data Catalog as long as you have
permissions through your Amazon Athena workgroup.

311
Amazon SageMaker Developer Guide
Use custom models

To import data from an Amazon S3 bucket, or to run queries and import data tables with Amazon
Athena, see Create a dataset (p. 301). You can only import tabular data from Amazon Athena, and you
can import tabular and image data from Amazon S3.

Connect to an Amazon Redshift database

You can import data from Amazon Redshift, a data warehouse where your organization keeps its
data. Before you can import data from Amazon Redshift, the AWS IAM role you use must have the
AmazonRedshiftFullAccess managed policy attached. For instructions on how to attach this policy,
see Grant Users Permissions to Import Amazon Redshift Data (p. 281).

To import data from Amazon Redshift, you do the following:

1. Create a connection to an Amazon Redshift database.


2. Choose the data that you're importing.
3. Import the data.

You can use the Amazon Redshift editor to drag datasets onto the import pane and import them into
SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:

• SQL queries
• Joins

SQL queries give you the ability to customize how you import the values in the dataset. For example, you
can specify the columns returned in the dataset or the range of values for a column.

You can use joins to combine multiple datasets from Amazon Redshift into a single dataset. You can drag
your datasets from Amazon Redshift into the panel that gives you the ability to join the datasets.

You can use the SQL editor to edit the dataset that you've joined and convert the joined dataset into a
single node. You can join another dataset to the node. You can import the data that you've selected into
SageMaker Canvas.

Use the following procedure to import data from Amazon Redshift.

1. In the SageMaker Canvas application, go to the Datasets page.


2. Choose Import.
3. For Data Source, open the dropdown menu and choose Redshift.
4. Choose Add connection.
5. In the dialog box, specify your Amazon Redshift credentials.
6. From the tab that has the name of your connection, drag the .csv file that you're importing to the
Drag and drop table to import pane.
7. Optional: Drag additional tables to the import pane. You can use the GUI to join the tables. For more
specificity in your joins, choose Edit in SQL.
8. Optional: If you're using SQL to query the data, you can choose Context to add context to the
connection by specifying values for the following:

• Warehouse
• Database
• Schema
9. Choose Import data.

The following image shows an example of fields specified for an Amazon Redshift connection.

312
Amazon SageMaker Developer Guide
Use custom models

The following image shows the page used to join datasets in Amazon Redshift.

The following image shows an SQL query being used to edit a join in Amazon Redshift.

Connect to a SaaS platform


You can import data from Snowflake and over 40 other external SaaS platforms. For a full list of the
connectors, see the table on Import data into Canvas (p. 299).

313
Amazon SageMaker Developer Guide
Use custom models

Note
You can only import tabular data, such as data tables, from SaaS platforms.

Use Snowflake with Amazon SageMaker Canvas

Snowflake is a data storage and analytics service, and you can import your data from Snowflake into
SageMaker Canvas. For more information about Snowflake, see the Snowflake documentation.

You can import data from your Snowflake account by doing the following:

1. Create a connection to the Snowflake database.


2. Choose the data that you're importing by dragging and dropping the table from the left navigation
menu into the editor.
3. Import the data.

You can use the Snowflake editor to drag datasets onto the import pane and import them into
SageMaker Canvas. For more control over the values returned in the dataset, you can use the following:

• SQL queries
• Joins

SQL queries give you the ability to customize how you import the values in the dataset. For example, you
can specify the columns returned in the dataset or the range of values for a column.

You can join multiple Snowflake datasets into a single dataset before you import into Canvas using SQL
or the Canvas interface. You can drag your datasets from Snowflake into the panel that gives you the
ability to join the datasets, or you can edit the joins in SQL and convert the SQL into a single node. You
can join other nodes to the node that you've converted. You can then combine the datasets that you've
joined into a single node and join the nodes to a different Snowflake dataset. Finally, you can import the
data that you've selected into Canvas.

Use the following procedure to import data from Snowflake to Amazon SageMaker Canvas.

1. In the SageMaker Canvas application, go to the Datasets page.


2. Choose Import.
3. For Data Source, open the dropdown menu and choose Snowflake.
4. Choose Add connection.
5. In the Add a new Snowflake connection dialog box, specify your Snowflake credentials.
6. Choose Add connection.
7. From the tab that has the name of your connection, drag the .csv file that you're importing to the
Drag and drop table to import pane.
8. Optional: Drag additional tables to the import pane. You can use the user interface to join the
tables. For more specificity in your joins, choose Edit in SQL.
9. Optional: If you're using SQL to query the data, you can choose Context to add context to the
connection by specifying values for the following:

• Warehouse
• Database
• Schema

Adding context to a connection makes it easier to specify future queries.


10. Choose Import data.

314
Amazon SageMaker Developer Guide
Use custom models

The following image shows an example of fields specified for a Snowflake connection.

The following image shows the page used to add context to a connection.

The following image shows the page used to join datasets in Snowflake.

315
Amazon SageMaker Developer Guide
Use custom models

The following image shows a SQL query being used to edit a join in Snowflake.

Use SaaS connectors with Canvas


Note
For SaaS platforms besides Snowflake, you can only have one connection per data source.

Before you can import data from a SaaS platform, your administrator must authenticate and create a
connection to the data source. For more information about how administrators can create a connection
with a SaaS platform, see Managing Amazon AppFlow connections in the Amazon AppFlow User Guide.

316
Amazon SageMaker Developer Guide
Use custom models

If you’re an administrator getting started with Amazon AppFlow for the first time, see Getting started in
the Amazon AppFlow User Guide.

To import data from a SaaS platform, you can follow the standard Import tabular data (p. 303)
procedure, which shows you how to import tabular datasets into Canvas.

Join data that you've imported into SageMaker Canvas


Note
You can only make joins for tabular datasets in SageMaker Canvas.

You can use Amazon SageMaker Canvas to join multiple datasets into a single dataset. A join combines
the two datasets. By default, SageMaker Canvas automatically joins the datasets on their matching
column names. The option to combine multiple datasets might give you the ability to get more insight
from the models that you build.

You can make the following joins for your datasets:

• Inner – Returns a dataset with matching values in both datasets.


• Left – Returns a dataset that has:
• All the rows from the dataset to the left of the join.
• All the rows from the dataset to the right of the join that have matching values with the columns to
the left of the join.
• Right – Returns a dataset that has:
• All the rows from the dataset to the right of the join.
• All the rows from the dataset to the left of the join that have matching values with the columns to
the right of the join.
• Outer – Returns all the rows when there is a match in either the left or the right dataset. The dataset
from an outer join might have null values that SageMaker Canvas might impute when you build a
model.

Use the following procedure to join your datasets.

To join datasets, do the following.

1. Navigate to the Datasets page.


2. Choose Join data.
3. Drag and drop the datasets that you're joining into the Drag and drop datasets to join box.
4. Configure the join. Amazon SageMaker Canvas shows you a preview of the joined data after you
configure it.
5. Choose Save joined data to save the output of the join.

The following images show the workflow of the preceding procedure.

317
Amazon SageMaker Developer Guide
Use custom models

318
Amazon SageMaker Developer Guide
Use custom models

Use sample datasets


SageMaker Canvas provides sample datasets addressing unique use cases so you can start building,
training, and validating models quickly without writing any code. The use cases associated with these
datasets highlight the capabilities of SageMaker Canvas, and you can leverage these datasets to get
started with building models. You can find the sample datasets in the Datasets page of your SageMaker
Canvas application.

319
Amazon SageMaker Developer Guide
Use custom models

Sample datasets

The following datasets are the samples that SageMaker Canvas provides by default. These datasets
cover use cases such as predicting house prices, loan defaults, and readmission for diabetic patients;
forecasting sales; predicting machine failures to streamline predictive maintenance in manufacturing
units; and generating supply chain predictions for transportation and logistics. The datasets are stored in
the sample_dataset folder in the default Amazon S3 bucket that SageMaker creates for your account
in a Region.

• canvas-sample-diabetic-readmission.csv: This dataset contains historical data including over fifteen


features with patient and hospital outcomes. You can use this dataset to predict whether high-
risk diabetic patients are likely to get readmitted to the hospital within 30 days of discharge, after
30 days, or not at all. Use the redadmitted column as the target column, and use the 3+ category
prediction model type with this dataset. To learn more about how to build a model with this dataset,
see the SageMaker Canvas workshop page. This dataset was obtained from the UCI Machine Learning
Repository.
• canvas-sample-housing.csv: This dataset contains data on the characteristics tied to a given housing
price. You can use this dataset to predict housing prices. Use the median_house_value column as the
target column, and use the Numeric prediction model type with this dataset. To learn more about
building a model with this dataset, see the SageMaker Canvas workshop page. This is the California
housing dataset obtained from the StatLib repository.
• canvas-sample-loans.csv: This dataset contains complete loan data for all loans issued from 2007–
2011, including the current loan status and latest payment information. You can use this dataset
to predict whether a customer will repay a loan. Use the loan_status column as the target column,
and use the 3+ category prediction model type with this dataset. To learn more about how to build a
model with this dataset, see the SageMaker Canvas workshop page. This data uses the LendingClub
data obtained from Kaggle.
• canvas-sample-maintenance.csv: This dataset contains data on the characteristics tied to a given
maintenance failure type. You can use this dataset to predict which failure will occur in the future. Use
the Failure Type column as the target column, and use the 3+ category prediction model type with
this dataset. To learn more about how to build a model with this dataset, see the SageMaker Canvas
workshop page. This dataset was obtained from the UCI Machine Learning Repository.
• canvas-sample-shipping-logs.csv: This dataset contains complete shipping data for all products
delivered, including estimated time shipping priority, carrier, and origin. You can use this dataset to
predict the estimated time of arrival of the shipment in number of days. Use the ActualShippingDays
column as the target column, and use the Numeric prediction model type with this dataset. To learn
more about how to build a model with this data, see the SageMaker Canvas workshop page. This is a
synthetic dataset created by Amazon.
• canvas-sample-sales-forecasting.csv: This dataset contains historical time series sales data for retail
stores. You can use this dataset to forecast sales for a particular retail store. Use the sales column as
the target column, and use the Time series forecasting model type with this dataset. To learn more
about how to build a model with this dataset, see the SageMaker Canvas workshop page. This is a
synthetic dataset created by Amazon.

Re-import a deleted sample dataset

If you no longer wish to use the sample datasets, you can delete them from the Datasets page of your
SageMaker Canvas application. However, these datasets are still stored in the default SageMaker-created
Amazon S3 bucket for your account, so you can always access them later.

The default Amazon S3 bucket name where the datasets are stored follows the pattern
sagemaker-{region}-{account ID}. You can find the sample datasets in the directory path
Canvas/sample_dataset.

If you delete a sample dataset from your SageMaker Canvas application and want to access the sample
dataset again, use the following procedure.

320
Amazon SageMaker Developer Guide
Use custom models

1. Navigate to the Datasets page in your SageMaker Canvas application.


2. Choose Import data.
3. From the list of S3 buckets, select the default SageMaker S3 bucket for your account, which follows
the naming pattern sagemaker-{region}-{account ID}.
4. Select the Canvas folder.
5. Select the sample_dataset folder, which contains all of the sample datasets for SageMaker Canvas.
6. Select the dataset you want to import, and then choose Import data.

Build a custom model


Use Amazon SageMaker Canvas to build a custom model on the dataset that you've imported. Use the
model that you've built to make predictions on new data. SageMaker Canvas uses the information in the
dataset to build up to 250 models and choose the one that performs the best.

When you begin building a model, Canvas automatically recommends one or more model types. Model
types fall into one of the following categories:

• Numeric prediction – This is known as regression in machine learning. Use the numeric prediction
model type when you want to make predictions for numeric data. For example, you might want to
predict the price of houses based on features such as the house’s square footage.
• Categorical prediction – This is known as classification in machine learning. When you want to
categorize data into groups, use the categorical prediction model types:
• 2 category prediction – Use the 2 category prediction model type (also known as binary
classification in machine learning) when you have two categories that you want to predict for your
data. For example, you might want to determine whether a customer is likely to churn.
• 3+ category prediction – Use the 3+ category prediction model type (also known as multi-class
classification in machine learning) when you have three or more categories that you want to predict
for your data. For example, you might want to predict a customer's loan status based on features
such as previous payments.
• Time series forecasting – Use time series forecasts when you want to make predictions over a period
of time. For example, you might want to predict the number of items you’ll sell in the next quarter. For
information about time series forecasts, see Time Series Forecasts in Amazon SageMaker Canvas.
• Image prediction – Use the single-label image prediction model type (also known as single-label
image classification in machine learning) when you want to assign labels to images. For example, you
might want to classify different types of manufacturing defects in images of your product.
• Text prediction – Use the multi-category text prediction model type (also known as multi-class text
classification in machine learning) when you want to assign labels to passages of text. For example,
you might have a dataset of customer reviews for a product, and you want to determine whether
customers liked or disliked the product. You might have your model predict whether a given passage
of text is Positive, Negative, or Neutral.

For a table of the supported input data types for each model type, see Use custom models (p. 297).

For each tabular data model that you build (which includes numeric, categorical, time series forecasting,
and text prediction models), you choose the Target column. The Target column is the column that
contains the information that you want to predict. For example, if you're building a model to predict
whether people have cancelled their subscriptions, the Target column contains data points that are
either a yes or a no about someone's cancellation status.

For image prediction models, you build the model with a dataset of images that have been assigned
labels. For the unlabeled images that you provide, the model predicts a label. For example, if you’re
building a model to predict whether an image is a cat or a dog, you provide images labeled as cats or

321
Amazon SageMaker Developer Guide
Use custom models

dogs when building the model. Then, the model can accept unlabeled images and predict them as either
cats or dogs.

What happens when you build a model

To build your model, you can choose either a Quick build or a Standard build. The Quick build has a
shorter build time, but the Standard build generally has a higher accuracy. The following table outlines
the average build times for each model and build type, along with the minimum and maximum number
of data points you should have for each build type.

Limit Numeric and Time series Image prediction Text prediction


categorical forecasting
prediction

Quick build time 2‐20 minutes 2‐20 minutes 15‐30 minutes 15‐30 minutes

Standard build 2‐4 hours 2‐4 hours 2‐5 hours 2‐5 hours
time

Maximum number 50,000 50,000 5000 7500


of entries (rows or
images) for Quick
builds

If you log out while running a Quick build, your build might be interrupted until you log in again. When
you log in again, Canvas resumes the Quick build.

Canvas predicts values by using the information in the rest of the dataset, depending on the model type:

• For categorical prediction, Canvas puts each row into one of the categories listed in the Target
column.
• For numeric prediction, Canvas uses the information in the dataset to predict the numeric values in the
Target column.
• For time series forecasting, Canvas uses historical data to predict values for the Target column in the
future.
• For image prediction, Canvas uses images that have been assigned labels to predict labels for
unlabeled images.
• For text prediction, Canvas analyzes text data that has been assigned labels to predict labels for
passages of unlabeled text.

Additional features to help you build your model


Note
The following features are available for numeric and categorical prediction and time series
forecasting models.

Before building your model, you can filter your data or prepare it using advanced transforms. For
more information about preparing your data for model building, see Prepare data with advanced
transformations (p. 338).

You can also use visualization and analytics to explore your data and determine which features are best
to include in your model. For more information, see Explore and analyze your data.

To learn more about additional features such as previewing your model, validating your dataset, and
changing the size of the random sample used to build your model, see Preview your model (p. 326).

322
Amazon SageMaker Developer Guide
Use custom models

For tabular datasets with multiple columns (such as datasets for building categorical, numeric, or time
series forecasting model types), you might have rows with missing data points. While Canvas builds
the model, it automatically adds missing values. Canvas uses the values in your dataset to perform a
mathematical approximation for the missing values. For the highest model accuracy, we recommend
adding in the missing data if you can find it. Note that the missing data feature is not supported for text
prediction or image prediction models.

Get started

To get started with building a custom model, see Build a model (p. 323) and follow the procedure for
the type of model that you want to build.

Build a model
The following sections show you how to build a model for each of the main types of custom models.

• To build numeric prediction, 2 category prediction, or 3+ category prediction models, see Build a
custom numeric or categorical prediction model (p. 323).
• To build single-label image prediction models, see Build a custom image prediction model (p. 324).
• To build multi-category text prediction models, see Build a custom text prediction model (p. 325).
• To get started with time-series forecasting models, see Time Series Forecasts in Amazon SageMaker
Canvas.

Note
If you encounter an error during post-building analysis that tells you to increase your quota for
ml.m5.2xlarge instances, see Request a Quota Increase.

Build a custom numeric or categorical prediction model

Numeric and categorical prediction models support both Quick builds and Standard builds.

To build a numeric or categorical prediction model, use the following procedure:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. Choose New model.
4. In the Create new model dialog box, do the following:

a. Enter a name in the Model name field.


b. Select the Predictive analysis problem type.
c. Choose Create.
5. For Select dataset, select your dataset from the list of datasets. If you haven’t already imported your
data, choose Import to be directed through the import data workflow.
6. When you’re ready to begin building your model, choose Select dataset.
7. On the Build tab, for the Target column dropdown list, select the target for your model that you
would like to predict.
8. For Model type, Canvas automatically chooses the problem type for you. If you want to change the
type, choose Change model type and select your desired model type.
9. Select or deselect columns in your data to include or drop them from your build.
Note
If you make batch predictions with your model after building, Canvas adds dropped
columns to your prediction results. However, Canvas does not add the dropped columns to
your batch predictions for time series models.

323
Amazon SageMaker Developer Guide
Use custom models

10. (Optional) Use the visualization and analytics tools that Canvas provides to visualize your data and
determine which features you might want to include in your model. For more information, see
Explore and analyze your data.
11. (Optional) Use data transformations to clean, transform, and prepare your data for model building.
For more information, see Prepare your data with advanced transformations. You can view and
remove your transforms by choosing Model recipe to open the Model recipe side panel.
12. (Optional) For additional features such as previewing the accuracy of your model, validating your
dataset, and changing the size of the random sample that Canvas takes from your dataset, see
Preview your model (p. 326).
13. After reviewing your data and making any changes to your dataset, choose Quick build or Standard
build to begin a build for your model. The following screenshot shows the Build page and the Quick
build and Standard build options.

After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.

Build a custom image prediction model

Single-label image prediction models support both Quick builds and Standard builds.

To build a single-label image prediction model, use the following procedure:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. Choose New model.
4. In the Create new model dialog box, do the following:

a. Enter a name in the Model name field.


b. Select the Image analysis problem type.
c. Choose Create.
5. For Select dataset, select your dataset from the list of datasets. If you haven’t already imported your
data, choose Import to be directed through the import data workflow.
6. When you’re ready to begin building your model, choose Select dataset.
7. On the Build tab, you see the Label distribution for the images in your dataset. The Model type is
set to Single-label image prediction.
8. On this page, you can preview your images and edit the dataset. If you have any unlabeled images,
choose Edit dataset and Assign labels to unlabeled images (p. 329). You can also perform other

324
Amazon SageMaker Developer Guide
Use custom models

tasks when you Edit an image dataset (p. 328), such as renaming labels and adding images to the
dataset.
9. After reviewing your data and making any changes to your dataset, choose Quick build or Standard
build to begin a build for your model. The following screenshot shows the Build page of an image
prediction model that is ready to be built.

After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.

Build a custom text prediction model

Multi-category text prediction models support both Quick builds and Standard builds.

To build a text prediction model, use the following procedure:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. Choose New model.
4. In the Create new model dialog box, do the following:

a. Enter a name in the Model name field.


b. Select the Text analysis problem type.
c. Choose Create.
5. For Select dataset, select your dataset from the list of datasets. If you haven’t already imported your
data, choose Importto be directed through the import data workflow.
6. When you’re ready to begin building your model, choose Select dataset.
7. On the Build tab, for the Target column dropdown list, select the target for your model that you
would like to predict. The target column must have a binary or categorical data type, and there must
be at least 25 entries (or rows of data) for each unique label in the target column.
8. For Model type, confirm that the model type is automatically set to Multi-category text prediction.
9. For the training column, select your source column of text data. This should be the column
containing the text that you want to analyze.
10. Choose Quick build or Standard build to begin building your model. The following screenshot
shows the Build page of a text prediction model that is ready to be built.

325
Amazon SageMaker Developer Guide
Use custom models

After your model begins building, you can leave the page. When the model shows as Ready on the My
models page, it’s ready for analysis and predictions.

Preview your model


Note
The following functionalities are only available for custom models built with tabular datasets.
Multi-category text prediction models are also excluded.

SageMaker Canvas provides you with tools to preview your model and validate data before you begin
building. The following functionalities include previewing the accuracy of your model, validating your
dataset to prevent issues while building the model, and changing the size of the random sample for your
model.

Preview a model

With Amazon SageMaker Canvas, you can get insights from your data before you build a model by
choosing Preview model. For example, you can see how the data in each column is distributed. For
models built using categorical data, you can also choose Preview model to generate an Estimated
accuracy prediction of how well the model might analyze your data. The accuracy of a Quick build or a
Standard build represents how well the model can perform on real data and is generally higher than the
Estimated accuracy.

Amazon SageMaker Canvas automatically handles missing values in your dataset while it builds the
model. It infers the missing values by using adjacent values that are present in the dataset.

326
Amazon SageMaker Developer Guide
Use custom models

Validate data

Before you build your model, SageMaker Canvas checks your dataset for issues that will cause your build
to fail. If SageMaker Canvas finds any issues, then it warns you on the Build page before you attempt to
build a model.

You can choose Validate data to see a list of the issues with your dataset. You can then use the
SageMaker Canvas data preparation features (p. 338), or your own tools, to fix your dataset before
starting a build. If you don’t fix the issues with your dataset, then your build fails.

If you make changes to your dataset to fix the issues, you have the option to re-validate your dataset
before attempting a build. We recommend that you re-validate your dataset before building.

The following table shows the issues that SageMaker Canvas checks for in your dataset and how to
resolve them.

Issue Resolution

Wrong model type for your data Try another model type or use a different dataset.

Missing values in your target column Replace the missing values, drop rows with
missing values, or use a different dataset.

Too many unique labels in your target column Verify that you've used the correct column for
your target column, or use a different dataset.

Too many non-numeric values in your target Choose a different target column, select another
column model type, or use a different dataset.

One or more column names contain double Rename the columns to remove any double
underscores underscores, and try again.

None of the rows in your dataset are complete Replace the missing values, or use a different
dataset.

Too many unique labels for the number of rows in Check that you're using the right target column,
your data increase the number of rows in your dataset,
consolidate similar labels, or use a different
dataset.

Random sample

SageMaker Canvas uses the random sampling method to sample your dataset. The random sample
method means that each row has an equal chance of being picked for the sample. You can choose a
column in the preview to get summary statistics for the random sample, such as the mean and the mode.

By default, SageMaker Canvas uses a random sample size of 20,000 rows from your dataset for datasets
with more than 20,000 rows. For datasets smaller than 20,000 rows, the default sample size is the
number of rows in your dataset. You can increase or decrease the sample size by choosing Random
sample in the Build tab of the SageMaker Canvas application. You can use the slider to select your
desired sample size, and then choose Update to change the sample size. The maximum sample size you
can choose for a dataset is 40,000 rows, and the minimum sample size is 500 rows. If you choose a large
sample size, the dataset preview and summary statistics might take a few moments to reload.

The Build page shows a preview of 100 rows from your dataset. If the sample size is the same size as
your dataset, then the preview uses the first 100 rows of your dataset. Otherwise, the preview uses the
first 100 rows of the random sample.

327
Amazon SageMaker Developer Guide
Use custom models

Edit an image dataset


In Amazon SageMaker Canvas, you can edit your image datasets and review your labels before building
a model. You might want to perform tasks such as assigning labels to unlabeled images or adding more
images to the dataset. These tasks can all be done in the Canvas application, providing you with one
place to modify your dataset and build a model.
Note
Before building a model, you must assign labels to all images in your dataset. Also, you must
have at least 25 images per label and a minimum of two labels. For more information about
assigning labels, see the section on this page called Assign labels to unlabeled images. If
you can’t determine a label for an image, you should delete it from your dataset. For more
information about deleting images, see the section on this page Add or delete images from the
dataset (p. 329).

To begin editing your image dataset, you should be on the Build tab while building your single-label
image prediction model.

A new page opens that shows the images in your dataset along with their labels. This page categorizes
your image dataset into Total images, Labeled images, and Unlabeled images. You can also review the
Dataset preparation guide for best practices on building a more accurate image prediction model.

The following screenshot shows the page for editing your image dataset.

From this page, you can do the following actions.

View the properties for each image (label, size, dimensions)

To view an individual image, you can search for it by file name in the search bar. Then, choose the image
to open the full view. You can view the image properties and reassign the image’s label. Choose Save
when you’re doing viewing the image.

Add, rename, or delete labels in the dataset

Canvas lists the labels for your dataset in the left navigation pane. You can add new labels to the dataset
by entering a label in the Add label text field.

To rename or delete a label from your dataset, choose the More options icon ( ) next to the label and
select either Rename or Delete. If you rename the label, you can enter the new label name and choose
Confirm. If you delete the label, the label is removed from all images in your dataset that have that
label. Any images with that label will be unlabeled.

328
Amazon SageMaker Developer Guide
Use custom models

Assign labels to unlabeled images

To view the unlabeled images in your dataset, choose Unlabeled in the left navigation pane. For each
image, select it and open the label titled Unlabeled and select a label to assign to the image from the
dropdown list. You can also select more than one image and perform this action, and all selected images
are assigned the label you chose.

Reassign labels to images

You can reassign labels to images by selecting the image (or multiple images at a time) and opening the
dropdown titled with the current label. Select your desired label, and the image or images are updated
with the new label.

Sort your images by label

You can view all the images for a given label by choosing the label in the left navigation pane.

Add or delete images from the dataset

You can add more images to your dataset by choosing Add images in the top navigation pane. You’ll be
taken through the workflow to import more images. The images you import are added to your existing
dataset.

You can delete images from your dataset by selecting them and then choosing Delete in the top
navigation pane.
Note
After making any changes to your dataset, choose Save dataset to make sure that you don’t lose
your changes.

Explore and analyze your data


Note
You can only use SageMaker Canvas visualizations and analytics for models built on tabular
datasets. Multi-category text prediction models are also excluded.

In Amazon SageMaker Canvas, you can explore the variables in your dataset using visualizations and
analytics. SageMaker Canvas provides you with the ability to create in-application visualizations and
analytics. You can use these explorations to uncover relationships between your variables before building
your model.

For more information about visualization techniques in Canvas, see Explore your data using visualization
techniques (p. 329).

For more information about analytics in Canvas, see Explore your data using analytics (p. 336).

Explore your data using visualization techniques


Note
You can only use SageMaker Canvas visualizations for models built on tabular datasets. Multi-
category text prediction models are also excluded.

With Amazon SageMaker Canvas, you can explore and visualize your data to gain advanced insights into
your data before building your ML models. You can visualize using scatter plots, bar charts, and box
plots, which can help you understand your data and discover the relationships between features that
could affect the model accuracy.

In the Build tab of the SageMaker Canvas application, choose Data visualizer to begin creating your
visualizations.

329
Amazon SageMaker Developer Guide
Use custom models

You can change the visualization sample size to adjust the size of the random sample taken from your
dataset. A sample size that is too large might affect the performance of your data visualizations, so we
recommend that you choose an appropriate sample size. To change the sample size, use the following
procedure.

1. Choose Visualization sample.


2. Use the slider to select your desired sample size.
3. Choose Update to confirm the change to your sample size.

Note
Certain visualization techniques require columns of a specific data type. For example, you can
only use numeric columns for the x and y-axes of scatter plots.

Scatter plot

To create a scatter plot with your dataset, choose Scatter plot in the Visualization panel. Then, you can
choose the features you want to plot on the x and y-axes from the Columns section. You can drag and
drop the columns onto the axes, or once an axis has been dropped, you can choose a column from the
list of supported columns.

You can use Color by to color the data points on the plot with a third feature. You can also use Group by
to group the data into separate plots based on a fourth feature.

The following image shows a scatter plot that uses Color by and Group by. In this example, each data
point is colored by the MaritalStatus feature, and grouping by the Department feature results in a
scatter plot for the data points of each department.

330
Amazon SageMaker Developer Guide
Use custom models

331
Amazon SageMaker Developer Guide
Use custom models

Bar chart

To create a bar chart with your dataset, choose Bar chart in the Visualization panel. Then, you can
choose the features you want to plot on the x and y-axes from the Columns section. You can drag and
drop the columns onto the axes, or once an axis has been dropped, you can choose a column from the
list of supported columns.

You can use Group by to group the bar chart by a third feature. You can use Stack by to vertically shade
each bar based on the unique values of a fourth feature.

The following image shows a bar chart that uses Group by and Stack by. In this example, the bar chart
is grouped by the MaritalStatus feature and stacked by the JobLevel feature. For each JobRole on
the x axis, there is a separate bar for the unique categories in the MaritalStatus feature, and every bar
is vertically stacked by the JobLevel feature.

332
Amazon SageMaker Developer Guide
Use custom models

333
Amazon SageMaker Developer Guide
Use custom models

Box plot

To create a box plot with your dataset, choose Box plot in the Visualization panel. Then, you can choose
the features you want to plot on the x and y-axes from the Columns section. You can drag and drop
the columns onto the axes, or once an axis has been dropped, you can choose a column from the list of
supported columns.

You can use Group by to group the box plots by a third feature.

The following image shows a box plot that uses Group by. In this example, the x and y-axes show
JobLevel and JobSatisfaction, respectively, and the colored box plots are grouped by the
Department feature.

334
Amazon SageMaker Developer Guide
Use custom models

335
Amazon SageMaker Developer Guide
Use custom models

Explore your data using analytics


Note
You can only use SageMaker Canvas analytics for models built on tabular datasets. Multi-
category text prediction models are also excluded.

With analytics in Amazon SageMaker Canvas, you can explore your dataset and gain insight on all of
your variables before building a model. You can determine the relationships between features in your
dataset using correlation matrices. You can use this technique to summarize your dataset into a matrix
that shows the correlations between two or more values. This helps you identify and visualize patterns in
a given dataset for advanced data analysis.

The matrix shows the correlation between each feature as positive, negative, or neutral. You might want
to include features that have a high correlation with each other when building your model. Features that
have little to no correlation might be irrelevant to your model, and you can drop those features when
building your model.

To get started with correlation matrices in SageMaker Canvas, see the following section.

Create a correlation matrix

You can create a correlation matrix when you are preparing to build a model in the Build tab of the
SageMaker Canvas application.

For instructions on how to begin creating a model, see Build a model (p. 323).

After you’ve started preparing a model in the SageMaker Canvas application, do the following:

1. In the Build tab, choose Data visualizer.


2. Choose Analytics.
3. Choose Correlation matrix.

You should see a visualization similar to the following screenshot, which shows up to 15 columns of the
dataset organized into a correlation matrix.

After you’ve created the correlation matrix, you can customize it by doing the following:

1. Choose your columns

For Columns, you can select the columns that you want to include in the matrix. You can compare up to
15 columns from your dataset.

336
Amazon SageMaker Developer Guide
Use custom models

Note
You can use numeric, categorical, or binary column types for a correlation matrix. The
correlation matrix doesn’t support datetime or text data column types.

To add or remove columns from the correlation matrix, select and deselect columns from the Columns
panel. You can also drag and drop columns from the panel directly onto the matrix. If your dataset has a
lot of columns, you can search for the columns you want in the Search columns bar.

To filter the columns by data type, choose the dropdown menu and select All, Numeric, or Categorical.
Selecting All shows you all of the columns from your dataset, whereas the Numeric and Categorical
filters only show you the numeric or categorical columns in your dataset. Note that binary column types
are included in the numeric or categorical filters.

For the best data insights, include your target column in the correlation matrix. When you include your
target column in the correlation matrix, it appears as the last feature on the matrix with a target symbol.

2. Choose your correlation type


SageMaker Canvas supports different correlation types, or methods for calculating the correlation
between your columns.

To change the correlation type, use the Columns filter mentioned in the preceding section to filter
for your desired column type and columns. You should see the Correlation type in the side panel.
For numeric comparisons, you have the option to select either Pearson or Spearman. For categorical
comparisons, the correlation type is set as MI. For categorical and mixed comparisons, the correlation
type is set as Spearman & MI.

For matrices that only compare numeric columns, the correlation type is either Pearson or Spearman.
The Pearson measure evaluates the linear relationship between two continuous variables. The Spearman
measure evaluates the monotonic relationship between two variables. For both Pearson and Spearman,
the scale of correlation ranges from -1 to 1, with either end of the scale indicating a perfect correlation
(a direct 1:1 relationship) and 0 indicating no correlation. You might want to select Pearson if your data
has more linear relationships (as revealed by a scatter plot visualization). If your data is not linear, or
contains a mixture of linear and monotonic relationships, then you might want to select Spearman.

For matrices that only compare categorical columns, the correlation type is set to Mutual Information
Classification (MI). The MI value is a measure of the mutual dependence between two random variables.
The MI measure is on a scale of 0 to 1, with 0 indicating no correlation and 1 indicating a perfect
correlation.

For matrices that compare a mix of numeric and categorical columns, the correlation type Spearman &
MI is a combination of the Spearman and MI correlation types. For correlations between two numeric
columns, the matrix shows the Spearman value. For correlations between a numeric and categorical
column or two categorical columns, the matrix shows the MI value.

Lastly, remember that correlation does not necessarily indicate causation. A strong correlation value only
indicates that there is a relationship between two variables, but the variables might not have a causal
relationship. Carefully review your columns of interest to avoid bias when building your model.

3. Filter your correlations


In the side panel, you can use the Filter correlations feature to filter for the range of correlation values
that you want to include in the matrix. For example, if you want to filter for features that only have
positive or neutral correlation, you can set the Min to 0 and the Max to 1 (valid values are -1 to 1).

For Spearman and Pearson comparisons, you can set the Filter correlations range anywhere from -1 to
1, with 0 meaning that there is no correlation. -1 and 1 mean that the variables have a strong negative or
positive correlation, respectively.

For MI comparisons, the correlation range only goes from 0 to 1, with 0 meaning that there is no
correlation and 1 meaning that the variables have a strong correlation, either positive or negative.

337
Amazon SageMaker Developer Guide
Use custom models

Each feature has a perfect correlation (1) with itself. Therefore, you might notice that the top row of the
correlation matrix is always 1. If you want to exclude these values, you can use the filter to set the Max
less than 1.

Keep in mind that if your matrix compares a mix of numeric and categorical columns and uses the
Spearman & MI correlation type, then the categorical x numeric and categorical x categorical correlations
(which use the MI measure) are on a scale of 0 to 1, whereas the numeric x numeric correlations (which
use the Spearman measure) are on a scale of -1 to 1. Review your correlations of interest carefully to
ensure that you know the correlation type being used to calculate each value.

4. Choose the visualization method

In the side panel, you can use Visualize by to change the visualization method of the matrix. Choose
the Numeric visualization method to show the correlation (Pearson, Spearman, or MI) value, whereas
choosing the Size visualization method visualizes the correlation with differently sized and colored dots.
If you choose Size, you can hover over a specific dot on the matrix to see the actual correlation value.

5. Choose a color palette

In the side panel, you can use Color selection to change the color palette used for the scale of negative
to positive correlation in the matrix. Select one of the alternative color palettes to change the colors
used in the matrix.

Prepare data with advanced transformations


Note
You can only use advanced transformations for models built on tabular datasets. Multi-category
text prediction models are also excluded.

Your machine learning dataset might require data preparation before you build your model. You
might want to clean your data due to various issues, which might include missing values or outliers,
and perform feature engineering to improve the accuracy of your model. Amazon SageMaker Canvas
provides ML data transforms with which you can clean, transform, and prepare your data for model
building. You can use these transforms on your datasets without any code. SageMaker Canvas adds the
transforms you use to the Model recipe, which is a record of the data preparation done on your data
before building the model. Any data transforms you use only modify the input data for model building
and do not modify your original data source.

The following transforms are available in SageMaker Canvas for you to prepare your data for building.
Note
The preview of your dataset shows the first 100 rows of the dataset. If your dataset has more
than 20,000 rows, Canvas takes a random sample of 20,000 rows and previews the first 100
rows from that sample. You can only search for and specify values from the previewed rows, and
the filter functionality only filters the previewed rows and not the entire dataset.

Functions and operators

You can use mathematical functions and operators to explore and distribute your data. You can use the
SageMaker Canvas supported functions or create your own formula with your existing data and create a
new column with the result of the formula. For example, you can add the corresponding values of two
columns and save the result to a new column.

You can nest statements to create more complex functions. The following are some examples of nested
functions that you might use.

• To calculate BMI, you could use the function weight / (height ^ 2).
• To classify ages, you could use the function Case(age < 18, 'child', age < 65, 'adult',
'senior').

338
Amazon SageMaker Developer Guide
Use custom models

You can specify functions in the data preparation stage before you build your model. To use a function,
do the following.

• In the Build tab of the SageMaker Canvas app, choose Functions to open up the Functions panel.
• In the Functions panel, you can choose a Formula to add to your Model Recipe. Each formula is
applied to all of the values in the columns you specify. For formulas that accept two or more columns
as arguments, use columns with matching data types; otherwise, you will get an error or null values
in the new column.
• After you’ve specified a Formula, add a column name in the New Column Name field. SageMaker
Canvas uses this name for the new column that is created.
• To add the function to your Model Recipe, choose Add.

SageMaker Canvas saves the result of your function to a new column using the name you specified in
New Column Name. You can view or remove functions from the Model Recipe panel.

SageMaker Canvas supports the following operators for functions. You can use either the text format or
the in-line format to specify your function.

Operator Description Supported Text format In-line format


data types

Add Returns the sum of the values Numeric Add(sales1, sales1 + sales2
sales2)

Subtract Returns the difference between Numeric Subtract(sales1, sales1 ‐ sales2


the values sales2)

Multiply Returns the product of the Numeric Multiply(sales1, sales1 * sales2


values sales2)

Divide Returns the quotient of the Numeric Divide(sales1, sales1 / sales2


values sales2)

Mod Returns the result of the modulo Numeric Mod(sales1, sales1 %


operator (the remainder after sales2) sales2
dividing the two values)

Abs Returns the absolute value of Numeric Abs(sales1) N/A


the value

Negate Returns the negative of the Numeric Negate(c1) ‐c1


value

Exp Returns e (Euler's number) raised Numeric Exp(sales1) N/A


to the power of the value

Log Returns the logarithm (base 10) Numeric Log(sales1) N/A


of the value

Ln Returns the natural logarithm Numeric Ln(sales1) N/A


(base e) of the value

Pow Returns the value raised to a Numeric Pow(sales1, 2) sales1 ^ 2


power

If Returns a true or false label Boolean, If(sales1>7000, N/A


based on a condition you specify Numeric, Text 'truelabel,
'falselabel')

339
Amazon SageMaker Developer Guide
Use custom models

Operator Description Supported Text format In-line format


data types

Or Returns a boolean value of Boolean Or(fullprice, fullprice ||


whether one of the specified discount) discount
values/conditions is true or not

And Returns a boolean value of Boolean And(sales1,sales2)sales1 &&


whether two of the specified sales2
values/conditions are true or not

Not Returns a boolean value that Boolean Not(sales1) !sales1


is the opposite of the specified
value/conditions

Case Returns a boolean value based Boolean, Case(cond1, c1, N/A


on conditional statements Numeric, Text cond2, c2, c3)
(returns c1 if cond1 is true,
returns c2 if cond2 is true, else
returns c3)

Equal Returns a boolean value of Boolean, N/A c1 = c2


whether two values are equal Numeric, Text
c1 == c2

Not equal Returns a boolean value of Boolean, N/A c1 != c2


whether two values are not Numeric, Text
equal

Less than Returns a boolean value of Boolean, N/A c1 < c2


whether c1 is less than c2 Numeric, Text

Greater than Returns a boolean value of Boolean, N/A c1 > c2


whether c1 is greater than c2 Numeric, Text

Less than or Returns a boolean value of Boolean, N/A c1 <= c2


equal whether c1 is less than or equal Numeric, Text
to c2

Greater than or Returns a boolean value of Boolean, N/A c1 >= c2


equal whether c1 is greater than or Numeric, Text
equal to c2

SageMaker Canvas also supports aggregate operators, which can perform operations such as calculating
the sum of all the values or finding the minimum value in a column. You can use aggregate operators
in combination with standard operators in your functions. For example, to calculate the difference of
values from the mean, you could use the function Abs(height – avg(height)). SageMaker Canvas
supports the following aggregate operators.

Aggregate Description Format Example


operator

sum Returns the sum of all the values in a sum sum(c1)


column

minimum Returns the minimum value of a min min(c2)


column

340
Amazon SageMaker Developer Guide
Use custom models

Aggregate Description Format Example


operator

maximum Returns the maximum value of a max max(c3)


column

average Returns the average value of a column avg avg(c4)

std Returns the sample standard deviation std std(c1)


of a column

stddev Returns the standard deviation of the stddev stddev(c1)


values in a column

variance Returns the unbiased variance of the variance variance(c1)


values in a column

approx_count_distinct
Returns the approximate number of approx_count_distinct
approx_count_distinct(c1)
distinct items in a column

count Returns the number of items in a count count(c1)


column

first Returns the first value of a column first first(c1)

last Returns the last value of a column last last(c1)

stddev_pop Returns the population standard stddev_pop stddev_pop(c1)


deviation of a column

variance_pop Returns the population variance of the variance_pop variance_pop(c1)


values in a column

Datetime extraction

With the datetime extraction transform, you can extract values from a datetime column to a separate
column. For example, if you have a column containing dates of purchases, you can extract the month
value to a separate column and use the new column when building your model. You can also extract
multiple values to separate columns with a single transform.

Your datetime column must use a supported timestamp format. For a list of the formats that SageMaker
Canvas supports, see Time Series Forecasts in Amazon SageMaker Canvas (p. 367). If your dataset does
not use one of the supported formats, update your dataset to use a supported timestamp format and re-
import it to Amazon SageMaker Canvas before building your model.

To perform a datetime extraction, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Extract.


2. Choose the Column from which you want to extract values.
3. For Value, select one or more values to extract from the column. The values you can extract from a
timestamp column are Year, Month, Day, Hour, Week of year, Day of year, and Quarter.
4. Choose Add to add the transform to the Model recipe.

SageMaker Canvas creates a new column in the dataset for each of the values you extract. Except for
Year values, SageMaker Canvas uses a 0-based encoding for the extracted values. For example, if you
extract the Month value, January is extracted as 0, and February is extracted as 1.

341
Amazon SageMaker Developer Guide
Use custom models

You can see the transform listed in the Model recipe section. If you remove the transform from the
Model recipe section, the new columns are removed from the dataset.

Drop columns

You can exclude a column from your model build by dropping it in the Build tab of the SageMaker
Canvas application. Deselect the column you want to drop, and it isn't included when building the model.
Note
If you drop columns and then make batch predictions (p. 358) with your model, SageMaker
Canvas adds the dropped columns back to the .csv file available for you to download. However,
SageMaker Canvas does not add the dropped columns back for time series models.

Rename columns

With the rename columns transform, you can rename columns in your data. When you rename a column,
SageMaker Canvas changes the column name in the model input.

You can rename a column in your dataset by double-clicking on the column name in the Build tab of the
SageMaker Canvas application and entering a new name. Pressing the Enter key submits the change, and
clicking anywhere outside the input cancels the change. You can also rename a column by clicking the
More options icon ( ), located at the end of the row in list view or at the end of the header cell in grid
view, and choosing Rename.

Your column name can’t be longer than 32 characters or have double underscores (__), and you can’t
rename a column to the same name as another column. You also can’t rename a dropped column.

The following screenshot shows how to rename a column by double-clicking the column name.

342
Amazon SageMaker Developer Guide
Use custom models

When you rename a column, SageMaker Canvas adds the transform in the Model recipe section. If you
remove the transform from the Model recipe section, the column reverts to its original name.

Remove rows
This transform removes rows of data from the dataset where values in a specific column meet conditions
that you specify. You can remove rows that have missing values, contain outliers, or meet custom
conditions in a column you choose. These rows are not used when building your model.

Remove rows by missing values


Missing values are a common occurrence in machine learning datasets and can impact model accuracy.
Use this transform if you want to drop rows with null or empty values in certain columns.

To remove rows that contain missing values in a specified column, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check for missing values.
3. For the Operation, choose Is missing.
4. Choose Add to add the transform to the Model recipe.

SageMaker Canvas drops rows that contain missing values in the Column you selected. After removing
the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe section. If you
remove the transform from the Model recipe section, the rows return to your dataset.

343
Amazon SageMaker Developer Guide
Use custom models

Remove rows by outliers

Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. With SageMaker Canvas, you can detect and remove rows that contain
outliers in numeric columns. You can choose to define outliers with either standard deviations or a
custom range.

To remove outliers from your data, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check for outliers.
3. For the Operation, choose Is outlier.
4. Set the Outlier range to either Standard deviation or Custom range.
5. If you choose Standard deviation, specify a SD (standard deviation) value from 1–3. If you choose
Custom range, select either Percentile or Number, and then specify the Min and Max values.
6. Choose Add to add the transform to the Model recipe.

The Standard deviation option detects and removes outliers in numeric columns using the mean and
standard deviation. You specify the number of standard deviations a value must vary from the mean to
be considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.

The Custom range option detects and removes outliers in numeric columns using minimum and
maximum values. Use this method if you know your threshold values that delimit outliers. You can set
the Type of the range to either Percentile or Number. If you choose Percentile, the Min and Max values
should be the minimum and maximum of the percentile range (0–100) that you want to allow. If you
choose Number, the Min and Max values should be the minimum and maximum numeric values that you
want to allow in the data.

After removing the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the rows return to your dataset.

Remove rows by custom values

You can remove rows with values that meet custom conditions. For example, you might want to exclude
all of the rows with a price value greater than 100 when building your model. With this transform, you
can create a rule that removes all rows that exceed the threshold you set.

To use the custom remove transform, do the following.

344
Amazon SageMaker Developer Guide
Use custom models

1. In the Build tab of the SageMaker Canvas application, choose Remove rows by.
2. Choose the Column you want to check.
3. Select the type of Operation you want to use, and then specify the values for the selected condition.
4. Choose Add to add the transform to the Model recipe.

For the Operation, you can choose one of the following options. Note that the available operations
depend on the data type of the column you choose. For example, you cannot create a is greater
than operation for a column containing text values.

Operation Supported column Function


type

Is equal to Binary, numeric, text, Removes rows where the value in Column equals
categorical the values you specify.

Is not equal to Binary, numeric, text, Removes rows where the value in Column doesn't
categorical equal the values you specify.

Is less than Numeric Removes rows where the value in Column is less
than the value you specify.

Is less than or equal to Numeric Removes rows where the value in Column is less
than or equal to the value you specify.

Is greater than Numeric Removes rows where the value in Column is


greater than the value you specify.

Is greater than or equal Numeric Removes rows where the value in Column is
to greater than or equal to the value you specify.

Is between Numeric Removes rows where the value in Column is


between or equal to two values you specify.

Contains Text, categorical Removes rows where the value in Column


contains a values you specify.

Starts with Text, categorical Removes rows where the value in Column begins
with a value you specify.

Ends with Text, categorical Removes rows where the value in Column ends
with a value you specify.

After removing the rows from the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the rows return to your dataset.

345
Amazon SageMaker Developer Guide
Use custom models

Replace values

This transform replaces values in your dataset where the values in a specific column meet conditions that
you specify. You can replace missing values or outliers. SageMaker Canvas uses the replaced values when
building your model but doesn’t change your original dataset. Note that if you've dropped a column
from your dataset using the Drop columns (p. 342) transform, you can't replace values in that column.

Replace missing values

Missing values are a common occurrence in machine learning datasets and can impact model accuracy.
You can choose to drop rows that have missing values, but your model is more accurate if you choose
to replace the missing values instead. With this transform, you can replace missing values in numeric
columns with the mean or median of the data in a column, or you can also specify a custom value with
which to replace missing values. For non-numeric columns, you can replace missing values with the mode
(most common value) of the column or a custom value.

Use this transform if you want to replace the null or empty values in certain columns. To replace missing
values in a specified column, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Replace.


2. Choose the Column in which you want to replace missing values.
3. For Values to replace, choose Is missing.
4. Set Mode to either Automatic (default) or Manual. If you choose Automatic (default), SageMaker
Canvas replaces missing values with imputed values that best fit your data. If you choose Manual,
then specify the Replace with value in the next step.
5. (Optional) If you choose the Manual replacement option, set the Replace with value:

• If your column is numeric, then select Mean, Median, or Custom. Mean replaces missing values
with the mean for the column, and Median replaces missing values with the median for the
column. If you choose Custom, then you must specify a custom value that you want to use to
replace missing values.
• If your column is non-numeric, then select Mode or Custom. Mode replaces missing values with
the mode, or the most common value, for the column. For Custom, specify a custom value. that
you want to use to replace missing values.
6. Choose Add to add the transform to the Model recipe.

After replacing the missing values in the dataset, SageMaker Canvas adds the transform in the Model
recipe section. If you remove the transform from the Model recipe section, the missing values return to
the dataset.

346
Amazon SageMaker Developer Guide
Use custom models

Replace outliers
Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. SageMaker Canvas enables you to detect outliers in numeric columns
and replace the outliers with values that lie within an accepted range in your data. You can choose to
define outliers with either standard deviations or a custom range, and you can replace outliers with the
minimum and maximum values in the accepted range.

To replace outliers in your data, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Replace.


2. Choose the Column in which you want to replace outliers.
3. For Values to replace, choose Is outlier.
4. For Define outliers, choose either Standard deviation or Custom Range.
5. If you choose Standard deviation, specify a SD (standard deviation) value from 1–3. If you choose
Custom Range, select either Percentile or Number, and then specify the Min and Max values.
6. For Replace with, select Min/max range.
7. Choose Add to add the transform to the Model recipe.

The Standard deviation option detects outliers in numeric columns using the mean and standard
deviation. You specify the number of standard deviations a value must vary from the mean to be
considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier. SageMaker Canvas replaces outliers with the
minimum value or maximum value in the accepted range. For example, if you configure the standard
deviations to only include values from 200–300, then SageMaker Canvas changes a value of 198 to 200
(the minimum).

The Custom Range option detects outliers in numeric columns using minimum and maximum values.
Use this method if you know your threshold values that delimit outliers. You can set the Type of the
custom range to either Percentile or Number. If you choose Percentile, the Min and Max values should
be the minimum and maximum of the percentile range (0–100) that you want to allow. If you choose
Number, the Min and Max values should be the minimum and maximum numeric values that you want
to allow. SageMaker Canvas replaces any values that fall outside of the minimum and maximum to
the minimum and maximum values. For example, if your range only allows values from 1–100, then
SageMaker Canvas changes a value of 102 to 100 (the maximum).

After replacing the values in the dataset, SageMaker Canvas adds the transform in the Model recipe
section. If you remove the transform from the Model recipe section, the original values return to the
dataset.

347
Amazon SageMaker Developer Guide
Use custom models

Filter rows

The filter functionality filters the previewed rows (the first 100 rows of your dataset) according to
conditions that you specify. Filtering rows creates a temporary preview of the data and does not impact
the model building. You can filter to preview rows that have missing values, contain outliers, or meet
custom conditions in a column you choose.

Filter rows by missing values

Missing values are a common occurrence in machine learning datasets. If you have rows with null or
empty values in certain columns, you might want to filter for and preview those rows.

To filter missing values from your previewed data, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check for missing values.
3. For the Operation, choose Is missing.

SageMaker Canvas filters for rows that contain missing values in the Column you selected and provides a
preview of the filtered rows.

348
Amazon SageMaker Developer Guide
Use custom models

Filter rows by outliers

Outliers, or rare values in the distribution and range of your data, can negatively impact model accuracy
and lead to longer building times. SageMaker Canvas enables you to detect and filter rows that contain
outliers in numeric columns. You can choose to define outliers with either standard deviations or a
custom range.

To filter for outliers in your data, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check for outliers.
3. For the Operation, choose Is outlier.
4. Set the Outlier range to either Standard deviation or Custom range.
5. If you choose Standard deviation, specify a SD (standard deviation) value from 1–3. If you choose
Custom range, select either Percentile or Number, and then specify the Min and Max values.

The Standard deviation option detects and filters for outliers in numeric columns using the mean and
standard deviation. You specify the number of standard deviations a value must vary from the mean to
be considered an outlier. For example, if you specify 3 for SD, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.

The Custom range option detects and filters for outliers in numeric columns using minimum and
maximum values. Use this method if you know your threshold values that delimit outliers. You can set
the Type of the range to either Percentile or Number. If you choose Percentile, the Min and Max values
should be the minimum and maximum of the percentile range (0-100) that you want to allow. If you
choose Number, the Min and Max values should be the minimum and maximum numeric values that you
want to filter in the data.

Filter rows by custom values

You can filter for rows with values that meet custom conditions. For example, you might want to preview
rows that have a price value greater than 100 before removing them. With this functionality, you can
filter rows that exceed the threshold you set and preview the filtered data.

To use the custom filter functionality, do the following.

1. In the Build tab of the SageMaker Canvas application, choose Filter by rows ( ).
2. Choose the Column you want to check.
3. Select the type of Operation you want to use, and then specify the values for the selected condition.

349
Amazon SageMaker Developer Guide
Use custom models

For the Operation, you can choose one of the following options. Note that the available operations
depend on the data type of the column you choose. For example, you cannot create a is greater
than operation for a column containing text values.

Operation Supported column Function


type

Is equal to Binary, numeric, text, Filters rows where the value in Column equals the
categorical values you specify.

Is not equal to Binary, numeric, text, Filters rows where the value in Column doesn't
categorical equal the values you specify.

Is less than Numeric Filters rows where the value in Column is less
than the value you specify.

Is less than or equal to Numeric Filters rows where the value in Column is less
than or equal to the value you specify.

Is greater than Numeric Filters rows where the value in Column is greater
than the value you specify.

Is greater than or equal Numeric Filters rows where the value in Column is greater
to than or equal to the value you specify.

Is between Numeric Filters rows where the value in Column is between


or equal to two values you specify.

Contains Text, categorical Filters rows where the value in Column contains a
values you specify.

Starts with Text, categorical Filters rows where the value in Column begins
with a value you specify.

Ends with Text, categorical Filters rows where the value in Column ends with
a value you specify.

After you set the filter operation, SageMaker Canvas updates the preview of the dataset to show you the
filtered data.

350
Amazon SageMaker Developer Guide
Use custom models

Evaluate Your Model's Performance in Amazon SageMaker


Canvas
After you’ve built your model, you can evaluate how well your model performed on your data before
using it to make predictions. You can use information, such as the model’s accuracy when predicting
labels and advanced metrics, to determine whether your model can make sufficiently accurate
predictions for your data.

The section Evaluate your model's performance (p. 351) describes how to view your model’s accuracy
score, broken down by model type. For each model, there is an Overview tab, which gives you a general
overview of the model’s performance, depending on the model type. There is also a Scoring tab, which
shows visualizations that you can use to get more insights into your model's performance beyond the
overall accuracy metric.

The Advanced metrics for your model contain information that you can use for a deeper understanding
of your model's performance. For information about how to view metrics you can use to quantify your
model’s accuracy, see Use advanced metrics in your analyses (p. 355).

Evaluate your model's performance


Amazon SageMaker Canvas provides overview and scoring information for the different types of model.
Your model’s score can help you determine how accurate your model is when it makes predictions. The
additional scoring insights can help you quantify the differences between the actual and predicted
values.

To view the analysis of your model, do the following:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. Choose the model that you built.
4. In the top navigation pane, choose the Analyze tab.
5. Within the Analyze tab, you can view the overview and scoring information for your model.

The following sections describe how to interpret the scoring for each model type.

Evaluate categorical prediction models

The Overview tab shows you the column impact for each column. Column impact is a percentage score
that indicates how much weight a column has in making predictions in relation to the other columns. For
a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other
columns.

The following screenshot shows the Overview tab for a 2 category prediction model.

351
Amazon SageMaker Developer Guide
Use custom models

The Scoring tab for a categorical prediction model gives you the ability to visualize all the predictions.
Line segments extend from the left of the page, indicating all the predictions the model has made. In the
middle of the page, the line segments converge on a perpendicular segment to indicate the proportion
of each prediction to a single category. From the predicted category, the segments branch out to the
actual category. You can get a visual sense of how accurate the predictions were by following each line
segment from the predicted category to the actual category.

The following image gives you an example Scoring section for a 2 category prediction model.

The following image gives you an example Scoring section for a 3+ category prediction model.

352
Amazon SageMaker Developer Guide
Use custom models

Evaluate numeric prediction models

The Overview tab shows you the column impact for each column. Column impact is a percentage score
that indicates how much weight a column has in making predictions in relation to the other columns. For
a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for the other
columns.

The Scoring tab for numeric prediction shows a line to indicate the model's predicted value in relation
to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE (root
mean squared error) value. The value that the model predicts is often within the range of the RMSE.
The width of the purple band around the line indicates the RMSE range. The predicted values often fall
within the range.

The following image shows the Scoring section for numeric prediction.

Evaluate time series forecasting models

On the Analyze page for time series forecasting models, you can see an overview of the model’s metrics.
You can hover over each metric for more information, or you can see Use advanced metrics in your
analyses (p. 355).

353
Amazon SageMaker Developer Guide
Use custom models

In the Column impact section, you can see the score for each column. Column impact is a percentage
score that indicates how much weight a column has in making predictions in relation to the other
columns. For a column impact of 25%, Canvas weighs the prediction as 25% for the column and 75% for
the other columns.

Evaluate image prediction models

The Overview tab shows you the Per label performance, which gives you an overall accuracy score
for the images predicted for each label. You can choose a label to see more specific details, such as the
Correctly predicted and Incorrectly predicted images for the label.

You can turn on the Heatmap toggle to see a heatmap for each image. The heatmap shows you the areas
of interest that have the most impact when your model is making predictions. For more information
about heatmaps and how to use them to improve your model, choose the More info icon next to the
Heatmap toggle.

The Scoring tab for single-label image prediction models shows you a comparison of what the model
predicted as the label versus what the actual label was. You can select up to 10 labels at a time. You
can change the labels in the visualization by choosing the labels dropdown menu and selecting or
deselecting labels.

You can also view insights for individual labels or groups of labels, such as the three labels with the
highest or lowest accuracy, by choosing the View scores for dropdown menu in the Model accuracy
insights section.

The following screenshot shows the Scoring information for a single-label image prediction model.

Evaluate text prediction models

The Overview tab shows you the Per label performance, which gives you an overall accuracy score for
the passages of text predicted for each label. You can choose a label to see more specific details, such as
the Correctly predicted and Incorrectly predicted passages for the label.

The Scoring tab for multi-category text prediction models shows you a comparison of what the model
predicted as the label versus what the actual label was.

In the Model accuracy insights section, you can see the Most frequent category, which tells you the
category that the model predicted most frequently and how accurate those predictions were. If you
model predicts a label of Positive correctly 99% of the time, then you can be fairly confident that your
model is good at predicting positive sentiment in text.

The following screenshot shows the Scoring information for a multi-category text prediction model.

354
Amazon SageMaker Developer Guide
Use custom models

Use advanced metrics in your analyses


The advanced metrics that SageMaker Canvas shows you depend on whether your model performs
numeric, categorical, image, text, or time series forecasting predictions on your data.

Numeric prediction refers to the mathematical concept of regression. When your Target column has
values that can be measured, such as yearly revenue or the number of items sold by a department store,
Canvas builds a model on your data using regression.

Categorical prediction, such as 2 category prediction or 3 category prediction, refers to the mathematical
concept of classification. Categorical prediction can be performed on data that can be put into a
category, such as the following:

• The colors on a color wheel


• Instances where the data is either a 0 or 1
• Instances where the data is either a Yes or a No
• A list of responses to a survey question

Image prediction, such as single-label image prediction, refers to using computer vision to identify and
classify information in images. For example, you can use image prediction to predict whether an image is
of a dog or a cat.

Text prediction, such as multi-category text prediction, refers to using natural language processing
(NLP) to analyze language data. You can use multi-category text prediction on text data to analyze the
sentiment of text, or the overall mood of a text, such as Positive, Negative, Neutral, or Mixed.

Time series forecasting refers to making predictions that vary over time. You can perform time series
forecasts on data with timestamps that correlate to a value you want to predict. For example, you can
make a time series forecast that takes daily sales data and makes sales predictions for the next month.

SageMaker Canvas uses confusion matrices to help you visualize when a model makes predictions
correctly. In a confusion matrix, your results are arranged to compare the predicted values against the
actual values. The following example explains how a confusion matrix works for a 2 category prediction
model that predicts positive and negative labels:

• True positive – The model correctly predicted positive when the true label was positive.
• True negative – The model correctly predicted negative when the true label was negative.
• False positive – The model incorrectly predicted positive when the true label was negative.
• False negative – The model incorrectly predicted negative when the true label was positive.

355
Amazon SageMaker Developer Guide
Use custom models

The following image is an example of a confusion matrix for 2 categories.

The following image is an example of a confusion matrix for 3+ categories.

Metrics for numeric prediction

The following defines the advanced metrics for numeric prediction in Amazon SageMaker Canvas and
gives you information about how you can use them.

• R2 – The percentage of the difference in the target column that can be explained by the input column.
• MAE – Mean absolute error. On average, the prediction for the target column is +/- {MAE} from the
actual value.
• MAPE – Mean absolute percent error. On average, the prediction for the target column is +/- {MAPE} %
from the actual value
• RMSE – Root Mean Square Error. The standard deviation of the errors.

The following image shows a graph of the residuals or errors. The horizontal line indicates an error of 0
or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the
magnitude of the errors.

356
Amazon SageMaker Developer Guide
Use custom models

The following image shows an error density plot.

Metrics for categorical, image, and text prediction


The following defines the advanced metrics for categorical, image, and text prediction in Canvas and
gives you information about how you can use them.

• Missing – A missing value contains no content or is non-existent. Missing values are automatically
inferred.
• Mismatched – A mismatched value has a different data type from the type specified for its column.
SageMaker Canvas categorizes these values as missing and infers values for them.
• Unique – The number and percentage of values that are unique.
• Target correlation – A value between -1 and 1 that represents strength of the linear relationship
between a column and the target column. 0 represents no detectable relationship. 1 represents a
strong positive relationship. -1 represents a strong negative relationship.
• Column impact – Identifies the relative impact of the column in predicting the target column.

The following is a list of available metrics for 2 category prediction.

• F1 – A balanced measure of accuracy that takes class balance into account.

357
Amazon SageMaker Developer Guide
Use custom models

• Accuracy – The percentage of correct predictions.


• Precision – Of all the times that {category-1} was predicted, the prediction was correct {precision}% of
the time.
• Recall – The model correctly predicted {recall}% to be {category-1} when {target_column} was actually
{category-1}.
• AUC – A value between 0 and 1 that indicates how well your model is able to separate the categories in
your dataset. A value of 1 indicates that it was able to separate the categories perfectly.

The following is a list of available metrics for 3+ category prediction, image prediction, and text
prediction.

• F1 – A balanced measure of accuracy that takes class balance into account.


• Accuracy – The percentage of correct predictions.
• Precision – Of all the times that {category-1} was predicted, the prediction was correct {precision}% of
the time.
• Recall – The model correctly predicted {recall}% to be {category-1} when {target_column} was actually
{category-1}.
• AUC – A value between 0 and 1 that indicates how well your model is able to separate the categories in
your dataset. A value of 1 indicates that it was able to separate the categories perfectly.
• Average F1 – The F1 averaged for all categories.
• Average Accuracy – The percentage of correct predictions out of all the predictions that are made.
• Average Precision – The precision averaged for all categories.
• Average Recall – The recall averaged for all categories.
• Average AUC – The AUC averaged for all categories.

Metrics for time series forecasts

The following defines the advanced metrics for time series forecasts in Amazon SageMaker Canvas and
gives you information about how you can use them.

• Average Weighted Quantile Loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10,
P50, and P90 quantiles. A lower value indicates a more accurate model.
• Weighted Absolute Percent Error (WAPE) – The sum of the absolute error normalized by the sum of
the absolute target, which measure the overall deviation of forecasted values from observed values. A
lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
• Root Mean Square Error (RMSE) – The square root of the average squared errors. A lower RMSE
indicates a more accurate model, where RMSE = 0 is a model with no errors.
• Mean Absolute Percent Error (MAPE) – The percentage error (percent difference of the mean
forecasted value versus the actual value) averaged over all time points. A lower value indicates a more
accurate model, where MAPE = 0 is a model with no errors.
• Mean Absolute Scaled Error (MASE) – The mean absolute error of the forecast normalized by the mean
absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model,
where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse
than the baseline.

Make predictions for your data


Use the custom model that you've built in SageMaker Canvas to make predictions for your data. The
following sections show you how to make predictions for numeric and categorical prediction models,
image prediction models, and text prediction models. For information about how to make predictions
with a time series forecast model, see Make a time series forecast.

358
Amazon SageMaker Developer Guide
Use custom models

Numeric and categorical prediction, image prediction, and text prediction custom models support
making the following types of predictions for your data:

• Single predictions — A Single prediction is when you only need to make one prediction. For example,
you have one image or passage of text that you want to classify.
• Batch predictions — A Batch prediction is when you’d like to make predictions for an entire dataset.
For example, you have a CSV file of customer reviews for which you’d like to predict the customer
sentiment, or you have a folder of image files that you'd like to classify. You should make predictions
with a dataset that matches your input dataset. Canvas provides you with the ability to do manual
batch predictions, or you can configure automatic batch predictions that initiate whenever a specified
dataset is updated in Canvas.

For each prediction or set of predictions, SageMaker Canvas returns the following:

• The predicted values


• The probability of the predicted value being correct

Get started

Choose one of the following workflows to make predictions with your custom model:

• Make batch predictions (p. 360)


• Make single predictions (p. 359)

After generating predictions with your model, you can also do the following:

• Update your model by creating a new version. If you want to try to improve the prediction accuracy
of your model, you can build new versions of your model. You can update your data or change any
advanced transformations you used, and then you can review and compare the versions of your model
to choose the best one.
• Register a model version in the SageMaker model registry (p. 373). You can register versions of your
model to the SageMaker model registry, which is a feature for tracking and managing the status of
model versions and machine learning pipelines. A data scientist or MLOps team user with access to
the SageMaker model registry can review your model versions and approve or reject them before
deploying them to production.
• Send your batch predictions to Amazon QuickSight. In Amazon QuickSight, you can build and publish
dashboards with your batch prediction datasets. This can help you analyze and share results generated
by your custom model.

Make single predictions


Make single predictions if you want to get a prediction for a single data point. You can use this feature
to get real-time predictions or to experiment with changing individual values to see how they impact the
prediction outcome.

Choose one of the following procedures based on your model type.

Make single predictions with numeric and categorical prediction models

To make a single prediction for a numeric or categorical prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.

359
Amazon SageMaker Developer Guide
Use custom models

4. On the Run predictions page, choose Single prediction.


5. For each Column field, which represents the columns of your input data, you can change the Value.
Select the dropdown menu for the Value you want to change. For numeric fields, you can enter a
new number. For fields with labels, you can select a different label.
6. When you’re ready to generate the prediction, in the right Prediction pane, choose Update.

In the right Prediction pane, you’ll see the prediction result. You can Copy the prediction result chart,
or you can also choose Download to either download the prediction result chart as an image or to
download the values and prediction as a CSV file.

Make single predictions with image prediction models

To make a single prediction for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.
4. On the Run predictions page, choose Single prediction.
5. Choose Import image.
6. You’ll be prompted to upload an image. You can upload an image from your local computer or from
an Amazon S3 bucket.
7. Choose Import to import your image and generate the prediction.

In the right Prediction results pane, the model lists the possible labels for the image along with a
Confidence score for each label. For example, the model might predict the label Sea for an image, with
a confidence score of 96%. The model may have predicted the image as a Glacier with only a confidence
score of 4%. Therefore, you can determine that your model is fairly confident in predicting images of the
sea.

Make single predictions with text prediction models

To make a single prediction for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.
4. On the Run predictions page, choose Single prediction.
5. For the Text field, enter the text for which you’d like to get a prediction.
6. Choose Generate prediction results to get your prediction.

In the right Prediction results pane, you receive an analysis of your text in addition to a Confidence
score for each possible label. For example, if you entered a good review for a product, you might get
Positive with a confidence score of 85%, while the confidence score for Neutral might be 10% and the
confidence score for Negative only 5%.

Make batch predictions


Make batch predictions when you have an entire dataset for which you’d like to generate predictions.

There are two types of batch predictions you can make:

• Manual batch predictions are when you have a dataset for which you want to make one-time
predictions.

360
Amazon SageMaker Developer Guide
Use custom models

• Automatic batch predictions are when you set up a configuration that runs a batch prediction
whenever a specific dataset is updated. For example, if you’ve configured weekly updates to a
SageMaker Canvas dataset of inventory data, you can set up automatic batch predictions that run
whenever you update the dataset. After setting up an automated batch predictions workflow, see
Manage automations (p. 375) for more information about viewing and editing the details of your
configuration. For more information about setting up automatic dataset updates, see Configure
automatic updates for a dataset (p. 309).

Note
You can only set up automatic batch predictions for datasets imported through local upload or
Amazon S3. Additionally, automatic batch predictions can only run while you’re logged in to the
Canvas application. If you log out of Canvas, automatic batch prediction jobs resume when you
log back in.

To get started, reviewing the following section for batch prediction dataset requirements, and then
choose one of the following manual or automatic batch prediction workflows.

Batch prediction dataset requirements

For batch predictions, make sure that your datasets meet the requirements outlined in Create a
dataset (p. 301).

You might not be able to make predictions on some datasets because they have incompatible schemas. A
schema is an organizational structure. For a tabular dataset, the schema is the names of the columns and
the data type of the data in the columns. An incompatible schema might happen for one of the following
reasons:

• The dataset that you're using to make predictions has fewer columns than the dataset that you're
using to build the model.
• The data types in the columns you used to build the dataset might be different from the data types in
dataset that you're using to make predictions.
• The dataset that you're using to make predictions and the dataset that you've used to build the model
have column names that don't match. The column names are case sensitive. Column1 is not the same
as column1.

To ensure that you can successfully generate batch predictions, match the schema of your batch
predictions dataset to the dataset you used to train the model.
Note
For batch predictions, if you dropped any columns when building your model, Canvas adds the
dropped columns back to the prediction results. However, Canvas does not add the dropped
columns to your batch predictions for time series models.

Make manual batch predictions


Choose one of the following procedures to make manual batch predictions based on your model type.

Make manual batch predictions with numeric and categorical prediction models
To make manual batch predictions for a numeric or categorical prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.
4. On the Run predictions page, choose Batch prediction.
5. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you’ll be directed through the import data workflow.

361
Amazon SageMaker Developer Guide
Use custom models

6. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to preview the output data. You can see the input data matched to the prediction
and the probability that the prediction is correct. Then, you can choose Download CSV to download the
results as a CSV file.

Make manual batch predictions with image prediction models

To make manual batch predictions for a single-label image prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.
4. On the Run predictions page, choose Batch prediction.
5. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you’ll be directed through the import data workflow.
6. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose View prediction results to see the output data. You can see the images along with their
predicted labels and confidence scores. Then, you can choose Download prediction to download the
results as a CSV or a ZIP file.

Make manual batch predictions with text prediction models

To make manual batch predictions for a multi-category text prediction model, do the following:

1. In the left navigation pane of the Canvas application, choose My models.


2. On the My models page, choose your model.
3. After opening your model, choose the Predict tab.
4. On the Run predictions page, choose Batch prediction.
5. Choose Select dataset if you’ve already imported your dataset. If not, choose Import new dataset,
and then you’ll be directed through the import data workflow. The dataset you choose must have
the same source column as the dataset with which you built the model.
6. From the list of available datasets, select your dataset and choose Generate predictions to get your
predictions.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to see the output data. You can see the images along with their predicted labels and
confidence scores. Then, you can choose Download CSV to download the results.

Make automatic batch predictions

To set up a schedule for automatic batch predictions, do the following:

1. In the left navigation pane of Canvas, choose My models.


2. Choose your model.

362
Amazon SageMaker Developer Guide
Use custom models

3. Choose the Predict tab.


4. Choose Batch prediction.
5. For Generate predictions, choose Automatic.
6. The Automate batch predictions dialog box pops up. Choose Select dataset and choose the dataset
for which you want to automate predictions. Note that you can only select a dataset that was
imported through local upload or Amazon S3.
7. After selecting a dataset, choose Set up.

Canvas runs a batch predictions job for the dataset after you set up the configuration. Then, every time
you Update a dataset (p. 308), either manually or automatically, another batch predictions job runs.

After the prediction job finishes running, on the Run predictions page, you see an output dataset listed
under Predictions. This dataset contains your results, and if you select the More options icon ( ), you
can choose Preview to preview the output data. You can see the input data matched to the prediction
and the probability that the prediction is correct. Then, you can choose Download to download the
results.

The following sections describe how to view, update, and delete your automatic batch prediction
configuration through the Datasets page in the Canvas application. You can only set up a maximum
of 20 automatic configurations in Canvas. For more information about viewing your automated batch
predictions job history or making changes to your automatic configuration through the Automations
page, see Manage automations (p. 375).

View your automatic batch prediction jobs

To view your job history for your automatic batch predictions, go to the Predict tab of your model.

Each automatic batch prediction job shows up in the Predict tab of your model. Under Predictions, you
can see the All jobs tab and the Configuration tabs:

• All jobs – In this tab, you can see all of the batch prediction jobs for this model. You can filter the
jobs by configuration name. For each job, you can see fields such as the Input dataset, which includes
the version of the dataset, and the Prediction type, such as whether the predictions were automatic
or manual. If you choose the More options icon ( ), you can choose View prediction or Download
prediction.
• Configuration – In this tab, you can see all of the automatic batch prediction configurations you’ve
created for this model. For each configuration, you can see fields such as the timestamp for when it
was Created, the Input dataset it tracks for updates, and the Next job scheduled. If you choose the
More options icon ( ), you can choose View all jobs to see the job history and in progress jobs for the
configuration.

Edit your automatic batch prediction configuration

You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.

When you edit a batch prediction configuration, you can change the target dataset but not the frequency
(since automatic batch predictions occur whenever the dataset is updated).

To edit your auto update configuration, do the following:

1. Go to the Predict tab of your model.


2. Under Predictions, choose the Configuration tab.

363
Amazon SageMaker Developer Guide
Use custom models

3. Find your configuration and choose the More options icon ( ).


4. From the dropdown menu, choose Update configuration.
5. The Automate batch prediction dialog box opens. You can select another dataset and choose Set up
to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration by doing the following:

1. Go to the Predict tab of your model.


2. Under Predictions, choose the Configuration tab.
3. Find your configuration from the list and turn off the Auto update toggle.

Automatic batch predictions are now paused. You can turn the toggle back on at any time to resume the
update schedule.

Delete your automatic batch prediction configuration

To learn how to delete your automatic batch prediction configuration, see Delete an automatic
configuration (p. 377).

You can also delete your configuration by doing the following:

1. Go to the Predict tab of your model.


2. Under Predictions, choose the Configuration tab.
3. Find your configuration from the list and choose the More options icon ( ).
4. From the dropdown menu, choose Delete configuration.

Your configuration should now be deleted.

Send predictions to Amazon QuickSight


Note
You can send batch predictions to Amazon QuickSight for numeric and categorical prediction
and time series forecasting models. You can also send predictions generated with BYOM models.
Single-label image prediction and multi-category text prediction models are excluded.

Once you generate batch predictions with custom tabular models in SageMaker Canvas, you can send
those predictions as CSV files to Amazon QuickSight, which is a business intelligence (BI) service to build
and publish predictive dashboards.

For example, if you built a 2 category prediction model to determine whether a customer will churn, you
can create a visual, predictive dashboard in QuickSight to show the percentage of customers that are
expected to churn. To learn more about Amazon QuickSight, see the Amazon QuickSight User Guide.

The following sections show you how to send your batch predictions to QuickSight for analysis.

Before you begin

Your user must have the necessary AWS Identity and Access Management (IAM) permissions to send your
predictions to QuickSight. Your administrator can set up the IAM permissions for your user. For more
information, see Grant Your Users Permissions to Send Predictions to Amazon QuickSight (p. 283).

Your QuickSight account must contain the default namespace, which is set up when you first create
your QuickSight account. Contact your administrator to help you get access to QuickSight. For more
information, see Setting up for Amazon QuickSight in the Amazon QuickSight User Guide.

364
Amazon SageMaker Developer Guide
Use custom models

Your QuickSight account must be created in the same Region as your Canvas application. If your
QuickSight account’s home Region differs from your Canvas application’s Region, you must either
close and recreate your QuickSight account, or set up a Canvas application in the same Region as your
QuickSight account. You can check your QuickSight home Region by doing the following (assuming you
already have a QuickSight account):

1. Open your QuickSight console.


2. When the page loads, your QuickSight home Region is appended to the URL in the following format:
https://<your-home-region>.quicksight.aws.amazon.com/.

You must know the usernames of the QuickSight users to whom you want to send your predictions. You
can send predictions to yourself or other users who have the right permissions. Any users to whom you
send predictions must be in the default namespace of your QuickSight account and have the Author
or Admin role in QuickSight.

Additionally, QuickSight must have access to the SageMaker default Amazon S3 bucket for your Domain,
which is named with the following format: sagemaker-{REGION}-{ACCOUNT_ID}. The Region should
be the same as your QuickSight account's home Region and your Canvas application’s Region. To learn
how to give QuickSight access to the batch predictions stored in your Amazon S3 bucket, see the topic I
can’t connect to Amazon S3 in the Amazon QuickSight User Guide.

Supported data formats

Before sending your predictions, check that the data format of your batch predictions is compatible with
QuickSight.

• To learn more about the accepted data formats for timeseries data, see Supported date formats in the
Amazon QuickSight User Guide.
• To learn more about data values that might prevent you from sending to QuickSight, see Unsupported
values in data in the Amazon QuickSight User Guide.

Also note that Amazon QuickSight uses the character " as a text qualifier, so if your Canvas data contains
any " characters, make sure that you close all matching quotes. Any mismatching quotes can cause
issues with sending your dataset to QuickSight.

Send your batch predictions to QuickSight

Use the following procedure to send your predictions to QuickSight:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. On the My models page, choose your model.
4. Choose the Predict tab.
5. Under Predictions, select the dataset (or datasets) of batch predictions that you’d like to share. You
can share up to 5 datasets of batch predictions at a time.
6. After you select your dataset, choose Send to Amazon QuickSight.
Note
The Send to Amazon QuickSight button doesn’t activate unless you select one or more
datasets.

Alternatively, you can preview your predictions by choosing the More options icon ( ) and then
View prediction results. From the dataset preview, you can choose Send to Amazon QuickSight.
The following screenshot shows you the Send to Amazon QuickSight button in a dataset preview.

365
Amazon SageMaker Developer Guide
Use custom models

7. In the Send to Amazon QuickSight dialog box, do the following:

a. For QuickSight users, enter the name of the QuickSight users to whom you want to send your
predictions. If you want to send them to yourself, enter your own username. You can only send
predictions to users in the default namespace of the QuickSight account, and the user must
have the Author or Admin role in QuickSight.
b. Choose Send.

The following screenshot shows the Send to Amazon QuickSight dialog box:

366
Amazon SageMaker Developer Guide
Use custom models

After you send your batch predictions, the QuickSight field for the datasets you sent shows as Sent.
In the confirmation box that confirms your predictions were sent, you can choose Open Amazon
QuickSight to open your QuickSight application. If you’re done using Canvas, you should log out of the
Canvas application.

The QuickSight users that you’ve sent datasets to can open their QuickSight application and view the
Canvas datasets that have been shared with them. Then, they can create predictive dashboards with the
data. For more information, see Getting started with Amazon QuickSight data analysis in the Amazon
QuickSight User Guide.

By default, all of the users to whom you send predictions have owner permissions for the dataset in
QuickSight. Owners are able to create analyses, refresh, edit, delete, and re-share datasets. The changes
that owners make to a dataset change the dataset for all users with access. To change the permissions,
go to the dataset in QuickSight and manage its permissions. For more information, see Viewing and
editing the permissions users that a dataset is shared with in the Amazon QuickSight User Guide.

Time Series Forecasts in Amazon SageMaker Canvas


Note
Time series forecasting models are only supported for tabular datasets.

Amazon SageMaker Canvas gives you the ability to use machine learning time series forecasts. Time
series forecasts give you the ability to make predictions that can vary with time.

You can make a time series forecast for the following examples:

• Forecasting your inventory in the coming months.


• The number of items sold in the next four months.
• The effect of reducing the price on sales during the holiday season.
• Item inventory in the next 12 months.
• The number of customers entering a store in the next several hours.
• Forecasting how a 10% reduction in the price of a product affects sales over a time period.

To make a time series forecast, your dataset must have the following:

• A timestamp column with all values having the datetime type.


• A target column that has the values that you're using to forecast future values.

The datetime values in the timestamp column must use one of the following formats:

• YYYY-MM-DD HH:MM:SS
• YYYY-MM-DDTHH:MM:SSZ
• YYYY-MM-DD
• MM/DD/YY
• MM/DD/YY HH:MM
• MM/DD/YYYY
• YYYY/MM/DD HH:MM:SS
• YYYY/MM/DD
• DD/MM/YYYY
• DD/MM/YY

367
Amazon SageMaker Developer Guide
Use custom models

• DD-MM-YY
• DD-MM-YYYY

You can make forecasts for the following intervals:

• 1 min
• 5 min
• 15 min
• 30 min
• 1 hour
• 1 day
• 1 week
• 1 month
• 1 year

For higher prediction accuracy, your dataset can also have additional columns that can provide data that
can explain the variation in the target column. Using the additional explanatory columns might help you
forecast future values in the target column more accurately.

For example, you can forecast the amount of ice cream sold by a grocery store. To make a forecast, you
must have a timestamp column and a column that indicates how much ice cream the grocery store sold.
For a more accurate forecast, your dataset can also include the price, the ambient temperature, the flavor
of the ice cream, or a unique identifier for the ice cream.

Ice cream sales might increase when the weather is warmer. A decrease in the price of the ice cream
might result in more units sold. Having a column with ambient temperature data and a column with
pricing data can improve your ability to forecast the number of units of ice cream the grocery store sells.

You might have missing data for different reasons. The reason for your missing data might inform
how you want Amazon SageMaker Canvas to impute it. For example, your organization might use an
automatic system that only tracks when a sale happens. If you're using a dataset that comes from this
type of automatic system, you have missing values in the target column.

For missing values in the dataset, SageMaker Canvas imputes the missing values for you.
Important
If you have missing values in the target column, we recommend using a dataset that doesn't
have them. SageMaker Canvas uses the target column to forecast future values. Missing values
in the target column can greatly reduce the accuracy of the forecast.

You can make one of the following types of forecasts:

• Single item
• All items

For a forecast on all the items in your dataset, SageMaker Canvas returns a forecast for the future values
for each item in your dataset.

For a single item forecast, you specify the item and SageMaker Canvas returns a forecast for the future
values. The forecast includes a line graph that plots the predicted values over time.

Topics
• Gain additional insights from your forecast (p. 369)

368
Amazon SageMaker Developer Guide
Use custom models

• Make a time series forecast (p. 369)

Gain additional insights from your forecast


In Amazon SageMaker Canvas, you can use the following optional methods to get more insights from
your forecast:

• Group column
• Holiday schedule
• What-if scenario

You can specify a column in your dataset as a Group column. Amazon SageMaker Canvas groups the
forecast by each value in the column. For example, you can group the forecast on columns containing
price data or unique item identifiers. Grouping a forecast by a column lets you make more specific
forecasts. For example, if you group a forecast on a column containing item identifiers, you can see the
forecast for each item.

Overall sales of items might be impacted by the presence of holidays. For example, in the United States,
the number of items sold in both November and December might differ greatly from the number of
items sold in January. If you use the data from November and December to forecast the sales in January,
your results might be inaccurate. Using a holiday schedule prevents you getting inaccurate results. You
can use a holiday schedule for 251 countries.

For a forecast on a single item in your dataset, you can use what-if scenarios. A what-if scenario gives
you the ability to change values in your data and change the forecast. For example, you can answer the
following questions by using a what-if scenario, "What if I lowered prices? How would that affect the
number of items sold?"

Make a time series forecast


To make a time series forecast, you choose a target column. The target column contains the data
that you want to predict. For example, your target column might have data on the number of items
sold. After you select the target column, Amazon SageMaker Canvas selects a Model type. SageMaker
Canvas uses the time-series data to automatically choose a time series model that you can use to make
predictions on your data. After you build the model, you can evaluate its performance and use it to make
predictions on new data.

Use the following procedure to make a time series forecast.

To make a time-series forecast, do the following.

1. Import the data.


2. Choose a target column in your dataset.
3. SageMaker Canvas automatically chooses Time series forecasting as the model type. Choose Set
configuration to confirm that you're performing a time series forecast.
4. Specify the following fields:

• Item ID column – The column that contains unique identifiers for each item in your dataset. For
example, an SKU number uniquely identifies an item.
• Optional: Group column – Groups the time series forecast by values in the column. For example,
you can group your forecast for an item by store.
• Time stamp column – The column containing the time stamps in your dataset. For a list of the
supported datetime formats for this column, see Time Series Forecasts in Amazon SageMaker
Canvas (p. 367).

369
Amazon SageMaker Developer Guide
Use custom models

• Future timestamp – A timestamp that indicates a future forecast time. SageMaker Canvas
forecasts values up to the point in time specified by the timestamp.
• Optional: Holiday schedule – Activate the holiday schedule to use a country's holiday schedule.
Use it to make your forecasts with holiday data more accurate.

You can have one of the following types of missing values:

• Missing future values


• Missing values

Missing future values are missing values in the target column. SageMaker Canvas uses the values in the
target column to forecast the values in the future. If you have missing values in the target column, your
forecast might be less accurate. We highly recommend updating the dataset.

Missing values are values that are missing in any column other than the target column. With missing
values that aren't in the target column, it's helpful to note the following:

• They generally don't reduce the accuracy of your forecast as much as missing future values.
• SageMaker Canvas automatically imputes the missing values.

You can evaluate the model by seeing how close the predictions are within the actual value. You can also
use the Column Impact metric to determine the direction and magnitude of the column's impact on the
model's predictions. For example, in the following image, holidays had the largest positive impact on the
forecast for demand. Price had the largest negative impact on demand.

After you've built a model, you can make the following types of forecasts:

• Single item – Make a forecast for a single item in a dataset and a line graph of the values that
SageMaker Canvas forecasts. For example, you can see how sales of an item vary over time.
• All items – Make a forecast for all items in a dataset.
• What-if scenario – See how changing values in the dataset can affect the overall forecast for a single
item.

370
Amazon SageMaker Developer Guide
Use custom models

The following image shows a single item forecast with a what-if scenario. In a what-if scenario, you have
the ability to change values that can vary with time. You can see how changing the values affects the
forecast.

The points connected by the solid blue line are the values that the model forecasts. The points
connected by the dashed lines show the what-if scenario.

Updating a Model in Amazon SageMaker Canvas


Amazon SageMaker Canvas gives you the ability to update the models that you've built using new data.
SageMaker Canvas shows you a model history, so that you can compare the models that you've built
recently to those that you've generated in the past.

Each model that you build has a version number. The first model is Version 1, or V1. You can use
model versions to see changes in prediction accuracy when you update your data or use advanced
transformations.
Note
Text prediction and image prediction models only support one model version.

For new versions of a model, you can only choose datasets that have the same target column as the
target column in Version 1. You must build at least one version of a model to add a new version, and you
can delete versions that aren’t useful to you anymore.

You can also see Register a model version in the SageMaker model registry (p. 373) to help you track
your versions over time and collaborate with Studio users who can approve or reject your model versions.

Use the following procedure to add a new model version or to view all of the versions for you model.

To add a new model version, do the following:

1. Open your SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. On the My models page, choose your model. You can Filter by problem type to find your model
more easily.

371
Amazon SageMaker Developer Guide
Use custom models

4. After choosing your model, the Versions page opens, listing all of the versions of your model.
5. Choose Add version.

The following image shows the Versions page for a model, on which you can view your model versions
and add new versions.

On the Versions page, you can view the following information for each of your model versions:

• Status – This field tells you whether your model is currently building (In building), done building
(Ready), failed to build (Failed), or still being edited (In draft).
• Model score, F1, Precision, Recall, and AUC – If you turn on the Show advanced metrics toggle on
this page, you can see these model metrics. These metrics indicate the accuracy and performance of
your model. For more information, see Evaluate your model.
• Shared – This field tells you whether or not you’ve shared the model version with SageMaker Studio
users.
• Model registry – This field tells you whether or not you’ve registered the version to a model registry.
For more information, see Register a model version in the SageMaker model registry (p. 373).

After you choose a new version, you start the process of building another model. The process for
building a new version of a model is almost the same as the process for building a model for the first
time. For new versions of a model, you can only choose datasets that have the same target column
as the target column in Version 1. For more information about building a model, see Build a custom
model (p. 321).

Operationalize your models


After building a model in SageMaker Canvas that you feel confident about, you might want to integrate
your model with the machine learning operations (MLOps) processes in your organization. MLOps
includes common tasks such as deploying a model for use in production or setting up continuous
integration and continuous deployment (CI/CD) pipelines.

The following topics describe how you can use features within Canvas to use a Canvas-built model in
production.

Topics
• Register a model version in the SageMaker model registry (p. 373)

372
Amazon SageMaker Developer Guide
Use custom models

Register a model version in the SageMaker model registry


With SageMaker Canvas, you can build multiple iterations, or versions, of your model to improve it over
time. You might want to build a new version of your model if you acquire better training data or if you
want to attempt to improve the model’s accuracy. For more information about adding versions to your
model, see Update a model.

After you’ve built a model that you feel confident about, you might want to evaluate its performance
and have it reviewed by a data scientist or MLOps engineer in your organization before using it in
production. To do this, you can register your model versions to the SageMaker model registry. The
SageMaker model registry is a repository that data scientists or engineers can use to catalog machine
learning (ML) models and manage model versions and their associated metadata, such as training
metrics. They can also manage and log the approval status of a model.

After you register your model versions to the SageMaker model registry, a data scientist or your MLOps
team can access the SageMaker model registry through SageMaker Studio, which is a web-based
integrated development environment (IDE) for working with machine learning models. In the SageMaker
model registry interface in Studio, the data scientist or MLOps team can evaluate your model and
update its approval status. If the model doesn’t perform to their requirements, the data scientist or
MLOps team can update the status to Rejected. If the model does perform to their requirements,
then the data scientist or MLOps team can update the status to Approved. Then, they can deploy your
model to an endpoint or automate model deployment with CI/CD pipelines. You can use the SageMaker
model registry feature to seamlessly integrate models built in Canvas with the MLOps processes in your
organization.

The following diagram summarizes an example of registering a model version built in Canvas to the
SageMaker model registry for integration into an MLOps workflow.

You can register tabular, image, and text model versions to the SageMaker model registry.
Note
Currently, registration of time series forecasting or BYOM model versions built in Canvas to the
SageMaker model registry isn’t supported.

The following sections show you how to register a model version to the SageMaker model registry from
Canvas.

Permissions management

By default, you have permissions to register model versions to the SageMaker model registry.
SageMaker grants these permissions for all new and existing Canvas user profiles through the
AmazonSageMakerCanvasFullAccess policy, which is attached to the AWS IAM execution role for the
SageMaker Domain that hosts your Canvas application.

If your Canvas administrator is setting up a new Domain or user profile, when they're setting up the
Domain and following the prerequisite instructions in the Getting started guide, SageMaker turns on the
model registration permissions through the ML Ops permissions configuration option, which is enabled
by default.

373
Amazon SageMaker Developer Guide
Use custom models

The Canvas administrator can manage model registration permissions at the user profile level as well. For
example, if the administrator wants to grant model registration permissions to some user profiles but
remove permissions for others, they can edit the permissions for a specific user. The following procedure
shows how to turn off model registration permissions for a specific user profile:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Domains.
3. From the list of Domains, select the user profile’s Domain.
4. On the Domain details page, choose the User profile whose permissions you want to edit.
5. On the User Details page, choose Edit.
6. In the left navigation pane, choose Canvas settings.
7. In the ML Ops permissions configuration section, turn off the Enable Model Registry registration
permissions toggle.
8. Choose Submit to save the changes to your Domain settings.

The user profile should no longer have model registration permissions.

Register a model version to the SageMaker model registry

SageMaker model registry tracks all of the model versions that you build to solve a particular problem in
a model group. When you build a SageMaker Canvas model and register it to SageMaker model registry, it
gets added to a model group as a new model version. For example, if you build and register four versions
of your model, then a data scientist or MLOps team working in the SageMaker model registry interface
can view the model group and review all four versions of the model in one place.

When registering a Canvas model to the SageMaker model registry, a model group is automatically
created and named after your Canvas model. Optionally, you can rename it to a name of your choice, or
use an existing model group in the SageMaker model registry. For more information about creating a
model group, see Create a Model Group.
Note
Currently, you can only register models built in Canvas to the SageMaker model registry in the
same account.

To register a model version to the SageMaker model registry from the Canvas application, use the
following procedure:

1. Open the SageMaker Canvas application.


2. In the left navigation pane, choose My models.
3. On the My models page, choose your model. You can Filter by problem type to find your model
more easily.
4. After choosing your model, the Versions page opens, listing all of the versions of your model. You
can turn on the Show advanced metrics toggle to view the advanced metrics, such as Recall and
Precision, to compare your model versions and determine which one you’d like to register.
5. From the list of model versions, for the the version that you want to register, choose the More
options icon ( ). Alternatively, you can double click on the version that you need to register, and
then on the version details page, choose the More options icon ( ).
6. In the dropdown list, choose Add to Model Registry. The Add to Model Registry dialog box opens.
7. In the Add to Model Registry dialog box, do the following:

a. (Optional) In the SageMaker Studio model group section, for the Model group name field,
enter the name of the model group to which you want to register your version. You can specify
the name for a new model group that SageMaker creates for you, or you can specify an existing
model group. If you don’t specify this field, Canvas registers your version to a default model
group with the same name as your model.

374
Amazon SageMaker Developer Guide
Use custom models

b. Choose Add.

Your model version should now be registered to the model group in the SageMaker model registry. When
you register a model version to a model group in the SageMaker model registry, all subsequent versions
of the Canvas model are registered to the same model group (if you choose to register them). If you want
to register your versions to a different model group, you need to go to the SageMaker model registry and
delete the model group. Then, you can re-register your model versions to the new model group.

To view the status of your models, you can return to the Versions page for your model in the Canvas
application. This page shows you the Model Registry status of each version. If the status is Registered,
then the model has been successfully registered.

If you want to view the details of your registered model version, for the Model Registry status, you can
hover over the Registered field to see the Model registry details pop-up box. These details contain more
info, such as the following:

• The Model package group name is the model group that your version is registered to in the
SageMaker model registry.
• The Approval status, which can be Pending Approval, Approved, or Rejected. If a Studio user
approves or rejects your version in the SageMaker model registry, then this status is updated on your
model versions page when you refresh the page.

The following screenshot shows the Model registry details box, along with an Approval status of
Approved for this particular model version.

Manage automations
In SageMaker Canvas, you can create automations that update your dataset or generate predictions from
your model on a schedule. For example, you might receive new shipping data on a daily basis. You can
set up an automatic update for your dataset and automatic batch predictions that run whenever the
dataset is updated. Using these features, you can set up an automated workflow and reduce the amount
of time you spend manually updating datasets and making predictions.
Note
You can only set up a maximum of 20 automatic configurations in your Canvas application.
Automations are only active while you’re logged in to the Canvas application. If you log out of
Canvas, your automatic jobs pause until you log back in.

The following sections describe how to view, edit, and delete configurations for existing automations. To
learn how to set up automations, see the following topics:

• To set up automatic dataset updates, see Update a dataset (p. 308).


• To set up automatic batch predictions, see Make batch predictions (p. 360).

375
Amazon SageMaker Developer Guide
Use custom models

View your automations


You can also view all of your auto update jobs by going to the left navigation pane of Canvas and
choosing Automations. The Automations page combines automations for both automatic dataset
updates and automatic batch predictions. From the Automationspage, you can see the following tabs:

• All jobs – You can see every instance of a Dataset update or Batch prediction job that Canvas has
done. For each job, you can see fields such as the associated Input dataset, the Configuration name of
the associated auto update configuration, and the Status showing whether the job was successful or
not. You can filter the jobs by configuration name:
• For dataset update jobs, you can choose the latest version of the dataset, or the most recent job, to
preview the dataset.
• For batch prediction jobs, you can choose the More options icon ( ) to view or download the
predictions for that job.
• Configuration – You can see all of the Dataset update and Batch prediction configurations you’ve
created. For each configuration, you can see fields such as the associated Input dataset and the
Frequency of the jobs. You can also turn off or turn on the Auto update toggle to pause or resume
automatic updates. If you choose the More options icon ( ) for a specific configuration, you can
choose to View all jobs for the configuration, Update configuration, or Delete configuration.

Edit your automatic configurations


After setting up a configuration, you might want to make changes to it. For automatic dataset updates,
you can update the Amazon S3 location for Canvas to import data, the frequency of the updates, and
the starting time. For automatic batch predictions, you can change the dataset that the configuration
tracks for updates. You can also turn off the automation to temporarily pause updates until you choose
to resume them.

The following sections show you how to update each type of configuration.
Note
You can’t change the frequency for automatic batch predictions because automatic batch
predictions run every time the target dataset is updated.

Edit your automatic dataset update configuration

You might want to make changes to your auto update configuration for a dataset, such as changing the
frequency of the updates. You might also want to turn off your automatic update configuration to pause
the updates to your dataset.

To make changes to your auto update configuration for a dataset, do the following:

1. In the left navigation pane of Canvas, choose Automations.


2. Choose the Configuration tab.
3. For your auto update configuration, choose the More options icon ( ).
4. In the dropdown menu, choose Update configuration. You are taken to the Auto updates tab of the
dataset.
5. Make your changes to the configuration. When you’re done making changes, choose Save.

To pause your dataset updates, turn off your automatic configuration. One way to turn off auto updates
is by doing the following:

1. In the left navigation pane of Canvas, choose Automations.


2. Choose the Configuration tab.
3. Find your configuration from the list and turn off the Auto update toggle.

376
Amazon SageMaker Developer Guide
Use custom models

Automatic updates for your dataset are now paused. You can turn this toggle back on at any time to
resume the update schedule.

Edit your automatic batch prediction configuration

When you edit a batch prediction configuration, you can change the target dataset but not the frequency
(since automatic batch predictions occur whenever the dataset is updated).

To make changes to your automatic batch predictions configuration, do the following:

1. In the left navigation pane of Canvas, choose Automations.


2. Choose the Configuration tab.
3. For your auto update configuration, choose the More options icon ( ).
4. In the dropdown menu, choose Update configuration. You are taken to the Auto updates tab of the
dataset.
5. The Automate batch prediction dialog box opens. You can select another dataset and choose Set up
to save your changes.

Your automatic batch predictions configuration is now updated.

To pause your automatic batch predictions, turn off your automatic configuration. Use the following
procedure to turn off your configuration:

1. In the left navigation pane of Canvas, choose Automations.


2. Choose the Configuration tab.
3. Find your configuration from the list and turn off the Auto update toggle.

Automatic batch predictions for your dataset are now paused. You can turn this toggle back on at any
time to resume the update schedule.

Delete an automatic configuration


You might want to delete a configuration to stop your automated workflow in SageMaker Canvas.

To delete a configuration for automatic dataset updates or automatic batch predictions, do the
following:

1. In the left navigation pane of Canvas, choose Automations.


2. Choose the Configuration tab.
3. Find your auto update configuration, and choose the More options icon ( ).
4. Choose Delete configuration.
5. In the dialog box that pops up, choose Delete.

Your auto update configuration is now deleted.

Collaborate with data scientists


Note
Collaboration on models with Studio users isn’t supported for single-label image prediction,
multi-category text prediction, or time series forecasting model types.

With Amazon SageMaker Canvas, business analysts using Canvas and data scientists using Amazon
SageMaker Studio can share ML models and collaborate with each other while working in their own
environments to share domain knowledge and provide expert inputs towards improving models.

377
Amazon SageMaker Developer Guide
Use custom models

Using SageMaker Canvas collaboration, you can share Standard build models from Canvas with data
scientists in Studio to review, update, and share back with Canvas users. Users in Canvas can share one
version of a model with up to 23 Studio users.

The following sections describe the steps for collaboration:

• In the Canvas application, a business analyst shares their model with a Studio user.
• The Studio user receives the shared model in the Studio application. They can choose to share
feedback with the analyst, make updates to the model, or share an alternate model version.
• The business analyst receives the feedback or updated model in Canvas and can generate predictions
in view-only mode.

To collaborate, the Canvas user and Studio user must be in the same Amazon SageMaker Domain. For
more information about setting up your Domain and users, see the SageMaker Canvas Prerequisites.
Note
Model collaboration is different from Bring your own model to SageMaker Canvas (p. 384),
where you can bring a model that you’ve trained anywhere and import it into Canvas for
generating predictions.

Prerequisites
Before a Canvas user and Studio user can collaborate on models, the users' IAM role must have AWS
Identity and Access Management (IAM) permissions to share models. If you haven’t already set up
permissions, see Grant Users Permissions to Collaborate with Studio (p. 282).

The Canvas user must also have a Standard build model trained in Canvas and ready to share.
Note
Collaboration does not support Quick build models.

You should also have the user profile name of the Studio user with whom you want to collaborate. The
Studio user must be in the same Amazon SageMaker Domain as your Canvas user. You can find a user’s
profile name by using the following procedure:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation panel, choose Domains.
3. From the list of Domains, choose your Domain. This opens the Domain details page, where you can
find all of the User profiles for the Domain.

Keep the user profile name ready for the first step of the following tutorial.

Canvas users: Share a model with Studio users


Within the Canvas application, share your model version with Studio users or request feedback from
them. You should use a model version that has been built; you can’t share a model version that is a draft
or currently building. You can only share one version per model.

To share your Canvas model with Studio users, use the following procedure.

1. Open the SageMaker Canvas application.


2. From the Models page, select the model that you want to share. You can only share Standard build
models.
3. In the header, choose Share.
4. In the Share Model dialog box, do the following:

a. From the Choose a model version to share dropdown list, select the model version for which
you want feedback.

378
Amazon SageMaker Developer Guide
Use custom models

b. From the SageMaker Studio users dropdown list, select Studio users by their profile names. You
can add up to 23 Studio users.
c. For the Add a note field, you can enter a quick note that accompanies your model when you
send it to the Studio users.
d. Choose Share.
e. In the Share Model confirmation box that appears, choose Share.

You have now shared your model with the Studio users, and the users receive a notification in Studio that
a model has been shared with them.

Studio users: Receive a model in Studio from Canvas users


In Studio, if a model has been shared with you, you receive a notification similar to the following when
you open the Studio application.

Choose View shared models to open the Shared models and notebooks page in Studio. If you miss the
notification, you can find the Shared models and notebooks page by doing the following:

1. Open your Amazon SageMaker Studio application.


2.
In the side navigation pane, choose the Home icon ( ).
3. In the side navigation bar that opens, choose Models.
4. In the dropdown list, choose Shared models to open the Shared models and notebooks page.

On the Shared models and notebooks page, select the filter Shared with me. You should see the
Canvas model that has been shared with you in the list of shared models. Choose View model on the
shared model, which opens the model details page in Autopilot. The opened model should have a banner
at the top that looks similar to the following screenshot.

From this page, you can see the model details, as well as any notes about the model shared with you by
the Canvas user. In the Canvas banner at the top, you can choose the following actions:

• Share feedback with the Canvas user.


• Make updates to the shared model and share the updates with the Canvas user.
• Share an alternate version of the model with the Canvas user. Canvas uses Autopilot to train multiple
versions of the model and select the best version. You can select a different version if you decide that
it’s better for your use case.

For more information on the preceding actions, see the following sections.

Share feedback

You might want to send a comment or feedback to the Canvas user without making any changes to the
model.

To share feedback on the shared model, use the following procedure:

379
Amazon SageMaker Developer Guide
Use custom models

1. On the model details page, choose Share feedback.


2. In the Share feedback dialog box, add a note in the Add feedback field.
3. Choose Share to send the feedback to the Canvas user.

After giving feedback, you can view the feedback you sent in the Canvas banner at the top of the model
details page. The Canvas user receives the feedback in the Canvas application and can make changes
based on your feedback.

Share an updated model with the Canvas user

You might want to make changes to the model that the Canvas user shared with you. For example, you
might want to use advanced data transformations such as one-hot encoding to improve the accuracy of
the model. You can update the model with Amazon SageMaker Data Wrangler and Amazon SageMaker
Autopilot in Studio, which are features that help you make data transformations and train your model.
Warning
If you exit the following workflow at any time, your model updates are not saved, and you must
restart the workflow.

To update the model and send the updated model to the Canvas user, use the following procedure:

1. On the model details page, in the Canvas banner, choose Update model.
2. In the banner’s dropdown list, choose Update data transformations.

3. The workflow opens your model in Amazon SageMaker Data Wrangler, where you can choose to edit
the data transformations used for the model. Make your data transformations in the Data Wrangler
interface. For more information about Data Wrangler and the data transformations you can use, see
the Data Wrangler documentation.
4. After you’ve finished your data transformations, choose Retrain model on the Canvas banner to
open the Export data and train a model with SageMaker Autopilot page in the Data Wrangler
interface.
5. Verify the fields on the Export data and train a model with SageMaker Autopilot page, and then
choose Export and train to export your data transformations to Amazon SageMaker Autopilot.
6. The workflow opens the Create an Autopilot experiment page in Autopilot, where you can create
an Autopilot experiment and retrain the model with the updated data transformations. Fill out the
fields for each of the Create an Autopilot experiment pages.

For more information about Autopilot and Autopilot experiments, see Create an experiment in the
Autopilot documentation.
7. After you’ve finished configuring your Autopilot experiment and reviewed the final settings, choose
Create experiment in the Autopilot interface to begin training the model. The model trains, during
which you can choose Stop training in the Autopilot interface at any time.
8. After the model has trained, the Canvas banner at the top of the page compares the metrics of the
old model with the updated model. The Best model summary lists the metrics, such as Recall and
Precision, and whether the new model improves the metrics or not. Review the metrics and decide
whether you would like to share the updated model or not. For more information about Autopilot
metrics, see Metrics and validation.
9. If you decide that you want to share the updated model with the Canvas user, choose Share in the
banner.

380
Amazon SageMaker Developer Guide
Use custom models

10. In the Share dialog box, do the following:

a. For the Select a model to share dropdown list, the best model from your Autopilot experiment
should already be selected and marked with a label Best Candidate. If the model version that
you want to share is not selected, open the dropdown and select the correct version.
b. For the Add feedback field, you can enter a note for the Canvas user.
c. Choose Share to share the updated model and note with the Canvas user.

After sharing the model, you receive a notification that your model was shared successfully similar to the
following screenshot.

You can choose View shared models in the banner to return to the Shared models and notebooks page.
From this page, you can see the updated model that you shared with the Canvas user under the Shared
by me label.

Share an alternate model with the Canvas user

When SageMaker Canvas builds a model, Amazon SageMaker Autopilot trains multiple versions of
the model and selects the best one. You might decide that an alternate version of the model is better
according to your needs. You can share an alternate Autopilot version of the model with the Canvas user
instead of making changes to the one they sent. For more information about Autopilot, see the Autopilot
documentation.

To share an alternate model, use the following procedure:

1. On the model details page, in the Canvas banner, choose Update model.
2. In the banner’s dropdown list, choose Recommend an alternate Auto ML candidate.
3. The page for the Autopilot job opens where you can review all of the trained model versions. When
you're ready to share an alternate version, in the Canvas banner at the top of the page, choose
Share.
4. In the Share dialog box, do the following:

a. For the Select a model to share dropdown list, the best model from the Autopilot experiment
is selected and marked with the label Best Candidate. Open the dropdown and select the
alternate model version that you want to share.
b. For the Add feedback field, you can enter a note for the Canvas user.
c. Choose Share to share the alternate model version and note with the Canvas user.

After sharing the model, you receive a notification that your alternate model was shared successfully
similar to the following screenshot.

You can choose View shared models in the banner to return to the Shared models and notebooks page.
From this page, you can see the updated model that you shared with the Canvas user under the Shared
by me label.

381
Amazon SageMaker Developer Guide
Use custom models

Canvas users: Receive model updates from a Studio user


When a Studio user shares an updated or alternate model with the Canvas user, the Canvas user receives
a notification.

In the Canvas app, the notification looks like the following screenshot.

You can choose View update to see the updated model, or you can go to the Models page in the Canvas
application and select the shared model to view it.
Note
Canvas users can’t edit a model that has been shared with them by a Studio user. Models
imported from Studio are view and predict only.

A model on which a Studio user has collaborated looks like the following card on the Models page.

382
Amazon SageMaker Developer Guide
Use custom models

The model import from Studio can take up to 20 minutes, during which the model shows as Importing.

383
Amazon SageMaker Developer Guide
Use custom models

After importing the model, you can view its metrics and generate predictions with it.

The following screenshot shows the Analyze tab, where you can evaluate the model accuracy
and metrics. For more information, see Evaluate Your Model's Performance in Amazon SageMaker
Canvas (p. 351).

The following screenshot shows the Predict tab, where you can generate predictions with the model. For
more information on generating predictions in Canvas, see Make predictions for your data (p. 358).

On both the Analyze and Predict tabs, you can see the Shared History panel, which shows you the
model versions and comments shared with you by Studio users.

Bring your own model to SageMaker Canvas


Note
You can share models trained with tabular, text, and image data to Canvas. You can't share time
series models. Also, Canvas bring your own model (BYOM) only supports CPU-based models (or
models that use CPU instances to make predictions).

Business analysts can benefit from ML models already built by data scientists to solve business problems
instead of creating a new model in Amazon SageMaker Canvas. However, it might be difficult to use
these models outside the environments in which they are built due to technical requirements, rigidity of
tools, and manual processes to import models. This often forces users to rebuild ML models, resulting in
the duplication of effort and additional time and resources.

384
Amazon SageMaker Developer Guide
Use custom models

SageMaker Canvas removes these limitations so you can generate predictions in Canvas with models that
you’ve trained anywhere. You can register ML models in SageMaker Model Registry, which is a metadata
store for ML models, and import them into SageMaker Canvas. Additionally, you can generate predictions
with models that data scientists have trained in Amazon SageMaker Autopilot or SageMaker JumpStart.
Canvas users can then analyze and generate predictions from any model that has been shared with them.

After you’ve satisfied the Prerequisites (p. 385), see the following sections for instructions on how to
bring your own models into Canvas and generate predictions. The workflow begins in Studio, where a
Studio user shares a model with a Canvas user. Then, the Canvas user signs in to their Canvas app to
receive the shared model and generate predictions with it.
Important
You can only share models trained with tabular data. Also, you can't share time series models.

Prerequisites
To bring your model into SageMaker Canvas, complete the following prerequisites:

• You must have a Amazon SageMaker Studio user who has onboarded to Amazon SageMaker Domain.
The Studio user must be in the same Domain as the Canvas user. Model sharing occurs when a Studio
user shares a model with a Canvas user from within Studio. If you don’t already have a Studio user set
up, see the Studio documentation and Onboard to Amazon SageMaker Domain.
• You must have a trained model from SageMaker Autopilot, SageMaker JumpStart, or SageMaker
Model Registry. For any model that you’ve built outside of SageMaker, you must register your model
in Model Registry before importing it into Canvas. For more information, see the Model Registry
documentation.
• The Canvas user with whom you want to share your model must have permission to access the Amazon
S3 bucket in which you store your datasets and model artifacts. For instructions on how admins
can give Canvas users the permissions they need, see Grant Users Permissions to Collaborate with
Studio (p. 282).
• You should also have the user profile name of the Canvas user with whom you want to collaborate.
The Canvas user must be in the same Amazon SageMaker Domain as your Studio user. You can find a
user’s profile name by using the following procedure:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation panel, choose Domains.
3. From the list of Domains, choose your Domain. This opens the Domain details page, where you
can find all of the User profiles for the Domain.

Keep the user profile name ready for the first step of the following tutorial.

If your SageMaker Canvas app is running in a private customer VPC, any Autopilot models shared from
Studio must use Autopilot HPO mode to support generating predictions in Canvas. For more information
about HPO mode, see Training modes and algorithm support in the Autopilot documentation.
Note
If you want feedback from data scientists on a model built inside Canvas, see Collaborate with
data scientists (p. 377), where a Canvas user shares a model with a Studio user, and the Studio
user shares feedback or model updates.

Studio users: Share a model to SageMaker Canvas


You should have a model trained with tabular data that you’re ready to share with Canvas users. See the
following sections for information on how to share your models from features within Studio.

385
Amazon SageMaker Developer Guide
Use custom models

Autopilot

You can share a model to Canvas from Amazon SageMaker Autopilot in Studio. Autopilot is a feature that
enables you to train and and deploy your models in SageMaker.

You need to have a Studio user and a trained model ready to share from Autopilot. For more information
on how to set up Studio, see the Studio documentation. For more information about Autopilot, see the
Autopilot documentation.

To share a model from Autopilot to Canvas, use the following procedure.

1. Open your Amazon SageMaker Studio application.


2.
In the side navigation pane, choose the Home icon ( ).
3. In the side navigation bar of Studio, choose AutoML to open Autopilot.
4. On the Autopilot page, select the Autopilot model that you want to share with the Canvas user. You
can only share one model at a time.
5. From the Autopilot job details page, in the Models tab, select the model version that you want to
share.
6. Choose Share.
7. In the Share model dialog box, do the following:

a. For the Add Canvas users field, enter the Canvas user’s profile name. You can enter up to 23
Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it, you can't
enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.

You have now shared the model with the Canvas user.

JumpStart

You can share a model to Canvas from SageMaker JumpStart in Studio. With JumpStart, you can access
and tune pretrained models before deploying them.

You need to have a Studio user and a successfully completed training job in JumpStart. For more
information about how to set up Studio, see the Studio documentation. For more information about
JumpStart, see the JumpStart documentation.

To share a model from JumpStart to Canvas, use the following procedure.

1. Open your Amazon SageMaker Studio application.


2.
In the side navigation pane, choose the Home icon ( ).
3. In the side navigation bar that opens, choose SageMaker JumpStart.
4. Choose Launched JumpStart assets to open the page that lists your JumpStart training jobs,
models, and endpoints.
5. Choose the Training jobs tab to view the list of your model training jobs.
6. From the Training jobs list, select the training job that you want to share with the Canvas user. You
can only share one job at a time. This opens the training job details page.
7. In the header for the training job, choose Share, and select Share to Canvas.

386
Amazon SageMaker Developer Guide
Use custom models

Note
You can only share tabular models to Canvas. Trying to share a model that is not tabular
throws an Unsupported data type error.
8. In the Share to Canvas dialog box, do the following:

a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.

You have now shared the model with the Canvas user.

Model Registry

You can share a model to Canvas from SageMaker Model Registry in Studio. With Model Registry, you can
register models that you bring from outside of SageMaker and integrate them with your ML pipelines.

You need to have a Studio user and a model version saved in the Model Registry. For more information
about how to set up Studio, see the Studio documentation. If you don’t have a model version in the
Model Registry, create a model group and register a version to it. For more information about Model
Registry, see the Model Registry documentation.

To share a model version from Model Registry to Canvas, use the following procedure.

1. Open your Amazon SageMaker Studio application.


2.
In the side navigation pane, choose the Home icon ( ).
3. In the side navigation bar that opens, choose Models.
4. Select Model Registry from the dropdown list to open the Model Registry page and show all of the
model groups registered in your account.
5. Choose the model group that has the model version that you want to share.
6. You can share a model version either from the model group page or the model version page.

• To share a model version from the model group page, complete the following steps:

1. Choose Versions, and check the box next to the model version you want to share with the
Canvas user. You can only share one model version at a time.
2. In the Actions dropdown menu, choose Share model artifacts.
• To share a model version from the model version page, complete the following steps:

1. Choose Versions, and select the name of the model version you want to share with the
Canvas user. You can only share one model version at a time.
2. In the Actions dropdown menu, choose Share model artifacts.
7. In the Share model dialog box, do the following:

a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For Add model details, do the following:

i. For the Training dataset field, enter the Amazon S3 path for your training dataset.
ii. For the Validation dataset field, enter the Amazon S3 path for your validation dataset.

387
Amazon SageMaker Developer Guide
Use custom models

iii. For Target column, either select Use the first column if the first column in your dataset
is the target, or select Specify the target column name to set the target as a different
column in your dataset.
iv. For Column headers, select one of the following options:

A. Select Use the first row if the first row of your dataset contains the column headers.
B. Select Specify a different dataset in S3 for column headers if you have a file stored
in Amazon S3 containing headers that can be mapped to your dataset. The headers file
must have the same number of columns as your dataset.
C. Select Automatically generate if you don’t already have column headers and would
like SageMaker to generate generic column names for your dataset.
v. From the Problem type dropdown list, select your model type.
vi. If you selected the Binary classification or Multi-class problem types, the Configure model
outputs option appears.

If you already have a file stored in Amazon S3 that maps default target column class names
to your desired class names, then turn on Model output names and enter the Amazon
S3 path to the mapping file. If you don't have a mapping file, then turn off Model output
names and manually enter the Numer of model outputs (the number of target column
classes in your data). Then, enter your desired class names to replace the default class
names.
c. (Optional) For the Add a note field, add a description or note for the Canvas user when they
receive the model.
d. Choose Share to share the model version.

You have now shared the model with the Canvas user.

Shared models and notebooks

On the Shared models and notebooks page in Amazon SageMaker Studio, you can view the models
that you've shared and that have been shared with you. This page gives you a central place to view and
manage all of your models in Studio.

You need to have a Studio user and a model ready to share from Autopilot, JumpStart, or Model Registry.
For more information on how to set up Studio, see the Studio documentation. For more information
about the Shared models and notebooks page, see the Shared models and notebooks documentation.

The following example walks you through sharing an Amazon SageMaker Autopilot model, but you can
use the sharing feature on the Shared models and notebooks page to share models from any of the
other features in the previous sections, such as Jumpstart and Model Registry.

To share an Autopilot model from the Shared models and notebooks page, use the following procedure.

1. Open your Amazon SageMaker Studio application.


2.
In the side navigation pane, choose the Home icon ( ).
3. In the side navigation bar of Studio, choose Models.
4. In the dropdown list, choose Shared models to open the Shared models and notebooks page.
5. Choose the filter icon, and in the Shared from dropdown list, choose Autopilot.
6. Select the Autopilot model from the list that you want to share with the Canvas user. You can only
share one model at a time. Alternatively, you can select the model to open the model details page.
7. From either the Autopilot jobs page or the model details page, choose Share.
8. In the Share model dialog box, do the following:

388
Amazon SageMaker Developer Guide
Use custom models

a. For the Add Canvas users to share field, enter the Canvas user’s profile name. You can enter up
to 23 Canvas users. If a user profile you specify doesn’t have a Canvas app associated with it,
you can't enter the profile name.
b. For the Add a note field, add a description or note for the Canvas user when they receive the
model.
c. Choose Share to share the model.

You have now shared the model with the Canvas user.

After you share the model, you receive a notification popup in Studio similar to the following screenshot.

You can choose View model to open the Shared models and notebooks page in Studio. You can also
view your shared models at any time from the Shared models and notebooks page.

From this page, you can see the models that you’ve shared with the Canvas user under the Shared by me
label, as shown in the following screenshot.

Models that you’ve shared to Canvas have text on the card similar to the following example: Shared
to: 12 Canvas users.

389
Amazon SageMaker Developer Guide
Use custom models

Canvas users: Receive a shared model in SageMaker Canvas


When a Studio user shares a model with a Canvas user, you receive a notification within the Canvas
application that a Studio user has shared a model with you.

In the Canvas application, the notification is similar to the following screenshot.

You can choose View update to see the shared model, or you can go to the Models page in the Canvas
application to discover all of the models that have been shared with you.
Note
Canvas users can’t edit a model that has been shared with them by a Studio user. Models
imported from Studio are view and predict only.

A model that has been shared by a Studio user looks like the following card on the Models page. This
is different from Collaborate with data scientists (p. 377), where a Canvas user shares a model and a
Studio user shares updates or feedback with the Canvas user.

390
Amazon SageMaker Developer Guide
Use custom models

391
Amazon SageMaker Developer Guide
Logging out

The model import from Studio can take up to 20 minutes, during which the model shows as Importing.

After importing the model, you can view its metrics and generate predictions with it. SageMaker Canvas
uses Amazon SageMaker Serverless Inference resources to generate model analysis and predictions for
shared models. You might see costs associated with Serverless Inference in your AWS account.

The following screenshot shows the Analyze tab in the Canvas application for a shared model, where
you can evaluate the model accuracy and metrics. For more information, see Evaluate Your Model's
Performance in Amazon SageMaker Canvas (p. 351).

The following screenshot shows the Predict tab, where you can generate predictions with the model. For
more information on generating predictions in Canvas, see Make predictions for your data (p. 358).

On both the Analyze and Predict tabs, you can see the Shared History panel, which shows you the
model versions and comments shared with you by Studio users.

Logging out of Amazon SageMaker Canvas


If you're not using Amazon SageMaker Canvas, you can log out. A workspace instance is dedicated for
your use as soon as you launch SageMaker Canvas from the console. Logging out ends the workspace
instance. You are only billed for the duration you are logged in.

392
Amazon SageMaker Developer Guide
Limitations and troubleshooting

When you log out, your models and datasets aren't affected, but SageMaker Canvas cancels any Quick
build tasks. If you log out of SageMaker Canvas while running a Quick build, your build might be
interrupted until you log back in. When you log back in, SageMaker Canvas automatically restarts the
build.

To log out, choose the Log out button ( ) on the left panel of the SageMaker Canvas app.

You can also log out from the SageMaker Canvas app by closing your browser tab and then deleting the
app (p. 284) in the console.

After you log out, SageMaker Canvas tells you to relaunch in a different tab. Logging in takes between
3 minutes and 8 minutes. If you have an administrator who set up SageMaker Canvas for you, use the
instructions they gave you to log back in. If don't have an administrator, see the procedure for accessing
SageMaker Canvas in Prerequisites for setting up Amazon SageMaker Canvas (p. 260).

Limitations and troubleshooting


The following section outlines troubleshooting help and limitations that apply when using Amazon
SageMaker Canvas. You can use these this topic to help troubleshoot any issues you encounter.

Troubleshooting issues with granting permissions through the


SageMaker console
If you’re having trouble granting Canvas base permissions or Ready-to-use models permissions to your
user, your user might have an AWS IAM execution role with more than one trust relationship to other
AWS services. A trust relationship is a policy attached to your role that defines which principals (users,
roles, accounts, or services) can assume the role. For example, you might encounter an issue granting
additional Canvas permissions to your user if their execution role has a trust relationship to both Amazon
SageMaker and Amazon Forecast.

You can fix this problem by choosing one of the following options.

1. Remove all but one trusted service from the role.


This solution requires you to edit the trust relationship for your user profile’s IAM role and remove all
AWS services except SageMaker.

To edit the trust relationship for your IAM execution role, do the following:

1. Go to the IAM console at https://console.aws.amazon.com/iam/.


2. In the navigation pane of the IAM console, choose Roles. The console displays the roles for your
account.
3. Choose the name of the role that you want to modify, and select the Trust relationships tab on the
details page.
4. Choose Edit trust policy.
5. In the Edit trust policy editor, paste the following, and then choose Update Policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
]
},
"Action": "sts:AssumeRole"

393
Amazon SageMaker Developer Guide
Limitations and troubleshooting

}
]
}

You can also update this policy document using the IAM CLI. For more information, see update-trust in
the IAM Command Line Reference.

You can now retry granting the Canvas base permissions or the Ready-to-use models permissions to your
user.

2. Use a different role with one or fewer trusted services.


This solution requires you to specify a different IAM role for your user profile. Use this option if you
already have an IAM role that you can substitute.

To specify a different execution role for your user, do the following:

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. On the left navigation pane, choose Domains.
3. From the list of Domains, select the Domain that you want to view a list of user profiles for.
4. On the Domain details page, choose the User profiles tab.
5. Choose the user whose permissions you want to edit. On the User details page, choose Edit.
6. On the General settings page, choose the Execution role dropdown list and select the role that you
want to use.
7. Choose Submit to save your changes to the user profile.

Your user should now be using an execution role with only one trusted service (SageMaker).

You can retry granting the Canvas base permissions or the Ready-to-use models permissions to your
user.

3. Manually attach the AWS managed policy to the execution role instead of
using the toggle in the SageMaker Domain settings.
Instead of using the toggle in the Domain or user profile settings, you can manually attach the AWS
managed policies that grant a user the correct permissions.

To grant a user Canvas base permissions, attach the AmazonSageMakerCanvasFullAccess policy. To grant
a user Ready-to-use models permissions, attach the AmazonSageMakerCanvasAIServicesAccess policy.

Use the following procedure to attach an AWS managed policy to your role:

1. Go to the IAM console at https://console.aws.amazon.com/iam/.


2. Choose Roles.
3. In the search box, search for the user's IAM role by name and select it.
4. On the page for the user's role, under Permissions, choose Add permissions.
5. From the dropdown menu, choose Attach policies.
6. Search for and select the policy or policies that you want to attach to the user’s execution role:

a. To grant the Canvas base permissions, search for and select the
AmazonSageMakerCanvasFullAccess policy.
b. To grant the Ready-to-use models permissions, search for and select the
AmazonSageMakerCanvasAIServicesAccess policy.
7. Choose Add permissions to attach the policy to the role.

394
Amazon SageMaker Developer Guide
Limitations and troubleshooting

After attaching an AWS managed policy to the user’s role through the IAM console, your user should now
have the Canvas base permissions or Ready-to-use models permissions.

Limitations for collaboration


The following general limitations apply when you are collaborating with data scientists in Amazon
SageMaker Studio.

• You can only share successfully trained models from Canvas to Studio. Similarly, you can only share
models that have been successfully trained in Studio back to Canvas.
• You can’t share Quick build models from Canvas to Studio. You can only share Standard build models.
• You can only share one version of a Standard build model trained in Canvas. You can train additional
versions of your model within Canvas, but you can't share them to Studio.
• From Studio, you can only share feedback or share an updated model with Canvas. You can’t perform
both actions at the same time.
• The length limitation for comments shared from Studio to Canvas and Canvas to Studio is 1024
characters.
• You can only share your Canvas or Studio models with a different user profile. You can’t share models
between Canvas and Studio within your own user profile.
• You can't share from a Canvas user to a Canvas user, or from a Studio user to a Studio user.

There are also limitations that apply depending on the type of model you want to share. See the
following sections for limitations on time series forecasting models and numeric and categorical
prediction models.

Limitations for collaborating on time series forecasting models


The following limitations apply when you are collaborating on time series forecasting models between
Canvas and Studio.

• You can’t make predictions with time series forecasting models in Studio through an automated Share
button. However, you can create a Jupyter notebook and write your own code.
• For time series forecasting models, you can’t change the model recipe or data transformations in
Studio. You can only make the following updates to time series forecasting models in Studio:
• You can update the length of the forecast horizon.
• You can update the item's metadata field, which groups your data by a certain column.
• You can update other dimension fields, such as specifying a holiday schedule.

Limitations for collaborating on numeric and categorical prediction models


The following limitations apply when you are collaborating on numeric and categorical prediction model
types between Canvas and Studio.

• When updating or training models in Studio, if you close the tab with the collaboration banner at
the top, it ends the share model workflow and you lose your progress. In that case, you must restart
the share model workflow from the Shared With Me section on the Shared Models page. For more
information, see Collaborate with data scientists.
• When updating models in Studio, you can’t change the target column if you want to share the model
updates back to Canvas. If you want to change the target column and re-train the model, train the
model and then use the Share button to share to Canvas. For more information about sharing a new
model to Canvas, see Bring your own model to SageMaker Canvas.
• When updating models in the Amazon SageMaker Data Wrangler Recipe interface in Studio, there are
limits to which changes a Studio user can apply that Canvas supports:

395
Amazon SageMaker Developer Guide
Limitations and troubleshooting

• You can only share a model to Canvas that has been trained from the last node in a Data Wrangler
linear data flow.
• Only transformation nodes are supported.
• You can’t perform operations on the Target column.
• You can’t update the data type of columns.
• You can’t update the data source or add a new data source.
• When sharing an alternative candidate to Canvas from the Studio Autopilot page, you can’t select the
model from the leaderboard. You must choose the shared model from the banner and then select an
alternative from the list. For more information, see Share an alternate model with the Canvas user in
the Canvas documentation.
• Only models that are compatible with SageMaker Neo can be shared back to Canvas successfully.
Compatible models are Autopilot models that use XGBoost or MLP algorithms. Incompatible models
include Autopilot models that use the linear learner algorithm.
• For custom formula transforms using Spark SQL, Canvas only supports Unary operations, Aggregate
functions, the String concatenation operation and the Power operation. Other operations are not
supported.

Limitations for bring your own model (BYOM)


The following general limitations apply when you want to bring your own model to SageMaker Canvas.

• When a model is shared from Studio to Canvas, the Canvas user cannot update or view details on the
dataset that was used to build the model.
• When a Canvas user wants to run a single prediction on an imported model, there are no data type
restrictions when updating column values. You must manually make sure that when you update values
for single predictions, you match the data type of the existing values.
• When a Canvas user wants to run batch predictions on an imported model, Canvas assumes that you
(the Canvas user) know what the expected input dataset should look like. You should have a dataset
with columns and data types that match the dataset that was used to train the model. If not, consult
with the user who shared the model with you and import a dataset that you can use for running batch
predictions.
• The Canvas application internally uses a serverless endpoint to run predictions and generate model
metrics. The model shared to Canvas must be compatible with serverless endpoints:
• The maximum memory size is 6144 MB.
• When configuring the inference input response keys in your container, use the following
configuration:

INFERENCE_INPUT_RESPONSE_KEYS = {
"BINARY": ["predicted_label", "probability"],
"MULTI_CLASS": ["predicted_label", "probability", "probabilities", "labels"],
}

• You can choose either a SageMaker-provided inference container or bring your own image inference
container to be used for endpoint. SageMaker provides containers for its built-in algorithms and
prebuilt Docker images for some of the most common machine learning frameworks. If you are
bringing your own container, you must modify it to work with SageMaker. For more information
about bringing your own container, see Adapting Your Own Inference Container.
• The Feature exclusions for serverless endpoints also apply.
• To share a model from Studio to Canvas successfully, Canvas accepts model inference outputs in the
format below:

TEXT/CSV

396
Amazon SageMaker Developer Guide
Limitations and troubleshooting

• Regression: The model inference response should be a byte string where each of the output
predictions are separated by \n:

b'-0.0007884334772825241\n-0.015136942267417908\n0.050063662230968475\n0.02891816757619381\n'

• Classification: The model inference response should be a byte string where each of
predicted_label, predicted_probability, probabilities, and labels are separated by
\n. The following example is for binary classification:

b'no,0.9967488050460815,"[0.9967488050460815, 0.003251201706007123]","[\'no\', \'yes


\']"\nno,0.9999420642852783,"[0.9999420642852783, 5.793538366560824e-05]","[\'no\',
\'yes\']"\nno,0.9999846816062927,"[0.9999846816062927, 1.5326571883633733e-05]","[\'no
\', \'yes\']"\nno,0.9999727606773376,"[0.9999727606773376,
2.7267418772680685e-05]","[\'no\', \'yes\']"\n'

The following example is for multi-class classification:

b'Iris-setosa,1.0,"[1.0, 0.0, 0.0]","[\'Iris-setosa\', \'Iris-versicolor\', \'Iris-


virginica\']"\nIris-setosa,1.0,"[1.0, 0.0, 0.0]","[\'Iris-setosa\', \'Iris-versicolor
\', \'Iris-virginica\']"\nIris-setosa,1.0,"[1.0, 0.0, 0.0]","[\'Iris-setosa\', \'Iris-
versicolor\', \'Iris-virginica\']"\nIris-setosa,1.0,"[1.0, 0.0, 0.0]","[\'Iris-setosa
\', \'Iris-versicolor\', \'Iris-virginica\']"\n'

APPLICATION/JSON
• Regression: The model inference response should be a JSON string which contains the prediction
key, and its value should be the list of output predictions:

let response = {
"predictions": [
// First instance prediction.
1.75
// Second instance prediction.
3.25
]
}

• Classification: The model inference response should be a JSON string which contains the
probabilities key, and its value should be the list of probabilities.

The following example is for binary classification:

let response = {
"probabilities": [
// First instance prediction.
[0.9, 0.1]
// Second instance prediction.
[0.2, 0.8]
]
}

The following example is for multi-class classification:

let response = {
"probabilities": [
// First instance prediction.
[0.7, 0.2, 0.1]
// Second instance prediction.
[0.2, 0.5, 0.3]

397
Amazon SageMaker Developer Guide
Limitations and troubleshooting

]
}

There are also limitations that apply depending on the type of model you want to bring:

Bring your own model from SageMaker JumpStart


Review the following information and limits when sharing a SageMaker JumpStart model with Canvas.

• The following are the supported algorithms for which you can import models into Canvas. For more
details, see the SageMaker JumpStart documentation.
• Tabular classification: LightGBM, CatBoost, XGBoost, AutoGluon-Tabular, TabTransformer, Linear
Learner
• Tabular regression: LightGBM, CatBoost, XGBoost, AutoGluon-Tabular, TabTransformer, Linear
Learner
• In SageMaker JumpStart, the Share button is only turned on if the model is ready to share to Canvas.
If your trained model does not have a Share to SageMaker Canvas button, your model is not supported
for BYOM.
• You must provide training and validation datasets when training the SageMaker JumpStart model. The
datasets should be stored in Amazon S3, and your Studio and Canvas users' execution role must have
access to the Amazon S3 location. You can use the same Amazon S3 URIs to share the training and
validation datasets with Canvas, or you can share different datasets with the same data schema.

Your training or validation data file should look like the following (in CSV format). You should index
your files with the first column as the target.

3 1 22 1 1 0 4 4
0 0 38 0 0 1 3 4
1 0 67 0 1 0 1 6
1 0 67 0 0 2 2 6
0 0 40 0 0 2 6 6
2 0 56 1 0 1 2 6

• By default, SageMaker JumpStart uses the first column of the training and validation datasets as the
target when training a model. The target column (or by default, the first column) of the datasets is
shared to Canvas.
• You must provide the column headers of the training and validation datasets when training the
SageMaker JumpStart model. By default, SageMaker JumpStart only accepts datasets without column
headers, so you must add the column headers as a file while training your model. The Amazon S3 URI
for the column headers file is shared to Canvas as well. Your column headers file should look like the
following example (in CSV format). The first column should be the target.

Segmentation EverMarried Age Graduated WorkExperience SpendingScore FamilySize Var1

• The training job in SageMaker JumpStart must be Complete before you can share with Canvas.
• For classification problems (or categorical prediction in Canvas), original class names need to be
provided in the Configure model output section when sharing to Canvas. The order of the class names
must match the indexing used in the model. Your mapping relation file should look like the following
example in CSV format, where index 0 (the first index) is mapped to the class name A:

A B C D

When the Canvas user views the model metrics in the Canvas application, they can only see the index
of each class (0, 1, 2). However, the user can see the class names when viewing the results for a single
prediction.

398
Amazon SageMaker Developer Guide
Limitations and troubleshooting

Bring your own model from Autopilot


Review the following information and limits when sharing a model from Autopilot to Canvas.

• You can only share models to Canvas that you’ve successfully trained from an AutoML job with
Ensembling, HPO, or Auto mode (for Auto mode, Autopilot chooses Ensembling or HPO mode based
on the training dataset size). The currently supported Autopilot problem types are Regression, Multi-
class classification, Binary classification.
• For each Autopilot job, you can choose any model (the Best model or any other candidates) to share to
Canvas one at a time. You only need to choose the Share model button and then specify the Canvas
users with whom you’d like to share the model and a note.
• AutoGluon-Tabular models that use Data Wrangler transformers for inference cannot be shared to
Canvas. This is because Data Wrangler transformers cause the model to use more than one container.
• HPO models that aren’t compatible with SageMaker Neo can’t be shared to Canvas successfully.
Compatible models are Autopilot models that use XGBoost or MLP algorithms. Incompatible models
include Autopilot models that use the linear learner algorithm.

Bring your own model from Model Registry


Review the following information and limits when sharing a model from Model Registry to Canvas.

• Unlike the Share button provided by SageMaker JumpStart, Model Registry doesn’t provide model
validation, so it’s possible that a registered model shared successfully from Studio can fail while
importing to Canvas due to model incompatibility. Review the following tips before sharing to Canvas
from Model Registry:
• Use a single inference container for your model. You can register models with multiple containers
within the AdditionalInferenceSpecifications field, but Canvas is only optimized for one inference
container per model. For example, when you use a inference pipeline and register multiple
containers in the AdditionalInferenceSpecifications field with multiple data preprocessing
containers and an inference container, by default the first container is selected for model inference
in Canvas. Evaluate if this works for your use case if you're using machine learning pipelines.
• Use a SageMaker built-in tabular algorithm with compatible inference formats. Tested sample
algorithms with compatible inference outputs are Autogluon-Tabular, CatBoost, LightGBM,
TabTransformer and XGBoost. Algorithms like Factorization Machines don't accept CSV as file input,
and the inference output formats for algorithms like Linear Learner and K-NN are not supported by
Canvas.
• You can also bring your own image container and share to Canvas, or modify pre-built SageMaker
containers.
• If you are bringing your own container, you must modify it to work with SageMaker. For more
information about bringing your own container, see Adapting Your Own Inference Container.
• For detailed formatting for your inference output formats, see Limitations for bring your own
model (BYOM) (p. 396).
• When registering your model in a model package group, remember to provide the following attributes
with your inference container:
• Environment:

"{\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"20\", \"SAGEMAKER_PROGRAM\": \"inference.py\",


\"SAGEMAKER_REGION\": \"us-west-2\", \"SAGEMAKER_SUBMIT_DIRECTORY\": \"/opt/ml/model/
code\"}"

• Image:

"s3://sagemaker-us-west-2-<account-id>/model-regression-abalone-2022-10-14-23-02-45/
model.tar.gz"

399
Amazon SageMaker Developer Guide
Manage billing and cost

• ModelDataUrl

"<account-id>.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1"

• You must provide training and validation datasets when sharing the model from Model Registry to
Canvas. The datasets should be stored in Amazon S3, and the Studio and Canvas users' execution
role must have access to the Amazon S3 location. You can use the same Amazon S3 URIs to share
the training and validation datasets with Canvas, or you can share different datasets with the same
data schema. The datasets must have the exact input formatting that feeds your model’s inference
container.
• You must provide the target column to Canvas, or the first column of your training/validation dataset
is used by default.
• In the Add model details section when sharing to Canvas, you can provide the first row your training
and validation datasets as the headers, or you can specify the headers as a different file.
• For classification problems (or categorical prediction in Canvas), original class names need to be
provided when sharing to SageMaker Canvas through the Configure model outputs option. The order
of the class names must match the indexing used with the shared model. The mapping can be either a
CSV file in Amazon S3, or you can manually input the class names.

Manage billing and cost in SageMaker Canvas


To track the costs associated with your SageMaker Canvas application, you can use the AWS Billing and
Cost Management service. Billing and Cost Management provides tools to help you gather information
related to your cost and usage, analyze your cost drivers and usage trends, and take action to budget
your spending. For more information, see What is AWS Billing and Cost Management?

Billing in SageMaker Canvas consists of the following components:

• Workspace instance charges – You are charged for the number of hours that you are logged in to or
using SageMaker Canvas.
• AWS service charges – You are charged for building and making predictions with custom models, or for
making predictions with Ready-to-use models:
• Training charges – You are charged for the resources used to build a custom model.
• Prediction charges – You are charged for the resources used to generate predictions, depending on
the type of custom model that you built or the type of Ready-to-use model you used.

The Ready-to-use models (p. 289) in Canvas leverage other AWS services to generate predictions. When
you use a Ready-to-use model, you are charged by the respective service, and their pricing conditions
apply:

• For sentiment analysis, entities extraction, language detection, and personal information detection,
you’re charged with Amazon Comprehend pricing.
• For object detection in images and text detection in images, you’re charged with Amazon Rekognition
pricing.
• For expense analysis, identity document analysis, and document analysis, you’re charged with Amazon
Textract pricing.

For more information, see SageMaker Canvas pricing.

To help you track your costs in Billing and Cost Management, you can assign custom tags to your
SageMaker Canvas app and users. You can track the costs your apps incur, and by tagging individual user
profiles, you can track costs based on the user profile. For more information about tags, see Using Cost
Allocation Tags.

400
Amazon SageMaker Developer Guide
SageMaker geospatial capabilities

You can add tags to your SageMaker Canvas app and users by doing the following:

• If you are setting up your Amazon SageMaker Domain and SageMaker Canvas for the first time,
follow the Getting Started instructions and add tags when creating your Domain or users. You can
add tags either through the General settings in the Domain console setup, or through the APIs
(CreateDomain or CreateUserProfile). SageMaker adds the tags specified in your Domain or UserProfile
to any SageMaker Canvas apps or users you create after you create the Domain.
• If you want to add tags to apps in an existing Domain, you must add tags to either the Domain or the
UserProfile. You can adds tags through either the console or the AddTags API. If you add tags through
the console, then you must delete and relaunch your SageMaker Canvas app in order for the tags to
propagate to the app. If you use the API, the tags are added directly to the app. For more information
about deleting and relaunching a SageMaker Canvas app, see Manage apps.

After you add tags to your Domain, it might take up to 24 hours for the tags to appear in the AWS Billing
and Cost Management console for activation. After they appear in the console, it takes another 24 hours
for the tags to activate.

On the Cost explorer page, you can group and filter your costs by tags and usage types to separate your
Workspace instance (Session-Hrs) charges from your Training charges. The names of the usage types are
as follows:

• Workspace instance (Session-Hrs) charges: REGION-Canvas:Session-Hrs (Hrs)


• Training charges:
• REGION-Canvas:CreateModelRequest-Tier0 (CreateModelRequest)
• REGION-Canvas:MillionCells-Tier1 (MillionCells)

Amazon SageMaker geospatial capabilities


Amazon SageMaker geospatial capabilities make it easier for data scientists and machine learning (ML)
engineers to build, train, and deploy ML models faster using geospatial data. You have access to open-
source and third-party data, processing, and visualization tools to make it more efficient to prepare
geospatial data for ML. You can increase your productivity by using purpose-built algorithms and pre-
trained ML models to speed up model building and training, and use built-in visualization tools to
explore prediction outputs on an interactive map and then collaborate across teams on insights and
results.

Why use SageMaker geospatial capabilities?

You can use SageMaker geospatial capabilities to make predictions on geospatial data faster than
do-it-yourself solutions. SageMaker geospatial capabilities make it easier to access geospatial data
from your existing customer data lakes, open-source datasets, and other SageMaker geospatial data
providers. SageMaker geospatial capabilities minimize the need for building custom infrastructure
and data preprocessing functions by offering purpose-built algorithms for efficient data preparation,
model training, and inference. You can also create and share custom visualizations and data with your
company from Amazon SageMaker Studio. SageMaker geospatial capabilities offer pre-trained models
for common uses in agriculture, real estate, insurance, and financial services.
Note
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region.
To view Amazon SageMaker geospatial capabilities, choose the name of the currently displayed
Region in the navigation bar of the console. Then choose the US West (Oregon) Region.

How can I use SageMaker geospatial capabilities?

You can use SageMaker geospatial capabilities in two ways.

401
Amazon SageMaker Developer Guide
Getting Started

• Through the SageMaker geospatial UI, as a part of Amazon SageMaker Studio UI.
• Through SageMaker notebooks with a SageMaker geospatial image.

Geospatial data represents features or objects on the earth’s surface. The first type of geospatial data is
vector data, which uses two-dimensional geometry such as points, lines, or polygons to represent objects
such as roads and land boundaries. The second type of geospatial data is raster data, such as imagery
captured by satellite, aerial platforms, or remote sensing data. This data type uses a matrix of pixels to
define where features are located. You can use raster formats for storing data that varies. A third type of
geospatial data is geo-tagged location data. It includes points of interest—for example, the Eiffel Tower
—location-tagged social media posts, latitude and longitude coordinates, or different styles and formats
of street addresses. SageMaker has the following SageMaker geospatial capabilities.

• An Earth Observation job for raster data


• A Vector Enrichment job for vector data
• Built-in Visualization Using SageMaker geospatial capabilities
• A geospatial image that has commonly used open-source libraries used in the geospatial ML workflow
pre-installed

Along with this, you can access data from a catalog of geospatial data providers. Currently, the data
collections available include:

• USGS Landsat
• Sentinel-2

Topics
• Getting Started with Amazon SageMaker geospatial capabilities (p. 402)
• Earth Observation Jobs (p. 405)
• Vector Enrichment Jobs (p. 412)
• Visualization Using SageMaker geospatial capabilities (p. 413)
• Amazon SageMaker geospatial Map SDK (p. 418)
• SageMaker geospatial capabilities FAQ (p. 423)
• SageMaker geospatial Security and Permissions (p. 424)

Getting Started with Amazon SageMaker geospatial


capabilities
This guide demonstrates how to complete the necessary steps to satisfy the prerequisites for using
SageMaker geospatial capabilities.

To use SageMaker geospatial capabilities you need to have an AWS account. If you already have an AWS
account, skip this step.

Sign up for an AWS account


If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.

402
Amazon SageMaker Developer Guide
Getting Started

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.

When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.

AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.

Create an administrative user


After you sign up for an AWS account, create an administrative user so that you don't use the root user
for everyday tasks.

Secure your AWS account root user

1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.

For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.

For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.

Create an administrative user

• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).

For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.

Sign in as the administrative user

• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.

For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.

Create an execution role and trust policy


As a managed service, Amazon SageMaker geospatial capabilities perform operations on your behalf on
the AWS hardware that is managed by SageMaker. It can perform only operations that the user permits.
To work with SageMaker geospatial capabilities you need to setup a user role and an execution role. See
Amazon SageMaker geospatial capabilities roles to learn more.

403
Amazon SageMaker Developer Guide
Getting Started

Setup the SageMaker geospatial UI


See Onboard to Amazon SageMaker Domain, which provides you with steps to create a Domain, giving
you access to Amazon SageMaker Studio and Amazon SageMaker geospatial capabilities.

To use the SageMaker geospatial UI:


Note
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region
for public preview. To view Amazon SageMaker geospatial capabilities, choose the name of
the currently displayed Region in the navigation bar of the console. Then choose the US West
(Oregon) Region.

Within the Studio UI, choose Geospatial under Data from the left navigation panel on the Home menu.

Shut down a SageMaker Studio notebook with a SageMaker


geospatial image
When you are finished using SageMaker geospatial capabilities, we recommend that you shut down
the instance it runs on to avoid incurring additional charges. For more information, see Shut Down
Resources.

Launch a SageMaker Studio notebook with a SageMaker


geospatial image
See Onboard to Amazon SageMaker Domain, which provides you with steps to create a Domain, giving
you access to Amazon SageMaker Studio and Amazon SageMaker geospatial capabilities.

To use a SageMaker Studio notebook with a SageMaker geospatial image:

1. From the Launcher, choose Change environment under Notebooks and compute resources.
2. Next, the Change environment dialog opens.
3. Select the Image dropdown and choose Geospatial 1.0. The Instance type should be
ml.geospatial.interactive. Do not change the default values for other settings.
4. Choose Select.
5. Choose Create notebook.

Types of Compute Instances


SageMaker geospatial capabilities offer three types of compute instances. Here's how you can use them:

• ml.geospatial.interactive – Launch an interactive SageMaker notebook with a SageMaker geospatial


image to build, train, and deploy ML models.
• ml.geospatial.jobs – Run processing jobs to transform satellite image data.
• ml.geospatial.models – Make predictions using pre-trained ML models on satellite imagery.

The instance type is determined by the operation that you run. The following table shows the instance
type for each operation.

Operations Instance

Launch a SageMaker Studio notebook ml.geospatial.interactive


with SageMaker geospatial image

404
Amazon SageMaker Developer Guide
Earth Observation Jobs

Operations Instance

Temporal Statistics ml.geospatial.jobs

Zonal Statistics ml.geospatial.jobs

Resampling ml.geospatial.jobs

Geomosaic ml.geospatial.jobs

Band Stacking ml.geospatial.jobs

Band Math ml.geospatial.jobs

Cloud Removal with Landsat8 ml.geospatial.jobs

Cloud Removal with Sentinel-2 ml.geospatial.models

Cloud Masking ml.geospatial.models

Land Cover Segmentation ml.geospatial.models

You are charged different rates for each type of compute instance you use. See Geospatial ML with
Amazon SageMaker for more information on pricing.

Earth Observation Jobs


Using an Earth Observation job (EOJ), you can acquire, transform, and visualize geospatial data to make
predictions. You can choose an operation based on your use case from a wide range of operations and
models. You get the flexibility of choosing your area of interest, selecting the data providers, and setting
time-range based and cloud-cover-percentage-based filters. After SageMaker creates an EOJ for you, you
can visualize the inputs and outputs of the job using the visualization functionality. An EOJ has various
use cases that include comparing deforestation over time and diagnosing plant health. You can create
an EOJ by using a SageMaker notebook with a SageMaker geospatial image. You can also access the
SageMaker geospatial UI as a part of Amazon SageMaker Studio UI to view the list of all your jobs. You
can also use the UI to pause or stop an ongoing job. You can choose a job from the list of available EOJ
to view the Job summary, the Job details as well as visualize the Job output.

Topics
• Create an Earth Observation Job Using a Amazon SageMaker Studio Notebook with a SageMaker
geospatial Image (p. 405)
• Types of Operations (p. 409)
• Data Collections (p. 411)

Create an Earth Observation Job Using a Amazon SageMaker


Studio Notebook with a SageMaker geospatial Image
To use a SageMaker Studio notebook with a SageMaker geospatial image:

1. From the Launcher, choose Change environment under Notebooks and compute resources.
2. Next, the Change environment dialog opens.
3. Select the Image dropdown and choose Geospatial 1.0. The Instance type should be
ml.geospatial.interactive. Do not change the default values for other settings.
4. Choose Select.

405
Amazon SageMaker Developer Guide
Earth Observation Jobs

5. Choose Create notebook.

You can initiate an EOJ using a Amazon SageMaker Studio notebook with a SageMaker geospatial image
using the code provided below.

import boto3
import sagemaker
import sagemaker_geospatial_map

session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")

The following is an example showing how to create an EOJ in the in the US West (Oregon) Region.

#Query and Access Data


search_rdc_args = {
"Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/
public/nmqj48dcu3g7ayw8", # sentinel-2 L2A COG
"RasterDataCollectionQuery": {
"AreaOfInterest": {
"AreaOfInterestGeometry": {
"PolygonGeometry": {
"Coordinates": [
[
[-114.529, 36.142],
[-114.373, 36.142],
[-114.373, 36.411],
[-114.529, 36.411],
[-114.529, 36.142],
]
]
}
}
},
"TimeRangeFilter": {
"StartTime": "2021-01-01T00:00:00Z",
"EndTime": "2022-07-10T23:59:59Z",
},
"PropertyFilters": {
"Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound":
1}}}],
"LogicalOperator": "AND",
},
"BandFilter": ["visual"],
},
}

tci_urls = []
data_manifests = []
while search_rdc_args.get("NextToken", True):
search_result = sg_client.search_raster_data_collection(**search_rdc_args)
if search_result.get("NextToken"):
data_manifests.append(search_result)
for item in search_result["Items"]:
tci_url = item["Assets"]["visual"]["Href"]
print(tci_url)
tci_urls.append(tci_url)

search_rdc_args["NextToken"] = search_result.get("NextToken")

# Perform land cover segmentation on images returned from the sentinel dataset.
eoj_input_config = {

406
Amazon SageMaker Developer Guide
Earth Observation Jobs

"RasterDataCollectionQuery": {
"RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-
west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
"AreaOfInterest": {
"AreaOfInterestGeometry": {
"PolygonGeometry": {
"Coordinates": [
[
[-114.529, 36.142],
[-114.373, 36.142],
[-114.373, 36.411],
[-114.529, 36.411],
[-114.529, 36.142],
]
]
}
}
},
"TimeRangeFilter": {
"StartTime": "2021-01-01T00:00:00Z",
"EndTime": "2022-07-10T23:59:59Z",
},
"PropertyFilters": {
"Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound":
1}}}],
"LogicalOperator": "AND",
},
}
}
eoj_config = {"LandCoverSegmentationConfig": {}}

response = sg_client.start_earth_observation_job(
Name="lake-mead-landcover",
InputConfig=eoj_input_config,
JobConfig=eoj_config,
ExecutionRoleArn=execution_role,
)

After your EOJ is created, the Arn is returned to you. You use the Arn to identify
a job and perform further operations. To get the status of a job, you can run
sg_client.get_earth_observation_job(Arn = response['Arn']).

The following example shows how to query the status of an EOJ until it is completed.

eoj_arn = response["Arn"]
job_details = sg_client.get_earth_observation_job(Arn=eoj_arn)
{k: v for k, v in job_details.items() if k in ["Arn", "Status", "DurationInSeconds"]}
# List all jobs in the account
sg_client.list_earth_observation_jobs()["EarthObservationJobSummaries"]

After the EOJ is completed, you can visualize the EOJ outputs directly in the notebook. The following
example shows you how an interactive map can be rendered.

map = sagemaker_geospatial_map.create_map({
'is_raster': True
})
map.set_sagemaker_geospatial_client(sg_client)
# render the map
map.render()

The following example shows how the map can be centered on an area of interest and the input and
output of the EOJ can be rendered as separate layers within the map.

407
Amazon SageMaker Developer Guide
Earth Observation Jobs

# visualize the area of interest


config = {"label": "Lake Mead AOI"}
aoi_layer = map.visualize_eoj_aoi(Arn=eoj_arn, config=config)

# Visualize input.
time_range_filter = {
"start_date": "2022-07-01T00:00:00Z",
"end_date": "2022-07-10T23:59:59Z",
}
config = {"label": "Input"}

input_layer = map.visualize_eoj_input(
Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)
# Visualize output, EOJ needs to be in completed status.
time_range_filter = {
"start_date": "2022-07-01T00:00:00Z",
"end_date": "2022-07-10T23:59:59Z",
}
config = {"preset": "singleBand", "band_name": "mask"}
output_layer = map.visualize_eoj_output(
Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)

You can use the export_earth_observation_job function to export the EOJ results to your Amazon
S3 bucket. The export function makes it convenient to share results across teams. SageMaker also
simplifies dataset management. We can simply share the EOJ results using the job ARN, instead of
crawling thousands of files in the S3 bucket. Each EOJ becomes an asset in the data catalog, as results
can be grouped by the job ARN. The following example shows how you can export the results of an EOJ.

sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket() # Replace with your own bucket if
needed
s3_bucket = session.resource("s3").Bucket(s3_bucket_name)
prefix = "eoj_lakemead" # Replace with the S3 prefix desired
export_bucket_and_key = f"s3://{s3_bucket_name}/{prefix}/"

eoj_output_config = {"S3Data": {"S3Uri": export_bucket_and_key}}


export_response = sg_client.export_earth_observation_job(
Arn=eoj_arn,
ExecutionRoleArn=execution_role,
OutputConfig=eoj_output_config,
ExportSourceImages=False,
)

You can monitor the status of your export job using the following snippet.

# Monitor the export job status


export_job_details = sg_client.get_earth_observation_job(Arn=export_response["Arn"])
{k: v for k, v in export_job_details.items() if k in ["Arn", "Status",
"DurationInSeconds"]}

You are not charged the storage fees after you delete the EOJ.

For an example that showcases how to run an EOJ, see this blog post.

For more example notebooks on SageMaker geospatial capabilities, see this GitHub repository.

408
Amazon SageMaker Developer Guide
Earth Observation Jobs

Types of Operations
When you create an EOJ, you select an operation based on your use case. Amazon SageMaker geospatial
capabilities provide a combination of purpose-built operations and pre-trained models. You can use
these operations to understand the impact of environmental changes and human activities over time or
identify cloud and cloud-free pixels.

Cloud Masking

Identify clouds in satellite images is an essential pre-processing step in producing high-quality geospatial
data. Ignoring cloud pixels can lead to errors in analysis, and over-detection of cloud pixels can decrease
the number of valid observations. Cloud masking has the ability to identify cloudy and cloud-free pixels
in satellite images. An accurate cloud mask helps get satellite images for processing and improves data
generation. The following is the class map for cloud masking.

{
0: "No_cloud",
1: "cloud"
}

Cloud Removal

Cloud removal for Sentinel-2 data uses an ML-based semantic segmentation model to identify clouds
in the image. Cloudy pixels can be replaced by with pixels from other timestamps. USGS Landsat data
contains landsat metadata that is used for cloud removal.

Temporal Statistics

Temporal statistics calculate statistics for geospatial data through time. The temporal statistics currently
supported include mean, median, and standard deviation. You can calculate these statistics by using
GROUPBY and set it to either all or yearly. You can also mention the TargetBands.

Zonal Statistics

Zonal statistics performs statistical operations over a specified area on the image.

Resampling

Resampling is used to upscale and downscale the resolution of a geospatial image. The value attribute
in resampling represents the length of a side of the pixel.

Geomosaic

Geomosaic allows you to stitch smaller images into a large image.

Band Stacking

Band stacking takes more than one image band as input and stacks them into a single GeoTIFF. The
OutputResolution attribute determines the resolution of the output image. Based on the resolutions
of the input images, you can set it to lowest, highest or average.

Band Math

Band Math, also known as Spectral Index, is a process of transforming the observations from multiple
spectral bands to a single band, indicating the relative abundance of features of interests. For instance,
Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) are helpful for
observing the presence of green vegetation features.

409
Amazon SageMaker Developer Guide
Earth Observation Jobs

Land Cover Segmentation

Land Cover segmentation is a semantic segmentation model that has the capability to identify the
physical material, such as vegetation, water, and bare ground, at the earth surface. Having an accurate
way to map the land cover patterns helps you understand the impact of environmental change and
human activities over time. Land Cover segmentation is often used for region planning, disaster
response, ecological management, and environmental impact assessment. The following is the class map
for Land Cover segmentation.

{
0: "No_data",
1: "Saturated_or_defective",
2: "Dark_area_pixels",
3: "Cloud_shadows",
4: "Vegetation",
5: "Not_vegetated",
6: "Water",
7: "Unclassified",
8: "Cloud_medium_probability",
9: "Cloud_high_probability",
10: "Thin_cirrus",
11: "Snow_ice"
}

Availability of EOJ Operations


The availability of operations depends on whether you are using the SageMaker geospatial UI or the
Amazon SageMaker Studio notebooks with a SageMaker geospatial image. Currently, notebooks support
all functionalities. To summarize, the following geospatial operations are supported by SageMaker:

Operations Description Availability

Cloud Masking Identify cloud and cloud-free UI, Notebook


pixels to get improved and
accurate satellite imagery.

Cloud Removal Remove pixels containing Notebook


parts of a cloud from satellite
imagery.

Temporal Statistics Calculate statistics through time Notebook


for a given GeoTIFF.

Zonal Statistics Calculate statistics on user- Notebook


defined regions.

Resampling Scale images to different Notebook


resolutions.

Geomosaic Combine multiple images for Notebook


greater fidelity.

Band Stacking Combine multiple spectral Notebook


bands to create a single image.

Band Math / Spectral Index Obtain a combination of UI, Notebook


spectral bands that indicate
the abundance of features of
interest.

410
Amazon SageMaker Developer Guide
Earth Observation Jobs

Operations Description Availability

Land Cover Segmentation Identify land cover types such as UI, Notebook
vegetation and water in satellite
imagery.

Data Collections
Amazon SageMaker geospatial provides the following data collections to create an EOJ.

• USGS Landsat
• Sentinel-2

The image band information for these data collections is provided below.

USGS Landsat
Band name Wave length Units Valid range Fill value Spatial
range (nm) resolution

coastal 435 - 451 Unitless 1 - 65455 0 (No Data) 30m

blue 452 - 512 Unitless 1 - 65455 0 (No Data) 30m

green 533 - 590 Unitless 1 - 65455 0 (No Data) 30m

red 636 - 673 Unitless 1 - 65455 0 (No Data) 30m

nir 851 - 879 Unitless 1 - 65455 0 (No Data) 30m

Unitless 1 - 65455 0 (No Data) 30m

swir16 1566 - 1651 Unitless 1 - 65455 0 (No Data) 30m

swir22 2107 - 2294 Unitless 1 - 65455 0 (No Data) 30m

qa_aerosol NA Bit Index 0 - 255 1 30m

qa_pixel NA Bit Index 1 - 65455 1 (bit 0) 30m

qa_radsat NA Bit Index 1 - 65455 NA 30m

t 10600 - 11190 Scaled Kelvin 1 - 65455 0 (No Data) 30m (scaled


from 100m)

atran NA Unitless 0 - 10000 -9999 (No 30m


Data)

cdist NA Kilometers 0 - 24000 -9999 (No 30m


Data)

drad NA W/(m^2 sr 0 - 28000 -9999 (No 30m


µm)/DN Data)

urad NA W/(m^2 sr 0 - 28000 -9999 (No 30m


µm)/DN Data)

trad NA W/(m^2 sr 0 - 28000 -9999 (No 30m


µm)/DN Data)

411
Amazon SageMaker Developer Guide
Vector Enrichment Jobs

Band name Wave length Units Valid range Fill value Spatial
range (nm) resolution

emis NA Emissivity 1 - 10000 -9999 (No 30m


coefficient Data)

emsd NA Emissivity 1 - 10000 -9999 (No 30m


coefficient Data)

Sentinel-2

Band name Wave length Scale Valid range Fill value Spatial
range (nm) resolution

coastal 443 0.0001 NA 0 (No Data) 60m

blue 490 0.0001 NA 0 (No Data) 10m

green 560 0.0001 NA 0 (No Data) 10m

red 665 0.0001 NA 0 (No Data) 10m

rededge1 705 0.0001 NA 0 (No Data) 20m

rededge2 740 0.0001 NA 0 (No Data) 20m

rededge3 783 0.0001 NA 0 (No Data) 20m

nir 842 0.0001 NA 0 (No Data) 10m

nir08 865 0.0001 NA 0 (No Data) 20m

nir08 865 0.0001 NA 0 (No Data) 20m

nir09 940 0.0001 NA 0 (No Data) 60m

swir16 1610 0.0001 NA 0 (No Data) 20m

swir22 2190 0.0001 NA 0 (No Data) 20m

aot Aerosol Optical 0.001 NA 0 (No Data) 10m


Thickness

wvp Scene-average 0.001 NA 0 (No Data) 10m


Water Vapour

scl Scene NA 1 - 11 0 (No Data) 20m


classification
data

Vector Enrichment Jobs


A Vector Enrichment Job (VEJ) performs operations on your vector data. Currently, you can use a VEJ to
do reverse geocoding or map matching.

Reverse Geocoding

With a reverse geocoding VEJ, you can convert geographic coordinates (latitude, longitude) to human-
readable addresses powered by Amazon Location Service. When you upload a CSV file containing

412
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities

the longitude and latitude coordinates, a it returns the address number, country, label, municipality,
neighborhood, postal code and region of that location. The output file consists of your input data along
with columns containing these the values appended at the end. These jobs are optimized to accept tens
of thousands of GPS traces.

Map Matching

Map matching allows you to snap GPS coordinates to road segments. The input should be a CSV file
containing the trace ID (route), longitude, latitude and the timestamp attributes. There can be multiple
GPS co-ordinates per route. The input can contain multiple routes too. The output is a GeoJSON file
that contains links of the predicted route. It also has the snap points provided in the input. These jobs
are optimized to accept tens of thousands of drives in one request. Map matching is supported by
OpenStreetMap. Map matching fails if the names in the input source field don't match the ones in
MapMatchingConfig. The error message you receive contains the the field names present in the input
file and the expected field name that is not found in MapMatchingConfig.

While you need to use an Amazon SageMaker Studio notebook to execute a VEJ, you can view all the
jobs you create using the UI. To use the visualization in the notebook, you first need to export your
output to your S3 bucket. The VEJ actions you can perform are as follows.

• StartVectorEnrichmentJob
• GetVectorEnrichmentJob
• ListVectorEnrichmentJobs
• StopVectorEnrichmentJob
• DeleteVectorEnrichmentJob

Visualization Using SageMaker geospatial capabilities


Using the visualization functionalities provided by Amazon SageMaker geospatial you can visualize
geospatial data, the inputs to your EOJ or VEJ jobs as well as the outputs exported from your Amazon
S3 bucket. The visualization tool is powered by Foursquare Studio. The following image depicts the
visualization tool supported by SageMaker geospatial capabilities.

413
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities

414
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities

You can use the left navigation panel to add data, layers, filters, and columns. You can also make
modifications to how you interact with the map.

Dataset

The source of data used for visualization is called a Dataset. To add data for visualization, choose Add
Data in the left navigation panel. You can either upload the data from your Amazon S3 bucket or your
local machine. The data formats supported are CSV, JSON and GeoJSON. You can add multiple datasets
to your map. After you upload the dataset, you can see it loaded on the map screen.

Layers

In the layer panel, a layer is created and populated automatically when you add a dataset. If your map
consists of more than one dataset, you can select which dataset belongs to a layer. You can create
new layers and group them. SageMaker SageMaker geospatial capabilities support various layer types,
including point, arc, icon, and polygon.

You can choose any data point in a layer to have an Outline. You can also further customize the data
points. For example, you can choose the layer type as Point and then Fill Color based on any column of
your dataset. You can also change the radius of the points.

The following image shows the layers panel supported by SageMaker geospatial capabilities.

415
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities

416
Amazon SageMaker Developer Guide
Visualization Using SageMaker geospatial capabilities

Columns

You can view the columns present in your dataset by using the Columns tab in the left navigation panel.

Filters

You can use filters to limit the data points that display on the map.

Interactions

In the Interactions panel, you can customize how you interact with the map. For example, you can
choose what metrics to display when you hover the tooltip over a data point.

Base map

Currently, SageMaker only supports the Amazon Dark base map.

Split Map Modes

You can have a Single Map, Dual Maps or Swipe Maps. With Dual Maps, you can compare the same
map side-by-side using different layers. Use Swipe Maps to overlay two maps on each other and use
the sliding separator to compare them. You can choose the split map mode by choosing the Split Mode
button on the top right corner of your map.

Legends for EOJ in the SageMaker geospatial UI


The output visualization of an EOJ depends on the operation you choose to create it. The legend is based
on the default color scale. You can view the legend by choosing the Show legend button on the top right
corner of your map.

Spectral Index

When you visualize the output for an EOJ that uses the spectral index operation, you can map the
category based on the color from the legend as shown.

Cloud Masking

When you visualize the output for an EOJ that uses the cloud masking operation, you can map the
category based on the color from the legend as shown.

Land Cover Segmentation

417
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK

When you visualize the output for an EOJ that uses the Land Cover Segmentation operation, you can
map the category based on the color from the legend as shown.

Amazon SageMaker geospatial Map SDK


You can use Amazon SageMaker geospatial capabilities to visualize maps within the SageMaker
geospatial UI as well as SageMaker notebooks with a geospatial image. These visualizations are
supported by the map visualization library called Foursquare Studio

You can use the APIs provided by the SageMaker geospatial map SDK to visualize your geospatial data,
including the input, output, and AoI for EOJ.

Topics
• add_dataset API (p. 418)
• update_dataset API (p. 419)
• add_layer API (p. 420)
• update_layer API (p. 421)
• visualize_eoj_aoi API (p. 422)
• visualize_eoj_input API (p. 422)
• visualize_eoj_output API (p. 423)

add_dataset API
Adds a raster or vector dataset object to the map.

Request syntax

Request =
add_dataset(
self,
dataset: Union[Dataset, Dict, None] = None,

418
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK

*,
auto_create_layers: bool = True,
center_map: bool = True,
**kwargs: Any,
) -> Optional[Dataset]

Request parameters

The request accepts the following parameters.

Positional arguments

Argument Type Description

dataset Union[Dataset, Dict, None] Data used to create a dataset, in


CSV, JSON, or GeoJSON format
(for local datasets) or a UUID
string.

Keyword arguments

Argument Type Description

auto_create_layers Boolean Whether to attempt to create


new layers when adding a
dataset. Default value is False.

center_map Boolean Whether to center the map on


the created dataset. Default
value is True.

id String Unique identifier of the dataset.


If you do not provide it, a
random ID is generated.

label String Dataset label which is displayed.

color Tuple[float, float, float] Color label of the dataset.

metadata Dictionary Object containing tileset


metadata (for tiled datasets).

Response

This API returns the Dataset object that was added to the map.

update_dataset API
Updates an existing dataset's settings.

Request syntax

Request =
update_dataset(
self,
dataset_id: str,

419
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK

values: Union[_DatasetUpdateProps, dict, None] = None,


**kwargs: Any,
) -> Dataset

Request parameters

The request accepts the following parameters.

Positional arguments

Argument Type Description

dataset_id String The identifier of the dataset to


be updated.

values Union[_DatasetUpdateProps, The values to update.


dict, None]

Keyword arguments

Argument Type Description

label String Dataset label which is displayed.

color RGBColor Color label of the dataset.

Response

This API returns the updated dataset object for interactive maps, or None for non-interactive HTML
environments.

add_layer API
Adds a new layer to the map. This function requires at least one valid layer configuration.

Request syntax

Request =
add_layer(
self,
layer: Union[LayerCreationProps, dict, None] = None,
**kwargs: Any
) -> Layer

Request parameters

The request accepts the following parameters.

Arguments

Argument Type Description

layer Union[LayerCreationProps, dict, A set of properties used to


None] create a layer.

420
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK

Response

The layer object that was added to the map.

update_layer API
Update an existing layer with given values.

Request syntax

Request =
update_layer(
self,
layer_id: str,
values: Union[LayerUpdateProps, dict, None],
**kwargs: Any
) -> Layer

Request parameters

The request accepts the following parameters.

Arguments

Positional argument Type Description

layer_id String The ID of the layer to be


updated.

values Union[LayerUpdateProps, dict, The values to update.


None]

Keyword arguments

Argument Type Description

type LayerType The type of layer.

data_id String Unique identifier of the dataset


this layer visualizes.

fields Dict [string, Optional[string]] Dictionary that maps fields


that the layer requires for
visualization to appropriate
dataset fields.

label String Canonical label of this layer.

is_visible Boolean Whether the layer is visible or


not.

config LayerConfig Layer configuration specific to


its type.

Response

421
Amazon SageMaker Developer Guide
Amazon SageMaker geospatial Map SDK

Returns the updated layer object.

visualize_eoj_aoi API
Visualize the AoI of the given job ARN.

Request parameters

The request accepts the following parameters.

Arguments

Argument Type Description

Arn String The ARN of the job.

config Dictionary An option to pass layer


properties.
config = { label: <string> custom
label of the added AoI layer,
default AoI }

Response

Reference of the added input layer object.

visualize_eoj_input API
Visualize the input of the given EOJ ARN.

Request parameters

The request accepts the following parameters.

Arguments

Argument Type Description

Arn String The ARN of the job.

time_range_filter Dictionary An option to provide the start


and end time. Defaults to the
time_range_filter = { raster data collection search
start and end date.
start_date: <string> date in ISO
format

end_date: <string> date in ISO


format

config Dictionary An option to pass layer


properties.
config = { label: <string> custom
label of the added output layer,
default Input }

422
Amazon SageMaker Developer Guide
SageMaker geospatial capabilities FAQ

Response

Reference of the added input layer object.

visualize_eoj_output API
Visualize the output of the given EOJ ARN.

Request parameters

The request accepts the following parameters.

Arguments

Argument Type Description

Arn String The ARN of the job.

time_range_filter Dictionary An option to provide the start


and end time. Defaults to the
time_range_filter = { raster data collection search
start and end date.
start_date: <string> date in ISO
format

end_date: <string> date in ISO


format

config Dictionary An option to pass layer


properties.
config = {

label: <string> custom label of


the added output layer, default
Output

preset: <string> singleBand or


trueColor,

band_name: <string>, only


required for 'singleBand' preset.
Allowed bands for a EOJ

Response

Reference of the added output Layer object.

To learn more about visualizing your geospatial data, refer to Visualization Using Amazon SageMaker
geospatial.

SageMaker geospatial capabilities FAQ


Use the following FAQ items to find answers to commonly asked questions about SageMaker geospatial
capabilities.

423
Amazon SageMaker Developer Guide
Security and Permissions

1. What regions are Amazon SageMaker geospatial capabilities available in?

Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon) Region. To
view SageMaker geospatial, choose the name of the currently displayed Region in the navigation bar
of the console. Then choose the US West (Oregon) Region.
2. How can I setup a user role and an execution role to get started with SageMaker geospatial
capabilities?

As a managed service, SageMaker geospatial capabilities perform operations on your behalf on the
AWS hardware managed by SageMaker. It can only perform the operations that the user permits. To
work with SageMaker geospatial capabilities, you need to setup a user role and an execution role. See
SageMaker geospatial capabilities roles to learn more.
3. Can I use SageMaker geospatial capabilities through my VPC environment?

No, currently SageMaker geospatial capabilities only support a public internet environment.
4. Why can't I see the SageMaker geospatial UI link when I navigate to Amazon SageMaker Studio?

Verify that you are launching Amazon SageMaker Studio in the US West (Oregon) Region and that you
are not in an VPC only environment or in a shared space environment.
5. How to create a notebook job in Studio?

See Schedule a notebook job to learn how to create and manage your notebook jobs. Make sure you
are using the latest JupyterLab version.
6. What bands supported for various raster data collections?

Use the GetRasterDataCollection API response and refer to the ImageSourceBands field to find
the bands supported for that particular data collection.
7. Can I use SageMaker geospatial capabilities if my browser does not have internet connection?

You cannot access the list of EOJs, VEJs as well as map visualization from the UI if your browser does
not have internet connection.

SageMaker geospatial Security and Permissions


Use the topics on this page to learn about SageMaker geospatial capabilities security features.
Additionally, learn how to use SageMaker geospatial capabilities in an Amazon Virtual Private Cloud as
well as protect your data at rest using encryption.

For more information about IAM users and roles, see Identities (Users, Groups, and Roles) in the IAM User
Guide.

To learn more about using IAM with SageMaker, see Identity and Access Management for Amazon
SageMaker (p. 3048).

Topics
• Configuration and Vulnerability Analysis in SageMaker geospatial (p. 425)
• Security Best Practices for SageMaker geospatial capabilities (p. 425)
• Use Amazon SageMaker geospatial capabilities in Your Amazon Virtual Private Cloud (p. 426)
• Use AWS KMS Permissions for Amazon SageMaker geospatial capabilities (p. 427)

424
Amazon SageMaker Developer Guide
Security and Permissions

Configuration and Vulnerability Analysis in SageMaker


geospatial
Configuration and IT controls are a shared responsibility between AWS and you, our customer.
AWS handles basic security tasks like guest operating system (OS) and database patching, firewall
configuration, and disaster recovery. These procedures have been reviewed and certified by the
appropriate third parties. For more details, see the following resources:

• Shared Responsibility Model.


• Amazon Web Services: Overview of Security Processes.

Security Best Practices for SageMaker geospatial capabilities


Amazon SageMaker geospatial capabilities provide a number of security features to consider as you
develop and implement your own security policies. The following best practices are general guidelines
and don't represent a complete security solution. Because these best practices might not be appropriate
or sufficient for your environment, treat them as helpful considerations rather than prescriptions.

Apply principle of least privilege

Amazon SageMaker geospatial capabilities provide granular access policy for applications using IAM
roles. We recommend that the roles be granted only the minimum set of privileges required by the job.
We also recommend auditing the jobs for permissions on a regular basis and upon any change to your
application.

Role-based access control (RBAC) permissions

Administrators should strictly control Role-based access control (RBAC) permissions for Amazon
SageMaker geospatial capabilities.

Use temporary credentials whenever possible

Where possible, use temporary credentials instead of long-term credentials, such as access keys.
For scenarios in which you need IAM users with programmatic access and long-term credentials, we
recommend that you rotate access keys. Regularly rotating long-term credentials helps you familiarize
yourself with the process. This is useful in case you are ever in a situation where you must rotate
credentials, such as when an employee leaves your company. We recommend that you use IAM access
last used information to rotate and remove access keys safely. For more information, see Rotating access
keys and Security best practices in IAM.

Use AWS CloudTrail to view and log API calls

AWS CloudTrail tracks anyone making API calls in your AWS account. API calls are logged whenever
anyone uses the Amazon SageMaker geospatial capabilities API, the Amazon SageMaker geospatial
capabilities console or Amazon SageMaker geospatial capabilities AWS CLI commands. Enable logging
and specify an Amazon S3 bucket to store the logs.

Your trust, privacy, and the security of your content are our highest priorities. We implement responsible
and sophisticated technical and physical controls designed to prevent unauthorized access to, or
disclosure of, your content and ensure that our use complies with our commitments to you. For more
information, see AWS Data Privacy FAQ.

425
Amazon SageMaker Developer Guide
Security and Permissions

Use Amazon SageMaker geospatial capabilities in Your Amazon


Virtual Private Cloud
The following topic gives information on how to use SageMaker notebooks with a SageMaker geospatial
image in a Amazon SageMaker Domain with VPC only mode. For more information on VPCs in Amazon
SageMaker Studio see Choose an Amazon VPC.

VPC only communication with the internet


By default, SageMaker Domain uses two Amazon VPC. One of the Amazon VPC is managed by Amazon
SageMaker and provides direct internet access. You specify the other Amazon VPC, which provides
encrypted traffic between the Domain and your Amazon Elastic File System (Amazon EFS) volume.

You can change this behavior so that SageMaker sends all traffic over your specified Amazon VPC. If
VPC only has been choosen as the network access mode during the SageMaker Domain creation, the
following requirements need to be considered to still allow usage of SageMaker Studio notebooks within
the created SageMaker Domain.

Requirements to use VPC only mode


Note
In order to use the visualization components of SageMaker geospatial capabilities, the browser
you use to access the SageMaker Studio UI needs to be connected to the internet.

When you choose VpcOnly, follow these steps:

1. You must use private subnets only. You cannot use public subnets in VpcOnly mode.
2. Ensure your subnets have the required number of IP addresses needed. The expected number
of IP addresses needed per user can vary based on use case. We recommend between 2 and 4 IP
addresses per user. The total IP address capacity for a Studio domain is the sum of available IP
addresses for each subnet provided when the domain is created. Ensure that your estimated IP
address usage does not exceed the capacity supported by the number of subnets you provide.
Additionally, using subnets distributed across many availability zones can aid in IP address
availability. For more information, see VPC and subnet sizing for IPv4.
Note
You can configure only subnets with a default tenancy VPC in which your instance runs on
shared hardware. For more information on the tenancy attribute for VPCs, see Dedicated
Instances.
3. Set up one or more security groups with inbound and outbound rules that together allow the
following traffic:

• NFS traffic over TCP on port 2049 between the domain and the Amazon EFS volume.
• TCP traffic within the security group. This is required for connectivity between the JupyterServer
app and the KernelGateway apps. You must allow access to at least ports in the range
8192-65535.
4. If you want to allow internet access, you must use a NAT gateway with access to the internet, for
example through an internet gateway.
5. If you don't want to allow internet access, create interface VPC endpoints (AWS PrivateLink) to
allow Studio to access the following services with the corresponding service names. You must also
associate the security groups for your VPC with these endpoints.
Note
Currently, SageMaker geospatial capabilities are only supported in the US West (Oregon)
Region.

• SageMaker API : com.amazonaws.us-west-2.sagemaker.api

426
Amazon SageMaker Developer Guide
Security and Permissions

• SageMaker runtime: com.amazonaws.us-west-2.sagemaker.runtime. This is required to run


Studio notebooks with a SageMaker geospatial image.
• Amazon S3: com.amazonaws.us-west-2.s3.
• To use SageMaker Projects: com.amazonaws.us-west-2.servicecatalog.
• SageMaker geospatial capabilities: com.amazonaws.us-west-2.sagemaker-geospatial

If you use the SageMaker Python SDK to run remote training jobs, you must also create the
following Amazon VPC endpoints.

• AWS Security Token Service: com.amazonaws.region.sts


• Amazon CloudWatch: com.amazonaws.region.logs. This is required to allow SageMaker
Python SDK to get the remote training job status from Amazon CloudWatch.

Note
For a customer working within VPC mode, company firewalls can cause connection issues with
SageMaker Studio or between JupyterServer and the KernelGateway. Make the following checks
if you encounter one of these issues when using SageMaker Studio from behind a firewall.

• Check that the Studio URL is in your networks allowlist.


• Check that the websocket connections are not blocked. Jupyter uses websocket under the
hood. If the KernelGateway application is InService, JupyterServer may not be able to connect
to the KernelGateway. You should see this problem when opening System Terminal as well.

Use AWS KMS Permissions for Amazon SageMaker geospatial


capabilities
You can protect your data at rest using encryption for SageMaker geospatial capabilities. By default, it
uses server-side encryption with an Amazon SageMaker geospatial owned key. SageMaker geospatial
capabilities also supports an option for server-side encryption with a customer managed KMS key.

Server-Side Encryption with Amazon SageMaker geospatial managed key


(Default)
SageMaker geospatial capabilities encrypts all your data, including computational results from your
Earth Observation jobs (EOJ) and Vector Enrichment jobs (VEJ) along with all your service metadata.
There is no data that is stored within SageMaker geospatial capabilities unencrypted. It uses a default
AWS owned key to encrypt all your data.

Server-Side Encryption with customer managed KMS key (Optional)


SageMaker geospatial capabilities supports the use of a symmetric customer managed key that you
create, own, and manage to add a second layer of encryption over the existing AWS owned encryption.
Because you have full control of this layer of encryption, you can perform such tasks as:

• Establishing and maintaining key policies


• Establishing and maintaining IAM policies and grants
• Enabling and disabling key policies
• Rotating key cryptographic material
• Adding tags
• Creating key aliases
• Scheduling keys for deletion

427
Amazon SageMaker Developer Guide
Security and Permissions

For more information, see Customer managed keys in the AWS Key Management Service Developer Guide.

How SageMaker geospatial capabilities uses grants in AWS KMS


SageMaker geospatial capabilities requires a grant to use your customer managed key. When you create
an EOJ or an VEJ encrypted with a customer managed key, SageMaker geospatial capabilities creates
a grant on your behalf by sending a CreateGrant request to AWS KMS. Grants in AWS KMS are used
to give SageMaker geospatial capabilities access to a KMS key in a customer account. You can revoke
access to the grant, or remove the service's access to the customer managed key at any time. If you do,
SageMaker geospatial capabilities won't be able to access any of the data encrypted by the customer
managed key, which affects operations that are dependent on that data.

Create a customer managed key


You can create a symmetric customer managed key by using the AWS Management Console, or the AWS
KMS APIs.

To create a symmetric customer managed key

Follow the steps for Creating symmetric encryption KMS keys in the AWS Key Management Service
Developer Guide.

Key policy

Key policies control access to your customer managed key. Every customer managed key must have
exactly one key policy, which contains statements that determine who can use the key and how they can
use it. When you create your customer managed key, you can specify a key policy. For more information,
see Determining access to AWS KMS keys in the AWS Key Management Service Developer Guide.

To use your customer managed key with your SageMaker geospatial capabilities resources, the following
API operations must be permitted in the key policy. The principal for these operations should be the
Execution Role you provide in the SageMaker geospatial capabilities request. SageMaker geospatial
capabilities assumes the provided Execution Role in the request to perform these KMS operations.

• kms:CreateGrant
• kms:GenerateDataKey
• kms:Decrypt
• kms:GenerateDataKeyWithoutPlaintext

The following are policy statement examples you can add for SageMaker geospatial capabilities:

CreateGrant

"Statement" : [
{
"Sid" : "Allow access to Amazon SageMaker geospatial capabilities",
"Effect" : "Allow",
"Principal" : {
"AWS" : "<Customer provided Execution Role ARN>"
},
"Action" : [
"kms:CreateGrant",
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource" : "*",
},
]

428
Amazon SageMaker Developer Guide
Security and Permissions

For more information about specifying permissions in a policy, see AWS KMS permissions in the AWS Key
Management Service Developer Guide. For more information about troubleshooting, see Troubleshooting
key access in the AWS Key Management Service Developer Guide.

If your key policy does not have your account root as key administrator, you need to add the same KMS
permissions on your execution role ARN. Here is a sample policy you can add to the execution role:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"kms:CreateGrant",
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource": [
"<KMS key Arn>"
],
"Effect": "Allow"
}
]
}

Monitoring your encryption keys for SageMaker geospatial capabilities


When you use an AWS KMS customer managed key with your SageMaker geospatial capabilities
resources, you can use AWS CloudTrail or Amazon CloudWatch Logs to track requests that SageMaker
geospatial sends to AWS KMS.

Select a tab in the following table to see examples of AWS CloudTrail events to monitor KMS operations
called by SageMaker geospatial capabilities to access data encrypted by your customer managed key.

CreateGrant

{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
"arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/
SageMaker-Geospatial-StartEOJ-KMSAccess",
"accountId": "111122223333",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AKIAIOSFODNN7EXAMPLE3",
"arn": "arn:aws:sts::111122223333:assumed-role/
SageMakerGeospatialCustomerRole",
"accountId": "111122223333",
"userName": "SageMakerGeospatialCustomerRole"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-03-17T18:02:06Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "arn:aws:iam::111122223333:root"
},

429
Amazon SageMaker Developer Guide
Security and Permissions

"eventTime": "2023-03-17T18:02:06Z",
"eventSource": "kms.amazonaws.com",
"eventName": "CreateGrant",
"awsRegion": "us-west-2",
"sourceIPAddress": "172.12.34.56",
"userAgent": "ExampleDesktop/1.0 (V1; OS)",
"requestParameters": {
"retiringPrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
"operations": [
"Decrypt"
],
"granteePrincipal": "sagemaker-geospatial.us-west-2.amazonaws.com"
},
"responseElements": {
"grantId":
"0ab0ac0d0b000f00ea00cc0a0e00fc00bce000c000f0000000c0bc0a0000aaafSAMPLE",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
},
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": false,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}

GenerateDataKey

{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "sagemaker-geospatial.amazonaws.com"
},
"eventTime": "2023-03-24T00:29:45Z",
"eventSource": "kms.amazonaws.com",
"eventName": "GenerateDataKey",
"awsRegion": "us-west-2",
"sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
"userAgent": "sagemaker-geospatial.amazonaws.com",
"requestParameters": {
"encryptionContext": {
"aws:s3:arn": "arn:aws:s3:::axis-earth-
observation-job-378778860802/111122223333/napy9eintp64/output/
consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
},
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE",
"keySpec": "AES_256"
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",

430
Amazon SageMaker Developer Guide
Security and Permissions

"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}

Decrypt

{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "sagemaker-geospatial.amazonaws.com"
},
"eventTime": "2023-03-28T22:04:24Z",
"eventSource": "kms.amazonaws.com",
"eventName": "Decrypt",
"awsRegion": "us-west-2",
"sourceIPAddress": "sagemaker-geospatial.amazonaws.com",
"userAgent": "sagemaker-geospatial.amazonaws.com",
"requestParameters": {
"encryptionAlgorithm": "SYMMETRIC_DEFAULT",
"encryptionContext": {
"aws:s3:arn": "arn:aws:s3:::axis-earth-
observation-job-378778860802/111122223333/napy9eintp64/output/
consolidated/32PPR/2022-01-04T09:58:03Z/S2B_32PPR_20220104_0_L2A_msavi.tif"
},
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}

GenerateDataKeyWithoutPlainText

{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAIGDTESTANDEXAMPLE:SageMaker-Geospatial-StartEOJ-KMSAccess",
"arn": "arn:aws:sts::111122223333:assumed-role/SageMakerGeospatialCustomerRole/
SageMaker-Geospatial-StartEOJ-KMSAccess",

431
Amazon SageMaker Developer Guide
RStudio on Amazon SageMaker

"accountId": "111122223333",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE3",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AKIAIOSFODNN7EXAMPLE3",
"arn": "arn:aws:sts::111122223333:assumed-role/
SageMakerGeospatialCustomerRole",
"accountId": "111122223333",
"userName": "SageMakerGeospatialCustomerRole"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-03-17T18:02:06Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "arn:aws:iam::111122223333:root"
},
"eventTime": "2023-03-28T22:09:16Z",
"eventSource": "kms.amazonaws.com",
"eventName": "GenerateDataKeyWithoutPlaintext",
"awsRegion": "us-west-2",
"sourceIPAddress": "172.12.34.56",
"userAgent": "ExampleDesktop/1.0 (V1; OS)",
"requestParameters": {
"keySpec": "AES_256",
"keyId": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
},
"responseElements": null,
"requestID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"eventID": "ff000af-00eb-00ce-0e00-ea000fb0fba0SAMPLE",
"readOnly": true,
"resources": [
{
"accountId": "111122223333",
"type": "AWS::KMS::Key",
"ARN": "arn:aws:kms:us-
west-2:111122223333:key/1234abcd-12ab-34cd-56ef-123456SAMPLE"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "111122223333",
"eventCategory": "Management"
}

RStudio on Amazon SageMaker


RStudio is an integrated development environment for R, with a console, syntax-highlighting editor
that supports direct code execution, and tools for plotting, history, debugging and workspace
management. Amazon SageMaker supports RStudio as a fully-managed integrated development
environment (IDE) integrated with Amazon SageMaker Domain.

RStudio allows customers to create data science insights using an R environment. With RStudio
integration, you can launch an RStudio environment in the Domain to run your RStudio workflows on
SageMaker resources. For more information about RStudio, see the RStudio website.

Topics
• Region availability (p. 433)

432
Amazon SageMaker Developer Guide
Region availability

• RStudio components (p. 434)


• Differences from RStudio Workbench (p. 434)
• Manage RStudio on Amazon SageMaker (p. 434)
• Use RStudio on Amazon SageMaker (p. 463)

SageMaker integrates RStudio through the creation of a RStudioServerPro app.

The following are supported by RStudio on SageMaker.

• R developers use the RStudio IDE interface with popular developer tools from the R ecosystem. Users
can launch new RStudio sessions, write R code, install dependencies from RStudio Package Manager,
and publish Shiny apps using RStudio Connect.
• R developers can quickly scale underlying compute resources to run large scale data processing and
statistical analysis.
• Platform administrators can set up user identities, authorization, networking, storage, and security
for their data science teams through AWS IAM Identity Center (successor to AWS Single Sign-On) and
AWS Identity and Access Management integration. This includes connection to private Amazon Virtual
Private Cloud (Amazon VPC) resources and internet-free mode with AWS PrivateLink.
• Integration with AWS License Manager.

For information on the onboarding steps to create a Domain with RStudio enabled, see Onboard to
Amazon SageMaker Domain (p. 37).

Region availability
The following table gives information about the AWS Regions that RStudio on SageMaker is supported
in.

Region name Region

US East (Ohio) us-east-2

US East (N. Virginia) us-east-1

US West (N. California) us-west-1

US West (Oregon) us-west-2

Asia Pacific (Mumbai) ap-south-1

Asia Pacific (Seoul) ap-northeast-2

Asia Pacific (Singapore) ap-southeast-1

Asia Pacific (Sydney) ap-southeast-2

Asia Pacific (Tokyo) ap-northeast-1

Canada (Central) ca-central-1

Europe (Frankfurt) eu-central-1

Europe (Ireland) eu-west-1

Europe (London) eu-west-2

433
Amazon SageMaker Developer Guide
RStudio components

Region name Region

Europe (Paris) eu-west-3

Europe (Stockholm) eu-north-1

South America (São Paulo) sa-east-1

RStudio components
• RStudioServerPro: The RStudioServerPro app is a multiuser app that is a shared resource among
all user profiles in the Domain. Once an RStudio app is created in a Domain, the admin can give
permissions to users in the Domain.
• RStudio user: RStudio users are users within the Domain that are authorized to use the RStudio license.
• RStudio admin: An RStudio on Amazon SageMaker admin can access the RStudio administrative
dashboard. RStudio on Amazon SageMaker admins differ from "stock" RStudio Workbench admins
because they do not have root access to the instance running the RStudioServerPro app and can't
modify the RStudio configuration file.
• RStudio Server: The RStudio Server instance is responsible for serving the RStudio UI to all authorized
Users. This instance is launched on an Amazon SageMaker instance.
• RSession: An RSession is a browser-based interface to the RStudio IDE running on an Amazon
SageMaker instance. Users can create and interact with their RStudio projects through the RSession.
• RSessionGateway: The RSessionGateway app is used to support an RSession.
• RStudio administrative dashboard: This dashboard gives information on the RStudio users in the
Amazon SageMaker Domain and their sessions. This dashboard can only be accessed by users that have
RStudio admin authorization.

Differences from RStudio Workbench


RStudio on Amazon SageMaker has some significant differences from RStudio Workbench.

• When using RStudio on SageMaker, users don’t have access to the RStudio configuration files. Amazon
SageMaker manages the configuration file and sets defaults. You can modify the RStudio Connect and
RStudio Package Manager URLs when creating your RStudio-enabled Amazon SageMaker Domain.
• Project sharing, realtime collaboration, and Job Launcher are not currently supported when using
RStudio on Amazon SageMaker.
• When using RStudio on SageMaker, the RStudio IDE runs on Amazon SageMaker instances for on-
demand containerized compute resources.
• RStudio on SageMaker only supports the RStudio IDE and does not support other IDEs supported by
an RStudio Workbench installation.
• RStudio on SageMaker only supports the RStudio version specified in Upgrade the RStudio Version
(p. 436).

Manage RStudio on Amazon SageMaker


The following topics give information on managing RStudio on Amazon SageMaker. This includes
information on your RStudio environment configuration, user sessions, and necessary resources. For
information on how to use RStudio on SageMaker, see Use RStudio on Amazon SageMaker (p. 463).

For information about creating a Amazon SageMaker Domain with RStudio enabled, see Onboard to
Amazon SageMaker Domain (p. 37).

434
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

For information about the AWS Regions that RStudio on SageMaker is supported in, see Supported
Regions and Quotas (p. 33).

Topics
• RStudio license (p. 435)
• Upgrade the RStudio Version (p. 436)
• Network and Storage (p. 437)
• RStudioServerPro instance type (p. 437)
• RStudio Connect URL (p. 438)
• RStudio Package Manager (p. 438)
• Create an Amazon SageMaker Domain with RStudio using the AWS CLI (p. 439)
• Add RStudio support to an existing Domain (p. 443)
• Bring your own image to RStudio on SageMaker (p. 446)
• Manage users (p. 459)
• RStudio administrative dashboard (p. 460)
• Shut down and restart RStudio (p. 461)
• Manage billing and cost (p. 462)
• Diagnose issues and get support (p. 462)

RStudio license
RStudio on Amazon SageMaker is a paid product and requires that each user is appropriately licensed.
Licenses for RStudio on Amazon SageMaker may be obtained from RStudio PBC directly, or by
purchasing a subscription to RStudio Workbench on AWS Marketplace. For existing customers of RStudio
Workbench Enterprise, licenses are issued at no additional cost.

To use an RStudio license with Amazon SageMaker, you must first have a valid RStudio license registered
with AWS License Manager. Subscriptions to RStudio Workbench on AWS Marketplace automatically
trigger license creation with AWS License Manager. For licenses purchased directly through Rstudio PBC,
a licenses grant for your AWS Account must be created. Contact RStudio for direct license purchases or to
enable existing licenses in AWS License Manager. For more information about registering a license with
AWS License Manager, see Seller issued licenses in AWS License Manager.

The following topics show how to acquire and validate a license granted by RStudio PBC.

Get an RStudio license

1. If you don't have an RStudio license, you may purchase one from the AWS Marketplace or from
RStudio PBC directly.

• To purchase a subscription from the AWS Marketplace, complete the steps in Subscribing to an
AMI product with contract pricing public offer by searching for Posit Workbench.
• To purchase from RStudio PBC directly, navigate to RStudio Pricing or contact [email protected].
When buying or updating an RStudio license, you must provide the AWS Account that will host
your Amazon SageMaker Domain.

If you have an existing RStudio license, contact your RStudio Sales representative or
[email protected] to add RStudio on Amazon SageMaker to your existing RStudio Workbench
Enterprise license, or to convert your RStudio Workbench Standard license. The RStudio Sales
representative will send you the appropriate electronic order form.
2. RStudio grants a RStudio Workbench license to your AWS Account through AWS License Manager in
the US East (N. Virginia) Region. Although the RStudio license is granted in the US East (N. Virginia)
Region, your license can be consumed in any AWS Region that RStudio on Amazon SageMaker is

435
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

supported in. You can expect the license grant process to complete within three business days after
you share your AWS account ID with RStudio.
3. When this license is granted, you receive an email from your RStudio Sales representative with
instructions to accept your license grant.

Validate your RStudio license to be used with Amazon SageMaker

1. Log into the AWS License Manager console in the same region as your Amazon SageMaker Domain.
If you are using AWS License Manager for the first time, AWS License Manager prompts you to grant
permission to use AWS License Manager.
2. Select Start using AWS License manager.
3. Select I grant AWS License Manager the required permissions and select Grant
Permissions.
4. Navigate to Granted Licenses on the left panel.
5. Select the license grant with RSW-SageMaker as the Product name and select View.
6. From the license detail page, select Accept & activate license.

RStudio administrative dashboard

You can use the RStudio administrative dashboard to see the number of users on the license following
the steps in RStudio administrative dashboard (p. 460).

Upgrade the RStudio Version


In this guide, you'll get information about the latest version update for RStudio on SageMaker. You must
update your version to provide continued functionality. To upgrade to the latest RStudio version on
Amazon SageMaker, complete the steps in Shut down and restart RStudio (p. 461).

Latest version updates


The updated RStudio version is 2022.02.2-485.pro2. This version supports:

• Enhanced R help system, introduced with R 4.2.0.


• Added editor support for the R pipe-bind placeholder (_).
• Encryption between RStudioServerPro and RSession applications. This update requires upgrading
existing RStudio-enabled SageMaker domains. For more information, see Upgrade Scenarios (p. 436).

All newly created SageMaker domains with RStudio and RSession support this new version. To use this
new version with existing SageMaker domains with RStudio, you must relaunch your RStudioServerPro
application. For more information about the changes in this release, see the RStudio Release Notes.

Upgrade Scenarios
All new RStudio applications are created using the 2022.02.2-485.pro2 release.
The RStudioServerPro application is deployed when the domain is created, and it persists unless it's
deleted. If you have an existing RStudio-enabled domain, you must upgrade the RStudioServerPro
application to support end-to-end encryption with new RSessions. If you create a new domain, you don't
need to upgrade.

Your upgrade status is one of the following:

• If you create a new domain with RStudio Enabled: RStudio applications for newly created domains
are created using the 2022.02.2-485.pro2 release and support end-to-end encryption. No further
action is required.

436
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

• You have a pre-existing SageMaker domain with RStudio and update the RStudioServerPro
App: This enables end-to-end encryption and requires no further changes for new RSessions. You must
delete and re-create your existing RSession applications.
• You have a pre-existing SageMaker domain with RStudio and do not update the RStudioServerPro
App: If you don’t update your application, there is a version mismatch with all new RSessions.
There may be functionality issues because of the version mismatch. Traffic encryption between
RStudioServerPro and RSession is also not available. We recommend that you update your
RStudioServerPro application to the new version.

Changes to BYOI Images


If you are using a BYOI image with RStudio, you must upgrade your custom images to use
the 2022.02.2-485.pro2 release and redeploy your existing RSessions. If you attempt to load a non-
compatible image in an RSession of a domain using the 2022.02.2-485.pro2 version, the RSession
fails because it cannot parse parameters that it receives. Your RSession will not function properly and will
not support end-to-end encryption. You must update all of the deployed custom images in your existing
RStudioServerPro app.

Network and Storage


The following topic describes network access and data storage considerations for your RStudio instance.
For general information about network access and data storage when using Amazon SageMaker,
see Data Protection in Amazon SageMaker (p. 3042).

Encryption

RStudio on Amazon SageMaker supports encryption at rest.

Use RStudio in VPC-only mode

RStudio in Amazon SageMaker supports AWS PrivateLink integration. With this integration, you can
use RStudio on SageMaker in VPC-only mode without direct access to the internet. When you use
RStudio in VPC-only mode, your security groups are automatically managed by the service. This includes
connectivity between your RServer and your RSessions.

The following are required to use RStudio in VPC-only mode. For more information on selecting a VPC,
see Choose an Amazon VPC (p. 46).

• A private subnet with either access the internet to make a call to Amazon SageMaker & License
Manager, or Amazon Virtual Private Cloud (Amazon VPC) endpoints for both Amazon SageMaker &
License Manager.
• The Domain cannot have any more than two associated Security Groups.
• A Security Group ID for use with the Domain in Domain Settings. This must allow all outbound access.
• A Security Group ID for use with the Amazon VPC endpoint. This security group must allow inbound
traffic from the Domain Security Group ID.
• Amazon VPC Endpoint for sagemaker.api and AWS License Manager. This must be in the same
Amazon VPC as the private subnet.

RStudioServerPro instance type


When deciding which Amazon EC2 instance type to use for your RStudioServerPro app, the main factor
to consider is bandwidth. Bandwidth is important because the RStudioServerPro instance is responsible
for serving the RStudio UI to all users. This includes UI heavy workflows, such as generating figures,
animations, and displaying many data rows. Therefore, there may be some UI performance degradation

437
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

depending on the workload across all users. The following are the available instance types to use for
your RStudioServerPro. For pricing information about these instances, see Amazon SageMaker Pricing.

• ml.t3.medium: This instance type is recommended for Domains with low UI use and is free to use.
• ml.c5.4xlarge: This instance type is recommended for Domains with moderate UI use.
• ml.c5.9xlarge: This instance type is recommended for Domains with heavy UI use.

Changing RStudio instance type

To change the instance type of your RStudioServerPro, pass the new instance type as part of a call to
the update-domain CLI command. You then need to delete the existing RStudioServerPro app using
the delete-app CLI command and create a new RStudioServerPro app using the create-app CLI
command.

RStudio Connect URL


RStudio Connect is a publishing platform for Shiny applications, R Markdown reports, dashboards, plots,
and more. RStudio Connect makes it easy to surface machine learning and data science insights by
making hosting content simple and scalable. If you have an RStudio Connect server, then you can set the
server as the default place where apps are published. For more information about RStudio Connect, see
RStudio Connect.

When you onboard to RStudio on Amazon SageMaker Domain, an RStudio Connect server is not created.
You can create an RStudio Connect server on an Amazon EC2 instance to use Connect with Amazon
SageMaker Domain. For information about how to set up your RStudio Connect server, see Host RStudio
Connect and Package Manager for ML development in RStudio on Amazon SageMaker.

Add an RStudio Connect URL

If you have an RStudio Connect URL, you can update the default URL so that your RStudio Users can
publish to it.

1. Navigate to the Domains page.


2. Select the desired Domain.
3. Choose Domain Settings.
4. Under General Settings, select Edit.
5. From the new page, select RStudio Settings on the left side.
6. Under RStudio Connect URL, enter the RStudio Connect URL to add.
7. Select Submit.

CLI

You can set a default RStudio Connect URL when you create your Domain. The only way to update
your RStudio Connect URL from the AWS CLI is to delete your Domain and create a new one with the
updated RStudio Connect URL.

RStudio Package Manager


RStudio Package Manager is a repository management server used to organize and centralize packages
across your organization. For more information on RStudio Package Manager, see RStudio Package
Manager. If you don't supply your own Package Manager URL, Amazon SageMaker Domain uses the
default Package Manager repository when you onboard RStudio following the steps in Onboard to
Amazon SageMaker Domain (p. 37). For more information, see Host RStudio Connect and Package
Manager for ML development in RStudio on Amazon SageMaker.

438
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Update Package Manager URL

You can update the Package Manager URL used for your RStudio-enabled Domain as follows.

1. Navigate to the Domains page.


2. Select the desired Domain.
3. Choose Domain Settings.
4. Under General Settings, select Edit.
5. From the new page, select RStudio Settings on the left side.
6. Under RStudio Package Manager, enter your RStudio Package Manager URL.
7. Select Submit.

CLI

The only way to update your Package Manager URL from the AWS CLI is to delete your Domain and
create a new one with the updated Package Manager URL.

Create an Amazon SageMaker Domain with RStudio using the


AWS CLI
The following topic shows how to onboard to Amazon SageMaker Domain with RStudio enabled using
the AWS CLI. To onboard using the AWS Management Console, see Onboard to Amazon SageMaker
Domain (p. 37).

Prerequisites
• Install and configure AWS CLI version 2
• Configure the AWS CLI with IAM credentials

Create DomainExecution role


To launch the RStudio App, you must provide a DomainExecution role. This role is used to determine
whether RStudio needs to be launched as part of Amazon SageMaker Domain creation. This role is also
used by Amazon SageMaker to access the RStudio License and push RStudio logs.
Note
The DomainExecution role should have at least AWS License Manager permissions to access
RStudio License, and CloudWatch permissions to push logs in your account.

The following procedure shows how to create the DomainExecution role with the AWS CLI.

1. Create a file named assume-role-policy.json with the following content.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
]
}
}

439
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

]
}

2. Create the DomainExecution role. <REGION> should be the AWS Region to launch your Domain in.

aws iam create-role --region <REGION> --role-name DomainExecution --assume-role-policy-


document file://assume-role-policy.json

3. Create a file named domain-setting-policy.json with the following content. This policy
allows the RStudioServerPro app to access necessary resources and allows Amazon SageMaker
to automatically launch an RStudioServerPro app when the existing RStudioServerPro app is in a
Deleted or Failed status.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"license-manager:ExtendLicenseConsumption",
"license-manager:ListReceivedLicenses",
"license-manager:GetLicense",
"license-manager:CheckoutLicense",
"license-manager:CheckInLicense",
"logs:CreateLogDelivery",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DeleteLogDelivery",
"logs:Describe*",
"logs:GetLogDelivery",
"logs:GetLogEvents",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
"sagemaker:CreateApp"
],
"Resource": "*"
}
]
}

4. Create the Domain setting policy that is attached to the DomainExecution role. Be aware of the
PolicyArn from the response, you will need to enter that ARN in the following steps.

aws iam create-policy --region <REGION> --policy-name domain-setting-policy --policy-


document file://domain-setting-policy.json

5. Attach domain-setting-policy to the DomainExecution role. Use the PolicyArn returned in


the previous step.

aws iam attach-role-policy --role-name DomainExecution --policy-arn <POLICY_ARN>

Create Amazon SageMaker Domain with RStudio App


The RStudioServerPro app is launched automatically when you create a Amazon SageMaker Domain
using the create-domain CLI command with the RStudioServerProDomainSettings parameter
specified. When launching the RStudioServerPro App, Amazon SageMaker checks for a valid RStudio
license in the account and fails Domain creation if the license is not found.

440
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

The creation of a Amazon SageMaker Domain differs based on the authentication method and the
network type. These options must be used together, with one authentication method and one network
connection type selected. For more information about the requirements to create a new Domain, see
CreateDomain.

The following authentication methods are supported.

• IAM Auth
• SSO Auth

The following network connection types are supported:

• PublicInternet
• VPCOnly

Authentication methods

IAM Auth Mode

The following shows how to create a Amazon SageMaker Domain with RStudio enabled and an IAM
Auth Network Type. For more information about AWS Identity and Access Management, see What is
IAM?.

• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.
• app-network-access-type should be either PublicInternetOnly or VPCOnly.

aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \


--auth-mode IAM \
--default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
--domain-settings
RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CO
\
--vpc-id <VPC_ID> \
--subnet-ids <SUBNET_IDS> \
--app-network-access-type <NETWORK_ACCESS_TYPE>

Authentication using IAM Identity Center

The following shows how to create a Amazon SageMaker Domain with RStudio enabled and an SSO
Auth Network Type. AWS IAM Identity Center (successor to AWS Single Sign-On) must be enabled for
the region that the domain is launched on. For more information about IAM Identity Center, see What is
AWS IAM Identity Center (successor to AWS Single Sign-On)?.

• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.

441
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

• app-network-access-type should be either PublicInternetOnly or VPCOnly.

aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \


--auth-mode SSO \
--default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
--domain-settings
RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CO
\
--vpc-id <VPC_ID> \
--subnet-ids <SUBNET_IDS> \
--app-network-access-type <NETWORK_ACCESS_TYPE>

Connection types

PublicInternet/Direct Internet network type

The following shows how to create a Amazon SageMaker Domain with RStudio enabled and a
PublicInternet Network Type.

• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. For information about vpc-id and subnet-ids, see VPCs and subnets.
• RStudioPackageManagerUrl and RStudioConnectUrl are optional and should be set to the URLs
of your RStudio Package Manager and RStudio Connect server, respectively.
• auth-mode should be either SSO or IAM.

aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \


--auth-mode <AUTH_MODE> \
--default-user-settings ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
--domain-settings
RStudioServerProDomainSettings={RStudioPackageManagerUrl=<<PACKAGE_MANAGER_URL>,RStudioConnectUrl=<<CO
\
--vpc-id <VPC_ID> \
--subnet-ids <SUBNET_IDS> \
--app-network-access-type PublicInternetOnly

VPCOnly mode

The following shows how to launch a Amazon SageMaker Domain with RStudio enabled and a VPCOnly
Network Type. For more information about using the VPCOnly network access type, see Connect
SageMaker Studio Notebooks in a VPC to External Resources (p. 3209).

• DomainExecutionRoleArn should be the ARN for the role created in the previous step.
• ExecutionRole is the ARN of the role given to users in the Amazon SageMaker Domain.
• vpc-id should be the ID of your Amazon Virtual Private Cloud. subnet-ids should be a space-
separated list of subnet IDs. Your private subnet must be able to either access the internet to make
a call to Amazon SageMaker, and AWS License Manager or have Amazon VPC endpoints for both
Amazon SageMaker and AWS License Manager. For information about Amazon VPC endpoints, see
Interface Amazon VPC endpoints For information about vpc-id and subnet-ids, see VPCs and
subnets.
• SecurityGroups must allow outbound access to the Amazon SageMaker and AWS License Manager
endpoints.
• auth-mode should be either SSO or IAM.

442
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Note
When using Amazon Virtual Private Cloud endpoints, the security group attached to your
Amazon Virtual Private Cloud endpoints must allow inbound traffic from the security group you
pass as part of the domain-setting parameter of the create-domain CLI call.

With RStudio, Amazon SageMaker manages security groups for you. This means that Amazon SageMaker
manages security group rules to ensure RSessions can access RStudioServerPro Apps. Amazon
SageMaker creates one security group rule per user profile.

aws sagemaker create-domain --region <REGION> --domain-name <DOMAIN_NAME> \


--auth-mode <AUTH_MODE> \
--default-user-settings
SecurityGroups=<USER_SECURITY_GROUP>,ExecutionRole=<DEFAULT_USER_EXECUTIONROLE> \
--domain-settings
SecurityGroupIds=<DOMAIN_SECURITY_GROUP>,RStudioServerProDomainSettings={DomainExecutionRoleArn=<DOMAI
\
--vpc-id <VPC_ID> \
--subnet-ids "<SUBNET_IDS>" \
--app-network-access-type VPCOnly --app-security-group-management Service

Note: The RStudioServerPro app is launched by a special user profile named domain-shared. As a
result, this app is not returned as part of list-app API calls by any other user profiles.

You may have to increase the Amazon VPC quota in your account to increase the number of users. For
more information, see Amazon VPC quotas.

Verify Domain creation


Use the following command to verify that your Domain has been created with a Status
of InService. Your domain-id is appended to the Domains ARN. For example,
arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:domain/<DOMAIN_ID>.

aws sagemaker describe-domain --domain-id <DOMAIN_ID> --region <REGION>

Add RStudio support to an existing Domain


If you have added an RStudio License through AWS License Manager, you can create a new Amazon
SageMaker Domain with support for RStudio on SageMaker. If you have an existing Domain that does
not support RStudio, you can add RStudio support to that Domain without having to delete and recreate
the Domain.

The following topic outlines how to add this support.

Prerequisites
You must complete the following steps before you update your current Domain to add support for
RStudio on SageMaker.

• Install and configure AWS CLI version 2


• Configure the AWS CLI with IAM credentials
• Create a Domain execution role following the steps in Create a SageMaker Domain with RStudio
using the AWS CLI. This Domain-level IAM role is required by the RStudioServerPro app. The role
requires access to AWS License Manager for verifying a valid RStudio Workbench license and Amazon
CloudWatch Logs for publishing server logs.
• Bring your RStudio license to AWS License Manager following the steps in RStudio license.
• (Optional) If you want to use RStudio in VPCOnly mode, complete the steps in RStudio in VPC-Only.
• Ensure that the security groups you have configured for each UserProfile in your Domain meet the
account-level quotas. When configuring the default user profile during Domain creation, you can use

443
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

the DefaultUserSettings parameter of the CreateDomain API to add SecurityGroups that are
inherited by all the user profiles created in the Domain. You can also provide additional security groups
for a specific user as part of the UserSettings parameter of the CreateUserProfile API. If you have
added security groups this way, you must ensure that the total number of security groups per user
profile doesn’t exceed the maximum quota of 2 in VPCOnly mode and 4 in PublicInternetOnly
mode. If the resulting total number of security groups for any user profile exceeds the quota, you can
combine multiple security groups’ rules into one security group.

Add RStudio support to an existing Domain


After you have completed the prerequisites, you can add RStudio support to your existing Domain. The
following steps outline how to update your existing Domain to add support for RStudio.

Step 1: Delete all apps in the Domain

To add support for RStudio in your Domain, SageMaker must update the underlying security groups for
all existing user profiles. To complete this, you must delete and recreate all existing apps in the Domain.
The following procedure shows how to delete all of the apps.

1. List all of the apps in the Domain.

aws sagemaker \
list-apps \
--domain-id-equals <DOMAIN_ID>

2. Delete each app for each user profile in the Domain.

// JupyterServer apps
aws sagemaker \
delete-app \
--domain-id <DOMAIN_ID> \
--user-profile-name <USER_PROFILE> \
--app-type JupyterServer \
--app-name <APP_NAME>

// KernelGateway apps
aws sagemaker \
delete-app \
--domain-id <DOMAIN_ID> \
--user-profile-name <USER_PROFILE> \
--app-type KernelGateway \
--app-name <APP_NAME>

Step 2 - Update all user profiles with the new list of security groups

This is a one-time action that you must complete for all of the existing user profiles in your Domain
when you have refactored your existing security groups. This prevents you from hitting the quota for the
maximum number of security groups. The UpdateUserProfile API call fails if the user has any apps
that are in InService status. Delete all apps, then call UpdateUserProfile API to update the security
groups.
Note
The following requirement for VPCOnly mode outlined in Connect Amazon SageMaker Studio
Notebooks in a VPC to External Resources is no longer needed when adding RStudio support
because AppSecurityGroupManagement is managed by the SageMaker service:
“TCP traffic within the security group. This is required for connectivity between the
JupyterServer app and the KernelGateway apps. You must allow access to at least ports in the
range 8192-65535.”

444
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

aws sagemaker \
update-user-profile \
--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\", \"<SECURITY_GROUP>\"]}"

Step 3 - Activate RStudio by calling the UpdateDomain API

1. Call the UpdateDomain API to add support for RStudio on SageMaker. The defaultusersettings
parameter is only needed if you have refactored the default security groups for your user profiles.

• For VPCOnly mode:

aws sagemaker \
update-domain \
--domain-id <DOMAIN_ID> \
--app-security-group-management Service \
--domain-settings-for-update
RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>}
\
--default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\",
\"<SECURITY_GROUP>\"]}"

• For PublicInternetOnly mode:

aws sagemaker \
update-domain \
--domain-id <DOMAIN_ID> \
--domain-settings-for-update
RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} \
--default-user-settings "{\"SecurityGroups\": [\"<SECURITY_GROUP>\",
\"<SECURITY_GROUP>\"]}"

2. Verify that the Domain status is InService. After the Domain status is InService, support for
RStudio on SageMaker is added.

aws sagemaker \
describe-domain \
--domain-id <DOMAIN_ID>

3. Verify that the RStudioServerPro app’s status is InService using the following command.

aws sagemaker list-apps --user-profile-name domain-shared

Step 4 - Add RStudio access for existing users

As part of the update in Step 3, SageMaker marks the RStudio AccessStatus of all existing user profiles
in the Domain as DISABLED by default. This prevents exceeding the number of users allowed by your
current license. To add access for existing users, there is a one-time opt-in step. Perform the opt-in by
calling the UpdateUserProfile API with the following RStudioServerProAppSettings:

• AccessStatus = ENABLED
• Optional - UserGroup = R_STUDIO_USER or R_STUDIO_ADMIN

aws sagemaker \
update-user-profile \

445
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"ENABLED\"}}"

Note
By default, the number of users that can have access to RStudio is 60.

Step 5 – Deactivate RStudio access for new users

Unless otherwise specified when calling UpdateDomain, RStudio support is added by default for all
new user profiles created after you have added support for RStudio on SageMaker. To deactivate access
for a new user profile, you must explicitly set the AccessStatus parameter to DISABLED as part of
the CreateUserProfile API call. If the AccessStatus parameter is not specified as part of the
CreateUserProfile API, the default access status is ENABLED.

aws sagemaker \
create-user-profile \
--domain-id <DOMAIN_ID>\
--user-profile-name <USER_PROFILE> \
--user-settings "{\"RStudioServerProAppSettings\": {\"AccessStatus\": \"DISABLED\"}}"

Bring your own image to RStudio on SageMaker


A SageMaker image is a file that identifies language packages and other dependencies that are required
to run RStudio on Amazon SageMaker. SageMaker uses these images to create an environment where
you run RStudio. Amazon SageMaker provides a built-in RStudio image for you to use. If you need
different functionality, you can bring your own custom images.

The process to bring your own image to use with RStudio on SageMaker takes three steps:

1. Build a custom image from a Dockerfile and push it to a repository in Amazon Elastic Container
Registry (Amazon ECR).
2. Create a SageMaker image that points to a container image in Amazon ECR and attach it to your
Amazon SageMaker Domain.
3. Launch a new session in RStudio with your custom image.

You can create images and image versions, and attach image versions to your Domain, using the
SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS
CLI). You can also create images and image versions using the SageMaker console, even if you haven't
onboarded to a Domain.

The following topics show how to bring your own image to RStudio on SageMaker by creating, attaching,
and launching a custom image.

Key terminology
The following section defines key terms for bringing your own image to use with RStudio on SageMaker.

• Dockerfile: A Dockerfile is a file that identifies the language packages and other dependencies for your
Docker image.
• Docker image: The Docker image is a built Dockerfile. This image is checked into Amazon ECR and
serves as the basis of the SageMaker image.
• SageMaker image: A SageMaker image is a holder for a set of SageMaker image versions based on
Docker images.

446
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

• Image version: An image version of a SageMaker image represents a Docker image that is compatible
with RStudio and stored in an Amazon ECR repository. Each image version is immutable. These image
versions can be attached to a domain and used with RStudio on SageMaker.

Prerequisites
You must complete the following prerequisites before bringing your own image to use with RStudio on
Amazon SageMaker.

• If you have an existing Domain with RStudio that was created before April 7, 2022, you must delete
your RStudioServerPro application and recreate it. For information about how to delete an application,
see Shut down and Update SageMaker Studio (p. 199).
• Install the Docker application. For information about setting up Docker, see Orientation and setup.
• Create a local copy of an RStudio-compatible Dockerfile that works with SageMaker. For information
about creating a sample RStudio dockerfile, see Use a custom image to bring your own development
environment to RStudio on Amazon SageMaker.
• Use an AWS Identity and Access Management execution role that has the AmazonSageMakerFullAccess
policy attached. If you have onboarded to Domain, you can get the role from the Domain Summary
section of the SageMaker control panel.

Add the following permissions to access the Amazon Elastic Container Registry (Amazon ECR) service
to your execution role.

{
"Version":"2012-10-17",
"Statement":[
{
"Sid": "VisualEditor0",
"Effect":"Allow",
"Action":[
"ecr:CreateRepository",
"ecr:BatchGetImage",
"ecr:CompleteLayerUpload",
"ecr:DescribeImages",
"ecr:DescribeRepositories",
"ecr:UploadLayerPart",
"ecr:ListImages",
"ecr:InitiateLayerUpload",
"ecr:BatchCheckLayerAvailability",
"ecr:PutImage"
],
"Resource": "*"
}
]
}

• Install and configure AWS CLI with the following (or higher) version. For information about installing
the AWS CLI, see Installing or updating the latest version of the AWS CLI.

AWS CLI v1 >= 1.23.6


AWS CLI v2 >= 2.6.2

Custom RStudio image specifications


In this guide, you'll learn custom RStudio image specifications to use when you bring your own image.
There are two sets of requirements that you must satisfy with your custom RStudio image to use it with
Amazon SageMaker. These requirements are imposed by RStudio PBC and the Amazon SageMaker Studio

447
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

platform. If either of these sets of requirements aren't satisfied, then your custom image won't function
properly.

RStudio PBC requirements

RStudio PBC requirements are laid out in the Using Docker images with RStudio Workbench / RStudio
Server Pro, Launcher, and Kubernetes article. Follow the instructions in this article to create the base of
your custom RStudio image.

For instructions about how to install multiple R versions in your custom image, see Installing multiple
versions of R on Linux.

Amazon SageMaker Studio requirements

Amazon SageMaker Studio imposes the following set of installation requirements for your RStudio
image.

• You must use an RStudio base image of at least 2022.02.2-485.pro2. For more information, see
Upgrade the RStudio Version (p. 436).
• You must install the following packages:

yum install -y sudo \


openjdk-11-jdk \
libpng-dev \
&& yum clean all \
&& /opt/R/${R_VERSION}/bin/R -e "install.packages('reticulate', repos='https://
packagemanager.rstudio.com/cran/__linux__/centos7/latest')" \
&& /opt/python/${PYTHON_VERSION}/bin/pip install --upgrade \
'boto3>1.0<2.0' \
'awscli>1.0<2.0' \
'sagemaker[local]<3'

• You must provide default values for the RSTUDIO_CONNECT_URL and


RSTUDIO_PACKAGE_MANAGER_URL environment values.

ENV RSTUDIO_CONNECT_URL "YOUR_CONNECT_URL"


ENV RSTUDIO_PACKAGE_MANAGER_URL "YOUR_PACKAGE_MANAGER_URL"

The following general specifications apply to the image that is represented by an RStudio image version.

Running the image

ENTRYPOINT and CMD instructions are overridden so that the image is run as an RSession
application.
Stopping the image

The DeleteApp API issues the equivalent of a docker stop command. Other processes in the
container won’t get the SIGKILL/SIGTERM signals.
File system

The /opt/.sagemakerinternal and /opt/ml directories are reserved. Any data in these
directories might not be visible at runtime.
User data

Each user in a SageMaker domain gets a user directory on a shared Amazon Elastic File System
volume in the image. The location of the current user’s directory on the Amazon Elastic File System
volume is /home/sagemaker-user.

448
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Metadata

A metadata file is located at /opt/ml/metadata/resource-metadata.json. No additional


environment variables are added to the variables defined in the image. For more information, see
Get App Metadata (p. 156).
GPU

On a GPU instance, the image is run with the --gpus option. Only the CUDA toolkit should be
included in the image, not the NVIDIA drivers. For more information, see NVIDIA User Guide.
Metrics and logging

Logs from the RSession process are sent to Amazon CloudWatch in the customer’s account. The
name of the log group is /aws/sagemaker/studio. The name of the log stream is $domainID/
$userProfileName/RSession/$appName.
Image size

Image size is limited to 25 GB. To view the size of your image, run docker image ls.

Create a custom RStudio image


This topic describes how you can create a custom RStudio image using the SageMaker console and the
AWS CLI. If you use the AWS CLI, you must run the steps from your local machine. The following steps do
not work from within Amazon SageMaker Studio.

When you create an image, SageMaker also creates an initial image version. The image version
represents a container image in Amazon Elastic Container Registry (ECR). The container image must
satisfy the requirements to be used in RStudio. For more information, see Custom RStudio image
specifications (p. 447).

For information about testing your image locally and resolving common issues, see the SageMaker
Studio Custom Image Samples repo.

Topics
• Add a SageMaker-compatible RStudio Docker container image to Amazon ECR (p. 449)
• Create a SageMaker image from the console (p. 450)
• Create an image from the AWS CLI (p. 451)

Add a SageMaker-compatible RStudio Docker container image to Amazon ECR


Use the following steps to add a Docker container image to Amazon ECR:

• Create an Amazon ECR repository.


• Authenticate to Amazon ECR.
• Build a SageMaker-compatible RStudio Docker image.
• Push the image to the Amazon ECR repository.

Note
The Amazon ECR repository must be in the same AWS Region as your domain.

To build and add a Docker image to Amazon ECR

1. Create an Amazon ECR repository using the AWS CLI. To create the repository using the Amazon ECR
console, see Creating a repository.

aws ecr create-repository \

449
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

--repository-name rstudio-custom \
--image-scanning-configuration scanOnPush=true

Response:

{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-2:acct-id:repository/rstudio-custom",
"registryId": "acct-id",
"repositoryName": "rstudio-custom",
"repositoryUri": "acct-id.dkr.ecr.us-east-2.amazonaws.com/rstudio-custom",
...
}
}

2. Authenticate to Amazon ECR using the repository URI returned as a response from the create-
repository command. Make sure that the Docker application is running. For more information, see
Registry Authentication.

aws ecr get-login-password | \


docker login --username AWS --password-stdin <repository-uri>

Response:

Login Succeeded

3. Build the Docker image. Run the following command from the directory that includes your
Dockerfile.

docker build .

4. Tag your built image with a unique tag.

docker tag <image-id> "<repository-uri>:<tag>"

5. Push the container image to the Amazon ECR repository. For more information, see ImagePush and
Pushing an image.

docker push <repository-uri>:<tag>

Response:

The push refers to repository [<account-id>.dkr.ecr.us-east-2.amazonaws.com/rstudio-


custom]
r: digest: <digest> size: 3066

Create a SageMaker image from the console

To create an image

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Images.
3. On the Custom images page, choose Create image.
4. For Image source, enter the registry path to the container image in Amazon ECR. The path is in the
following format:

450
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

acct-id.dkr.ecr.region.amazonaws.com/repo-name[:tag] or [@digest]
5. Choose Next.
6. Under Image properties, enter the following:

• Image name – The name must be unique to your account in the current AWS Region.
• (Optional) Image display name – The name displayed in the domain user interface. When not
provided, Image name is displayed.
• (Optional) Description – A description of the image.
• IAM role – The role must have the AmazonSageMakerFullAccess policy attached. Use the
dropdown menu to choose one of the following options:
• Create a new role – Specify any additional Amazon Simple Storage Service (Amazon S3) buckets
that you want your notebooks users to access. If you don't want to allow access to additional
buckets, choose None.

SageMaker attaches the AmazonSageMakerFullAccess policy to the role. The role allows
your notebook users to access the Amazon S3 buckets listed next to the check marks.
• Enter a custom IAM role ARN – Enter the Amazon Resource Name (ARN) of your IAM role.
• Use existing role – Choose one of your existing roles from the list.
• (Optional) Image tags – Choose Add new tag. You can add up to 50 tags. Tags are searchable
using the SageMaker console or the SageMaker Search API.
7. Under Image type, select RStudio image.
8. Choose Submit.

The new image is displayed in the Custom images list and briefly highlighted. After the image has been
successfully created, you can choose the image name to view its properties or choose Create version to
create another version.

To create another image version

1. Choose Create version on the same row as the image.


2. For Image source, enter the registry path to the Amazon ECR image. The image shouldn't be the
same image as used in a previous version of the SageMaker image.

To use the custom image in RStudio, you must attach it to your domain. For more information, see
Attach a custom SageMaker image (p. 453).

Create an image from the AWS CLI

This section shows how to create a custom Amazon SageMaker image using the AWS CLI.

Use the following steps to create a SageMaker image:

• Create an Image.
• Create an ImageVersion.
• Create a configuration file.
• Create an AppImageConfig.

To create the SageMaker image entities

1. Create a SageMaker image. The role ARN must have at least the
AmazonSageMakerFullAccessPolicy policy attached.

451
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

aws sagemaker create-image \


--image-name rstudio-custom-image \
--role-arn arn:aws:iam::<acct-id>:role/service-role/<execution-role>

Response:

{
"ImageArn": "arn:aws:sagemaker:us-east-2:acct-id:image/rstudio-custom-image"
}

2. Create a SageMaker image version from the image. Pass the unique tag value that you chose when
you pushed the image to Amazon ECR.

aws sagemaker create-image-version \


--image-name rstudio-custom-image \
--base-image <repository-uri>:<tag>

Response:

{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-
image/1"
}

3. Check that the image version was successfully created.

aws sagemaker describe-image-version \


--image-name rstudio-custom-image \
--version 1

Response:

{
"ImageVersionArn": "arn:aws:sagemaker:us-east-2:acct-id:image-version/rstudio-
custom-image/1",
"ImageVersionStatus": "CREATED"
}

Note
If the response is "ImageVersionStatus": "CREATED_FAILED", the response
also includes the failure reason. A permissions issue is a common cause of failure. You
also can check your Amazon CloudWatch Logs. The name of the log group is /aws/
sagemaker/studio. The name of the log stream is $domainID/$userProfileName/
KernelGateway/$appName.
4. Create a configuration file, named app-image-config-input.json. The app image config is used
to configuration for running a SageMaker image as a Kernel Gateway application.

{
"AppImageConfigName": "rstudio-custom-config"
}

5. Create the AppImageConfig using the file that you created in the previous step.

aws sagemaker create-app-image-config \


--cli-input-json file://app-image-config-input.json

452
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Response:

{
"AppImageConfigArn": "arn:aws:sagemaker:us-east-2:acct-id:app-image-config/r-image-
config"
}

Attach a custom SageMaker image


This guide shows how to attach a custom RStudio image to your Amazon SageMaker Domain using the
SageMaker console or the AWS Command Line Interface (AWS CLI).

To use a custom SageMaker image, you must attach a custom RStudio image to your Domain. When
you attach an image version, it appears in the RStudio Launcher and is available in the Select image
dropdown list. You use the dropdown to change the image used by RStudio.

There is a limit to the number of image versions that you can attach. After you reach the limit, you must
first detach a version so that you can attach a different version of the image.

Topics
• Attach an image version to your Domain using the console (p. 453)
• Attach an existing image version to your Domain using the AWS CLI (p. 454)

Attach an image version to your Domain using the console

You can attach a custom SageMaker image version to your Domain using the SageMaker console's
control panel. You can also create a custom SageMaker image, and an image version, and then attach
that version to your Domain.

To attach an existing image

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Domains.
3. Select the desired Domain.
4. Choose Environment.
5. Under Custom SageMaker Studio images attached to domain, choose Attach image.
6. For Image source, choose Existing image or New image.

If you select Existing image, choose an image from the Amazon SageMaker image store.

If you select New image, provide the Amazon ECR registry path for your Docker image. The path
must be in the same AWS Region as the Domain. The Amazon ECR repo must be in the same account
as your Domain, or cross-account permissions for SageMaker must be enabled.
7. Choose an existing image from the list.
8. Choose a version of the image from the list.
9. Choose Next.
10. Enter values for Image name, Image display name, and Description.
11. Choose the IAM role. For more information, see Create a custom RStudio image (p. 449).
12. (Optional) Add tags for the image.
13. (Optional) Choose Add new tag, then add a configuration tag.
14. For Image type, select RStudio Image.
15. Choose Submit.

453
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Wait for the image version to be attached to the Domain. After the version is attached, it appears in the
Custom images list and is briefly highlighted.

Attach an existing image version to your Domain using the AWS CLI

Two methods are presented to attach the image version to your Domain using the AWS CLI. In the first
method, you create a new Domain with the version attached. This method is simpler but you must
specify the Amazon Virtual Private Cloud (Amazon VPC) information and execution role that's required to
create the Domain.

If you have already onboarded to the Domain, you can use the second method to attach the image
version to your current Domain. In this case, you don't need to specify the Amazon VPC information and
execution role. After you attach the version, delete all of the applications in your Domain and relaunch
RStudio.

Attach the SageMaker image to a new Domain

To use this method, you must specify an execution role that has the AmazonSageMakerFullAccess policy
attached.

Use the following steps to create the Domain and attach the custom SageMaker image:

• Get your default VPC ID and subnet IDs.


• Create the configuration file for the Domain, which specifies the image.
• Create the Domain with the configuration file.

To add the custom SageMaker image to your Domain

1. Get your default VPC ID.

aws ec2 describe-vpcs \


--filters Name=isDefault,Values=true \
--query "Vpcs[0].VpcId" --output text

Response:

vpc-xxxxxxxx

2. Get your default subnet IDs using the VPC ID from the previous step.

aws ec2 describe-subnets \


--filters Name=vpc-id,Values=<vpc-id> \
--query "Subnets[*].SubnetId" --output json

Response:

[
"subnet-b55171dd",
"subnet-8a5f99c6",
"subnet-e88d1392"
]

3. Create a configuration file named create-domain-input.json. Insert the VPC ID, subnet IDs,
ImageName, and AppImageConfigName from the previous steps. Because ImageVersionNumber
isn't specified, the latest version of the image is used, which is the only version in this case. Your
execution role must satisfy the requirements in Prerequisites (p. 447).

454
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

{
"DomainName": "domain-with-custom-r-image",
"VpcId": "<vpc-id>",
"SubnetIds": [
"<subnet-ids>"
],
"DomainSettings": {
"RStudioServerProDomainSettings": {
"DomainExecutionRoleArn": "<execution-role>"
}
},
"DefaultUserSettings": {
"ExecutionRole": "<execution-role>",
"RSessionAppSettings": {
"CustomImages": [
{
"AppImageConfigName": "rstudio-custom-config",
"ImageName": "rstudio-custom-image"
}
]
}
},
"AuthMode": "IAM"
}

4. Create the Domain with the attached custom SageMaker image.

aws sagemaker create-domain \


--cli-input-json file://create-domain-input.json

Response:

{
"DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id",
"Url": "https://domain-id.studio.region.sagemaker.aws/..."
}

Attach the SageMaker image to an existing Domain

This method assumes that you've already onboarded to Domain. For more information, see Onboard to
Amazon SageMaker Domain (p. 37).
Note
You must delete all of the applications in your Domain to update the Domain with the new
image version. For information about deleting these applications, see Delete an Amazon
SageMaker Domain (p. 116).

Use the following steps to add the SageMaker image to your current Domain.

• Get your DomainID from the SageMaker console.


• Use the DomainID to get the DefaultUserSettings for the Domain.
• Add the ImageName and AppImageConfig as a CustomImage to the DefaultUserSettings.
• Update your Domain to include the custom image.

To add the custom SageMaker image to your Domain

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

455
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

2. From the left navigation pane, choose Domains.


3. Select the desired Domain.
4. Choose Domain settings.
5. Under General Settings, find the Domain ID. The ID is in the following format: d-xxxxxxxxxxxx.
6. Use the Domain ID to get the description of the Domain.

aws sagemaker describe-domain \


--domain-id <d-xxxxxxxxxxxx>

Response:

{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
}
}
}

7. Save the DefaultUserSettings section of the response to a file named update-domain-


input.json.
8. Insert the ImageName and AppImageConfigName from the previous steps as a custom image.
Because ImageVersionNumber isn't specified, the latest version of the image is used, which is the
only version in this case.

{
"DefaultUserSettings": {
"RSessionAppSettings": {
"CustomImages": [
{
"ImageName": "rstudio-custom-image",
"AppImageConfigName": "rstudio-custom-config"
}
]
}
}
}

9. Use the Domain ID and default user settings file to update your Domain.

aws sagemaker update-domain \


--domain-id <d-xxxxxxxxxxxx> \
--cli-input-json file://update-domain-input.json

Response:

{
"DomainArn": "arn:aws:sagemaker:region:acct-id:domain/domain-id"
}

10. Delete the RStudioServerPro application. You must restart the RStudioServerPro domain-
shared application for the RStudio Launcher UI to pick up the latest changes.

aws sagemaker delete-app \

456
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

--domain-id <d-xxxxxxxxxxxx> --user-profile-name domain-shared \


--app-type RStudioServerPro --app-name default

11. Create a new RStudioServerPro application. You must create this application using the AWS CLI.

aws sagemaker create-app \


--domain-id <d-xxxxxxxxxxxx> --user-profile-name domain-shared \
--app-type RStudioServerPro --app-name default

Launch a custom SageMaker image in RStudio


You can use your custom image when launching an RStudio applicaton from the console. After you
create your custom SageMaker image and attach it to your domain, the image appears in the image
selector dialog box of the RStudio Launcher. To launch a new RStudio app, follow the steps in Open
RStudio Launcher and launch RSessions (p. 464) and select your custom image as shown in the
following image.

Clean up image resources


This guide shows how to clean up RStudio image resources that you created in the previous sections. To
delete an image, complete the following steps using either the SageMaker console or the AWS CLI, as
shown in this guide.

• Detach the image and image versions from your Amazon SageMaker Domain.
• Delete the image, image version, and app image config.

After you've completed these steps, you can delete the container image and repository from Amazon
ECR. For more information about how to delete the container image and repository, see Deleting a
repository.

Clean up resources from the SageMaker console

When you detach an image from a Domain, all versions of the image are detached. When an image is
detached, all users of the Domain lose access to the image versions.

To detach an image

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation pane, choose Domains.
3. Select the desired Domain.

457
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

4. Choose Environment.
5. Under Custom images attached to domain, choose the image and then choose Detach.
6. (Optional) To delete the image and all versions from SageMaker, select Also delete the selected
images .... This does not delete the associated images from Amazon ECR.
7. Choose Detach.

Clean up resources from the AWS CLI

To clean up resources

1. Detach the image and image versions from your Domain by passing an empty custom image list to
the Domain. Open the update-domain-input.json file that you created in Attach the SageMaker
image to your current domain (p. 177).
2. Delete the RSessionAppSettings custom images and then save the file. Do not modify the
KernelGatewayAppSettings custom images.

{
"DomainId": "d-xxxxxxxxxxxx",
"DefaultUserSettings": {
"KernelGatewayAppSettings": {
"CustomImages": [
],
...
},
"RSessionAppSettings": {
"CustomImages": [
],
"DefaultResourceSpec": {
}
...
}
}
}

3. Use the Domain ID and default user settings file to update your Domain.

aws sagemaker update-domain \


--domain-id <d-xxxxxxxxxxxx> \
--cli-input-json file://update-domain-input.json

Response:

{
"DomainArn": "arn:aws:sagemaker:us-east-2:acct-id:domain/d-xxxxxxxxxxxx"
}

4. Delete the app image config.

aws sagemaker delete-app-image-config \


--app-image-config-name rstudio-image-config

5. Delete the SageMaker image, which also deletes all image versions. The container images in Amazon
ECR that are represented by the image versions are not deleted.

aws sagemaker delete-image \


--image-name rstudio-image

458
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Manage users
After your RStudio-enabled Amazon SageMaker Domain is running, you can add user profiles
(UserProfiles) to the Domain. The following topics show how to create user profiles that are authorized
to use RStudio, as well as update an existing user profile. For information on how to delete an RStudio
App, UserProfile, or Domain, follow the steps in Delete an Amazon SageMaker Domain.
Note
The limit for the total number of UserProfiles in a Amazon SageMaker Domain is 60.

There are two types of users:

• Unauthorized: This user cannot access the RStudio app.


• Authorized: This user can access the RStudio app and use one of the RStudio license seats. By default, a
new user is Authorized if the Domain is enabled for RStudio.

If a user is authorized, they can be given one of the following levels of access to RStudio.

• RStudio User: This is a standard RStudio user and can access RStudio.
• RStudio Admin: The admin of your Amazon SageMaker Domain has the ability to create users, add
existing users, and update the permissions of existing users. Admins can also access the RStudio
Administrative dashboard. However, this admin is not able to update parameters that are managed by
Amazon SageMaker.

Methods to create a user


The following topics show how to create a user in your RStudio-enabled Amazon SageMaker Domain.

Create user console

To create a user in your RStudio-enabled Amazon SageMaker Domain from the console, complete the
steps in Add user profiles (p. 119).

Create user CLI

The following command shows how to add users to a Amazon SageMaker Domain with IAM
authentication. A User can belong to either the R_STUDIO_USER or R_STUDIO_ADMIN User group.

aws sagemaker create-user-profile --region <REGION> \


--domain-id <DOMAIN-ID> \
--user-profile-name <USER_PROFILE_NAME-ID> \
--user-settings RStudioServerProAppSettings={UserGroup=<USER-GROUP>}

The following command shows how to add users to a Amazon SageMaker Domain with authentication
using IAM Identity Center. A user can belong to either the R_STUDIO_USER or R_STUDIO_ADMIN User
group.

aws sagemaker create-user-profile --region <REGION> \


--domain-id <DOMAIN-ID> \
--user-profile-name <USER_PROFILE_NAME-ID> \
--user-settings RStudioServerProAppSettings={UserGroup=<USER-GROUP>} \
--single-sign-on-user-identifier UserName \
--single-sign-on-user-value <USER-NAME>

Update existing user


You cannot update the authorization of an existing user. You must delete the existing user and create a
new one with the updated authorization.

459
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Log in to RStudio as another user

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Navigate to the Control Panel.
3. Select a user name from the list of users. This opens a new page with details about the user profile
and the apps that are running.
4. Select the Launch App.
5. From the dropdown, select RStudio to launch an RStudio instance.

Terminate sessions for another user

1. From the list of running apps, identify the app you want to delete.
2. Click the respective Delete app button for the app you are deleting.

Delete another user

You cannot delete a user if the user is running any apps. Delete all apps before attempting to delete a
user.

1. From the User Profile page, select Edit. This opens a new General settings page.
2. Under Delete user, select Delete user.

RStudio administrative dashboard


This topic shows how to access and use the RStudio administrative dashboard. With the RStudio
administrative dashboard, admins can manage users and RSessions, as well as view information about
RStudio Server instance utilization and Amazon CloudWatch Logs.

Launch the RStudio administrative dashboard


The R_STUDIO_ADMIN authorization allows the user to access the RStudio administrative dashboard.
An R_STUDIO_ADMIN user can access the RStudio administrative dashboard by replacing workspaces
with admin in their RStudio URL manually. The following shows how to modify the URL to access the
RStudio administrative dashboard.

For example, the following RStudio URL:

https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/
workspaces

Can be converted to:

https://<DOMAIN-ID>.studio.us-east-2.sagemaker.aws/rstudio/default/s/<SESSION-ID>/admin

Dashboard tab
This tab gives an overview of your RStudio Server instance utilization, as well as information on the
number of active RSessions.

Sessions tab
This tab gives information on the active RSessions, such as the user that launched the RSessions, the
time that the RSessions have been running, and their resource utilization.

460
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Users tab
This tab gives information on the RStudio authorized users in the Domain, such as the time that the
last RSession was launched and their resource utilization. The following procedure shows how to get
information about the user's historical resource utilization.

1. From the list of users, select the user that you want to view information for. This opens a new page
that is specific to the user.
2. To view the user's historical resource utilization, select the Stats tab. This tab gives information
about the historical CPU and memory usage, as well as the number of active RSessions.
3. To view Amazon CloudWatch Logs specific to the user, select the Logs tab.

Stats tab
This tab gives information on the historical utilization of your RStudio Server instance.

Logs tab
This tab displays Amazon CloudWatch Logs for the RStudio Server instance. For more information about
logging events with Amazon CloudWatch Logs, see What is Amazon CloudWatch Logs?.

Shut down and restart RStudio


To shut down and restart your RStudio Workbench and the associated RStudioServerPro app, you
must first shut down all of your existing RSessions. You can shut down the RSessionGateway apps
from within RStudio. You can then shut down the RStudioServerPro app using the AWS CLI. After the
RStudioServerPro app is shut down, you must reopen RStudio through the SageMaker console.

Any unsaved notebook information is lost in the process. The user data in the Amazon EFS volume isn't
impacted.
Note
If you are using a custom image with RStudio, ensure that your docker image is using an RStudio
version that is compatible with the version of RStudio Workbench being used by SageMaker
after you restart your RStudioServerPro app.

The following topics show how to shut down the RSessionGateway and RStudioServerPro apps and
restart them.

Suspend your RSessions


Complete the following procedure to suspend all of your RSessions.

1. From the RStudio Launcher, identify the RSession that you want to suspend.
2. Select Suspend for the session.
3. Repeat this for all RSessions.

Delete your RSessions


Complete the following procedure to shut down all of your RSessions.

1. From the RStudio Launcher, identify the RSession that you want to delete.
2. Select Quit for the session. This opens a new Quit Session window.
3. From the Quit Session window, select Force Quit, to end all child processes in the session.
4. Select Quit Session to confirm deletion of the session.
5. Repeat this for all RSessions.

461
Amazon SageMaker Developer Guide
Manage RStudio on SageMaker

Delete your RStudioServerPro app


Run the following commands from the AWS CLI to delete and restart your RStudioServerPro app.

1. Delete the RStudioServerPro application by using your current domain id.

aws sagemaker delete-app \


--domain-id <domainId> \
--user-profile-name domain-shared \
--app-type RStudioServerPro \
--app-name default

2. Re-create the RStudioServerPro application.

aws sagemaker create-app \


--domain-id <domainId> \
--user-profile-name domain-shared \
--app-type RStudioServerPro \
--app-name default

Manage billing and cost


To track the costs associated with your RStudio environment, you can use the AWS Billing and Cost
Management service. AWS Billing and Cost Management provides useful tools to help you gather
information related to your cost and usage, analyze your cost drivers and usage trends, and take action
to budget your spending. For more information, see What is AWS Billing and Cost Management?.

The following describes components required to run RStudio on Amazon SageMaker and how each
component factors into billing for your RStudio instance.

• RStudio License –You must purchase an RStudio license. There is no additional charge for using your
RStudio license with Amazon SageMaker. For more information about your RStudio license, see
RStudio license (p. 435).
• RSession - These are RStudio working sessions launched by end users. You are charged while the
RSession is running.
• RStudio Server - A multi-tenant server manages all the RSessions. You can choose the instance type
to run RStudio Server on, and pay the related costs. The default instance, "system", is free, but you
can choose to pay for higher tiers. For more information about the available instance types for your
RStudio Server, see RStudioServerPro instance type (p. 437).

Tracking billing at user level

To track billing at the user level using Cost Allocation Tags, see Using Cost Allocation Tags.

Diagnose issues and get support


The following sections describe how to diagnose issues with RStudio on Amazon SageMaker. To
get support for RStudio on Amazon SageMaker, contact Amazon SageMaker support. For help with
purchasing an RStudio license or modifying the number of license seats, contact [email protected].

Upgrade your version


If you receive a warning that there is a version mismatch between your RSession and RStudioServerPro
apps, then you must upgrade the version of your RStudioServerPro app. For more information, see
Upgrade the RStudio Version (p. 436).

462
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker

View Metrics and Logs


You can monitor your workflow performance while using RStudio on Amazon SageMaker. View data logs
and information about metrics with the RStudio administrative dashboard or Amazon CloudWatch.

View your RStudio logs from the RStudio administrative dashboard

You can view metrics and logs directly from the RStudio administrative dashboard.

1. Log in to your Amazon SageMaker Domain.


2. Navigate to the RStudio administrative dashboard following the steps in RStudio administrative
dashboard (p. 460).
3. Select the Logs tab.

View your RStudio logs from Amazon CloudWatch Logs

Amazon CloudWatch monitors your AWS resources and the applications that you run on AWS in real
time. You can use Amazon CloudWatch to collect and track metrics, which are variables that you can
measure for your resources and applications. To ensure that your RStudio apps have permissions for
Amazon CloudWatch, you must include the permissions described in Onboard to Amazon SageMaker
Domain (p. 37). You don’t need to do any setup to gather Amazon CloudWatch Logs.

The following steps show how to view Amazon CloudWatch Logs for your RSession.

These logs can be found in the /aws/sagemaker/studio log stream from the AWS CloudWatch
console.

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. Select Logs from the left side. From the dropdown menu, select Log groups.
3. On the Log groups screen, search for aws/sagemaker/studio. Select the Log group.
4. On the aws/sagemaker/studio Log group screen, navigate to the Log streams tab.
5. To find the logs for your Domain, search Log streams using the following format:

<DomainId>/domain-shared/rstudioserverpro/default

Use RStudio on Amazon SageMaker


With RStudio support in Amazon SageMaker, you can put your production workflows in place and take
advantage of SageMaker features. The following topics show how to launch an RStudio session and
complete key workflows. For information about managing RStudio on SageMaker, see Manage RStudio
on Amazon SageMaker (p. 434).

For information about the onboarding steps to create an Amazon SageMaker Domain with RStudio
enabled, see Onboard to Amazon SageMaker Domain (p. 37).

For information about the AWS Regions that RStudio on SageMaker is supported in, see Supported
Regions and Quotas (p. 33).

Topics
• Collaborate in RStudio (p. 464)
• Base R image (p. 464)
• Open RStudio Launcher and launch RSessions (p. 464)
• Publish to RStudio Connect (p. 465)

463
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker

• Access Amazon SageMaker features with RStudio on Amazon SageMaker (p. 465)

Collaborate in RStudio
To share your RStudio project, you can connect RStudio to your Git repo. For information on setting this
up, see Version Control with Git and SVN.

Note: Project sharing and realtime collaboration are not currently supported when using RStudio on
Amazon SageMaker.

Base R image
When launching your RStudio instance, the Base R image serves as the basis of your instance. This image
extends the r-session-complete Docker image.

This Base R image includes the following:

• R v4.0 or higher
• awscli, sagemaker, and boto3 Python packages
• Reticulate package for R SDK integration

Open RStudio Launcher and launch RSessions


The following topics show how to use the RStudio Launcher to launch RSessions.

Open RStudio Launcher


Open RStudio Launcher from the Amazon SageMaker Console

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation, select RStudio.
3. Under Get Started, select the domain and user profile to launch.
4. Choose Launch RStudio.

Open RStudio Launcher from the AWS CLI

The procedure to open the RStudio Launcher using the AWS CLI differs depending on the method used
to manage your users.

IAM Identity Center

1. Use the AWS access portal to open your Amazon SageMaker Domain.
2. Modify the URL path to “/rstudio/default” as follows.

#Studio URL
https://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/lab

#modified URL
https://<domain-id>.studio.<region>.sagemaker.aws/rstudio/default

IAM

To open the RStudio Launcher from the AWS CLI in IAM mode, complete the following procedure.

464
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker

1. Create a presigned URL using the following command.

aws sagemaker create-presigned-domain-url --region <REGION> \


--domain-id <DOMAIN-ID> \
--user-profile-name <USER-PROFILE-NAME>

2. Append &redirect=RStudioServerPro to the generated URL.


3. Navigate to the updated URL.

Launch RSessions
After you’ve launched the RStudio Launcher, you can create a new RSession.

1. Select New Session.


2. Enter a Session Name.
3. Select an instance type that your RSession runs on. This defaults to ml.t3.medium.
4. Select an Image that your RSession uses as the kernel.
5. Select Start Session.
6. After your session has been created, you can start it by selecting the name.
Note
If you receive a warning that there is a version mismatch between your RSession and
RStudioServerPro apps, then you must upgrade the version of your RStudioServerPro app.
For more information, see Upgrade the RStudio Version (p. 436).

Suspend your RSessions


1. From the RStudio Launcher, identify the RSession that you want to suspend.
2. Select Suspend for the session.

Delete your RSessions


1. From the RStudio Launcher, identify the RSession that you want to delete.
2. Select Quit for the session. This opens a new Quit Session window.
3. From the Quit Session window, select Force Quit, to end all child processes in the session.
4. Select Quit Session to confirm deletion of the session.

Publish to RStudio Connect


RStudio Connect enables data scientists to publish insights, dashboard and web applications from
RStudio on Amazon SageMaker. For more information, see Host RStudio Connect and Package Manager
for ML development in RStudio on Amazon SageMaker.

For more information on RStudio Connect, see the RStudio Connect User Guide.

Access Amazon SageMaker features with RStudio on Amazon


SageMaker
One of the benefits of using RStudio on Amazon SageMaker is the integration of Amazon SageMaker
features. This includes integration with Amazon SageMaker Studio and Reticulate.

Use Amazon SageMaker Studio JupyterLab and RStudio on Amazon SageMaker

465
Amazon SageMaker Developer Guide
Use RStudio on Amazon SageMaker

Your Amazon SageMaker Studio JupyterLab and RStudio instances share the same Amazon EFS file
system. This means that files that you import and create using JupyterLab can be accessed using RStudio
and vice versa. This allows you to work on the same files using both JupyterLab and RStudio without
having to move your files between the two. For more information on this workflow, see the Announcing
Fully Managed RStudio on Amazon SageMaker for Data Scientists blog.

Use Amazon SageMaker SDK with reticulate

The reticulate package is used as an R interface to Amazon SageMaker Python SDK to make API calls
to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon
SageMaker provides a serverless data science environment to train and deploy Machine Learning (ML)
models at scale. For general information about the reticulate package, see R Interface to Python.

For a blog that outlines how to use the reticulate package with Amazon SageMaker, see Using R with
Amazon SageMaker.

The following examples show how to use reticulate for specific use cases.

• For a notebook that describes how to use reticulate to do batch transform to make predictions, see
Batch Transform Using R with Amazon SageMaker.
• For a notebook that describes how to use reticulate to conduct hyperparameter tuning and generate
predictions, see Hyperparameter Optimization Using R with Amazon SageMaker.

466
Amazon SageMaker Developer Guide

Automate model development with


Amazon SageMaker Autopilot
Amazon SageMaker Autopilot is a feature-set that automates key tasks of an automatic machine
learning (AutoML) process. It explores your data, selects the algorithms relevant to your problem type,
and prepares the data to facilitate model training and tuning. Autopilot applies a cross-validation
resampling procedure automatically to all candidate algorithms when appropriate to test their ability
to predict data they have not been trained on. It also produces metrics to assess the predictive quality
of its machine learning model candidates. It simplifies your machine learning experience by automating
these key tasks that constitute an AutoML process. It ranks all of the optimized models tested by their
performance. It finds the best performing model that you can deploy at a fraction of the time normally
required.

You can use Autopilot in different ways: on autopilot (hence the name) or with various degrees of
human guidance, without code through Amazon SageMaker Studio or with code using one of the AWS
SDKs. Autopilot currently supports regression and binary and multiclass classification problem types. It
supports tabular data formatted as CSV or Parquet files in which each column contains a feature with
a specific data type and each row contains an observation. The column data types accepted include
numerical, categorical, text, and time series that consists of strings of comma-separate numbers.
Autopilot supports building machine learning models on large datasets up to hundreds of GBs.

Autopilot also helps explain how models make predictions using a feature attribution approach
developed for Amazon SageMaker Clarify. Autopilot automatically generates a report that indicates
the importance of each feature for the predictions made by the best candidate. This explainability
functionality can make machine learning models more understandable to AWS customers. The model
governance report generated can be used to inform risk and compliance teams and external regulators.

You get full visibility into how the data was wrangled and how the models were selected, trained, and
tuned for each of the candidates tested. This is provided by notebooks that Autopilot generates for each
trial that contains the code used to explore the data and find the best candidate. The notebooks also
provide educational tools to help you learn about and conduct your own ML experiments. You can learn
about the impact of various inputs and trade-offs made in experiments by examining the various data
exploration and candidate definition notebooks exposed by Autopilot. You can also conduct further
experiments on the higher performing candidates by making your own modifications to the notebooks
and rerunning them.

The following graphic outlines the principal tasks of an AutoML process managed by Autopilot.

With Amazon SageMaker, you pay only for what you use. You pay for the underlying compute and
storage resources within SageMaker or other AWS services, based on your usage. For more information
about the cost of using SageMaker, see Amazon SageMaker Pricing.

Topics
• Get started with Amazon SageMaker Autopilot (p. 468)

467
Amazon SageMaker Developer Guide
Get started

• Create an Amazon SageMaker Autopilot experiment (p. 470)


• Amazon SageMaker Autopilot datasets and problem types (p. 475)
• Training modes and algorithm support (p. 476)
• Metrics and validation (p. 478)
• Amazon SageMaker Autopilot model deployment and prediction (p. 483)
• Amazon SageMaker Autopilot explainability (p. 496)
• Models generated by Amazon SageMaker Autopilot (p. 497)
• Amazon SageMaker Autopilot notebooks generated to manage AutoML tasks (p. 509)
• Configure inference output in generated containers (p. 517)
• Amazon SageMaker Autopilot quotas (p. 522)
• API Reference guide for Amazon SageMaker Autopilot (p. 524)

Get started with Amazon SageMaker Autopilot


Amazon SageMaker Autopilot provides samples, videos, and tutorials to get started with Amazon
SageMaker Autopilot

Topics
• Samples: Explore modeling with Amazon SageMaker Autopilot (p. 468)
• Videos: Use Autopilot to automate and explore the machine learning process (p. 469)
• Tutorials: Get started with Amazon SageMaker Autopilot (p. 470)

Samples: Explore modeling with Amazon SageMaker


Autopilot
Amazon SageMaker Autopilot provides the following sample notebooks.

• Direct marketing with Amazon SageMaker Autopilot: This notebook demonstrates how uses the Bank
Marketing Data Set to predict whether a customer will enroll for a term deposit at a bank. You can
use Autopilot on this dataset to get the most accurate ML pipeline by exploring options contained
in various candidate pipelines. Autopilot generates each candidate in a two-step procedure. The first
step performs automated feature engineering on the dataset. The second step trains and tunes an
algorithm to produce a model. The notebook contains instructions on how to train the model and how
to deploy the model to perform batch inference using the best candidate.
• Customer Churn Prediction with Amazon SageMaker Autopilot: This notebook describes using
machine learning for the automated identification of unhappy customers, also known as customer
churn prediction. The sample shows how to analyze a publicly available dataset and perform feature
engineering on it. Next it shows how to tune a model by selecting the best performing pipeline along
with the optimal hyperparameters for the training algorithm. Finally, it shows how to deploy the
model to a hosted endpoint and how to evaluate its predictions against ground truth. However, ML
models rarely give perfect predictions. That's why this notebook also shows how to incorporate the
relative costs of prediction mistakes when determining the financial outcome of using ML.
• Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform
(Python SDK): This notebook also describes using machine learning for the automated identification
of unhappy customers, also known as customer churn prediction. This notebook demonstrates how
to configure the model to obtain the inference probability, select the top N models, and make Batch
Transform on a hold-out test set for evaluation.
Note
This notebook works with SageMaker Python SDK >= 1.65.1 released on 6/19/2020.

468
Amazon SageMaker Developer Guide
Videos

• Bringing your own data processing code to Amazon SageMaker Autopilot: This notebook demonstrates
how to incorporate and deploy custom data processing code when using Amazon SageMaker
Autopilot. It adds a custom feature selection step to remove irrelevant variables to an Autopilot job. It
then shows how to deploy both the custom processing code and models generated by Autopilot on a
real-time endpoint and, alternatively, for batch processing.

Videos: Use Autopilot to automate and explore the


machine learning process
Here is a video series that provides a tour of Amazon SageMaker Autopilot capabilities using Studio.
They show how to start an AutoML job, analyze and preprocess data, how to do feature engineering
and hyperparameter optimization on candidate models, and how to visualize and compare the resulting
model metrics.

Topics
• Start an AutoML job with Amazon SageMaker Autopilot (p. 469)
• Review data exploration and feature engineering automated in Autopilot. (p. 469)
• Tune models to optimize performance (p. 469)
• Choose and deploy the best model (p. 469)
• Amazon SageMaker Autopilot tutorial (p. 469)

Start an AutoML job with Amazon SageMaker Autopilot


This video shows you to how to start an AutoML job with Autopilot. (Length: 8:41)

Amazon SageMaker Studio - AutoML with Amazon SageMaker Autopilot (part 1)

Review data exploration and feature engineering automated in


Autopilot.
This video shows you how to review the data exploration and candidate definition notebooks generated
by Amazon SageMaker Autopilot. (Length: 10:04)

Amazon SageMaker Studio - AutoML with Amazon SageMaker Autopilot (part 2)

Tune models to optimize performance


This video shows you how to optimize model performance during training using hyperparameter tuning.
(Length: 4:59)

SageMaker Studio - AutoML with Amazon SageMaker Autopilot (part 3)

Choose and deploy the best model


This video shows you how to use job metrics to choose the best model and then how to deploy it.
(Length: 5:20)

SageMaker Studio - AutoML with Amazon SageMaker Autopilot (part 4)

Amazon SageMaker Autopilot tutorial


This video walks you through an end to end demo where we first build a binary classification model
automatically with Amazon SageMaker Autopilot. We see how candidate models have been built and

469
Amazon SageMaker Developer Guide
Tutorials

optimized using auto-generated notebooks. We also look at the top candidates with Amazon SageMaker
Experiments. Finally, we deploy the top candidate (based on XGBoost), and configure data capture with
SageMaker Model Monitor.

End to end demo with AutoML on SageMaker

Tutorials: Get started with Amazon SageMaker


Autopilot
Get started tutorials for Autopilot demonstrate how to create a machine learning model automatically
without writing code. They show you how Autopilot simplifies the machine learning experience by
helping you explore your data and try different algorithms. Autopilot builds the best machine learning
model for the problem type using AutoML capabilities while allowing full control and visibility.

• Create a machine learning model automatically with Autopilot: You assume the role of a developer
working at a bank in this tutorial. You have been asked to develop a machine learning model to predict
if a customer will enroll for a certificate of deposit (CD). This is a binary classification problem. The
model is trained on the marketing dataset that contains information on customer demographics,
responses to marketing events, and external factors.

Create an Amazon SageMaker Autopilot


experiment
This guide shows how to create an Amazon SageMaker Autopilot experiment (that is, how to start an
Autopilot job in SageMaker), so that you can explore, pre-process, and train various model candidates on
a given dataset. This can help you get started with machine learning quickly.

You can use a user interface (Amazon SageMaker Studio UI) to help you populate the input, output,
target, and parameters to run and evaluate an Autopilot experiment or use SageMaker API Reference.
The UI has descriptions, toggle switches, dropdown menus, radio buttons, and more to help you navigate
creating your model candidates. You can also view statistics while the experiment is running. After it
runs, you can compare trials and delve into the details of the pre-processing steps, algorithms, and
hyperparameter ranges of each model. You also have the option to download their explainability and
performance reports. Use the provided notebooks to see the results of the automated data exploration
or the candidate model definitions.

The following instructions show how to create an Amazon SageMaker Autopilot job as a pilot experiment
using Studio UI or SageMaker API reference. You name your experiment, provide locations for the input
and output data, and specify which target data to predict. Optionally, you can also specify the type of
machine learning problem that you want to solve, choose your modeling strategy (stacked ensembles or
hyperparameters optimization), select the list of algorithms used by the Autopilot job to train the data,
and more.

Create an Autopilot experiment using Studio


To create an Amazon SageMaker Autopilot experiment using Studio

1. Sign in at https://console.aws.amazon.com/sagemaker/, select Studio from the left navigation


pane, then choose Open Studio.
2.
In Studio, choose the home icon ( ) from the left navigation pane to view the Studio top-level
navigation menu.

470
Amazon SageMaker Developer Guide
Create an Autopilot experiment using Studio

3. On the Home tab, choose the AutoML card. This opens a new AutoML tab.
4. Choose Create an AutoML experiment. This opens a new Create experiment tab.
5. In the Experiment and data details section, enter the following information:

a. Experiment name – Must be unique to your account in the current AWS Region and contain a
maximum of 63 alphanumeric characters. Can include hyphens (-) but not spaces.
b. Input data – Provide the Amazon Simple Storage Service (Amazon S3) bucket location of your
input data. This S3 bucket must be in your current AWS Region. The URL must be in an s3://
format where Amazon SageMaker has write permissions. The file must be in CSV or Parquet
format and contain at least 500 rows. Select Browse to scroll through available paths and
Preview to see a sample of your input data.
c. Is your S3 input a manifest file? – A manifest file includes metadata with your input data. The
metadata specifies the location of your data in Amazon S3. It also specifies how the data is
formatted and which attributes from the dataset to use when training your model. You can use
a manifest file as an alternative to preprocessing when your labeled data is being streamed in
Pipe mode.
d. Auto split data? – Autopilot can split your data into an 80-20% split for training and validation
data. If you prefer a custom split, you can choose the Specify split ratio. To use a custom
dataset for validation, choose Provide a validation set.
e. Output data location (S3 bucket) – The name of the S3 bucket location where you want to
store the output data. The URL for this bucket must be in an Amazon S3 format where Amazon
SageMaker has write permissions. The S3 bucket must be in the current AWS Region. Autopilot
can also create this for you in the same location as your input data.
6. Choose Next: Target and features. The Target and features tab opens.
7. In the Target and features section:

• Select a column to set as a target for model predictions.


• Optionally, you can pass the name of a sample weights column in the Sample weight section to
request your dataset rows to be weighted during training and evaluation. For more information on
the available objective metrics, see Autopilot weighted metrics (p. 480).
Note
Support for sample weights is available in ensembling mode only.
• You can also select features for training and change their data type. The following data types are
available: Text, Numerical, Categorical, Datetime, Sequence, and Auto. All features are
selected by default.
8. Choose Next: Training method. The Training method tab opens.
9. In the Training method section, select your training option: Ensembling, Hyperparameter
optimization (HPO), or Auto to let Autopilot choose the training method automatically based on
the dataset size. Each training mode runs a pre-defined set of algorithms on your dataset to train
model candidates. By default, Autopilot pre-select all the available algorithms for the given training
mode. You can run an Amazon SageMaker Autopilot training experiment with all the algorithms or
choose your own subset.

For more information on the training modes and the available algorithms, see the Autopilot
training modes section in the Training modes and algorithms page.
10. Choose Next: Deployment and advanced settings to open the Deployment and advanced settings
tab. Settings include auto display endpoint name, machine learning problem type, and additional
choices for running your experiment.

a. Deployment settings – Autopilot can automatically create an endpoint and deploy your model
for you.

To auto deploy to an automatically generated endpoint, or to provide an endpoint name for


custom deployment, set the toggle to Yes under Auto deploy? If you are importing data from

471
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically

Amazon SageMaker Data Wrangler, you have additional options to auto deploy the best model
with or without the transforms from Data Wrangler.
Note
If your Data Wrangler flow contains multi-row operations such as groupby, join,
or concatenate, you won't be able to auto deploy with these transforms. For more
information see Automatically Train Models on Your Data Flow.
b. Advanced settings (optional) – Autopilot provides additional controls to manually set
experimental parameters such as defining your problem type, time constraints on your
Autopilot job and trials, security, and encryption settings.

• Machine learning problem type – Autopilot can automatically select the machine learning
problem type. If you prefer to choose it manually, use the Select the machine learning
problem type dropdown menu.

A. Auto – Autopilot infers the problem type from the values of the attribute that you
want to predict. In some cases, SageMaker is unable to infer accurately. When that
happens, you must provide the value for the job to succeed.
B. Binary classification– Binary classification is a type of supervised learning that assigns
an individual to one of two predefined and mutually exclusive classes, based on their
attributes. For example, medical diagnosis based on results of diagnostic tests that
determine if someone has a disease.
C. Regression – Regression estimates the values of a dependent target variable based
on one or more variables or attributes that are correlated with it. For example, house
prices based on features, such as square footage and number of bathrooms.
D. Multiclass classification – Multiclass classification is a type of supervised learning that
assigns an individual to one of several classes based on their attributes. For example,
the prediction of the topic most relevant to a text document, such as politics, finance,
or philosophy.
c. Choose Next: Review and create to get a summary of your Autopilot experiment before you
create it.
11. Select Create experiment.The creation of the experiment starts an Autopilot job in SageMaker.
Autopilot provides status on the course of the experiment, information on the data exploration
process and model candidates in notebooks, a list of generated models and their reports, and the
job profile used to create them.

For information on the notebooks generated by an Autopilot job, see Amazon SageMaker Autopilot
notebooks generated to manage AutoML tasks (p. 509). For information on the details of
each model candidate and their reports, see Models generated by Amazon SageMaker Autopilot
(p. 497).

Note
To avoid incurring unnecessary charges: If you deploy a model that is no longer needed, delete
the endpoints and resources that were created during that deployment. Information about
pricing instances by Region is available at Amazon SageMaker Pricing.

Create an Autopilot experiment programmatically


You can create an AutoMLjob programmatically by calling the CreateAutoMLJob API action in any
language supported by Amazon SageMaker Autopilot.

For information on how this API action translates into a function in the language of your choice, see the
See Also section of CreateAutoMLJob and choose an SDK.

As an example, for Python users, see the full request syntax of create_auto_ml_job in AWS SDK for
Python.

472
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically

Required parameters
When using CreateAutoMLJob to create an AutoML Job, you must provide the following four values:

• AutoMLJobName to specify the name of your job.


• AutoMLChannel in InputDataConfig to specify your data source.
• OutputDataConfig to specify the Amazon S3 output path to store the artifacts of your AutoML job.
• RoleArn to specify the ARN of the role used to access your data.

All other parameters are optional.

Optional parameters
The following sections provide details of some additional parameters that you can pass to your AutoML
job.

How to set the training mode of an AutoML job


The set of algorithms run on your data to train your model candidates is dependent on your modeling
strategy (ENSEMBLING or HYPERPARAMETER_TUNING). You can set the training method of an AutoML
job with the AutoMLJobConfig.Mode parameter.

If you keep it blank (or null), the Mode is inferred based on the size of your dataset.

For information on Autopilot's stacked ensembles and hyperparameters optimization training methods,
see Training modes and algorithm support (p. 476)

How to select features and algorithms for training an AutoML job


Features selection
Autopilot provides automatic data-preprocessing steps including feature selection and feature
extraction. However, you can manually provide the features to be used in training with the
FeatureSpecificatioS3Uri attribute of AutoMLCandidateGenerationConfig within the
CreateAutoMLJob API with the following format:

{
"AutoMLJobConfig": {
"CandidateGenerationConfig": {
"FeatureSpecificationS3Uri":"string"
},
}
}

Selected features should be contained within a JSON file in the following format:

{ "FeatureAttributeNames":["col1", "col2", ...] }

The values listed in ["col1", "col2", ...] are case sensitive. They should be a list of strings
containing unique values that are subsets of the column names in the input data.
Note
The list of columns provided as features cannot include the target column.

Algorithms selection
By default, your Autopilot job runs a pre-defined list of algorithms on your dataset to train
model candidates. The list of algorithms depends on the training mode (ENSEMBLING or
HYPERPARAMETER_TUNING) used by the job.

473
Amazon SageMaker Developer Guide
Create an Autopilot experiment programmatically

You can provide a subset of the default selection of algorithms by adding the AlgorithmsConfig
attribute and its nested AutoMLAlgorithms field to the AutoMLCandidateGenerationConfig within the
CreateAutoMLJob API.

For the list of available algorithms per training Mode, see AutoMLAlgorithms. For details on each
algorithm, see Training modes and algorithm support (p. 476).

The following is an example of a AlgorithmsConfig attribute listing exactly three algorithms


("xgboost", "fastai", "catboost") in its AutoMLAlgorithms field for the ensembling training mode.

{
"AutoMLJobConfig": {
"CandidateGenerationConfig": {
"AlgorithmsConfig":[
{"AutoMLAlgorithms":["xgboost", "fastai", "catboost"]}
]
},
"Mode": "ENSEMBLING"
}

How to specify the training and validation datasets of an AutoML job

You can provide your own validation dataset and custom data split ratio, or let Autopilot split the
dataset automatically. Each AutoMLChannel object (see the required parameter InputDataConfig) has
a ChannelType, which can be set to either training or validation values that specify how the data
is to be used when building a machine learning model. At least one data source must be provided and a
maximum of two data sources is allowed: one for training data and one for validation data.

How you split the data into training and validation datasets depends on whether you have one or two
data sources.

• If you only have one data source, the ChannelType is set to training by default and must have this
value.
• If the ValidationFraction value in AutoMLDataSplitConfig is not set, 0.2 (20%) of the data
from this source is used for validation by default.
• If the ValidationFraction is set to a value between 0 and 1, the dataset is split based on the
value specified, where the value specifies the fraction of the dataset used for validation.
• If you have two data sources, the ChannelType of one of the AutoMLChannel objects must be set to
training, the default value. The ChannelType of the other data source must be set to validation.
The two data sources must have the same format, either CSV or Parquet, and the same schema. You
must not set the value for the ValidationFraction in this case because all of the data from each
source is used for either training or validation. Setting this value causes an error.

For information on split and cross-validation in Autopilot see Cross-validation in Autopilot (p. 481).

How to set the problem type of an AutoML job

You can set the type of problem on an AutoML job with the CreateAutoPilot.ProblemType
parameter. This limits the kind of preprocessing and algorithms that Autopilot tries.
After the job is finished, if you had set the CreateAutoPilot.ProblemType, then the
ResolvedAttribute.ProblemType matches the ProblemType you set. If you keep it blank (or null),
the ProblemType is inferred on your behalf.
Note
In some cases, Autopilot is unable to infer the ProblemType with high enough confidence, in
which case you must provide the value for the job to succeed.

474
Amazon SageMaker Developer Guide
Datasets and problem types

How to add sample weights to an AutoML job

You can add a sample weights column to your tabular dataset and then pass it to your AutoML job to
request dataset rows to be weighted during training and evaluation.

To set sample weights when creating an experiment (see CreateAutoMLJob), you can pass the name of
your sample weights column in the SampleWeightAttributeName parameter of the AutoMLChannel
object. This ensures that your objective metric uses the weights for the training, evaluation, and selection
of model candidates.

Support for sample weights is available in ensembling mode only. Your weights should be numeric and
non-negative. Data points with invalid or no weight value are excluded. For more information on the
available objective metrics, see Autopilot weighted metrics (p. 480).

Amazon SageMaker Autopilot datasets and


problem types
Amazon SageMaker Autopilot gives you the option in Studio or with the AutoML API of specifying a
problem type, such as binary classification or regression, or of detecting it on your behalf based on the
data you provide. Autopilot supports tabular data in which each column contains a feature with a specific
data type and each row contains an observation.

Topics
• Autopilot datasets, data types, and formats (p. 475)
• Amazon SageMaker Autopilot problem types (p. 475)

Autopilot datasets, data types, and formats


Autopilot supports tabular data formatted as CSV files or as Parquet files. For tabular data, each column
contains a feature with a specific data type and each row contains an observation. The properties of
these two file formats differ considerably.

• CSV (comma-separated-values) is a row-based file format that stores data in human readable plaintext
which a popular choice for data exchange as they are supported by a wide range of applications.
• Parquet is a column-based file format where the data is stored and processed more efficiently than
row-based file formats. This makes them a better option for big data problems.

The data types accepted for columns include numerical, categorical, text, and time series that consists
of strings of comma-separate numbers. If Autopilot detects it is dealing with time series sequences, it
processes them through specialized feature transformers provided by the tsfresh library. This library
takes the time series as an input and outputs a feature such as the highest absolute value of the time
series or descriptive statistics on autocorrelation. These outputted features are then used as inputs to
one of the three problem types.

Autopilot supports building machine learning models on large datasets up to hundreds of GBs.
For details on the default resource limits for input datasets and how to increase them, see Amazon
SageMaker Autopilot quotas (p. 522)

Amazon SageMaker Autopilot problem types


Your problem type options are as follows:

475
Amazon SageMaker Developer Guide
Training modes and algorithm support

Regression
Regression estimates the values of a dependent target variable based on one or more other variables or
attributes that are correlated with it. An example is the prediction of house prices using features like the
number of bathrooms and bedrooms, square footage of the house and garden. Regression analysis can
create a model that takes one or more of these features as an input and predicts the price of a house.

Binary classification
Binary classification is a type of supervised learning that assigns an individual to one of two predefined
and mutually exclusive classes based on their attributes. It is supervised because the models are trained
using examples where the attributes are provided with correctly labelled objects. A medical diagnosis
for whether an individual has a disease or not based on the results of diagnostic tests is an example of
binary classification.

Multiclass classification
Multiclass classification is a type of supervised learning that assigns an individual to one of several
classes based on their attributes. It is supervised because the models are trained using examples where
the attributes are provided with correctly labelled objects. An example is the prediction of the topic most
relevant to a text document. A document may be classified as being about, say, religion or politics or
finance, or about one of several other predefined topic classes.

Training modes and algorithm support


Amazon SageMaker Autopilot supports different training modes and algorithms to address machine
learning problems, report on quality and objective metrics, and to use cross-validation automatically,
when needed.

Training modes
SageMaker Autopilot can automatically select the training method based on the dataset size, or you can
select it manually. The choices are as follows:

• Ensembling – Autopilot uses the AutoGluon library to train several base models. To find the best
combination for your dataset, ensemble mode runs 10 trials with different model and meta parameter
settings. Then Autopilot combines these models using a stacking ensemble method to create an
optimal predictive model. For a list of algorithms that Autopilot supports in ensembling mode, see the
following Algorithm support section.
• Hyperparameter optimization (HPO) – Autopilot finds the best version of a model by tuning
hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs
on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects
the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to
100 trials (default) to find the optimal hyperparameters settings within the selected range. If your
dataset size is less than 100 MB, Autopilot uses Bayesian optimization. Autopilot chooses multi-fidelity
optimization if your dataset is larger than 100 MB.

In multi-fidelity optimization, metrics are continuously emitted from the training containers. A trial
that is performing poorly against a selected objective metric is stopped early. A trial that is performing
well is allocated more resources.

For a list of algorithms that Autopilot supports in HPO mode, see the following Algorithm support
section.

476
Amazon SageMaker Developer Guide
Algorithm support

• Auto – Autopilot automatically chooses either ensembling mode or HPO mode based on your dataset
size. If your dataset is larger than 100 MB, Autopilot chooses HPO. Otherwise, it chooses ensembling
mode. Autopilot can fail to read the size of your dataset in the following cases.
• If you enable Virtual Private Cloud (VPC) mode, for an AutoML job but the S3 bucket containing the
dataset only allows access from the VPC.
• The input S3DataType of your dataset is a ManifestFile.
• The input S3Uri contains more than 1000 items.

If Autopilot is unable to read your dataset size, it defaults to choosing HPO mode.

Note
For optimal runtime and performance, use ensemble training mode for datasets that are smaller
than 100 MB.

Algorithm support
In HPO mode, Autopilot supports the following types of machine learning algorithms:

• Linear learner – A supervised learning algorithm that can solve either classification or regression
problems.
• XGBoost – A supervised learning algorithm that attempts to accurately predict a target variable by
combining an ensemble of estimates from a set of simpler and weaker models.
• Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network.
This algorithm can handle data that is not linearly separable.

Note
You don't need to specify an algorithm to use for your machine learning problem. Autopilot
automatically selects the appropriate algorithm to train.

In ensembling mode, Autopilot supports the following types of machine learning algorithms:

• LightGBM – An optimized framework that uses tree-based algorithms with gradient boosting. This
algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
• CatBoost – A framework that uses tree-based algorithms with gradient boosting. Optimized for
handling categorical variables.
• XGBoost – A framework that uses tree-based algorithms with gradient boosting that grows in depth,
rather than breadth.
• Random Forest – A tree-based algorithm that uses several decision trees on random sub-samples of
the data with replacement. The trees are split into optimal nodes at each level. The decisions of each
tree are averaged together to prevent overfitting and improve predictions.
• Extra Trees – A tree-based algorithm that uses several decision trees on the entire dataset. The trees
are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to
improve predictions. Extra trees add a degree of randomization in comparison to the random forest
algorithm.
• Linear Models – A framework that uses a linear equation to model the relationship between two
variables in observed data.
• Neural network torch – A neural network model that's implemented using Pytorch.
• Neural network fast.ai – A neural network model that's implemented using fast.ai.

477
Amazon SageMaker Developer Guide
Metrics and validation

Metrics and validation


This guide shows metrics and validation techniques that you can use to measure machine learning model
performance. Amazon SageMaker Autopilot produces metrics that measure the predictive quality of
machine learning model candidates. The metrics calculated for candidates are specified using an array of
MetricDatum types.

Autopilot metrics
The following list contains the names of the metrics that are currently available to measure model
performance within Autopilot.
Note
Autopilot supports sample weights. To learn more about sample weights and the available
objective metrics, see Autopilot weighted metrics (p. 480).

The following are the available metrics.

Accuracy

The ratio of the number of correctly classified items to the total number of (correctly and
incorrectly) classified items. It is used for both binary and multiclass classification. Accuracy
measures how close the predicted class values are to the actual values. Values for accuracy metrics
vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect
inaccuracy.
AUC

The area under the curve (AUC) metric is used to compare and evaluate binary classification by
algorithms that return probabilities, such as logistic regression. To map the probabilities into
classifications, these are compared against a threshold value.

The relevant curve is the receiver operating characteristic curve (ROC curve). The ROC curve plots
the true positive rate (TPR) of predictions (or recall) against the false positive rate (FPR) as a function
of the threshold value, above which a prediction is considered positive. Increasing the threshold
results in fewer false positives, but more false negatives.

AUC is the area under this ROC curve. Therefore, AUC provides an aggregated measure of the model
performance across all possible classification thresholds. AUC scores vary between 0 and 1. A score
of 1 indicates perfect accuracy, and a score of one half (0.5) indicates that the prediction is not
better than a random classifier.
BalancedAccuracy

BalancedAccuracy is a metric that measures the ratio of accurate predictions to all predictions.
This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total
number of positive (P) and negative (N) values. It is used in both binary and multiclass classification
and is defined as follows: 0.5*((TP/P)+(TN/N)), with values ranging from 0 to 1. BalancedAccuracy
gives a better measure of accuracy when the number of positives or negatives differ greatly from
each other in an imbalanced dataset, such as when only 1% of email is spam.
F1

The F1 score is the harmonic mean of the precision and recall, defined as follows: F1 = 2 * (precision
* recall) / (precision + recall). It is used for binary classification into classes traditionally referred to
as positive and negative. Predictions are said to be true when they match their actual (correct) class,
and false when they do not.

Precision is the ratio of the true positive predictions to all positive predictions, and it includes the
false positives in a dataset. Precision measures the quality of the prediction when it predicts the
positive class.

478
Amazon SageMaker Developer Guide
Autopilot metrics

Recall (or sensitivity) is the ratio of the true positive predictions to all actual positive instances.
Recall measures how completely a model predicts the actual class members in a dataset.

F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0
indicates the worst.
F1macro

The F1macro score applies F1 scoring to multiclass classification problems. It does this by
calculating the precision and recall, and then taking their harmonic mean to calculate the F1 score
for each class. Lastly, the F1macro averages the individual scores to obtain the F1macro score.
F1macro scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0
indicates the worst.
InferenceLatency

Inference latency is the approximate amount of time between making a request for a model
prediction to receiving it from a real time endpoint to which the model is deployed. This metric is
measured in seconds and only available in ensembling mode.
LogLoss

Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability
outputs, rather than the outputs themselves. It is used in both binary and multiclass classification
and in neural nets. It is also the cost function for logistic regression. Log loss is an important metric
to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to
infinity. A value of 0 represents a model that perfectly predicts the data.
MAE

The mean absolute error (MAE) is a measure of how different the predicted and actual values are,
when they're averaged over all values. MAE is commonly used in regression analysis to understand
model prediction error. If there is linear regression, MAE represents the average distance from
a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the
number of observations. Values range from 0 to infinity, with smaller numbers indicating a better
model fit to the data.
MSE

The mean squared error (MSE) is the average of the squared differences between the predicted
and actual values. It is used for regression. MSE values are always positive. The better a model is at
predicting the actual values, the smaller the MSE value is.
Precision

Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives
that it identifies. It is defined as follows: Precision = TP/(TP+FP), with values ranging from zero (0) to
one (1), and is used in binary classification. Precision is an important metric when the cost of a false
positive is high. For example, the cost of a false positive is very high if an airplane safety system is
falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative
in the data.
PrecisionMacro

The precision macro computes precision for multiclass classification problems. It does this by
calculating precision for each class and averaging scores to obtain precision for several classes.
PrecisionMacro scores range from zero (0) to one (1). Higher scores reflect the model's ability
to predict true positives (TP) out of all of the positives that it identifies, averaged across multiple
classes.
R2
2
R , also known as the coefficient of determination, is used in regression to quantify how much a
model can explain the variance of a dependent variable. Values range from one (1) to negative one
(-1). Higher numbers indicate a higher fraction of explained variability. R2 values close to zero (0)
indicate that very little of the dependent variable can be explained by the model. Negative values

479
Amazon SageMaker Developer Guide
Autopilot weighted metrics

indicate a poor fit and that the model is outperformed by a constant function. For linear regression,
this is a horizontal line.
Recall

Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A
true positive is a positive prediction that is also an actual positive value in the data. Recall is defined
as follows: Recall = TP/(TP+FN), with values ranging from 0 to 1. Higher scores reflect a better ability
of the model to predict true positives (TP) in the data. It is used in binary classification.

Recall is important when testing for cancer because it's used to find all of the true positives. A false
positive (FP) reflects a positive prediction that is actually negative in the data. It is often insufficient
to measure only recall, because predicting every output as a true positive yields a perfect recall
score.
RecallMacro

The RecallMacro computes recall for multiclass classification problems by calculating recall for
each class and averaging scores to obtain recall for several classes. RecallMacro scores range from
0 to 1. Higher scores reflect the model's ability to predict true positives (TP) in a dataset, whereas a
true positive reflects a positive prediction that is also an actual positive value in the data. It is often
insufficient to measure only recall, because predicting every output as a true positive will yield a
perfect recall score.
RMSE

Root mean squared error (RMSE) measures the square root of the squared difference between
predicted and actual values, and is averaged over all values. It is used in regression analysis to
understand model prediction error. It's an important metric to indicate the presence of large model
errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better
model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of
different sizes.

Metrics that are automatically calculated for a model candidate are determined by the type of problem
being addressed.

• Regression: InferenceLatency, MAE, MSE, R2, RMSE


• Binary classification: Accuracy, AUC, BalancedAccuracy, F1, InferenceLatency, LogLoss,
Precision, Recall
• Multiclass classification: Accuracy, BalancedAccuracy, F1macro, InferenceLatency, LogLoss,
PrecisionMacro, RecallMacro

Autopilot weighted metrics


Note
Autopilot supports sample weights in ensembling mode only for all available metrics with
the exception of Balanced Accuracy and InferenceLatency. BalanceAccuracy comes
with its own weighting scheme for imbalanced datasets that does not require sample weights.
InferenceLatency does not support sample weights. Both objective Balanced Accuracy
and InferenceLatency metrics ignore any existing sample weights when training and
evaluating a model.

Users can add a sample weights column to their data to ensure that each observation used to train
a machine learning model is given a weight corresponding to its perceived importance to the model.
This is especially useful in scenarios in which the observations in the dataset have varying degrees
of importance, or in which a dataset contains a disproportionate number of samples from one class
compared to others. Assigning a weight to each observation based on its importance or greater
importance to a minority class can help a model’s overall performance, or ensure that a model is not
biased toward the majority class.

480
Amazon SageMaker Developer Guide
Cross-validation in Autopilot

For information about how to pass sample weights when creating an experiment in the Studio UI, see
Step 7 in Create an Autopilot experiment using Studio.

For information about how to pass sample weights programmatically when creating an Autopilot
experiment using the API, see How to add sample weights to an AutoML job in Create an Autopilot
experiment programmatically.

Cross-validation in Autopilot
Cross-validation is used in to reduce overfitting and bias in model selection. It is also used to assess how
well a model can predict the values of an unseen validation dataset, if the validation dataset is drawn
from the same population. This method is especially important when training on datasets that have a
limited number of training instances.

Autopilot uses cross-validation to build models in hyperparameter optimization (HPO) and ensemble
training mode. The first step in the Autopilot cross-validation process is to split the data into k-folds.

K-fold splitting
K-fold splitting is a method that separates an input training dataset into multiple training and validation
datasets. The dataset is split into k equally-sized sub-samples called folds. Models are then trained
th
on k-1 folds and tested against the remaining k fold, which is the validation dataset. The process is
repeated k times using a different data set for validation.

The following image depicts k-fold splitting with k = 4 folds. Each fold is represented as a row. The dark-
toned boxes represent the parts of the data used in training. The remaining light-toned boxes indicate
the validation datasets.

Autopilot uses k-fold cross-validation for both hyperparameter optimization (HPO) mode and
ensembling mode.

You can deploy Autopilot models that are built using cross-validation like you would with any other
Autopilot or SageMaker model.

HPO mode
K-fold cross-validation uses the k-fold splitting method for cross-validation. In HPO mode, Autopilot
automatically implements k-fold cross-validation for small datasets with 50,000 or fewer training
instances. Performing cross-validation is especially important when training on small datasets because it
protects against overfitting and selection bias.

HPO mode uses a k value of 5 on each of the candidate algorithms that are used to model the dataset.
Multiple models are trained on different splits, and the models are stored separately. When training is
complete, validation metrics for each of the models are averaged to produce a single estimation metric.

481
Amazon SageMaker Developer Guide
Cross-validation in Autopilot

Lastly, Autopilot combines the models from the trial with the best validation metric into an ensemble
model. Autopilot uses this ensemble model to make predictions.

The validation metric for the models trained by Autopilot is presented as the objective metric in
the model leaderboard. Autopilot uses the default validation metric for each problem type that it
handles, unless you specify otherwise. For the list of all metrics that Autopilot uses, see Autopilot
metrics (p. 478).

For example, the Boston Housing dataset contains only 861 samples. If you build a model to predict
house sale prices using this dataset without cross-validation, you risk training on a dataset that is not
representative of the Boston housing stock. If you split the data only once into training and validation
subsets, the training fold may only contain data mainly from the suburbs. As a result, you would train on
data that isn't representative of the rest of the city. In this example, your model would likely overfit on
this biased selection. K-fold cross-validation can reduce the risk of this kind of error by making full and
randomized use of the available data for both training and validation.

Cross-validation can increase training times by an average of 20%. Training times may also increase
significantly for complex datasets.
Note
In HPO mode, you can see the training and validation metrics from each fold in your /aws/
sagemaker/TrainingJobs CloudWatch Logs. For more information about CloudWatch Logs,
see Log Amazon SageMaker Events with Amazon CloudWatch (p. 3284).

Ensembling mode
Note
Autopilot supports sample weights in ensembling mode. For the list of available metrics
supporting sample weights, see Autopilot metrics (p. 478).

In ensembling mode, cross-validation is performed regardless of dataset size. Customers can either
provide their own validation dataset and custom data split ratio, or let Autopilot split the dataset
automatically into an 80-20% split ratio. The training data is then split into k-folds for cross-validation,
where the value of k is determined by the AutoGluon engine. An ensemble consists of multiple machine
learning models, where each model is known as the base model. A single base model is trained on (k-1)
folds and makes out-of-fold predictions on the remaining fold. This process is repeated for all k folds,
and the out-of-fold (OOF) predictions are concatenated to form a single set of predictions. All base
models in the ensemble follow this same process of generating OOF predictions.

The following image depicts k-fold validation with k = 4 folds. Each fold is represented as a row. The
dark-toned boxes represent the parts of the data used in training. The remaining light-toned boxes
indicate the validation datasets.

In the upper part of the image, in each fold, the first base model makes predictions on the validation
dataset after training on the training datasets. At each subsequent fold, the datasets change roles. A
dataset that was previously used for training is now used for validation, and this also applies in reverse.
At the end of k folds, all of the predictions are concatenated to form a single set of predictions called an
out-of-fold (OOF) prediction. This process is repeated for each n base models.

482
Amazon SageMaker Developer Guide
Model Deployment and Prediction

The OOF predictions for each base model are then used as features to train a stacking model. The
stacking model learns the importance weights for each base model. These weights are used to combine
the OOF predictions to form the final prediction. Performance on the validation dataset determines
which base or stacking model is the best, and this model is returned as the final model.

In ensemble mode, you can either provide your own validation dataset or let Autopilot split the input
dataset automatically into 80% train and 20% validation datasets. The training data is then split into k-
folds for cross-validation and produces an OOF prediction and a base model for each fold.

These OOF predictions are used as features to train a stacking model, which simultaneously learns
weights for each base model. These weights are used to combine the OOF predictions to form the final
prediction. The validation datasets for each fold are used for hyperparameter tuning of all base models
and the stacking model. Performance on the validation datasets determines which base or stacking
model is the best model, and this model is returned as the final model.

Amazon SageMaker Autopilot model deployment


and prediction
This Amazon SageMaker Autopilot guide includes steps for model deployment, setting up real-time
inference, and running inference with batch jobs.

After you train your SageMaker Autopilot models, you can deploy them to get predictions in one of two
ways:

1. Use Real-time inferencing (p. 483) to set up an endpoint and obtain predictions interactively.
2. Use Batch inferencing (p. 490) to make predictions in parallel on batches of observations on an
entire dataset.

Note
To avoid incurring unnecessary charges: After the endpoints and resources that were created
from model deployment are longer needed, you can delete them. For information about pricing
of instances by Region, see Amazon SageMaker Pricing.

Real-time inferencing
Real-time inference is ideal for inference workloads where you have real-time, interactive, low
latency requirements. This section shows how you can use real-time inferencing to obtain predictions
interactively from your model.

To deploy the model that produced the best validation metric in an Autopilot experiment, you have
several options. For example, when using Autopilot in SageMaker Studio, you can deploy the model
automatically or manually. You can also use SageMaker APIs to manually deploy an Autopilot model.

The following tabs show three options for deploying your model. These instructions assume that you
have already created a model in Autopilot. If you don't have a model, see Create an Amazon SageMaker
Autopilot experiment (p. 470). To see examples for each option, open each tab.

Deploy using the Autopilot User Interface (UI)


The Autopilot UI contains helpful dropdown menus, toggles, tooltips, and more to help you navigate
through model deployment. You can deploy using either one of the following procedures: Automatic or
Manual.

483
Amazon SageMaker Developer Guide
Real-time inferencing

• Automatic Deployment: To automatically deploy the best model from an Autopilot experiment to an
endpoint
1. Create an experiment in SageMaker Studio.
2. Toggle the Auto deploy value to Yes.
Note
Automatic deployment will fail if either the default resource quota or your customer
quota for endpoint instances in a Region is too limited. In hyperparameter optimization
(HPO) mode, you are required to have at least two ml.m5.2xlarge instances. In ensembling
mode, you are required to have at least one ml.m5.12xlarge instance. If you encounter a
failure related to quotas, you can request a service limit increase for SageMaker endpoint
instances.
• Manual Deployment: To manually deploy the best model from an Autopilot experiment to an
endpoint
1. Create an experiment in SageMaker Studio.
2. Toggle the Auto deploy value to No.
3. Select the model that you want to deploy under Model name.
4. Select the orange Deployment and advanced settings button located on the right of the
leaderboard. This opens a new tab.
5. Configure the endpoint name, instance type, and other optional information.
6. Select the orange Deploy model to deploy to an endpoint.
7. Check the progress of the endpoint creation process in the https://console.aws.amazon.com/
sagemaker/ by navigating to the Endpoints section. That section is located in the Inference
dropdown menu in the navigation panel.
8. After the endpoint status changes from Creating to InService, as shown below, return to Studio and
invoke the endpoint.

Deploy using SageMaker APIs


You can also obtain real-time inference by deploying your model using API calls. This section shows the
five steps of this process using AWS Command Line Interface (AWS CLI) code snippets.

For complete code examples for both AWS CLI commands and AWS SDK for Python (Boto3), open the
tabs directly following these steps.

1. Obtain candidate definitions

Obtain the candidate container definitions from InferenceContainers. These candidate definitions are
used to create a SageMaker model.

The following example uses the DescribeAutoMLJob API to obtain candidate definitions for the best
model candidate. See the following AWS CLI command as an example.

aws sagemaker describe-auto-ml-job --auto-ml-job-name <job-name> --region <region>

2. List candidates

The following example uses the ListCandidatesForAutoMLJob API to list all candidates. See the
following AWS CLI command as an example.

484
Amazon SageMaker Developer Guide
Real-time inferencing

aws sagemaker list-candidates-for-auto-ml-job --auto-ml-job-name <job-name> --


region <region>

3. Create a SageMaker model

Use the container definitions from the previous steps to create a SageMaker model by using the
CreateModel API. See the following AWS CLI command as an example.

aws sagemaker create-model --model-name '<your-custom-model-name>' \


--containers ['<container-definition1>, <container-
definition2>, <container-definition3>]' \
--execution-role-arn '<execution-role-arn>' --region '<region>

4. Create an endpoint configuration

The following example uses the CreateEndpointConfig API to create an endpoint configuration. See
the following AWS CLI command as an example.

aws sagemaker create-endpoint-config --endpoint-config-name '<your-custom-endpoint-


config-name>' \
--production-variants '<list-of-production-variants>' \
--region '<region>'

5. Create the endpoint

The following AWS CLI example uses the CreateEndpoint API to create the endpoint.

aws sagemaker create-endpoint --endpoint-name '<your-custom-endpoint-name>' \


--endpoint-config-name '<endpoint-config-name-you-just-created>' \
--region '<region>'

Check the progress of your endpoint deployment by using the DescribeEndpoint API. See the
following AWS CLI command as an example.

aws sagemaker describe-endpoint —endpoint-name '<endpoint-name>' —region <region>

After the EndpointStatus changes to InService, the endpoint is ready to use for real-time
inference.
6. Invoke the endpoint

The following command structure invokes the endpoint for real-time inferencing.

aws sagemaker invoke-endpoint --endpoint-name '<endpoint-name>' \


--region '<region>' --body '<your-data>' [--content-type] '<content-
type>' <outfile>

The following tabs contain complete code examples for deploying a model with AWS SDK for Python
(Boto3) or the AWS CLI.

AWS SDK for Python (Boto3)

1. Obtain the candidate definitions by using the following code example.

import sagemaker
import boto3

485
Amazon SageMaker Developer Guide
Real-time inferencing

session = sagemaker.session.Session()

sagemaker_client = boto3.client('sagemaker', region_name='us-west-2')


job_name = 'test-auto-ml-job'

describe_response = sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
# extract the best candidate definition from DescribeAutoMLJob response
best_candidate = describe_response['BestCandidate']
# extract the InferenceContainers definition from the caandidate definition
inference_containers = best_candidate['InferenceContainers']

2. Create the model by using the following the code example.

# Create Model
model_name = 'test-model'
sagemaker_role = 'arn:aws:iam:444455556666:role/sagemaker-execution-role'
create_model_response = sagemaker_client.create_model(
ModelName = model_name,
ExecutionRoleArn = sagemaker_role,
Containers = inference_containers
)

3. Create the endpoint configuration by using the following the code example.

endpoint_config_name = 'test-endpoint-config'

instance_type = 'ml.m5.2xlarge'
# for all supported instance types, see
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_ProductionVariant.html#sagemaker-Type-ProductionVariant-InstanceType # Create
endpoint config

endpoint_config_response = sagemaker_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"VariantName": "variant1",
"ModelName": model_name,
"InstanceType": instance_type,
"InitialInstanceCount": 1
}
]
)

print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")

4. Create the endpoint and deploy the model with the following code example.

# create endpoint and deploy the model


endpoint_name = 'test-endpoint'
create_endpoint_response = sagemaker_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(create_endpoint_response)

Check the status of creating the endpoint by using the following the code example.

# describe endpoint creation status


status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
["EndpointStatus"]

486
Amazon SageMaker Developer Guide
Real-time inferencing

5. Invoke the endpoint for real-time inferencing by using the following command structure.

# once endpoint status is InService, you can invoke the endpoint for inferencing
if status == "InService":
sm_runtime = boto3.Session().client('sagemaker-runtime')
inference_result = sm_runtime.invoke_endpoint(EndpointName='test-endpoint',
ContentType='text/csv', Body='1,2,3,4,class')

AWS Command Line Interface (AWS CLI)

1. Obtain the candidate definitions by using the following code example.

aws sagemaker describe-auto-ml-job --auto-ml-job-name 'test-automl-job' --region us-


west-2

2. Create the model by using the following code example.

aws sagemaker create-model --model-name 'test-sagemaker-model'


--containers '[{
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-
automl:2.5-1-cpu-py3", DOC-EXAMPLE-BUCKET1
"ModelDataUrl": "s3://DOC-EXAMPLE-BUCKET/output/model.tar.gz",
"Environment": {
"AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF": "1",
"AUTOML_TRANSFORM_MODE": "feature-transform",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "application/x-recordio-protobuf",
"SAGEMAKER_PROGRAM": "sagemaker_serve",
"SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
}
}, {
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-
cpu-py3",
"ModelDataUrl": "s3://DOC-EXAMPLE-BUCKET/output/model.tar.gz",
"Environment": {
"MAX_CONTENT_LENGTH": "20971520",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED": "predicted_label,probability,probabilities"
}
}, {
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-
automl:2.5-1-cpu-py3", aws-region
"ModelDataUrl": "s3://DOC-EXAMPLE-BUCKET/output/model.tar.gz",
"Environment": {
"AUTOML_TRANSFORM_MODE": "inverse-label-transform",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_INPUT": "predicted_label",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED":
"predicted_label,probability,labels,probabilities",
"SAGEMAKER_PROGRAM": "sagemaker_serve",
"SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
}
}]' \
--execution-role-arn 'arn:aws:iam::1234567890:role/sagemaker-execution-role' \
--region 'us-west-2'

For additional details, see creating a model.

The create model command will return a response in the following format.

487
Amazon SageMaker Developer Guide
Real-time inferencing

{
"ModelArn": "arn:aws:sagemaker:us-west-2:1234567890:model/test-sagemaker-model"
}

3. Create an endpoint configuration by using the following code example.

aws sagemaker create-endpoint-config --endpoint-config-name 'test-endpoint-config' \


--production-variants '[{"VariantName": "variant1",
"ModelName": "test-sagemaker-model",
"InitialInstanceCount": 1,
"InstanceType": "ml.m5.2xlarge"
}]' \
--region us-west-2

The create endpoint configuration command will return a response in the following format.

{
"EndpointConfigArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint-config/
test-endpoint-config"
}

4. Create an endpoint by using the following code example.

aws sagemaker create-endpoint --endpoint-name 'test-endpoint' \


--endpoint-config-name 'test-endpoint-config' \
--region us-west-2

The create endpoint command will return a response in the following format.

{
"EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint"
}

Check the progress of the endpoint deployment by using the following describe-endpoint CLI
code example.

aws sagemaker describe-endpoint --endpoint-name 'test-endpoint' --region us-west-2

The previous progress check will return a response in the following format.

{
"EndpointName": "test-endpoint",
"EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint",
"EndpointConfigName": "test-endpoint-config",
"EndpointStatus": "Creating",
"CreationTime": 1660251167.595,
"LastModifiedTime": 1660251167.595
}

After the EndpointStatus changes to InService, the endpoint is ready for use in real-time
inference.
5. Invoke the endpoint for real-time inferencing by using the following command structure.

aws sagemaker-runtime invoke-endpoint --endpoint-name 'test-endpoint' \


--region 'us-west-2' \

488
Amazon SageMaker Developer Guide
Real-time inferencing

--body '1,51,3.5,1.4,0.2' \
--content-type 'text/csv' \
'/tmp/inference_output'

For more options, see invoking an endpoint.

Deploy models from different accounts


You can deploy an Autopilot model from a different account than the original account that a model
was generated in. To implement cross-account model deployment, this section shows how to do the
following:

1. Grant permission to the deploying account

To assume the role in the generating account, you must grant permission to the deploying account.
This allows the deploying account to describe Autopilot jobs in the generating account.

The following example uses a generating account with a trusted sagemaker-role entity. The
example shows how to give a deploying account with the ID 111122223333 permission to assume the
role of the generating account.

"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com"
],
"AWS": [ "111122223333"]
},
"Action": "sts:AssumeRole"
}

The new account with the ID 111122223333 can now assume the role for the generating account.

Next, call the DescribeAutoMLJob API from the deploying account to obtain a description of the job
created by the generating account.

The following code example describes the model from the deploying account.

import sagemaker
import boto3
session = sagemaker.session.Session()

sts_client = boto3.client('sts')
sts_client.assume_role

role = 'arn:aws:iam::111122223333:role/sagemaker-role'
role_session_name = "role-session-name"
_assumed_role = sts_client.assume_role(RoleArn=role, RoleSessionName=role_session_name)

credentials = _assumed_role["Credentials"]
access_key = credentials["AccessKeyId"]
secret_key = credentials["SecretAccessKey"]
session_token = credentials["SessionToken"]

session = boto3.session.Session()

sm_client = session.client('sagemaker', region_name='us-west-2',


aws_access_key_id=access_key,

489
Amazon SageMaker Developer Guide
Batch inferencing

aws_secret_access_key=secret_key,
aws_session_token=session_token)

# now you can call describe automl job created in account A

job_name = "test-job"
response= sm_client.describe_auto_ml_job(AutoMLJobName=job_name)

2. Grant access to the deploying account to the model artifacts in the generating account.

The deploying account only needs access to the model artifacts in the generating account to deploy it.
These are located in the S3OutputPath that was specified in the original CreateAutoMLJob API call
during model generation.

To give the deploying account access to the model artifacts, choose one of the following options:
a. Give access to the ModelDataUrl from the generating account to the deploying account.

Next, you need to give the deploying account permission to assume the role. follow the real-time
inferencing steps to deploy.
b. Copy model artifacts from the generating account's original S3OutputPath to the generating
account.

To grant access to the model artifacts, you must define a best_candidate model and reassign
model containers to the new account.

The following example shows how to define a best_candidate model and reassign the
ModelDataUrl.

best_candidate = automl.describe_auto_ml_job()['BestCandidate']

# reassigning ModelDataUrl for best_candidate containers below


new_model_locations = ['new-container-1-ModelDataUrl', 'new-container-2-ModelDataUrl',
'new-container-3-ModelDataUrl']
new_model_locations_index = 0
for container in best_candidate['InferenceContainers']:
container['ModelDataUrl'] = new_model_locations[new_model_locations_index++]

After this assignment of containers, follow the steps in Deploy using SageMaker APIs (p. 484) to
deploy.

To build a payload in real-time inferencing, see the notebook example to define a test payload. To create
the payload from a CSV file and invoke an endpoint, see the Predict with your model section in Create a
machine learning model automatically.

Batch inferencing
Batch inferencing, also known as offline inferencing, generates model predictions on a batch of
observations. Batch inference is a good option for large datasets or if you don't need an immediate
response to a model prediction request.

By contrast, online inference (real-time inferencing) generates predictions in real time.

You can make batch inferences from an Autopilot model using the SageMaker Python SDK, the Autopilot
user interface (UI), the AWS SDK for Python (Boto3), or the AWS Command Line Interface (AWS CLI).

The following tabs show three options for deploying your model: Using APIs, Autopilot UI, or using APIs
to deploy from different accounts. These instructions assume that you have already created a model in

490
Amazon SageMaker Developer Guide
Batch inferencing

Autopilot. If you don't have a model, see Create an Amazon SageMaker Autopilot experiment (p. 470).
To see examples for each option, open each tab.

Deploy a model using Autopilot UI


The Autopilot UI contains helpful dropdown menus, toggles, tooltips, and more to help you navigate
through model deployment.

The following steps show how to deploy a model from an Autopilot experiment for batch predictions.

1. Sign in at https://console.aws.amazon.com/sagemaker/ and select Studio from the navigation pane.


2. On the left navigation pane, choose Studio.
3. Under Get started, select the Domain that you want to launch the Studio application in. If your user
profile only belongs to one Domain, you do not see the option for selecting a Domain.
4. Select the user profile that you want to launch the Studio application for. If there is no user profile in
the Domain, choose Create user profile. For more information, see Add and Remove User Profiles.
5. Choose Launch Studio. If the user profile belongs to a shared space, choose Open Spaces.
6. When the SageMaker Studio console opens, choose the Launch SageMaker Studio button.
7. Select AutoML from the left navigation pane.
8. Under Name, select the Autopilot experiment corresponding to the model that you want to deploy.
This opens a new AUTOPILOT JOB tab.
9. In the Model name section, select the model that you want to deploy.
10.Choose Deploy model. This opens a new tab.
11.Choose Make batch predictions at the top of the page.
12.For Batch transform job configuration, input the Instance type, Instance count and other optional
information.
13.In the Input data configuration section, open the dropdown menu.
a. For S3 data type, choose ManifestFile or S3Prefix.
b. For Split type, choose Line, RecordIO, TFRecord or None.
c. For Compression, choose Gzip or None.
14.For S3 location, enter the Amazon S3 bucket location of the input data and other optional
information.
15.Under Output data configuration, enter the S3 bucket for the output data, and choose how to
assemble the output of your job.
a. For Additional configuration (optional), you can enter a MIME type and an S3 Encryption key.
16.For Input/output filtering and data joins (optional), you enter a JSONpath expression to filter your
input data, join the input source data with your output data, and enter a JSONpath expression to filter
your output data.
a. For examples for each type of filter, see the DataProcessing API.
17.To perform batch predictions on your input dataset, select Create batch transform job. A new Batch
Transform Jobs tab appears.
18.In the Batch Transform Jobs tab: Locate the name of your job in Status section. Then check the
progress of the job.

Deploy using SageMaker APIs


To use the SageMaker APIs for batch inferencing, there are three steps:

1. Obtain candidate definitions

Candidate definitions from InferenceContainers are used to create a SageMaker model.

491
Amazon SageMaker Developer Guide
Batch inferencing

The following example shows how to use the DescribeAutoMLJob API to obtain candidate definitions
for the best model candidate. See the following AWS CLI command as an example.

aws sagemaker describe-auto-ml-job --auto-ml-job-name <job-name> --region <region>

Use the ListCandidatesForAutoMLJob API to list all candidates. See the following AWS CLI command
as an example.

aws sagemaker list-candidates-for-auto-ml-job --auto-ml-job-name <job-name> --


region <region>

2. Create a SageMaker model

To create a SageMaker model using the CreateModel API, use the container definitions from the
previous steps. See the following AWS CLI command as an example.

aws sagemaker create-model --model-name '<your-custom-model-name>' \


--containers ['<container-definition1>, <container-
definition2>, <container-definition3>]' \
--execution-role-arn '<execution-role-arn>' --region '<region>

3. Create a SageMaker transform job

The following example creates a SageMaker transform job with the CreateTransformJob API. See the
following AWS CLI command as an example.

aws sagemaker create-transform-job --transform-job-name '<your-custom-transform-job-


name>' --model-name '<your-custom-model-name-from-last-step>'\
--transform-input '{
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "<your-input-data>"
}
},
"ContentType": "text/csv",
"SplitType": "Line"
}'\
--transform-output '{
"S3OutputPath": "<your-output-path>",
"AssembleWith": "Line"
}'\
--transform-resources '{
"InstanceType": "<instance-type>",
"InstanceCount": 1
}' --region '<region>'

Check the progress of your transform job using the DescribeTransformJob API. See the following AWS
CLI command as an example.

aws sagemaker describe-transform-job --transform-job-name '<your-custom-transform-job-


name>' --region <region>

After the job is finished, the predicted result will be available in <your-output-path>.

The output file name has the following format: <input_data_file_name>.out. As an example, if
your input file is text_x.csv, the output name will be text_x.csv.out.

492
Amazon SageMaker Developer Guide
Batch inferencing

The following tabs show code examples for SageMaker Python SDK, AWS SDK for Python (Boto3), and
the AWS CLI.

SageMaker Python SDK

The following example uses the SageMaker Python SDK to make predictions in batches.

from sagemaker import AutoML

sagemaker_session= sagemaker.session.Session()

job_name = 'test-auto-ml-job' # your autopilot job name


automl = AutoML.attach(auto_ml_job_name=job_name)
output_path = 's3://test-auto-ml-job/output'
input_data = 's3://test-auto-ml-job/test_X.csv'

# call DescribeAutoMLJob API to get the best candidate definition


best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']

# create model
model = automl.create_model(name=best_candidate_name,
candidate=best_candidate)

# create transformer
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.2xlarge',
assemble_with='Line',
output_path=output_path)

# do batch transform
transformer.transform(data=input_data,
split_type='Line',
content_type='text/csv',
wait=True)

AWS SDK for Python (Boto3)

The following example uses AWS SDK for Python (Boto3) to make predictions in batches.

import sagemaker
import boto3

session = sagemaker.session.Session()

sm_client = boto3.client('sagemaker', region_name='us-west-2')


role = 'arn:aws:iam::1234567890:role/sagemaker-execution-role'
output_path = 's3://test-auto-ml-job/output'
input_data = 's3://test-auto-ml-job/test_X.csv'

best_candidate = sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
['BestCandidate']
best_candidate_containers = best_candidate['InferenceContainers']
best_candidate_name = best_candidate['CandidateName']

# create model
reponse = sm_client.create_model(
ModelName = best_candidate_name,
ExecutionRoleArn = role,
Containers = best_candidate_containers
)

# Lauch Transform Job


response = sm_client.create_transform_job(

493
Amazon SageMaker Developer Guide
Batch inferencing

TransformJobName=f'{best_candidate_name}-transform-job',
ModelName=model_name,
TransformInput={
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': input_data
}
},
'ContentType': "text/csv",
'SplitType': 'Line'
},
TransformOutput={
'S3OutputPath': output_path,
'AssembleWith': 'Line',
},
TransformResources={
'InstanceType': 'ml.m5.2xlarge',
'InstanceCount': 1,
},
)

The batch inference job returns a response in the following format.

{'TransformJobArn': 'arn:aws:sagemaker:us-west-2:1234567890:transform-job/test-
transform-job',
'ResponseMetadata': {'RequestId': '659f97fc-28c4-440b-b957-a49733f7c2f2',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amzn-requestid': '659f97fc-28c4-440b-b957-a49733f7c2f2',
'content-type': 'application/x-amz-json-1.1',
'content-length': '96',
'date': 'Thu, 11 Aug 2022 22:23:49 GMT'},
'RetryAttempts': 0}}

AWS Command Line Interface (AWS CLI)

1. Obtain the candidate definitions by using the following the code example.

aws sagemaker describe-auto-ml-job --auto-ml-job-name 'test-automl-job' --region us-


west-2

2. Create the model by using the following the code example.

aws sagemaker create-model --model-name 'test-sagemaker-model'


--containers '[{
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-
automl:2.5-1-cpu-py3",
"ModelDataUrl": "s3://test-bucket/out/test-job1/data-processor-models/test-job1-
dpp0-1-e569ff7ad77f4e55a7e549a/output/model.tar.gz",
"Environment": {
"AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF": "1",
"AUTOML_TRANSFORM_MODE": "feature-transform",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "application/x-recordio-protobuf",
"SAGEMAKER_PROGRAM": "sagemaker_serve",
"SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
}
}, {
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-
cpu-py3",
"ModelDataUrl": "s3://test-bucket/out/test-job1/tuning/flicdf10v2-dpp0-xgb/test-
job1E9-244-7490a1c0/output/model.tar.gz",
"Environment": {
"MAX_CONTENT_LENGTH": "20971520",

494
Amazon SageMaker Developer Guide
Batch inferencing

"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED": "predicted_label,probability,probabilities"
}
}, {
"Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-
automl:2.5-1-cpu-py3",
"ModelDataUrl": "s3://test-bucket/out/test-job1/data-processor-models/test-job1-
dpp0-1-e569ff7ad77f4e55a7e549a/output/model.tar.gz",
"Environment": {
"AUTOML_TRANSFORM_MODE": "inverse-label-transform",
"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
"SAGEMAKER_INFERENCE_INPUT": "predicted_label",
"SAGEMAKER_INFERENCE_OUTPUT": "predicted_label",
"SAGEMAKER_INFERENCE_SUPPORTED":
"predicted_label,probability,labels,probabilities",
"SAGEMAKER_PROGRAM": "sagemaker_serve",
"SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
}
}]' \
--execution-role-arn 'arn:aws:iam::1234567890:role/sagemaker-execution-role' \
--region 'us-west-2'

3. Create the transform job by using the following the code example.

aws sagemaker create-transform-job --transform-job-name 'test-tranform-job'\


--model-name 'test-sagemaker-model'\
--transform-input '{
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://test-bucket/data.csv"
}
},
"ContentType": "text/csv",
"SplitType": "Line"
}'\
--transform-output '{
"S3OutputPath": "s3://test-bucket/output/",
"AssembleWith": "Line"
}'\
--transform-resources '{
"InstanceType": "ml.m5.2xlarge",
"InstanceCount": 1
}'\
--region 'us-west-2'

4. Check the progress of the transform job by using the following the code example.

aws sagemaker describe-transform-job --transform-job-name 'test-tranform-job' --


region us-west-2

The following is the response from the transform job.

{
"TransformJobName": "test-tranform-job",
"TransformJobArn": "arn:aws:sagemaker:us-west-2:1234567890:transform-job/test-
tranform-job",
"TransformJobStatus": "InProgress",
"ModelName": "test-model",
"TransformInput": {
"DataSource": {
"S3DataSource": {

495
Amazon SageMaker Developer Guide
Explainability

"S3DataType": "S3Prefix",
"S3Uri": "s3://test-bucket/data.csv"
}
},
"ContentType": "text/csv",
"CompressionType": "None",
"SplitType": "Line"
},
"TransformOutput": {
"S3OutputPath": "s3://test-bucket/output/",
"AssembleWith": "Line",
"KmsKeyId": ""
},
"TransformResources": {
"InstanceType": "ml.m5.2xlarge",
"InstanceCount": 1
},
"CreationTime": 1662495635.679,
"TransformStartTime": 1662495847.496,
"DataProcessing": {
"InputFilter": "$",
"OutputFilter": "$",
"JoinSource": "None"
}
}

After the TransformJobStatus changes to Completed, you can check the inference result in
the S3OutputPath.

Deploy models from different accounts


To create a batch inferencing job in a different account than the one that the model was generated in,
follow the instructions in Deploy models from different accounts (p. 489). Then you can create models
and transform jobs by following the Deploy using SageMaker APIs (p. 491).

Amazon SageMaker Autopilot explainability


Amazon SageMaker Autopilot uses tools provided by Amazon SageMaker Clarify to help explain how
machine learning (ML) models make predictions. These tools can help ML modelers, developers, and
other internal stakeholders understand model characteristics before deployment. Both consumers
and regulators rely on transparency in machine learning in order to accept decisions made on model
predictions. You can also use tools to debug predictions provided by a model after it's deployed. The
Autopilot explanatory functionality uses a model-agnostic feature attribution approach. You can use
this to understand why a model made a prediction after training, and use it to provide per-instance
explanation during inference. The implementation includes a scalable and efficient implementation of
SHAP. This is based on the concept of a Shapley value from the field of cooperative game theory that
assigns each feature an importance value for a particular prediction.

You can use explanations for auditing and meeting regulatory requirements, building trust in the model,
supporting human decision-making, and debugging and improving model performance.

For additional information on Shapely values and baselines, see Feature Attributions that Use Shapley
Values (p. 2094) and SHAP Baselines for Explainability (p. 2095).

For a guide to the Amazon SageMaker Clarify documentation, see Guide to the SageMaker Clarify
Documentation (p. 10).

496
Amazon SageMaker Developer Guide
Models generated

Models generated by Amazon SageMaker


Autopilot
This procedure describes how to share a model that you created in Amazon SageMaker Autopilot with
another user in SageMaker Canvas. It also shows how to view details about jobs that you've run.

Prerequisites
Before you begin this procedure, you must have created and run an Autopilot experiment. For
instructions, see Create an Amazon SageMaker Autopilot experiment (p. 470).

Share your Autopilot model


You can share your Autopilot model with another user in SageMaker Canvas. The other user can then
import your model and use it to generate predictions.

To share the model in the Autopilot user interface using a button, see the following section View model
details. The Share Model button is discussed in Step 6.

For more information about how to share a model, see Bring Your Own Model Into Canvas.

View model details


Autopilot generates details about the candidate models that you can obtain. These details include the
following:

• A plot of the aggregated SHAP values that indicate the importance of each feature. This helps explain
your models predictions.
• The summary statistics for various training and validation metrics, including the objective metric.
• A list of the hyperparameters used to train and tune the model.

To view model details after running an Autopilot job, follow these steps:

1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the details that you want to examine. This
opens a new Autopilot job tab.
4. The Autopilot job panel lists the metric values including the Objective metric for each model
under Model name. The Best model is listed at the top of the list under Model name and is also
highlighted in the Models tab.

• To review model details, select the model that you are interested in and select View model
details. This opens a new Model Details tab.
5. The Model Details tab is divided into four subsections.

1. The top of the Explainability tab contains a plot of aggregated SHAP values that indicate the
importance of each feature. Following that are the metrics and hyperparameter values for this
model.
2. The Performance tab contains metrics statistics a confusion matrix.
3. The Artifacts tab contains information about model inputs, outputs, and intermediate results.

497
Amazon SageMaker Developer Guide
Model Performance Report

4. The Network tab summarizes your network isolation and encryption choices.

Note
Feature importance and information in the Performance tab is only generated for the Best
model.

For more information about how the SHAP values help explain predictions based on feature
importance, see the whitepaper Understanding the model explainability. Additional information
is also available in the Amazon SageMaker Clarify Model Explainability (p. 2093) topic in the
SageMaker Developer Guide.
6. To share your Autopilot model with another SageMaker Canvas user, choose Share Model. That
button is located at the top right of the Model Details tab.

• In the Add Canvas users section, use the down arrow to select a SageMaker Canvas user.

View an Autopilot Model Performance Report


An Amazon SageMaker model quality report provides insights and quality information for the best
model candidate generated by an AutoML job. This includes information about the job details, model
problem type, objective function, and other information related to the problem type. This guide shows
how to view Amazon SageMaker Autopilot performance metrics graphically, or view metrics as raw data
in a JSON file.

For example, in classification problems, the model quality report includes the following:

• Confusion matrix
• Area under the receiver operating characteristic curve (AUC)
• Information to understand false positives and false negatives
• Tradeoffs between true positives and false positives
• Tradeoffs between precision and recall

Autopilot also provides performance metrics for all of your candidate models. These metrics are
calculated using all of the training data and are used to estimate model performance. The main working
area includes these metrics by default. The type of metric is determined by the type of problem being
addressed.

The following performance metrics are associated with the corresponding problem type:

• Regression: MAE, MSE, R2, RMSE


• Binary classification: Accuracy, AUC2, BalancedAccuracy, F1, LogLoss, Precision, Recall
• Multiclass classification: Accuracy, BalancedAccuracy, F1macro, LogLoss, PrecisionMacro,
RecallMacro

You can sort your model candidates with the relevant metric to help you select and deploy the model
that addresses your business needs. For definitions of these metrics, see the Autopilot candidate metrics
topic.

To view a performance report from an Autopilot job, follow these steps:

1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.

498
Amazon SageMaker Developer Guide
Model Performance Report

3. In the Name section, select the Autopilot job that has the details that you want to examine. This
opens a new Autopilot job tab.
4. The Autopilot job panel lists the metric values including the Objective metric for each model under
Model name. The Best model is listed at the top of the list under Model name and it is highlighted
in the Models tab.

• To review model details, select the model that you are interested in and select View in model
details. This opens a new Model Details tab.
5. Choose the Performance tab between the Explainability and Artifacts tab.

a. On the top right section of the tab, select the down arrow on the Download Performance
Reports button.
b. The down arrow provides two options to view Autopilot performance metrics:

i. You can download a PDF of the performance report to view the metrics graphically.
ii. You can view metrics as raw data and download it as a JSON file.

For instructions on how to create and run an AutoML job in SageMaker Studio, see Create an Amazon
SageMaker Autopilot experiment (p. 470).

The performance report contains two sections. The first contains details about the Autopilot job that
produced the model. The second section contains a model quality report.

Autopilot Job details


To understand how a model was generated, it's helpful to get the job details about the Autopilot job that
produced the model.

These job details include the following information:

• Autopilot candidate name


• Autopilot job name
• Problem type
• Objective metric
• Optimization direction

Model quality report


Model quality information is generated by Autopilot model insights. The report's content that is
generated depends on the problem type it addressed: regression, binary classification, or multiclass
classification. The report specifies the number of rows that were included in the evaluation dataset and
the time at which the evaluation occurred.

Metrics tables
The first part of the model quality report contains metrics tables. These are appropriate for the type of
problem that the model addressed.

The following image is an example of a metrics table that Autopilot generates for a regression problem.
It shows the metric name, value, and standard deviation.

499
Amazon SageMaker Developer Guide
Model Performance Report

The following image is an example of a metrics table generated by Autopilot for a multiclass
classification problem. It shows the metric name, value, and standard deviation.

Graphical model performance information


The second part of the model quality report contains graphical information to help you evaluate model
performance. The contents of this section depend on the problem type used in modeling.

The area under the receiver operating characteristic curve (AUC ROC curve)

The AUC ROC curve represents the trade-off between true positive and false positive rates. The AUC ROC
curve is an industry-standard accuracy metric used for binary classification models. AUC measures the
ability the model to predict a higher score for positive examples, as compared to negative examples. The
AUC metric provides an aggregated measure of the model performance across all possible classification
thresholds.

The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate that the machine
learning model is highly accurate. Values near 0.5 indicate that the model is performing no better than
guessing at random. AUC values close to 0 indicate that the model has learned the correct patterns, but
is making predictions that are as inaccurate as possible. Values near zero can indicate a problem with the
data. For more information about the AUC metric, see the Receiver operating characteristic article on
Wikipedia.

The following is an example of an AUC ROC curve graph to evaluate predictions made by a binary
classification model. The dashed thin line represents the AUC ROC curve that a model which classifies
no-better-than-random guessing would score, with an AUC score of 0.5. The curves of more accurate
classification models lie above this random baseline, where the rate of true positives exceeds the rate of
false positives. The AUC ROC curve representing the performance of the binary classification model is
the thicker solid line.

500
Amazon SageMaker Developer Guide
Model Performance Report

A summary of the graph's components of false positive rate (FPR) and true positive rate (TPR) are
defined as follows.

• Correct predictions
• True positive (TP): The predicted the value is 1, and the true value is 1.
• True negative (TN): The predicted the value is 0, and the true value is 0.
• Erroneous predictions
• False positive (FP): The predicted the value is 1, but the true value is 0.
• False negative (FN): The predicted the value is 0, but the true value is 1.

The false positive rate (FPR) measures the fraction of true negatives (TN) that were falsely predicted as
positives (FP), over the sum of FP and TN. The range is 0 to 1. A smaller value indicates better predictive
accuracy.

• FPR = FP/(FP+TN)

The true positive rate (TPR) measures the fraction true positives that were correctly predicted as
positives (TP) over the sum of TP and false negatives (FN). The range is 0 to 1. A larger value indicates
better predictive accuracy.

• TPR = TP/(TP+FN)

Confusion matrix

A confusion matrix provides a way to visualize the accuracy of the predictions made by a model for
binary and multiclass classification for different problems. The confusion matrix in the model quality
report contains the following.

• The number and percentage of correct and incorrect predictions for the actual labels
• The number and percentage of accurate predictions on the diagonal from the upper-left to the lower-
right corner

501
Amazon SageMaker Developer Guide
Model Performance Report

• The number and percentage of inaccurate predictions on the diagonal from the upper-right to the
lower-left corner

The incorrect predictions on a confusion matrix are the confusion values.

The following screenshot is an example of a confusion matrix for a binary classification problem. It
contains the following information:

• The vertical axis is divided into two rows containing true and false actual labels.
• The horizontal axis is divided into two columns containing true and false labels that were predicted by
the model.
• The color bar assigns a darker tone to a larger number of samples to visually indicate the number of
values that were classified in each category.

In this example, the model predicted actual 2817 false values correctly, and 353 actual true values
correctly. The model incorrectly predicted 130 actual true values to be false and 33 actual false values to
be true. The difference in tone indicates that the dataset is not balanced. The imbalance is because there
are many more actual false labels than actual true labels.

The following screenshot is an example of a confusion matrix for a multi-class classification problem. The
confusion matrix in the model quality report contains the following.

• The vertical axis is divided into three rows containing three different actual labels.
• The horizontal axis is divided into three columns containing labels that were predicted by the model.
• The color bar assigns a darker tone to a larger number of samples to visually indicate the number of
values that were classified in each category.

502
Amazon SageMaker Developer Guide
Model Performance Report

In the example below, the model correctly predicted actual 354 values for label f, 1094 values for label
i and 852 values for label m. The difference in tone indicates that the dataset is not balanced because
there are many more labels for the value i than for f or m.

The confusion matrix in the model quality report provides can accommodate a maximum of 15 labels
for multiclass classification problem types. If a row corresponding to a label shows a Nan value, it means
that the validation dataset used to check model predictions does not contain data with that label.

Gain curve

In binary classification, a gain curve predicts the cumulative benefit of using a percentage of the dataset
to find a positive label. The gain value is calculated during training by dividing the cumulative number
of positive observations by the total number of positive observations in the data, at each decile. If the
classification model created during training is representative of the unseen data, you can use the gain
curve to predict the percentage of data that you must target to obtain a percentage of positive labels.
The greater the percentage of the dataset used, the higher the percentage of positive labels found.

In the following example graph, the gain curve is the line with changing slope. The straight line is the
percentage of positive labels found by selecting a percentage of data from the dataset at random. Upon
targeting 20% of the dataset, you would expect to find larger than 40% of the positive labels. As an
example, you might consider using a gain curve to determine your efforts in a marketing campaign.
Using our gain curve example, for 83% of people in a neighborhood to purchase cookies, you'd send an
advertisement to about 60% of the neighborhood.

503
Amazon SageMaker Developer Guide
Model Performance Report

Lift curve

In binary classification, the lift curve illustrates the uplift of using a trained model to predict the
likelihood of finding a positive label compared to a random guess. The lift value is calculated during
training using the ratio of percentage gain to the ratio of positive labels at each decile. If the model
created during training is representative of the unseen data, use the lift curve to predict the benefit of
using the model over randomly guessing.

In the following example graph, the lift curve is the line with changing slope. The straight line is the
lift curve associated with selecting the corresponding percentage randomly from the dataset. Upon
targeting 40% of the dataset with your model's classification labels, you would expect to find about 1.7
times the number of the positive labels that you would have found by randomly selecting 40% of the
unseen data.

504
Amazon SageMaker Developer Guide
Model Performance Report

Precision-recall curve

The precision-recall curve represents the tradeoff between precision and recall for binary classification
problems.

Precision measures the fraction of actual positives that are predicted as positive (TP) out of all positive
predictions (TP and false positive). The range is 0 to 1. A larger value indicates better accuracy in the
predicted values.

• Precision = TP/(TP+FP)

Recall measures the fraction of actual positives (TP) that are predicted as positive out of all positive
predictions (TP and false negative). This is also known as the sensitivity and as the true positive rate. The
range is 0 to 1. A larger value indicates better detection of positive values from the sample.

• Recall = TP/(TP+FN)

The objective of a classification problem is to correctly label as many elements as possible. A system with
high recall but low precision returns a high percentage of false positives.

The following graphic depicts a spam filter that marks every email as spam. It has high recall, but low
precision, because recall doesn't measure false positives.

Give more weight to recall over precision if your problem has a low penalty for false positive values, but
a high penalty for missing a true positive result. For example, detecting an impending collision in a self-
driving vehicle.

505
Amazon SageMaker Developer Guide
Model Performance Report

By contrast, a system with high precision, but low recall, returns a high percentage of false negatives.
A spam filter that marks every email as desirable (not spam) has high precision but low recall because
precision doesn't measure false negatives.

If your problem has a low penalty for false negative values, but a high penalty for missing a true
negative results, give more weight to precision over recall. For example, flagging a suspicious filter for a
tax audit.

The following graphic depicts a spam filter that has high precision but low recall, because precision
doesn't measure false negatives.

A model that makes predictions with both high precision and high recall produces a high number of
correctly labeled results. For more information, see Precision and recall article in Wikipedia.

Area under precision-recall curve (AUPRC)

For binary classification problems, Amazon SageMaker Autopilot includes a graph of the area under
the precision-recall curve (AUPRC). The AUPRC metric provides an aggregated measure of the model
performance across all possible classification thresholds and uses both precision and recall. AUPRC
does not take the number of true negatives into account. Therefore, it can be useful to evaluate model
performance in cases where there's a large number of true negatives in the data. For example, to model a
gene containing a rare mutation.

The following graphic is an example of an AUPRC graph. Precision at its highest value is 1, and recall is at
0. In the lower right corner of the graph, recall is its highest value (1) and precision is 0. In between these
two points , the AUPRC curve illustrates the tradeoff between precision and recall at different thresholds.

506
Amazon SageMaker Developer Guide
Model Performance Report

Actual against predicted plot

The actual against predicted plot shows the difference between actual and predicted model values. In
the following example graph, the solid line is a linear line of best fit. If the model were 100% accurate,
each predicted point would equal its corresponding actual point and lie on this line of best fit. The
distance away from the line of best fit is a visual indication of model error. The larger the distance away
from the line of best fit, the higher the model error.

Standardized residual plot

A standardized residual plot incorporates the following statistical terms:

507
Amazon SageMaker Developer Guide
Model Performance Report

residual

A (raw) residual shows the difference between actual and values predicted by your model. The larger
the difference, the larger the residual value.
standard deviation

The standard deviation is a measure of how values vary from an average value. A high standard
deviation indicates that many values are very different from their average value. A low standard
deviation indicates that many values are close to their average value.
standardized residual

A standardized residual divides the raw residuals by their standard deviation. Standardized residuals
have units of standard deviation and are useful in identifying outliers in data regardless of the
difference in scale of the raw residuals. If a standardized residual is much smaller or larger than the
other standardized residuals, it indicates that the model is not fitting these observations well.

The standardized residual plot measures the strength of the difference between observed and expected
values. The actual predicted value is displayed on the x axis. A point with a value larger than an absolute
value of 3 is commonly regarded as an outlier.

The following example graph shows that a large number of standardized residuals are clustered around
0 on the horizontal axis. The values close to zero indicate that the model is fitting these points well. The
points towards the top and bottom of the plot are not predicted well by the model.

Residual histogram

A residual histogram incorporates the following statistical terms:

residual

A (raw) residual shows the difference between actual and values predicted by your model. The larger
the difference, the larger the residual value.
standard deviation

The standard deviation is a measure of how much values vary from an average value. A high
standard deviation indicates that many values are very different from their average value. A low
standard deviation indicates that many values are close to their average value.

508
Amazon SageMaker Developer Guide
Notebooks generated

standardized residual

A standardized residual divides the raw residuals by their standard deviation. Standardized residuals
have units of standard deviation. These are useful in identifying outliers in data regardless of the
difference in scale of the raw residuals. If a standardized residual is much smaller or larger than the
other standardized residuals, it would indicate that the model is not fitting these observations well.
histogram

A histogram is a graph that shows how often a value occurred.

The residual histogram shows the distribution of standardized residual values. A histogram distributed
in a bell shape and centered at zero indicates that the model does not systematically overpredict or
underpredict any particular range of target values.

In the following graphic, the standardized residual values indicate that the model is fitting the data well.
If the graph showed values far away from the center value, it would indicate that those values don't fit
the model well.

Amazon SageMaker Autopilot notebooks


generated to manage AutoML tasks
Amazon SageMaker Autopilot manages the key tasks in an automatic machine learning (AutoML) process
using an AutoML job.

The AutoML job creates three notebook-based reports that describe the plan that Autopilot follows to
generate candidate models. A candidate model consists of a (pipeline, algorithm) pair. First, there’s a
data exploration notebook that describes what Autopilot learned about the data that you provided.
Second, there’s a candidate definition notebook, which uses the information about the data to generate
candidates. Third, a model insights report that can help detail the performance characteristics of the
best model in the leaderboard of an Autopilot experiment.

Topics

509
Amazon SageMaker Developer Guide
Data exploration report

• Amazon SageMaker Autopilot Data exploration report (p. 510)


• Candidate definition notebook (p. 516)

You can run these notebooks in Amazon SageMaker, or locally, if you have installed the Amazon
SageMaker Python SDK. You can share the notebooks just like any other SageMaker Studio
notebook. The notebooks are created for you to conduct experiments. For example, you could edit the
following items in the notebooks:

• Preprocessors used on the data


• Amount of hyperparameter optimization (HPO) runs and their parallelism
• Algorithms to try
• Instance types used for the HPO jobs
• Hyperparameter ranges

Modifications to the candidate definition notebook are encouraged as a learning tool. With this
capability, you learn how decisions made during the machine learning process impact your results.
Note
When you run the notebooks in your default instance, you incur baseline costs. However, when
you run HPO jobs from the candidate notebook, these jobs use additional compute resources
that incur additional costs.

Amazon SageMaker Autopilot Data exploration


report
Amazon SageMaker Autopilot cleans and pre-processes your dataset automatically. High-quality data
improves machine learning efficiency and produces models that make more accurate predictions.

There are issues with customer-provided datasets that cannot be fixed automatically without the benefit
of some domain knowledge. Large outlier values in the target column for regression problems, for
example, may cause suboptimal predictions for the non-outlier values. Outliers may need to be removed
depending on the modeling objective. If a target column is included by accident as one of the input
features, the final model will validate well, but be of little value for future predictions.

To help customers discover these sorts of issues, Autopilot provides a data exploration report that
contains insights into potential issues with their data. The report also suggests how to handle the issues.

A data exploration notebook containing the report is generated for every Autopilot job. The report
is stored in an Amazon S3 bucket and can be accessed from your output path. The path of the data
exploration report usually adheres to the following pattern.

[s3 output path]/[name of the automl job]/sagemaker-automl-candidates/[name of processing


job used for data analysis]/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb

The location of the data exploration notebook can be obtained from the Autopilot API using the
DescribeAutoMLJob operation response, which is stored in DataExplorationNotebookLocation.

When running Autopilot from SageMaker Studio, you can open the data exploration report using the
following steps:

1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.

510
Amazon SageMaker Developer Guide
Data exploration report

2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the data exploration notebook that you want
to examine. This opens a new Autopilot job tab.
4. Select Open data exploration notebook from the top right section of the Autopilot job tab.

The data exploration report is generated from your data before the training process begins. This allows
you to stop Autopilot jobs that might lead to meaningless results. Likewise, you can address any issues
or improvements with your dataset before rerunning Autopilot. This way, you can use your domain
expertise to improve the data quality manually, before you train a model on a better-curated dataset.

The data report contains only static markdown and can be opened in any Jupyter environment. The
notebook that contains the report can be converted to other formats, such as PDF or HTML. For more
information about conversions, see Using the nbconvert script to convert Jupyter notebooks to other
formats..

Topics
• Dataset Summary (p. 511)
• Target Analysis (p. 511)
• Data Sample (p. 513)
• Duplicate rows (p. 514)
• Cross column correlations (p. 514)
• Anomalous Rows (p. 515)
• Missing values, cardinality, and descriptive statistics (p. 516)

Dataset Summary
This Dataset Summary provides key statistics characterizing your dataset including the number of rows,
columns, percent duplicate rows and missing target values. It is intended to provide you with a quick
alert when there are issue with your dataset that Amazon SageMaker Autopilot has detected and that
are likely to require your intervention. The insights are surfaced as warnings that are classified as being
of either “high” or “low” severity. The classification depends on the level of confidence that the issue will
adversely impact the performance of the model.

The high and low severity insights appear in the summary as pop-ups. For most of the insights,
recommendations are offered for how to confirm that there is an issue with the dataset that requires
your attention. Proposals are also provided for how to resolve the issues.

Autopilot provides additional statistics about missing or not valid target values in our dataset to help
you detect other issues that may not be captured by high severity insights. An unexpected number of
columns of a particular type might indicate that some columns that you want to use may be missing
from the dataset. It could also indicate that there was an issue with how the data was prepared or stored.
Fixing these data problems brought to your attention by Autopilot can improve the performance of the
machine learning models trained on your data.

High severity insights are shown in the summary section and in other relevant sections in the report.
Examples of high and low-severity insights are usually given depending on the section of the data report.

Target Analysis
Various high and low-severity insights are shown in this section related to the distribution of values
in the target column. Check that target column contains the correct values. Incorrect values in target
column will likely result in a machine learning model that doesn't serve the intended business purpose.
Several data insights of high and low severity are present in this section. Here are several examples.

511
Amazon SageMaker Developer Guide
Data exploration report

• Outlier target values - Skewed or unusual target distribution for regression, such as heavy tailed
targets.
• High or low target cardinality - Infrequent number of class labels or a large number of unique classes
for classification.

For both regression and classification problem types, not valid values such as numeric infinity, NaN or
empty space in target column are surfaced. Depending on the problem type, different dataset statistics
are presented. A distribution of target column values for a regression problem allows you to verify if the
distribution is what you expected.

The following screenshot shows an Autopilot data report, which includes statistics such as the mean,
median, minimum, maximum, percentage of outliers in your dataset. The screenshot also includes a
histogram showing the distribution of labels in the target column. The histogram shows Target Column
Values on the horizontal axis and Count on the vertical axis. A box highlights the Outliers Percentage
section of the screenshot to indicate where this statistic appears.

Multiple statistics are shown regarding target values and their distribution. If any of the outliers,
not valid values, or missing percentages are greater than zero, these values are surfaced so you can
investigate why your data contains unusable target values. Some unusable target values are highlighted
as a low severity insight warning.

In the following screenshot, a ` symbol was added accidentally to the target column, which
prevented the numeric value of the target from being parsed. A Low severity insight: "Invalid
target values" warning appears. The warning in this example states "0.14% of the labels in the
target column could not be converted to numeric values. The most common non-numeric values are:
["-3.8e-05","-9-05","-4.7e-05","-1.4999999999999999e-05","-4.3e-05"]. That usually indicates that there
are problems with data collection or processing. Amazon SageMaker Autopilot ignores all observations
with invalid target label."

Autopilot also provides a histogram showing the distribution of labels for classification.

The following screenshot shows an example of statistics given for your target column including the
number of classes, missing or not valid values. A histogram with Target Label on the horizontal axis and
Frequency on the vertical axis shows the distribution of each label category.

512
Amazon SageMaker Developer Guide
Data exploration report

Note
You can find definitions of all the terms presented in this and other sections in Definitions
section at the bottom of the report notebook.

Data Sample
Autopilot presents an actual sample of your data to help you spot issues with your dataset. The sample
table scrolls horizontally. Inspect the sample data to verify that all the necessary columns are present in
the dataset.

Autopilot also calculates a measure of prediction power, that can be used to identify a linear or nonlinear
relationship between a feature and the target variable. A value of 0 indicates that the feature has no
predictive value in predicting the target variable. A value of 1 indicates the highest predictive power for
the target variable. For more information on predictive power, see the Definitions section.
Note
It is not recommended that you use prediction power as a substitute for feature importance.
Only use it if you're certain that prediction power is an appropriate measure for your use case.

The following screenshot shows example data sample. The top row contains the prediction power of
each column in your dataset. The second row contains the column data type. Subsequent rows contain
the labels. The columns contain the target column followed by each feature column. Each feature
column has an associated prediction power, highlighted in this screenshot, with a box. In this example,
the column containing the feature x51 has a predictive power of 0.68 for the target variable y. The
feature x55 is slightly less predictive with a prediction power of 0.59.

513
Amazon SageMaker Developer Guide
Data exploration report

Duplicate rows
If duplicate rows are present in the dataset, Amazon SageMaker Autopilot displays a sample of them.
Note
It is not recommended to balance a dataset by up-sampling before providing it to Autopilot.
This may result in inaccurate validation scores for the models trained by Autopilot, and the
models that are produced may be unusable.

Cross column correlations


Autopilot uses the Pearson's correlation coefficient, a measure of linear correlation between two
features, to populate a correlation matrix. In the correlation matrix, numeric features are plotted on both
the horizontal and vertical axes, with the Pearson's correlation coefficient plotted at their intersections.
The higher the correlation between two features, the higher the coefficient, with a maximum value of |
1|.

• A value of -1 indicates that the features are perfectly negatively correlated.


• A value of 1, which occurs when a feature is correlated with itself, indicates perfect positive correlation.

You can use the information in the correlation matrix to remove highly correlated features. A smaller
number of features reduces chances of overfitting a model and can reduce the costs of production in
two ways. It lessens the Autopilot runtime needed and, for some applications, can make data collection
procedures cheaper.

The following screenshot shows an example of a correlation matrix between 7 features. Each feature
is displayed in a matrix on both the horizontal and vertical axes. The Pearson's correlation coefficient is
displayed at the intersection between two features. Each feature intersection has a color tone associated
with it. The higher the correlation, the darker the tone. The darkest tones occupy the diagonal of the
matrix, where each feature is correlated with itself, representing perfect correlation.

514
Amazon SageMaker Developer Guide
Data exploration report

Anomalous Rows
Amazon SageMaker Autopilot detects which rows in your dataset might be anomalous. It then assigns an
anomaly score to each row. Rows with negative anomaly scores are considered anomalous.

The following screenshot shows the output from an Autopilot analysis for rows containing anomalies. A
column containing an anomalous score appears next to the dataset columns for each row.

515
Amazon SageMaker Developer Guide
Candidate definition notebook

Missing values, cardinality, and descriptive statistics


Amazon SageMaker Autopilot examines and reports on properties of the individual columns of your
dataset. In each section of the data report that presents this analysis, the content is arranged in order.
This is so you can check the most “suspicious” values first. Using these statistics you can improve
contents of individual columns, and improve the quality of the model produced by Autopilot.

Autopilot calculates several statistics on the categorical values in columns that contain them. These
include the number of unique entries and, for text, the number of unique words.

Autopilot calculates several standard statistics on the numerical values in columns that contain them.
The following image depicts these statistics, including the mean, median, minimum and maximum
values, and the percentages of numerical types and of outlier values.

Candidate definition notebook


The candidate definition notebook contains each suggested preprocessing step, algorithm, and
hyperparameter ranges.

You can choose which candidate to train and tune in two ways. The first, by running sections of the
notebook. The second, by running the entire notebook to optimize all candidates to identify a best
candidate. If you run the entire notebook, only the best candidate is displayed after job completion.

To run Autopilot from SageMaker Studio, open the candidate definition notebook by following these
steps:

1.
Choose the Home icon from the left navigation pane to view the top-level Amazon SageMaker
Studio navigation menu.
2. Select the AutoML card from the main working area. This opens a new Autopilot tab.
3. In the Name section, select the Autopilot job that has the candidate definition notebook that you
want to examine. This opens a new Autopilot job tab.
4. Choose Open candidate generation notebook from the top right section of the Autopilot job tab.
This opens a new read-only preview of the Amazon SageMaker Autopilot Candidate Definition
Notebook.

To run the candidate definition notebook, follow these steps:

1. Choose Import notebook at the top right of the Amazon SageMaker Autopilot Candidate
Definition Notebook tab. This opens a tab to set up a new notebook environment to run the
notebook.
2. Select an existing SageMaker Image or use a Custom Image.

516
Amazon SageMaker Developer Guide
Configure inference output

3. Select a Kernel, an Instance type, and an optional Start-up script.

You can now run the notebook in this new environment.

Configure inference output in generated


containers
Amazon SageMaker Autopilot generates an ordered ContainerDefinition list. This can be used to
build a model to deploy in a machine learning pipeline. This model can be used for online hosting and
inference.

Customers can list inference container definitions with the ListCandidateForAutoMLJob API.
The list of inference container definitions that represent the best candidate is also available in the
DescribeAutoMLJob response.

Inference container definitions for regression and


classification problem types
Autopilot generates inference containers specific to the training mode and the problem type of the job.

Container definitions for hyperparameter optimization (HPO)


mode
• Regression: HPO generates two containers:
1. A feature engineering container that transforms the original features into features that the
regression algorithms can train on.
2. An algorithm container that transforms features and generates a regression score for the dataset.
• Classification: HPO generates three containers:
1. A feature engineering container that transforms the original features into features that the
classification algorithms can train on.
2. An algorithm container that generates the predicted_label with the highest probability. This
container can also produce the various probabilities associated with the classification outcomes in
the inference response.
3. A feature engineering container that performs post-processing of the algorithm prediction. For
example, it can perform an inverse transform on the predicted label and change it to the original
label.

Container definitions for ensembling mode


In ensembling mode, both regression and classification problem types have only one inference container.
This inference container transforms the features and generates the predictions based on problem type.

Inference responses per problem type


Inference responses for classification models
For classification inference containers, you can select the content of the inference response by using four
predefined keys:

517
Amazon SageMaker Developer Guide
Inference responses

• predicted_label: The label with the highest probability of predicting the correct label, as
determined by Autopilot.
• probability:
• HPO models: The probability of the True class for binary classification. The probability of the
predicted_label for multiclass classification.
• Ensemble models: The probability of the predicted_label for binary and multiclass
classification.
• probabilities: The list of probabilities for all corresponding classes.
• labels: The list of all labels.

For example, for a binary classification problem, if you pass the inference response keys
['predicted_label', 'probability', 'probabilities', 'labels'] and the output
response appears as [1, 0.1, "[0.9, 0.1]", "['1', '0']"], you should interpret it as follows:

1. predicted_label equals 1 because label "1" has a higher probability (0.9 in this case).
2. For HPO models, probability equals 0.1 which is the probability of the positive_class (0 in
this case) selected by Autopilot.

For Ensemble models, probability equals 0.9 which is the probability of the predicted_label.
3. probabilities lists the probability of each label in labels.
4. labels are the unique labels in the dataset, where the second label ("0" in this case) is the
positive_class selected by Autopilot.

By default, inference containers are configured to generate only the predicted_label. To select
additional inference content, you can update the inference_response_keys parameter to include up
to these three environment variables:

• SAGEMAKER_INFERENCE_SUPPORTED: This is set to provide hints to you about what content each
container supports.
• SAGEMAKER_INFERENCE_INPUT: This should be set to the keys that the container expects in input
payload.
• SAGEMAKER_INFERENCE_OUTPUT: This should be populated with the set of keys that the container
outputs.

Inference responses for classification models in HPO mode


This section shows how to configure the inference response from classification models using
hyperparameter optimization (HPO) mode.

To choose the inference response content in HPO mode: Add the SAGEMAKER_INFERENCE_INPUT and
SAGEMAKER_INFERENCE_OUTPUT variables to the second and third containers that are generated in
HPO mode for classification problems.

The keys supported by the second container (algorithm) are predicted_label, probability, and
probabilities. Note that labels is deliberately not added to SAGEMAKER_INFERENCE_SUPPORTED.

The keys supported by the third classification model container are predicted_label, labels,
probability, and probabilities. Therefore, the SAGEMAKER_INFERENCE_SUPPORTED environment
includes the names of these keys.

To update the definition of the inference containers to receive predicted_label and probability,
use the following code example.

518
Amazon SageMaker Developer Guide
Inference responses

containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT': 'predicted_label,
probability'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})

The following code example updates the definition of the inference containers to receive
predicted_label, probabilities, and labels. Do not pass the labels to the second container
(the algorithm container), because it is generated by the third container independently.

containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label,probabilities'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT':
'predicted_label,probabilities'})
containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probabilities,labels'})

The following collapsible sections provide code examples for AWS SDK for Python (Boto3) and for
SageMaker SDK for Python. Each section shows how to select the content of the inference responses in
HPO mode for the respective code example.

AWS SDK for Python (Boto3)

import boto3

sm_client = boto3.client('sagemaker', region_name='<Region>')

role = '<IAM role>'


input_data = '<S3 input uri>'
output_path = '<S3 output uri>'

best_candidate = sm_client.describe_auto_ml_job(AutoMLJobName='<AutoML Job Name>')


['BestCandidate']
best_candidate_containers = best_candidate['InferenceContainers']
best_candidate_name = best_candidate['CandidateName']

best_candidate_containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT':
'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})

# create model
reponse = sm_client.create_model(
ModelName = '<Model Name>',
ExecutionRoleArn = role,
Containers = best_candidate_containers
)

# Lauch Transform Job


response = sm_client.create_transform_job(
TransformJobName='<Transform Job Name>',
ModelName='<Model Name>',
TransformInput={
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': input_data
}
},

519
Amazon SageMaker Developer Guide
Inference responses

'ContentType': "text/CSV",
'SplitType': 'Line'
},
TransformOutput={
'S3OutputPath': output_path,
'AssembleWith': 'Line',
},
TransformResources={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
},
)

SageMaker SDK for Python

from sagemaker import AutoML

aml = AutoML.attach(auto_ml_job_name='<AutoML Job Name>')


aml_best_model = aml.create_model(name='<Model Name>',
candidate=None,
inference_response_keys**=['probabilities', 'labels'])

aml_transformer = aml_best_model.transformer(accept='text/csv',
assemble_with='Line',
instance_type='ml.m5.xlarge',
instance_count=1,)

aml_transformer.transform('<S3 input uri>',


content_type='text/csv',
split_type='Line',
job_name='<Transform Job Name>',
wait=True)

Inference responses for classification models in ensembling


mode
This section shows how to configure the inference response from classification models using ensembling
mode.

In ensembling mode, to choose the content of the inference response, update the
SAGEMAKER_INFERENCE_OUTPUT environment variable.

The keys supported by the classification model container are predicted_label,


labels, probability, and probabilities. These keys are included in the
SAGEMAKER_INFERENCE_SUPPORTED environment.

To update the inference container definition to receive predicted_label and probability, refer to
the following code example.

containers[0]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,
probability'})

The following collapsible section provides a code example for selecting the content of the inference
responses in ensembling mode. The example uses AWS SDK for Python (Boto3).

AWS SDK for Python (Boto3)

import boto3

520
Amazon SageMaker Developer Guide
Inference responses

sm_client = boto3.client('sagemaker', region_name='<Region>')

role = '<IAM role>'


input_data = '<S3 input uri>'
output_path = '<S3 output uri>'

best_candidate = sm_client.describe_auto_ml_job(AutoMLJobName='<AutoML Job Name>')


['BestCandidate']
best_candidate_containers = best_candidate['InferenceContainers']
best_candidate_name = best_candidate['CandidateName']

*best_candidate_containers[0]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT':
'predicted_label, probability'})
*
# create model
reponse = sm_client.create_model(
ModelName = '<Model Name>',
ExecutionRoleArn = role,
Containers = best_candidate_containers
)

# Lauch Transform Job


response = sm_client.create_transform_job(
TransformJobName='<Transform Job Name>',
ModelName='<Model Name>',
TransformInput={
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': input_data
}
},
'ContentType': "text/CSV",
'SplitType': 'Line'
},
TransformOutput={
'S3OutputPath': output_path,
'AssembleWith': 'Line',
},
TransformResources={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
},
)

The following collapsible section provides a code example that is identical to the SageMaker SDK for
Python example for HPO. It is included for your convenience.

SageMaker SDK for Python

The following HPO code example uses SageMaker SDK for Python.

from sagemaker import AutoML

aml = AutoML.attach(auto_ml_job_name='<AutoML Job Name>')


aml_best_model = aml.create_model(name='<Model Name>',
candidate=None,
*inference_response_keys**=['probabilities', 'labels'])*

aml_transformer = aml_best_model.transformer(accept='text/csv',
assemble_with='Line',
instance_type='ml.m5.xlarge',
instance_count=1,)

521
Amazon SageMaker Developer Guide
Quotas

aml_transformer.transform('<S3 input uri>',


content_type='text/csv',
split_type='Line',
job_name='<Transform Job Name>',
wait=True)

Amazon SageMaker Autopilot quotas


There are quotas that limit the resources available to you when using Amazon SageMaker Autopilot.
Some of these limits are increasable and some are not.
Note
The resource quotas documented in the following sections are valid for versions of Amazon
SageMaker Studio 3.22.2 and higher. For information on updating your version of SageMaker
Studio, see Shut Down and Update SageMaker Studio and Studio Apps (p. 198).

Topics
• Quotas that you can increase (p. 522)
• Resource quotas (p. 523)

Quotas that you can increase


There are default limits for the size of the input datasets: file size of a single Parquet file (*), the target
dataset size subsampling (**), and the number of concurrent jobs you can run with Amazon SageMaker
Autopilot for each AWS account per AWS Region.

Resource limits

Resource Regions Default limits Can be increased up to

Size of input dataset All 100 GB Hundreds of GBs

Size of a single Parquet All 2 GB Tens of GBs


file*

Target dataset size for All 5 GB Hundreds of GBs


subsampling**

Number of concurrent us-east-1, us-east-2,us- 4 Hundreds


Autopilot jobs west-2, ap-northeast-1,
eu-west-1, eu-central-1

ap-northeast-2, ap- 2 Hundreds


southeast-2, eu-west-2,
ap-southeast-1

All other Regions 1 Tens

Note
*This 2 GB size limit is for a single compressed Parquet file. You can provide a Parquet dataset
that includes multiple compressed Parquet files. After the files are decompressed, they may
each expand to a larger size.
**Autopilot automatically subsamples input datasets that are larger than the target dataset size
while accounting for class imbalance and preserving rare class labels.

522
Amazon SageMaker Developer Guide
Resource quotas

You can increase these limits by contacting AWS Support.

To request a quota increase:

1. Open the AWS Support Center page, sign in if necessary, and then choose Create case.
2. On the Create case page, choose Service limit increase.
3. In the Case details panel, select SageMaker AutoML for the Limit Type.
4. On the Requests panel for Request 1, select the Region, the resource Limit to increase, and the New
Limit value that you are requesting. If you have additional requests for quota increases, select Add
another request.

5. Provide your preferred Contact options and choose Submit.

Resource quotas
The following table contains the runtime resource limits for an Amazon SageMaker Autopilot job in an
AWS Region.

Resource limits per Autopilot job

Resource Limit per Autopilot job

Maximum runtime for an Autopilot job 30 days

523
Amazon SageMaker Developer Guide
API reference

API Reference guide for Amazon SageMaker


Autopilot
This section provides a subset of the HTTP service APIs for creating and managing Amazon SageMaker
Autopilot resources (AutoML jobs) programmatically.

For information on the entire SageMaker REST APIs and the available SDKs, see API and SDK Reference.

If your language of choice is Python, you can also refer to Amazon SageMaker Python SDK directly or
AWS SDK for Python (Boto3).

Actions

This list details the operations available in the Reference API to manage AutoML jobs programmatically.

• CreateAutoMLJob
• DescribeAutoMLJob
• ListAutoMLJobs
• ListCandidatesForAutoMLJob
• StopAutoMLJob

Data Types

This list details the API AutoML objects used by the actions above as inbound requests or outbound
responses.

• AutoMLAlgorithmConfig
• AutoMLCandidate
• AutoMLCandidateGenerationConfig
• AutoMLCandidateStep
• AutoMLChannel
• AutoMLContainerDefinition
• AutoMLDataSource
• AutoMLDataSplitConfig
• AutoMLJobArtifacts
• AutoMLJobCompletionCriteria
• AutoMLJobConfig
• AutoMLJobObjective
• AutoMLJobStepMetadata
• AutoMLJobSummary
• AutoMLOutputDataConfig
• AutoMLPartialFailureReason
• AutoMLS3DataSource
• AutoMLSecurityConfig
• CandidateArtifactLocations
• CandidateProperties
• FinalAutoMLJobObjectiveMetric
• MetricDatum

524
Amazon SageMaker Developer Guide
API reference

• ModelDeployConfig
• ModelDeployResult
• ResolvedAttributes
• TuningJobCompletionCriteria

525
Amazon SageMaker Developer Guide
Ground Truth

Label Data
To train a machine learning model, you need a large, high-quality, labeled dataset. You can label your
data using Amazon SageMaker Ground Truth. Choose from one of the Ground Truth built-in task types or
create your own custom labeling workflow. To improve the accuracy of your data labels and reduce the
total cost of labeling your data, use Ground Truth enhanced data labeling features like automated data
labeling and annotation consolidation.

Topics
• Use Amazon SageMaker Ground Truth to Label Data (p. 526)
• Use Amazon SageMaker Ground Truth Plus to Label Data (p. 844)
• Use Amazon SageMaker Ground Truth Synthetic Data to Generate and Label Data (p. 855)
• Create and Manage Workforces (p. 863)
• Crowd HTML Elements Reference (p. 889)

Use Amazon SageMaker Ground Truth to Label


Data
To train a machine learning model, you need a large, high-quality, labeled dataset. Ground Truth helps
you build high-quality training datasets for your machine learning models. With Ground Truth, you can
use workers from either Amazon Mechanical Turk, a vendor company that you choose, or an internal,
private workforce along with machine learning to enable you to create a labeled dataset. You can use the
labeled dataset output from Ground Truth to train your own models. You can also use the output as a
training dataset for an Amazon SageMaker model.

Depending on your ML application, you can choose from one of the Ground Truth built-in task types
to have workers generate specific types of labels for your data. You can also build a custom labeling
workflow to provide your own UI and tools to workers labeling your data. To learn more about the
Ground Truth built in task types, see Built-in Task Types (p. 704). To learn how to create a custom
labeling workflow, see Creating Custom Labeling Workflows (p. 671).

In order to automate labeling your training dataset, you can optionally use automated data labeling,
a Ground Truth process that uses machine learning to decide which data needs to be labeled by
humans. Automated data labeling may reduce the labeling time and manual effort required. For more
information, see Automate Data Labeling (p. 807). To create a custom labeling workflow, see Creating
Custom Labeling Workflows (p. 671).

Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI
template is a webpage that Ground Truth uses to present tasks and instructions to your workers. The
SageMaker console provides built-in templates for labeling data. You can use these templates to get
started , or you can build your own tasks and instructions by using our HTML 2.0 components. For more
information, see Creating Custom Labeling Workflows (p. 671).

Use the workforce of your choice to label your dataset. You can choose your workforce from:

• The Amazon Mechanical Turk workforce of over 500,000 independent contractors worldwide.

526
Amazon SageMaker Developer Guide
Are You a First-time User of Ground Truth?

• A private workforce that you create from your employees or contractors for handling data within your
organization.
• A vendor company that you can find in the AWS Marketplace that specializes in data labeling services.

For more information, see Create and Manage Workforces (p. 863).

You store your datasets in Amazon S3 buckets. The buckets contain three things: The data to be labeled,
an input manifest file that Ground Truth uses to read the data files, and an output manifest file. The
output file contains the results of the labeling job. For more information, see Use Input and Output
Data (p. 734).

Events from your labeling jobs appear in Amazon CloudWatch under the /aws/sagemaker/
LabelingJobs group. CloudWatch uses the labeling job name as the name for the log stream.

Are You a First-time User of Ground Truth?


If you are a first-time user of Ground Truth, we recommend that you do the following:

1. Read Getting started (p. 527)—This section walks you through setting up your first Ground Truth
labeling job.
2. Explore other topics—Depending on your needs, do the following:
• Explore built-in task types— Use built-in task types to streamline the process of creating a labeling
job. See Built-in Task Types (p. 704) to learn more about Ground Truth built-in task types.
• Manage your labeling workforce—Create new work teams and manage your existing workforce.
For more information, see Create and Manage Workforces (p. 863).
• Learn about streaming labeling jobs— Create a streaming labeling job and send new dataset
objects to workers in real time using a perpetually running labeling job. Workers continuously
receive new data objects to label as long as the labeling job is active and new objects are being sent
to it. To learn more, see Ground Truth Streaming Labeling Jobs (p. 738).
3. See the Reference—This section describes operations to automate Ground Truth operations.

Getting started
This video shows you how to setup and use Amazon SageMaker Ground Truth. (Length: 9:37)

To get started using Amazon SageMaker Ground Truth, follow the instructions in the following sections.
The sections here explain how to use the console to create a labeling job, assign a public or private
workforce, and send the labeling job to your workforce. You can also learn how to monitor the progress
of a labeling job.

If you want to create a custom labeling workflow, see Creating Custom Labeling Workflows (p. 671) for
instructions.

Before you create a labeling job, you must upload your dataset to an Amazon S3 bucket. For more
information, see Use Input and Output Data (p. 734).

Topics
• Step 1: Before You Begin (p. 528)
• Step 2: Create a Labeling Job (p. 528)
• Step 3: Select Workers (p. 529)
• Step 4: Configure the Bounding Box Tool (p. 531)

527
Amazon SageMaker Developer Guide
Getting started

• Step 5: Monitoring Your Labeling Job (p. 532)

Step 1: Before You Begin


Before you begin using the SageMaker console to create a labeling job, you must set up the dataset for
use. Do this:

1. Save two images at publicly available HTTP URLs. The images are used when creating instructions
for completing a labeling task. The images should have an aspect ratio of around 2:1. For this
exercise, the content of the images is not important.
2. Create an Amazon S3 bucket to hold the input and output files. The bucket must be in the same
Region where you are running Ground Truth. Make a note of the bucket name because you use it
during step 2.

Ground Truth requires all S3 buckets that contain labeling job input image data have a CORS policy
attached. To learn more about this change, see CORS Permission Requirement (p. 816).
3. You can create an IAM role or let SageMaker create a role with the AmazonSageMakerFullAccess IAM
policy. Refer to Creating IAM roles and assign the following permissions policy to the user that is
creating the labeling job:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "sagemakergroundtruth",
"Effect": "Allow",
"Action": [
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:AdminCreateUser",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:DescribeUserPool",
"cognito-idp:UpdateUserPool"
],
"Resource": "*"
}
]
}

Next
Step 2: Create a Labeling Job (p. 528)

Step 2: Create a Labeling Job


In this step you use the console to create a labeling job. You tell Amazon SageMaker Ground Truth the
Amazon S3 bucket where the manifest file is stored and configure the parameters for the job. For more
information about storing data in an Amazon S3 bucket, see Use Input and Output Data (p. 734).

To create a labeling job

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. From the left navigation, choose Labeling jobs.

528
Amazon SageMaker Developer Guide
Getting started

3. Choose Create labeling job to start the job creation process.


4. In the Job overview section, provide the following information:

• Job name – Give the labeling job a name that describes the job. This name is shown in your job
list. The name must be unique in your account in an AWS Region.
• Label attribute name – Leave this unchecked as the default value is the best option for this
introductory job.
• Input data setup – Select Automated data setup. This option allows you to automatically connect
to your input data in S3.
• S3 location for input datasets – Enter the S3 location where you added the images in step 1.
• S3 location for output datasets – The location where your output data is written in S3.
• Data type – Use the drop down menu to select Image. Ground Truth will use all images found in
the S3 location for input datasets as input for your labeling job.
• IAM role – Create or choose an IAM role with the AmazonSageMakerFullAccess IAM policy
attached.
5. In the Task type section, for the Task category field, choose Image.
6. In the Task selection choose Bounding box.
7. Choose Next to move on to configuring your labeling job.

Next
Step 3: Select Workers (p. 529)

Step 3: Select Workers


In this step you choose a workforce for labeling your dataset. It is recommended that you create a private
workforce to test Amazon SageMaker Ground Truth. Use email addresses to invite the members of
your workforce. If you create a private workforce in this step you won't be able to import your Amazon
Cognito user pool later. If you want to create a private workforce using an Amazon Cognito user pool, see
Manage a Private Workforce (Amazon Cognito) (p. 871) and use the Mechanical Turk workforce instead
in this tutorial.
Tip
To learn about the other workforce options you can use with Ground Truth, see Create and
Manage Workforces (p. 863).

To create a private workforce:

1. In the Workers section, choose Private.


2. If this is your first time using a private workforce, in the Email addresses field, enter up to 100
email addresses. The addresses must be separated by a comma. You should include your own email
address so that you are part of the workforce and can see data object labeling tasks.
3. In the Organization name field, enter the name of your organization. This information is used
to customize the email sent to invite a person to your private workforce. You can change the
organization name after the user pool is created through the console.
4. In the Contact email field enter an email address that members of the workforce use to report
problems with the task.

If you add yourself to the private workforce, you will receive an email that looks similar to the following.
Amazon, Inc. is replaced by the organization you enter in step 3 of the preceding procedure. Select the
link in the email to log in using the temporary password provided. If prompted, change your password.
When you successfully log in, you see the worker portal where your labeling tasks appear.

529
Amazon SageMaker Developer Guide
Getting started

Tip
You can find the link to your private workforce's worker portal in the Labeling workforces
section of the Ground Truth area of the SageMaker console. To see the link, select the Private
tab. The link is under the Labeling portal sign-in URL header in Private workforce summary.

If you choose to use the Amazon Mechanical Turk workforce to label the dataset, you are charged for
labeling tasks completed on the dataset.

To use the Amazon Mechanical Turk workforce:

1. In the Workers section, choose Public.


2. Set a Price per task.
3. If applicable, choose The dataset does not contain adult content to acknowledge that the sample
dataset has no adult content. This information enables Amazon SageMaker Ground Truth to warn
external workers on Mechanical Turk that they might encounter potentially offensive content in your
dataset.
4. Choose the check box next to the following statement to acknowledge that the sample dataset does
not contain any personally identifiable information (PII). This is a requirement to use Mechanical
Turk with Ground Truth. If your input data does contain PII, use the private workforce for this
tutorial.

You understand and agree that the Amazon Mechanical Turk workforce consists of independent
contractors located worldwide and that you should not share confidential information, personal
information or protected health information with this workforce.

Next
Step 4: Configure the Bounding Box Tool (p. 531)

530
Amazon SageMaker Developer Guide
Getting started

Step 4: Configure the Bounding Box Tool


Finally you configure the bounding box tool to give instructions to your workers. You can configure a
task title that describes the task and provides high-level instructions for the workers. You can provide
both quick instructions and full instructions. Quick instructions are displayed next to the image to be
labeled. Full instructions contain detailed instructions for completing the task. In this example, you only
provide quick instructions. You can see an example of full instructions by choosing Full instructions at
the bottom of the section.

To configure the bounding box tool

1. In the Task description field type in brief instructions for the task. For example:

Draw a box around any objects in the image.

Replace objects with the name of an object that appears in your images.
2. In the Labels field, type a category name for the objects that the worker should draw a bounding
box around. For example, if you are asking the worker to draw boxes around football players, you
could use "Football Player" in this field.
3. The Short instructions section enables you to create instructions that are displayed on the page
with the image that your workers are labeling. We suggest that you include an example of a
correctly drawn bounding box and an example of an incorrectly drawn box. To create your own
instructions, use these steps:

a. Select the text between GOOD EXAMPLE and the image placeholder. Replace it with the
following text:

Draw the box around the object with a small border.


b. Select the first image placeholder and delete it.
c. Choose the image button and then enter the HTTPS URL of one of the images that you created
in step 1. It is also possible to embed images directly in the short instructions section, however
this section has a quota of 100 kilobytes (including text). If your images and text exceed 100
kilobytes, you receive an error.
d. Select the text between BAD EXAMPLE and the image placeholder. Replace it with the following
text:

Don't make the bounding box too large or cut into the object.
e. Select the second image placeholder and delete it.
f. Choose the image button and then enter the HTTPS URL of the other image that you created in
step 1.
4. Select Preview to preview the worker UI. The preview opens in a new tab, and so if your browser
blocks pop ups you may need to manually enable the tab to open. When you add one or more
annotations to the preview and then select Submit you can see a preview of the output data your
annotation would created.
5. After you have configured and verified your instructions, select Create to create the labeling job.

If you used a private workforce, you can navigate to the worker portal that you logged into in Step 3:
Select Workers (p. 529) of this tutorial to see your labeling tasks. The tasks may take a few minutes to
appear.

Next
Step 5: Monitoring Your Labeling Job (p. 532)

531
Amazon SageMaker Developer Guide
Label Images

Step 5: Monitoring Your Labeling Job


After you create your labeling job, you see a list of all the jobs that you have created. You can use this list
to monitor that status of your labeling jobs. The list has the following fields:

• Name – The name that you assigned the job when you created it.
• Status – The completion status of the job. The status can be one of Complete, Failed, In progress, or
Stopped.
• Labeled objects/total – Shows the total number of objects in the labeling job and how many of them
have been labeled.
• Creation time – The date and time that you created the job.

You can also clone, chain, or stop a job. Select a job and then select one of the following from the
Actions menu:

• Clone – Creates a new labeling job with the configuration copied from the selected job. You can clone
a job when you want to change to the job and run it again. For example, you can clone a job that was
sent to a private workforce so that you can send it to the Amazon Mechanical Turk workforce. Or you
can clone a job to rerun it against a new dataset stored in the same location as the original job.
• Chain – Creates a new labeling job that can build upon the data and models (if any) of a stopped,
failed, or completed job. For more information about the use cases and how to use it, see Chaining
Labeling Jobs (p. 813).
• Stop – Stops a running job. You cannot restart a stopped job. You can clone a job to start over or chain
the job to continue from where it left off. Labels for any already labeled objects are written to the
output file location. For more information, see Output Data (p. 776).

Label Images
Use Ground Truth to label images. Select one of the following built in task types to learn more about
that task type. Each page includes instructions to help you create a labeling job using that task type.
Tip
To learn more about supported file types and input data quotas, see Input Data (p. 734).

Topics
• Bounding Box (p. 532)
• Image Semantic Segmentation (p. 538)
• Auto-Segmentation Tool (p. 541)
• Image Classification (Single Label) (p. 545)
• Image Classification (Multi-label) (p. 547)
• Image Label Verification (p. 551)

Bounding Box
The images used to train a machine learning model often contain more than one object. To classify and
localize one or more objects within images, use the Amazon SageMaker Ground Truth bounding box
labeling job task type. In this context, localization means the pixel-location of the bounding box.

You create a bounding box labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.

532
Amazon SageMaker Developer Guide
Label Images

Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).

Creating a Bounding Box Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a
bounding box labeling job in the SageMaker console. In Step 10, choose Image from the Task category
drop down menu, and choose Bounding box as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and up to 50
labels that workers can choose from.

533
Amazon SageMaker Developer Guide
Label Images

534
Amazon SageMaker Developer Guide
Label Images

Create a Bounding Box Labeling Job (API)


To create a bounding box labeling job, use the SageMaker API operation CreateLabelingJob. This
API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for this
operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-BoundingBox. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-BoundingBox. To find
the annotation-consolidation Lambda ARN for your Region, see AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-bounding-box-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
BoundingBox',
'TaskKeywords': [
'Bounding Box',
],
'TaskTitle': 'Bounding Box task',
'TaskDescription': 'Draw bounding boxes around objects in an image',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-BoundingBox'
}

535
Amazon SageMaker Developer Guide
Label Images

},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide a Template for Bounding Box Labeling Jobs

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header. Upload this template to S3, and provide the S3 URI for this file in
UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="boundingBox"
src="{{ task.input.taskObject | grant_read_access }}"
header="please draw box"
labels="{{ task.input.labels | to_json | escape }}"
>

<full-instructions header="Bounding box instructions">


<ol><li><strong>Inspect</strong> the image</li><li><strong>Determine</strong>
if the specified label is/are visible in the picture.</li>
<li><strong>Outline</strong> each instance of the specified label in the image using
the provided “Box” tool.</li></ol>
<ul><li>Boxes should fit tight around each object</li>
<li>Do not include parts of the object are overlapping or that cannot be seen, even
though you think you can interpolate the whole shape.</li>
<li>Avoid including shadows.</li>
<li>If the target is off screen, draw the box up to the edge of the image.</li>
</full-instructions>

<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description of a correct bounding box label and add images</p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3>
<p>Enter description of an incorrect bounding box label and add images</p>
</short-instructions>

</crowd-bounding-box>
</crowd-form>

Bounding Box Output Data


Once you have created a bounding box labeling job, your output data will be located in the Amazon S3
bucket specified in the S3OutputPath parameter when using the API or in the Output dataset location
field of the Job overview section of the console.

For example, the output manifest file of a successfully completed single-class bounding box task will
contain the following:

[
{
"boundingBox": {
"boundingBoxes": [
{

536
Amazon SageMaker Developer Guide
Label Images

"height": 2832,
"label": "bird",
"left": 681,
"top": 599,
"width": 1364
}
],
"inputImageProperties": {
"height": 3726,
"width": 2662
}
}
}
]

The boundingBoxes parameter identifies the location of the bounding box drawn around an object
identified as a "bird" relative to the top-left corner of the image which is taken to be the (0,0) pixel-
coordinate. In the previous example, left and top identify the location of the pixel in the top-left
corner of the bounding box relative to the top-left corner of the image. The dimensions of the bounding
box are identified with height and width. The inputImageProperties parameter gives the pixel-
dimensions of the original input image.

When you use the bounding box task type, you can create single- and multi-class bounding box labeling
jobs. The output manifest file of a successfully completed multi-class bounding box will contain the
following:

[
{
"boundingBox": {
"boundingBoxes": [
{
"height": 938,
"label": "squirrel",
"left": 316,
"top": 218,
"width": 785
},
{
"height": 825,
"label": "rabbit",
"left": 1930,
"top": 2265,
"width": 540
},
{
"height": 1174,
"label": "bird",
"left": 748,
"top": 2113,
"width": 927
},
{
"height": 893,
"label": "bird",
"left": 1333,
"top": 847,
"width": 736
}
],
"inputImageProperties": {
"height": 3726,
"width": 2662
}

537
Amazon SageMaker Developer Guide
Label Images

}
}
]

To learn more about the output manifest file that results from a bounding box labeling job, see
Bounding Box Job Output (p. 782).

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

Image Semantic Segmentation


To identify the contents of an image at the pixel level, use an Amazon SageMaker Ground Truth semantic
segmentation labeling task. When assigned a semantic segmentation labeling job, workers classify pixels
in the image into a set of predefined labels or classes. Ground Truth supports single and multi-class
semantic segmentation labeling jobs.

Images that contain large numbers of objects that need to be segmented require more time. To help
workers (from a private or vendor workforce) label these objects in less time and with greater accuracy,
Ground Truth provides an AI-assisted auto-segmentation tool. For information, see Auto-Segmentation
Tool (p. 541).

You create a semantic segmentation labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).

Creating a Semantic Segmentation Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a
semantic segmentation labeling job in the SageMaker console. In Step 10, choose Image from the Task
category drop down menu, and choose Semantic segmentation as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.

538
Amazon SageMaker Developer Guide
Label Images

539
Amazon SageMaker Developer Guide
Label Images

Create a Semantic Segmentation Labeling Job (API)


To create a semantic segmentation labeling job, use the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-SemanticSegmentation. To find
the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
SemanticSegmentation. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-semantic-segmentation-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
SemanticSegmentation,
'TaskKeywords': [
'Semantic Segmentation',
],
'TaskTitle': 'Semantic segmentation task',
'TaskDescription': 'For each category provided, segment out each relevant object
using the color associated with that category',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {

540
Amazon SageMaker Developer Guide
Label Images

'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-SemanticSegmentation'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide a Template for Semantic Segmentation Labeling Jobs

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.

Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-semantic-segmentation
name="crowd-semantic-segmentation"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please segment out all pedestrians."
labels="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Segmentation instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits an object and paint
that object using the tools provided.</li></ol>
</full-instructions>
<short-instructions>
<h2><span style="color: rgb(0, 138, 0);">Good example</span></h2>
<p>Enter description to explain a correctly done segmentation</p>
<p><br></p><h2><span style="color: rgb(230, 0, 0);">Bad example</span></h2>
<p>Enter description of an incorrectly done segmentation</p>
</short-instructions>
</crowd-semantic-segmentation>
</crowd-form>

Semantic Segmentation Output Data


Once you have created a semantic segmentation labeling job, your output data will be located in the
Amazon S3 bucket specified in the S3OutputPath parameter when using the API or in the Output
dataset location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of an output manifest file for a semantic segmentation labeling job, see 3D Point
Cloud Semantic Segmentation Output (p. 792).

Auto-Segmentation Tool
Image segmentation is the process of dividing an image into multiple segments, or sets of labeled
pixels. In Amazon SageMaker Ground Truth, the process of identifying all pixels that fall under a given

541
Amazon SageMaker Developer Guide
Label Images

label involves applying a colored filler, or "mask", over those pixels. Some labeling job tasks contain
images with a large numbers of objects that need to be segmented. To help workers label these
objects in less time and with greater accuracy, Ground Truth provides an auto-segmentation tool for
segmentation tasks assigned to private and vendor workforces. This tool uses a machine learning model
to automatically segment individual objects in the image with minimal worker input. Workers can refine
the mask generated by the auto-segmentation tool using other tools found in the worker console. This
helps workers complete image segmentation tasks faster and more accurately, resulting in lower cost
and higher label quality.
Note
The auto-segmentation tool is available for segmentation tasks that are sent to a private
workforce or vendor workforce. It isn't available for tasks sent to the public workforce (Amazon
Mechanical Turk).

Tool Preview
When workers are assigned a labeling job that provides the auto-segmentation tool, they are provided
with detailed instructions on how to use the tool. For example, a worker might see the following in the
worker console:

542
Amazon SageMaker Developer Guide
Label Images

543
Amazon SageMaker Developer Guide
Label Images

Workers can use View full instructions to learn how to use the tool. Workers will need to place a point
on four extreme-points ( top-most, bottom-most, left-most, and right-most points ) of the object of
interest, and the tool will automatically generate a mask for the object. Workers can further-refine the
mask using the other tools provided, or by using the auto-segment tool on smaller portions of the object
that were missed.

Tool Availability
The auto-segmentation tool automatically appears in your workers' consoles if you create a semantic
segmentation labeling job using the Amazon SageMaker console. While creating a semantic
segmentation job in the SageMaker console, you will be able to preview the tool while creating worker
instructions. To learn how to create a semantic segmentation labeling job in the SageMaker console, see
Getting started (p. 527).

If you are creating a custom instance segmentation labeling job in the SageMaker console or creating
an instance- or semantic-segmentation labeling job using the Ground Truth API, you need to create a
custom task template to design your worker console and instructions. To include the auto-segmentation
tool in your worker console, ensure that the following conditions are met in your custom task template:

• For semantic segmentation labeling jobs created using the API, the <crowd-semantic-
segmentation> is present in the task template. For custom instance segmentation labeling jobs, the
<crowd-instance-segmentation> tag is present in the task template.
• The task is assigned to a private workforce or vendor workforce.
• The images to be labeled are Amazon Simple Storage Service Amazon S3) objects that have been
pre-signed for the Worker so that they can access it. This is true if the task template includes the
grant_read_access filter. For information about the grant_read_access filter, see Adding
automation with Liquid (p. 675).

The following is an example of a custom task template for a custom instance segmentation labeling job,
which includes the <crowd-instance-segmentation/> tag and the grant_read_access Liquid
filter.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instance-segmentation
name="crowd-instance-segmentation"
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Car','Road']"
<full-instructions header="Segmentation instructions">
Segment each instance of each class of objects in the image.
</full-instructions>

<short-instructions>
<p>Segment each instance of each class of objects in the image.</p>

<h3 style="color: green">GOOD EXAMPLES</h3>


<img src="path/to/image.jpg" style="width: 100%">
<p>Good because A, B, C.</p>

<h3 style="color: red">BAD EXAMPLES</h3>


<img src="path/to/image.jpg" style="width: 100%">
<p>Bad because X, Y, Z.</p>
</short-instructions>
</crowd-instance-segmentation>
</crowd-form>

544
Amazon SageMaker Developer Guide
Label Images

Image Classification (Single Label)


Use an Amazon SageMaker Ground Truth image classification labeling task when you need workers to
classify images using predefined labels that you specify. Workers are shown images and are asked to
choose one label for each image.

You can create an image classification labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).

Create an Image Classification Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a image
classification labeling job in the SageMaker console. In Step 10, choose Image from the Task category
drop down menu, and choose Image Classification (Single Label) as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.

Create an Image Classification Labeling Job (API)


To create an image classification labeling job, use the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.

545
Amazon SageMaker Developer Guide
Label Images

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-ImageMultiClass. To find the
pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
ImageMultiClass. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-image-classification-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
ImageMultiClass,
'TaskKeywords': [
Image classification',
],
'TaskTitle': Image classification task',
'TaskDescription': 'Carefully inspect the image and classify it by selecting one
label from the categories provided.',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-ImageMultiClass'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},

546
Amazon SageMaker Developer Guide
Label Images

]
)

Provide a Template for Image Classification Labeling Jobs

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.

Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="{{ task.input.taskObject | grant_read_access }}"
header="please classify"
categories="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Image classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li></
ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3><p>Enter description
of an incorrect label</p>
</short-instructions>
</crowd-image-classifier>
</crowd-form>

Image Classification Output Data


Once you have created an image classification labeling job, your output data will be located in the
Amazon S3 bucket specified in the S3OutputPath parameter when using the API or in the Output
dataset location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of an output manifest file from an image classification labeling job, see Classification
Job Output (p. 780).

Image Classification (Multi-label)


Use an Amazon SageMaker Ground Truth multi-label image classification labeling task when you need
workers to classify multiple objects in an image. For example, the following image features a dog and a
cat. You can use multi-label image classification to associate the labels "dog" and "cat" with this image.

547
Amazon SageMaker Developer Guide
Label Images

When working on a multi-label image classification task, workers should choose all applicable labels,
but must choose at least one. When creating a job using this task type, you can provide up to 50 label-
categories.

When creating a labeling job in the console, Ground Truth doesn't provide a "none" category for when
none of the labels applies to an image. To provide this option to workers, include a label similar to
"none" or "other" when you create a multi-label image classification job.

To restrict workers to choosing a single label for each image, use the Image Classification (Single
Label) (p. 545) task type.
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each image file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).

Create a Multi-Label Image Classification Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a multi-
label image classification labeling job in the SageMaker console. In Step 10, choose Image from the Task
category drop down menu, and choose Image Classification (Multi-label) as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create a labeling
job in the console, you specify instructions to help workers complete the job and labels that workers can
choose from.

548
Amazon SageMaker Developer Guide
Label Images

Create a Multi-Label Image Classification Labeling Job (API)


To create a multi-label image classification labeling job, use the SageMaker API operation
CreateLabelingJob. This API defines this operation for all AWS SDKs. To see a list of language-specific
SDKs supported for this operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-ImageMultiClassMultiLabel.
To find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
ImageMultiClassMultiLabel. To find the annotation-consolidation Lambda ARN for your Region,
see AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-multi-label-image-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},

549
Amazon SageMaker Developer Guide
Label Images

OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
ImageMultiClassMultiLabel',
'TaskKeywords': [
'Image Classification',
],
'TaskTitle': 'Multi-label image classification task',
'TaskDescription': 'Select all labels that apply to the images shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-ImageMultiClassMultiLabel'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide a Template for Multi-label Image Classification

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.

Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier-multi-select
name="crowd-image-classifier-multi-select"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please identify all classes in image"
categories="{{ task.input.labels | to_json | escape }}"
>
<full-instructions header="Multi Label Image classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to understand
more about the labels.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the image.</li></
ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</p>

550
Amazon SageMaker Developer Guide
Label Images

<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3>


<p>Enter description of an incorrect label</p>
</short-instructions>
</crowd-image-classifier-multi-select>
</crowd-form>

Multi-label Image Classification Output Data


Once you have created a multi-label image classification labeling job, your output data will be located in
the Amazon S3 bucket specified in the S3OutputPath parameter when using the API or in the Output
dataset location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of output manifest files for multi-label image classification labeling job, see Multi-
label Classification Job Output (p. 781).

Image Label Verification


Building a highly accurate training dataset for your machine learning (ML) algorithm is an iterative
process. Typically, you review and continuously adjust your labels until you are satisfied that they
accurately represent the ground truth, or what is directly observable in the real world.

You can use an Amazon SageMaker Ground Truth image label verification task to direct workers
to review a dataset's labels and improve label accuracy. Workers can indicate if the existing labels
are correct or rate label quality. They can also add comments to explain their reasoning. Amazon
SageMaker Ground Truth supports label verification for Bounding Box (p. 532) and Image Semantic
Segmentation (p. 538) labels.

You create an image label verification labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.

Ground Truth provides a worker console similar to the following for labeling tasks. When you create the
labeling job with the console, you can modify the images and content that are shown. To learn how to
create a labeling job using the Ground Truth console, see Create a Labeling Job (Console) (p. 706).

551
Amazon SageMaker Developer Guide
Label Text

You can create a label verification labeling job using the SageMaker console or API. To learn how to
create a labeling job using the Ground Truth API operation CreateLabelingJob, see Create a Labeling
Job (API) (p. 709).

Use Ground Truth to Label Text


Use Ground Truth to text. Select one of the following built in task types to learn more about that task
type. Each page includes instructions to help you create a labeling job using that task type.
Tip
To learn more about supported file types and input data quotas, see Input Data (p. 734).

Topics
• Named Entity Recognition (p. 552)
• Text Classification (Single Label) (p. 556)
• Text Classification (Multi-label) (p. 559)

Named Entity Recognition


To extract information from unstructured text and classify it into predefined categories, use an Amazon
SageMaker Ground Truth named entity recognition (NER) labeling task. Traditionally, NER involves sifting
through text data to locate noun phrases, called named entities, and categorizing each with a label,
such as "person," "organization," or "brand." You can broaden this task to label longer spans of text and
categorize those sequences with predefined labels that you specify.

When tasked with a named entity recognition labeling job, workers apply your labels to specific words
or phrases within a larger text block. They choose a label, then apply it by using the cursor to highlight
the part of the text to which the label applies. The Ground Truth named entity recognition tool supports

552
Amazon SageMaker Developer Guide
Label Text

overlapping annotations, in-context label selection, and multi-label selection for a single highlight. Also,
workers can use their keyboards to quickly select labels.

You can create a named entity recognition labeling job using the Ground Truth section of the Amazon
SageMaker console or the CreateLabelingJob operation.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).

Create a Named Entity Recognition Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a named
entity recognition labeling job in the SageMaker console. In Step 10, choose Text from the Task category
drop down menu, and choose Named entity recognition as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.

Create a Named Entity Recognition Labeling Job (API)


To create a named entity recognition labeling job, using the SageMaker API operation
CreateLabelingJob. This API defines this operation for all AWS SDKs. To see a list of language-specific
SDKs supported for this operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-NamedEntityRecognition. To
find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .

553
Amazon SageMaker Developer Guide
Label Text

• Annotation-consolidation Lambda functions for this task type end with ACS-
NamedEntityRecognition. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.
• You must provide the following ARN for HumanTaskUiArn:

arn:aws:sagemaker:aws-region:394669845002:human-task-ui/NamedEntityRecognition

Replace aws-region with the AWS Region you use to create the labeling job. For example, use us-
west-1 if you create a labeling job in US West (N. California).
• Provide worker instructions in the label category configuration file using the instructions
parameter. You can use a string, or HTML markup language in the shortInstruction and
fullInstruction fields. For more details, see Provide Worker Instructions in a Label Category
Configuration File (p. 555).

"instructions": {"shortInstruction":"<h1>Add header</h1><p>Add Instructions</p>",


"fullInstruction":"<p>Add additional instructions.</p>"}

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-ner-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*',
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn': 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
NamedEntityRecognition'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
NamedEntityRecognition',
'TaskKeywords': [
'Named entity Recognition',
],
'TaskTitle': 'Named entity Recognition task',
'TaskDescription': 'Apply the labels provided to specific words or phrases within
the larger text block.',

554
Amazon SageMaker Developer Guide
Label Text

'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 28800,
'TaskAvailabilityLifetimeInSeconds': 864000,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-NamedEntityRecognition'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide Worker Instructions in a Label Category Configuration File

You must provide worker instructions in the label category configuration file you identify with the
LabelCategoryConfigS3Uri parameter in CreateLabelingJob. You can use these instructions to
provide details about the task you want workers to perform and help them use the tool efficiently.

You provide short and long instructions using shortInstruction and fullInstruction in the
instructions parameter, respectively. To learn more about these instruction types, see Creating
Instruction Pages (p. 704).

The following is an example of a label category configuration file with instructions that can be used for a
named entity recognition labeling job.

{
"document-version": "2018-11-28",
"labels": [
{
"label": "label1",
"shortDisplayName": "L1"
},
{
"label": "label2",
"shortDisplayName": "L2"
},
{
"label": "label3",
"shortDisplayName": "L3"
},
{
"label": "label4",
"shortDisplayName": "L4"
},
{
"label": "label5",
"shortDisplayName": "L5"
}
],
"instructions": {
"shortInstruction": "<p>Enter description of the labels that workers have
to choose from</p><br><p>Add examples to help workers understand
the label</p>",
"fullInstruction": "<ol>
<li><strong>Read</strong> the text carefully.</li>
<li><strong>Highlight</strong> words, phrases, or sections of the
text.</li>
<li><strong>Choose</strong> the label that best matches what you
have highlighted.</li>

555
Amazon SageMaker Developer Guide
Label Text

<li>To <strong>change</strong> a label, choose highlighted text and


select a new label.</li>
<li>To <strong>remove</strong> a label from highlighted text,
choose the X next to the
abbreviated label name on the highlighted text.</li>
<li>You can select all of a previously highlighted text, but not a
portion of it.</li>
</ol>"
}
}

Named Entity Recognition Output Data


Once you have created a named entity recognition labeling job, your output data will be located in the
Amazon S3 bucket specified in the S3OutputPath parameter when using the API or in the Output
dataset location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

Text Classification (Single Label)


To categorize articles and text into predefined categories, use text classification. For example, you can
use text classification to identify the sentiment conveyed in a review or the emotion underlying a section
of text. Use Amazon SageMaker Ground Truth text classification to have workers sort text into categories
that you define.

You create a text classification labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).

Create a Text Classification Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a text
classification labeling job in the SageMaker console. In Step 10, choose Text from the Task category
drop down menu, and choose Text Classification (Single Label) as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.

556
Amazon SageMaker Developer Guide
Label Text

Create a Text Classification Labeling Job (API)


To create a text classification labeling job, use the SageMaker API operation CreateLabelingJob. This
API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for this
operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-TextMultiClass. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
TextMultiClass. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-text-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'

557
Amazon SageMaker Developer Guide
Label Text

},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
TextMultiClass,
'TaskKeywords': [
Text classification',
],
'TaskTitle': Text classification task',
'TaskDescription': 'Carefully read and classify this text using the categories
provided.',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-TextMultiClass'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide a Template for Text Classification Labeling Jobs

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.

Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="crowd-classifier"
categories="{{ task.input.labels | to_json | escape }}"
header="classify text"
>
<classification-target style="white-space: pre-wrap">
{{ task.input.taskObject }}
</classification-target>
<full-instructions header="Classifier instructions">
<ol><li><strong>Read</strong> the text carefully.</li>
<li><strong>Read</strong> the examples to understand more about the options.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the text.</li></ol>
</full-instructions>
<short-instructions>
<p>Enter description of the labels that workers have to choose from</p>
<p><br></p><p><br></p><p>Add examples to help workers understand the label</p>
<p><br></p><p><br></p><p><br></p><p><br></p><p><br></p>
</short-instructions>

558
Amazon SageMaker Developer Guide
Label Text

</crowd-classifier>
</crowd-form>

Text Classification Output Data


Once you have created a text classification labeling job, your output data will be located in the Amazon
S3 bucket specified in the S3OutputPath parameter when using the API or in the Output dataset
location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of an output manifest files from a text classification labeling job, see Classification
Job Output (p. 780).

Text Classification (Multi-label)


To categorize articles and text into multiple predefined categories, use the multi-label text classification
task type. For example, you can use this task type to identify more than one emotion conveyed in text.

When working on a multi-label text classification task, workers should choose all applicable labels,
but must choose at least one. When creating a job using this task type, you can provide up to 50 label
categories.

Amazon SageMaker Ground Truth doesn't provide a "none" category for when none of the labels applies.
To provide this option to workers, include a label similar to "none" or "other" when you create a multi-
label text classification job.

To restrict workers to choosing a single label for each document or text selection, use the Text
Classification (Single Label) (p. 556) task type.
Important
If you manually create an input manifest file, use "source" to identify the text that you want
labeled. For more information, see Input Data (p. 734).

Create a Multi-Label Text Classification Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) to learn how to create a multi-
label text classification labeling job in the Amazon SageMaker console. In Step 10, choose Text from the
Task category drop down menu, and choose Text Classification (Multi-label) as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create the
labeling job with the console, you specify instructions to help workers complete the job and labels that
workers can choose from.

559
Amazon SageMaker Developer Guide
Label Text

Create a Multi-Label Text Classification Labeling Job (API)


To create a multi-label text classification labeling job, use the SageMaker API operation
CreateLabelingJob. This API defines this operation for all AWS SDKs. To see a list of language-specific
SDKs supported for this operation, review the See Also section of CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Pre-annotation Lambda functions for this task type end with PRE-TextMultiClassMultiLabel. To
find the pre-annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Annotation-consolidation Lambda functions for this task type end with ACS-
TextMultiClassMultiLabel. To find the annotation-consolidation Lambda ARN for your Region,
see AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region. All parameters in red should be replaced with your specifications and resources.

response = client.create_labeling_job(
LabelingJobName='example-multi-label-text-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},

560
Amazon SageMaker Developer Guide
Label Text

RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/custom-worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda::function:PRE-TextMultiClassMultiLabel,
'TaskKeywords': [
'Text Classification',
],
'TaskTitle': 'Multi-label text classification task',
'TaskDescription': 'Select all labels that apply to the text shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-TextMultiClassMultiLabel'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Create a Template for Multi-label Text Classification


If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template. Only modify the short-instructions, full-
instructions, and header.

Upload this template to S3, and provide the S3 URI for this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier-multi-select
name="crowd-classifier-multi-select"
categories="{{ task.input.labels | to_json | escape }}"
header="Please identify all classes in the below text"
>
<classification-target style="white-space: pre-wrap">
{{ task.input.taskObject }}
</classification-target>
<full-instructions header="Classifier instructions">
<ol><li><strong>Read</strong> the text carefully.</li>
<li><strong>Read</strong> the examples to understand more about the options.</li>
<li><strong>Choose</strong> the appropriate labels that best suit the text.</li></ol>
</full-instructions>
<short-instructions>
<p>Enter description of the labels that workers have to choose from</p>
<p><br></p>
<p><br></p><p>Add examples to help workers understand the label</p>
<p><br></p><p><br></p><p><br></p><p><br></p><p><br></p>
</short-instructions>
</crowd-classifier-multi-select>
</crowd-form>

561
Amazon SageMaker Developer Guide
Label Videos and Video Frames

To learn how to create a custom template, see Creating Custom Labeling Workflows (p. 671).

Multi-label Text Classification Output Data


Once you have created a multi-label text classification labeling job, your output data will be located in
the Amazon S3 bucket specified in the S3OutputPath parameter when using the API or in the Output
dataset location field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of output manifest files for multi-label text classification labeling job, see Multi-label
Classification Job Output (p. 781).

Label Videos and Video Frames


You can use Ground Truth to classify videos and annotate video frames (still images extracted from
videos) using one of the three built-in video task types. These task types streamline the process of
creating video and video frame labeling jobs using the Amazon SageMaker console, API, and language-
specific SDKs.

• Video clip classification – Enable workers to classify videos into categories you specify. For example,
you can use this task type to have workers categorize videos into topics like sports, comedy, music, and
education. To learn more, see Video Classification (p. 562).
• Video frame labeling jobs – Enable workers to annotate video frames extracted from a video using
bounding boxes, polylines, polygons or keypoint annotation tools. Ground Truth offers two built-in
task types to label video frames:
• Video frame object detection: Enable workers to identify and locate objects in video frames.
• Video frame object tracking: Enable workers to track the movement of objects across video frames.
• Video frame adjustment jobs: Have workers adjust labels, label category attributes, and frame
attributes from a previous video frame object detection or object tracking labeling job.
• Video frame verification jobs: Have workers verify labels, label category attributes, and frame
attributes from a previous video frame object detection or object tracking labeling job.

If you have video files, you can use the Ground Truth automatic frame extraction tool to extract video
frames from your videos. To learn more, see Video Frame Input Data (p. 770).

Tip
To learn more about supported file types and input data quotas, see Input Data (p. 734).

Topics
• Video Classification (p. 562)
• Label Video Frames (p. 567)
• Worker Instructions (p. 579)

Video Classification
Use an Amazon SageMaker Ground Truth video classification labeling task when you need workers to
classify videos using predefined labels that you specify. Workers are shown videos and are asked to
choose one label for each video.

You create a video classification labeling job using the Ground Truth section of the Amazon SageMaker
console or the CreateLabelingJob operation.

562
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Your video files must be encoded in a format that is supported by the browser used by the work team
that labels your data. It is recommended that you verify that all video file formats in your input manifest
file display correctly using the worker UI preview. You can communicate supported browsers to your
workers using worker instructions. To see supported file formats, see Supported Data Formats (p. 737).
Important
For this task type, if you create your own manifest file, use "source-ref" to identify the
location of each video file in Amazon S3 that you want labeled. For more information, see Input
Data (p. 734).

Create a Video Classification Labeling Job (Console)


You can follow the instructions in Create a Labeling Job (Console) (p. 706) to learn how to create
a video classification labeling job in the SageMaker console. In step 10, choose Video from the Task
category dropdown list, and choose Video Classification as the task type.

Ground Truth provides a worker UI similar to the following for labeling tasks. When you create a labeling
job in the console, you specify instructions to help workers complete the job and labels from which
workers can choose.

563
Amazon SageMaker Developer Guide
Label Videos and Video Frames

564
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Create a Video Classification Labeling Job (API)


This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.

Follow the instructions on Create a Labeling Job (API) (p. 709) and do the following while you configure
your request:

• Use a pre-annotation Lambda function that ends with PRE-VideoClassification. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn .
• Use an annotation-consolidation Lambda function that ends with ACS-
VideoClassification. To find the annotation-consolidation Lambda ARN for your Region, see
AnnotationConsolidationLambdaArn.

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.

response = client.create_labeling_job(
LabelingJobName='example-video-classification-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:region:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoClassification',
'TaskKeywords': [
'Video Classification',
],
'TaskTitle': 'Video classification task',
'TaskDescription': 'Select a label to classify this video',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {

565
Amazon SageMaker Developer Guide
Label Videos and Video Frames

'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoClassification'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Provide a Template for Video Classification

If you create a labeling job using the API, you must supply a worker task template in UiTemplateS3Uri.
Copy and modify the following template by modifying the short-instructions, full-
instructions, and header. Upload this template to Amazon S3, and provide the Amazon S3 URI to
this file in UiTemplateS3Uri.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-classifier
name="crowd-classifier"
categories="{{ task.input.labels | to_json | escape }}"
header="Please classify video"
>
<classification-target>
<video width="100%" controls/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/mp4"/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/webm"/>
<source src="{{ task.input.taskObject | grant_read_access }}"
type="video/ogg"/>
Your browser does not support the video tag.
</video>
</classification-target>
<full-instructions header="Video classification instructions">
<ol><li><strong>Read</strong> the task carefully and inspect the
video.</li>
<li><strong>Read</strong> the options and review the examples
provided to understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits
the video.</li></ol>
</full-instructions>
<short-instructions>
<h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
<p>Enter description to explain the correct label to the workers</
p>
<p><img src="https://d7evko5405gb7.cloudfront.net/
fe4fed9b-660c-4477-9294-2c66a15d6bbe/src/images/quick-instructions-example-placeholder.png"
style="max-width:100%"></p>
<h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3>
<p>Enter description of an incorrect label</p>
<p><img src="https://d7evko5405gb7.cloudfront.net/
fe4fed9b-660c-4477-9294-2c66a15d6bbe/src/images/quick-instructions-example-placeholder.png"
style="max-width:100%"></p>
</short-instructions>
</crowd-classifier>
</crowd-form>

566
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Video Classification Output Data


Once you have created a video classification labeling job, your output data is located in the Amazon S3
bucket specified in the S3OutputPath parameter when using the API or in the Output dataset location
field of the Job overview section of the console.

To learn more about the output manifest file generated by Ground Truth and the file structure the
Ground Truth uses to store your output data, see Output Data (p. 776).

To see an example of output manifest files for video classification labeling jobs, see Classification Job
Output (p. 780).

Label Video Frames


You can use Ground Truth built-in video frame task types to have workers annotate video frames using
bounding boxes, polylines, polygons or keypoints. A video frame is a sequence of images that have been
extracted from a video.

If you do not have video frames, you can provide video files (MP4 files) and use the Ground
Truth automated frame extraction tool to extract video frames. To learn more, see Provide Video
Files (p. 772).

You can use the following built-in video task types to create video frame labeling jobs using the Amazon
SageMaker console, API, and language-specific SDKs.

• Video frame object detection – Use this task type when you want workers to identify and locate
objects in sequences of video frames. You provide a list of categories, and workers can select one
category at a time and annotate objects which the category applies to in all frames. For example, you
can use this task to ask workers to identify and localize various objects in a scene, such as cars, bikes,
and pedestrians.
• Video frame object tracking – Use this task type when you want workers to track the movement of
instances of objects across sequences of video frames. When a worker adds an annotation to a single
frame, that annotation is associated with a unique instance ID. The worker adds annotations associated
with the same ID in all other frames to identify the same object or person. For example, a worker
can track the movement of a vehicle across a sequences of video frames by drawing bounding boxes
associated with the same ID around the vehicle in each frame that it appears.

Use the following topics to learn more about these built-in task types and to how to create a labeling
job using each task type. See Task Types (p. 576) to learn more about the annotations tools (bounding
boxes, polylines, polygons and keypoints) available for these task types.

Before you create a labeling job, we recommend that you review Video Frame Labeling Job
Overview (p. 575).

Topics
• Video Frame Object Detection (p. 567)
• Video Frame Object Tracking (p. 571)
• Video Frame Labeling Job Overview (p. 575)

Video Frame Object Detection


You can use the video frame object detection task type to have workers identify and locate objects in
a sequence of video frames (images extracted from a video) using bounding boxes, polylines, polygons

567
Amazon SageMaker Developer Guide
Label Videos and Video Frames

or keypoint annotation tools. The tool you choose defines the video frame task type you create. For
example, you can use a bounding box video frame object detection task type workers to identify and
localize various objects in a series of video frames, such as cars, bikes, and pedestrians.

You can create a video frame object detection labeling job using the Amazon SageMaker Ground Truth
console, the SageMaker API, and language-specific AWS SDKs. To learn more, see Create a Video Frame
Object Detection Labeling Job (p. 568) and select your preferred method. See Task Types (p. 576) to
learn more about the annotations tools you can choose from when you create a labeling job.

Ground Truth provides a worker UI and tools to complete your labeling job tasks: Preview the Worker
UI (p. 568).

You can create a job to adjust annotations created in a video object detection labeling job using the
video object detection adjustment task type. To learn more, see Create Video Frame Object Detection
Adjustment or Verification Labeling Job (p. 571).

Preview the Worker UI

Ground Truth provides workers with a web user interface (UI) to complete your video frame object
detection annotation tasks. You can preview and interact with the worker UI when you create a labeling
job in the console. If you are a new user, we recommend that you create a labeling job through the
console using a small input dataset to preview the worker UI and ensure your video frames, labels, and
label attributes appear as expected.

The UI provides workers with the following assistive labeling tools to complete your object detection
tasks:

• For all tasks, workers can use the Copy to next and Copy to all features to copy an annotation to the
next frame or to all subsequent frames respectively.
• For tasks that include the bounding box tools, workers can use a Predict next feature to draw a
bounding box in a single frame, and then have Ground Truth predict the location of boxes with the
same label in all other frames. Workers can then make adjustments to correct predicted box locations.

Create a Video Frame Object Detection Labeling Job

You can create a video frame object detection labeling job using the SageMaker console or the
CreateLabelingJob API operation.

This section assumes that you have reviewed the Video Frame Labeling Job Overview (p. 575) and have
chosen the type of input data and the input dataset connection you are using.

Create a Labeling Job (Console)

You can follow the instructions in Create a Labeling Job (Console) (p. 706) to learn how to create a
video frame object tracking job in the SageMaker console. In step 10, choose Video - Object detection
from the Task category dropdown list. Select the task type you want by selecting one of the cards in
Task selection.

568
Amazon SageMaker Developer Guide
Label Videos and Video Frames

569
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Create a Labeling Job (API)

You create an object detection labeling job using the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.

Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/VideoObjectDetection.
Replace <region> with the AWS Region in which you are creating the labeling job.

Do not include an entry for the UiTemplateS3Uri parameter.


• Your LabelAttributeName must end in -ref. For example, video-od-labels-ref.
• Your input manifest file must be a video frame sequence manifest file. You can create this manifest file
using the SageMaker console, or create it manually and upload it to Amazon S3. For more information,
see Input Data Setup (p. 772).
• You can only use private or vendor work teams to create video frame object detection labeling jobs.
• You specify your labels, label category and frame attributes, the task type, and worker instructions
in a label category configuration file. Specify the task type (bounding boxes, polylines, polygons or
keypoint) using annotationType in your label category configuration file. For more information, see
Create a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719) to
learn how to create this file.
• You need to provide pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the Region
in which you are creating your labeling job to find the correct ARN that ends with PRE-
VideoObjectDetection.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn. Use
the Region in which you are creating your labeling job to find the correct ARN that ends with ACS-
VideoObjectDetection.
• The number of workers specified in NumberOfHumanWorkersPerDataObject must be 1.
• Automated data labeling is not supported for video frame labeling jobs. Do not specify values for
parameters in LabelingJobAlgorithmsConfig.
• Video frame object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800
seconds).

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.

response = client.create_labeling_job(
LabelingJobName='example-video-od-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://DOC-EXAMPLE-BUCKET/path/video-frame-sequence-input-
manifest.json'
}
},
'DataAttributes': {
'ContentClassifiers': [

570
Amazon SageMaker Developer Guide
Label Videos and Video Frames

'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://DOC-EXAMPLE-BUCKET/prefix/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/prefix/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn: 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
VideoObjectDetection'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoObjectDetection',
'TaskKeywords': [
'Video Frame Object Detection',
],
'TaskTitle': 'Video frame object detection task',
'TaskDescription': 'Classify and identify the location of objects and people in
video frames',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoObjectDetection'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Create Video Frame Object Detection Adjustment or Verification Labeling Job

You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).

Output Data Format

When you create a video frame object detection labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 output location you specified when
you created the labeling job. To learn about the video frame object detection output data format, see
Video Frame Object Detection Output (p. 788). If you are a new user of Ground Truth, see Output
Data (p. 776) to learn more about the Ground Truth output data format.

Video Frame Object Tracking


You can use the video frame object tracking task type to have workers track the movement of objects in
a sequence of video frames (images extracted from a video) using bounding boxes, polylines, polygons

571
Amazon SageMaker Developer Guide
Label Videos and Video Frames

or keypoint annotation tools. The tool you choose defines the video frame task type you create. For
example, you can use a bounding box video frame object tracking task type to ask workers to track the
movement of objects, such as cars, bikes, and pedestrians by drawing boxes around them.

You provide a list of categories, and each annotation that a worker adds to a video frame is identified as
an instance of that category using an instance ID. For example, if you provide the label category car, the
first car that a worker annotates will have the instance ID car:1. The second car the worker annotates will
have the instance ID car:2. To track an object's movement, the worker adds annotations associated with
the same instance ID around to object in all frames.

You can create a video frame object tracking labeling job using the Amazon SageMaker Ground Truth
console, the SageMaker API, and language-specific AWS SDKs. To learn more, see Create a Video Frame
Object Detection Labeling Job (p. 568) and select your preferred method. See Task Types (p. 576) to
learn more about the annotations tools you can choose from when you create a labeling job.

Ground Truth provides a worker UI and tools to complete your labeling job tasks: Preview the Worker
UI (p. 568).

You can create a job to adjust annotations created in a video object detection labeling job using the
video object detection adjustment task type. To learn more, see Create Video Frame Object Detection
Adjustment or Verification Labeling Job (p. 571).

Preview the Worker UI

Ground Truth provides workers with a web user interface (UI) to complete your video frame object
tracking annotation tasks. You can preview and interact with the worker UI when you create a labeling
job in the console. If you are a new user, we recommend that you create a labeling job through the
console using a small input dataset to preview the worker UI and ensure your video frames, labels, and
label attributes appear as expected.

The UI provides workers with the following assistive labeling tools to complete your object tracking
tasks:

• For all tasks, workers can use the Copy to next and Copy to all features to copy an annotation with the
same unique ID to the next frame or to all subsequent frames respectively.
• For tasks that include the bounding box tools, workers can use a Predict next feature to draw a
bounding box in a single frame, and then have Ground Truth predict the location of boxes with the
same unique ID in all other frames. Workers can then make adjustments to correct predicted box
locations.

Create a Video Frame Object Tracking Labeling Job

You can create a video frame object tracking labeling job using the SageMaker console or the
CreateLabelingJob API operation.

This section assumes that you have reviewed the Video Frame Labeling Job Overview (p. 575) and have
chosen the type of input data and the input dataset connection you are using.

Create a Labeling Job (Console)

You can follow the instructions in Create a Labeling Job (Console) (p. 706) to learn how to create a
video frame object tracking job in the SageMaker console. In step 10, choose Video - Object tracking
from the Task category dropdown list. Select the task type you want by selecting one of the cards in
Task selection.

572
Amazon SageMaker Developer Guide
Label Videos and Video Frames

573
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Create a Labeling Job (API)

You create an object tracking labeling job using the SageMaker API operation CreateLabelingJob.
This API defines this operation for all AWS SDKs. To see a list of language-specific SDKs supported for
this operation, review the See Also section of CreateLabelingJob.

Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/VideoObjectTracking.
Replace <region> with the AWS Region in which you are creating the labeling job.

Do not include an entry for the UiTemplateS3Uri parameter.


• Your LabelAttributeName must end in -ref. For example, ot-labels-ref.
• Your input manifest file must be a video frame sequence manifest file. You can create this manifest file
using the SageMaker console, or create it manually and upload it to Amazon S3. For more information,
see Input Data Setup (p. 772). If you create a streaming labeling job, the input manifest file is
optional.
• You can only use private or vendor work teams to create video frame object detection labeling jobs.
• You specify your labels, label category and frame attributes, the task type, and worker instructions
in a label category configuration file. Specify the task type (bounding boxes, polylines, polygons or
keypoint) using annotationType in your label category configuration file. For more information, see
Create a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719) to
learn how to create this file.
• You need to provide pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the Region
in which you are creating your labeling job to find the correct ARN that ends with PRE-
VideoObjectTracking.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn. Use
the Region in which you are creating your labeling job to find the correct ARN that ends with ACS-
VideoObjectTracking.
• The number of workers specified in NumberOfHumanWorkersPerDataObject must be 1.
• Automated data labeling is not supported for video frame labeling jobs. Do not specify values for
parameters in LabelingJobAlgorithmsConfig.
• Video frame object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800
seconds).

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job in the US
East (N. Virginia) Region.

response = client.create_labeling_job(
LabelingJobName='example-video-ot-labeling-job,
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://DOC-EXAMPLE-BUCKET/path/video-frame-sequence-input-
manifest.json'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',

574
Amazon SageMaker Developer Guide
Label Videos and Video Frames

]
}
},
OutputConfig={
'S3OutputPath': 's3://DOC-EXAMPLE-BUCKET/prefix/file-to-store-output-data',
'KmsKeyId': 'string'
},
RoleArn='arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri='s3://bucket/prefix/label-categories.json',
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'HumanTaskUiArn: 'arn:aws:sagemaker:us-east-1:394669845002:human-task-ui/
VideoObjectTracking'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-
VideoObjectTracking',
'TaskKeywords': [
'Video Frame Object Tracking,
],
'TaskTitle': 'Video frame object tracking task',
'TaskDescription': Tracking the location of objects and people across video
frames',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-VideoObjectTracking'
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Create a Video Frame Object Tracking Adjustment or Verification Labeling Job

You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).

Output Data Format

When you create a video frame object tracking labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 output location you specified when
you created the labeling job. To learn about the video frame object tracking output data format, see
Video Frame Object Tracking Output (p. 790). If you are a new user of Ground Truth, see Output
Data (p. 776) to learn more about the Ground Truth output data format.

Video Frame Labeling Job Overview


Use this page to learn about the object detection and object tracking video frame labeling jobs. The
information on this page applies to both of these built-in task types.

The video frame labeling job is unique because of the following:

575
Amazon SageMaker Developer Guide
Label Videos and Video Frames

• You can either provide data objects that are ready to be annotated (video frames), or you can provide
video files and have Ground Truth automatically extract video frames.
• Workers have the ability to save work as they go.
• You cannot use the Amazon Mechanical Turk workforce to complete your labeling tasks.
• Ground Truth provides a worker UI, as well as assistive and basic labeling tools, to help workers
complete your tasks. You do not need to provide a worker task template.

Use the following topics to learn more.

Topics
• Input Data (p. 576)
• Job Completion Times (p. 576)
• Task Types (p. 576)
• Workforces (p. 577)
• Worker User Interface (UI) (p. 577)
• Video Frame Job Permission Requirements (p. 579)

Input Data

The video frame labeling job uses sequences of video frames. A single sequence is a series of images that
have been extracted from a single video. You can either provide your own sequences of video frames, or
have Ground Truth automatically extract video frame sequences from your video files. To learn more, see
Provide Video Files (p. 772).

Ground Truth uses sequence files to identify all images in a single sequence. All of the sequences that
you want to include in a single labeling job are identified in an input manifest file. Each sequence is
used to create a single worker task. You can automatically create sequence files and an input manifest
file using Ground Truth automatic data setup. To learn more, see Automated Video Frame Input Data
Setup (p. 773).

To learn how to manually create sequence files and an input manifest file, see Create a Video Frame
Input Manifest File (p. 775).

Job Completion Times

Video and video frame labeling jobs can take workers hours to complete. You can set the total amount of
time that workers can work on each task when you create a labeling job. The maximum time you can set
for workers to work on tasks is 7 days. The default value is 3 days.

We strongly recommend that you create tasks that workers can complete within 12 hours. Workers must
keep the worker UI open while working on a task. They can save work as they go and Ground Truth saves
their work every 15 minutes.

When using the SageMaker CreateLabelingJob API operation, set the total time a task is available to
workers in the TaskTimeLimitInSeconds parameter of HumanTaskConfig.

When you create a labeling job in the console, you can specify this time limit when you select your
workforce type and your work team.

Task Types

When you create a video object tracking or video object detection labeling job, you specify the type of
annotation that you want workers to create while working on your labeling task. The annotation type
determines the type of output data Ground Truth returns and defines the task type for your labeling job.

576
Amazon SageMaker Developer Guide
Label Videos and Video Frames

If you are creating a labeling job using the API operation CreateLabelingJob, you specify the task
type using the label category configuration file parameter annotationType. To learn more, see Create
a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719).

The following task types are available for both video object tracking or video object detection labeling
jobs:

• Bounding box – Workers are provided with tools to create bounding box annotations. A bounding box
is a box that a worker draws around an objects to identify the pixel-location and label of that object in
the frame.
• Polyline – Workers are provided with tools to create polyline annotations. A polyline is defined by the
series of ordered x, y coordinates. Each point added to the polyline is connected to the previous point
by a line. The polyline does not have to be closed (the start point and end point do not have to be the
same) and there are no restrictions on the angles formed between lines.
• Polygon – Workers are provided with tools to create polygon annotations. A polygon is a closed shape
defined by a series of ordered x, y coordinates. Each point added to the polygon is connected to the
previous point by a line and there are no restrictions on the angles formed between lines. Two lines
(sides) of the polygon cannot cross. The start and end point of a polygon must be the same.
• Keypoint – Workers are provided with tools to create keypoint annotations. A keypoint is a single point
associated with an x, y coordinate in the video frame.

Workforces

When you create a video frame labeling job, you need to specify a work team to complete your
annotation tasks. You can choose a work team from a private workforce of your own workers, or from a
vendor workforce that you select in the AWS Marketplace. You cannot use the Amazon Mechanical Turk
workforce for video frame labeling jobs.

To learn more about vendor workforces, see Managing Vendor Workforces (p. 867).

To learn how to create and manage a private workforce, see Use a Private Workforce (p. 868).

Worker User Interface (UI)

Ground Truth provides a worker user interface (UI), tools, and assistive labeling features to help workers
complete your video labeling tasks. You can preview the worker UI when you create a labeling job in the
console.

When you create a labeling job using the API operation CreateLabelingJob, you must provide an ARN
provided by Ground Truth in the parameter HumanTaskUiArn to specify the worker UI for your task
type. You can use HumanTaskUiArn with the SageMaker RenderUiTemplate API operation to preview
the worker UI.

You provide worker instructions, labels, and optionally, attributes that workers can use to provide more
information about labels and video frames. These attributes are referred to as label category attributes
and frame attributes respectively. They are all displayed in the worker UI.

Label Category and Frame Attributes

When you create a video object tracking or video object detection labeling job, you can add one or more
label category attributes and frame attributes:

• Label category attribute – A list of options (strings), a free form text box, or a numeric field associated
with one or more labels. It is used by workers to provide metadata about a label.
• Frame attribute – A list of options (strings), a free form text box, or a numeric field that appears on
each video frame a worker is sent to annotate. It is used by workers to provide metadata about video
frames.

577
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Additionally, you can use label and frame attributes to have workers verify labels in a video frame label
verification job.

Use the following sections to learn more about these attributes. To learn how to add label category and
frame attributes to a labeling job, use the Create Labeling Job sections on the task type page (p. 567)
of your choice.

Label Category Attributes


Add label category attributes to labels to give workers the ability to provide more information about the
annotations they create. A label category attribute is added to an individual label, or to all labels. When a
label category attribute is applied to all labels it is referred to as a global label category attribute.

For example, if you add the label category car, you might also want to capture additional data about
your labeled cars, such as if they are occluded or the size of the car. You can capture this metadata using
label category attributes. In this example, if you added the attribute occluded to the car label category,
you can assign partial, completely, no to the occluded attribute and enable workers to select one of these
options.

When you create a label verification job, you add labels category attributes to each label you want
workers to verify.

Frame level Attributes


Add frame attributes to give workers the ability to provide more information about individual video
frames. Each frame attribute you add appears on all frames.

For example, you can add a number-frame attribute to have workers identify the number of objects they
see in a particular frame.

In another example, you may want to provide a free-form text box to give workers the ability to provide
an answer to a question.

When you create a label verification job, you can add one or more frame attributes to ask workers to
provide feedback on all labels in a video frame.

Worker Instructions
You can provide worker instructions to help your workers complete your video frame labeling tasks. You
might want to cover the following topics when writing your instructions:

• Best practices and things to avoid when annotating objects.


• The label category attributes provided (for object detection and object tracking tasks) and how to use
them.
• How to save time while labeling by using keyboard shortcuts.

You can add your worker instructions using the SageMaker console while creating a labeling job. If you
create a labeling job using the API operation CreateLabelingJob, you specify worker instructions in
your label category configuration file.

In addition to your instructions, Ground Truth provides a link to help workers navigate and use the
worker portal. View these instructions by selecting the task type on Worker Instructions (p. 579).

Declining Tasks
Workers are able to decline tasks.

Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.

578
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Video Frame Job Permission Requirements

When you create a video frame labeling job, in addition to the permission requirements found in Assign
IAM Permissions to Use Ground Truth (p. 817), you must add a CORS policy to your S3 bucket that
contains your input manifest file.

Add a CORS Permission Policy to S3 Bucket

When you create a video frame labeling job, you specify buckets in S3 where your input data and
manifest file are located and where your output data will be stored. These buckets may be the same. You
must attach the following Cross-origin resource sharing (CORS) policy to your input and output buckets.
If you use the Amazon S3 console to add the policy to your bucket, you must use the JSON format.

JSON

[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"HEAD",
"PUT"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": [
"Access-Control-Allow-Origin"
],
"MaxAgeSeconds": 3000
}
]

XML

<?xml version="1.0" encoding="UTF-8"?>


<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>HEAD</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>
<MaxAgeSeconds>3000</MaxAgeSeconds>
<ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

To learn how to add a CORS policy to an S3 bucket, see How do I add cross-domain resource sharing
with CORS? in the Amazon Simple Storage Service User Guide.

Worker Instructions
This topic provides an overview of the Ground Truth worker portal and the tools available to complete
your video frame labeling task. First, select the type of task you are working on from Topics.
Important
It is recommended that you complete your task using a Google Chrome or Firefox web browser.

579
Amazon SageMaker Developer Guide
Label Videos and Video Frames

For adjustment jobs, select the original labeling job task type that produced the labels you are adjusting.
Review and adjust the labels in your task as needed.

Topics
• Work on Video Frame Object Tracking Tasks (p. 580)
• Work on Video Frame Object Detection Tasks (p. 587)

Work on Video Frame Object Tracking Tasks


Video frame object tracking tasks require you to track the movement of objects across video frames. A
video frame is a still image from a video scene.

You can use the worker UI to navigate between video frames and use the tools provided to identify
unique objects and track their movement from one from to the next. Use this page to learn how to
navigate your worker UI, use the tools provided, and complete your task.

It is recommended that you complete your task using a Google Chrome or Firefox web browser.
Important
If you see annotations have already been added to one or more video frames when you open
your task, adjust those annotations and add additional annotations as needed.

Topics
• Your Task (p. 580)
• Navigate the UI (p. 582)
• Bulk Edit Label and Frame Attributes (p. 582)
• Tool Guide (p. 583)
• Icons Guide (p. 585)
• Shortcuts (p. 586)
• Release, Stop and Resume, and Decline Tasks (p. 587)
• Saving Your Work and Submitting (p. 587)

Your Task

When you work on a video frame object tracking task, you need to select a category from the Label
category menu on the right side of your worker portal to start annotating. After you've chosen a
category, use the tools provided to annotate the objects that the category applies to. This annotation
will be associated with a unique label ID that should only be used for that object. Use this same label ID
to create additional annotations for the same object in all of the video frames that it appears in. Refer to
Tool Guide (p. 583) to learn more about the tools provided.

After you've added a label, you may see a downward pointing arrow next to the label in the Labels
menu. Select this arrow and then select one option for each label attribute you see to provide more
information about that label.

You may see frame attributes under the Labels menu. These attributes will appear on each frame in your
task. Use these attribute prompts to enter additional information about each frame.

580
Amazon SageMaker Developer Guide
Label Videos and Video Frames

After you've added a label, you can quickly add and edit a label category attribute value by using the
downward pointing arrow next to the label in the Labels menu. If you select the pencil icon next to the
label in the Labels menu, the Edit instance menu will appear. You can edit the label ID, label category,
and label category attributes using this menu.

To edit an annotation, select the label of the annotation that you want to edit in the Labels menu or
select the annotation in the frame. When you edit or delete an annotation, the action will only modify
the annotation in a single frame.

If you are working on a task that includes a bounding box tool, use the predict next icon to predict the
location of all bounding boxes that you have drawn in a frame in the next frame. If you select a single
box and then select the predict next icon, only that box will be predicted in the next frame. If you have
not added any boxes to the current frame, you will receive an error. You must add at least one box to the
frame before using this feature.

581
Amazon SageMaker Developer Guide
Label Videos and Video Frames

After you've used the predict next icon, review the location of each box in the next frame and make
adjustments to the box location and size if necessary.

For all other tools, you can use the Copy to next and Copy to all tools to copy your annotations to the
next or all frames respectively.

Navigate the UI
You can navigate between video frames using the navigation bar in the bottom-left corner of your UI.

Use the play button to automatically move through the entire sequence of frames.

Use the next frame and previous frame buttons to move forward or back one frame at a time. You can
also input a frame number to navigate to that frame.

You can zoom in to and out of all video frames. Once you have zoomed into a video frame, you can move
around in that frame using the move icon. When you set a new view in a single video frame by zooming
and moving within that frame, all video frames are set to the same view. You can reset all video frames
to their original view using the fit screen icon. For additional view options, see Icons Guide (p. 585).

When you are in the worker UI, you see the following menus:

• Instructions – Review these instructions before starting your task. Additionally, select More
instructions and review these instructions.
• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate video frames and
use the tools provided.
• Help – Use this option to refer back to this documentation.

Bulk Edit Label and Frame Attributes


You can bulk edit label attributes and frame attributes (attributes).

When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.

To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.

You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.

To bulk edit a label or attribute:

1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.

If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.

You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.

582
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Guide
Your task will include one or more tools. The tool provided dictates the type of annotations you will
create to identify and track objects. Use the following table to learn more about each tool provided.

Tool Icon Action Description

Bounding box Add a bounding box Choose this icon to add


annotation. a bounding box. Each
bounding box you add
is associated with the
category you choose
from the Label category
drop down menu.
Select the bounding
box or its associated
label to adjust it.

Bounding box Predict bounding boxes Select a bounding box,


in the next frame. and then choose this
icon to predict the
location of that box in
the next frame. You
can select the icon
multiple times in a
row to automatically
detect the location of
box in multiple frames.
For example, select
this icon 5 times to
predict the location of
a bounding box in the
next 5 frames.

Keypoint Add a keypoint Choose this icon to add


annotation. a keypoint. Click on an
object the image to
place the keypoint at
that location.

Each keypoint you add


is associated with the
category you choose
from the Label category
drop down menu.
Select a keypoint or
its associated label to
adjust it.

Polyline Add a polyline Choose this icon to add


annotation. a polyline. To add a
polyline, continuously
click around the object
of interest to add new
points. To stop drawing
a polyline, select the
last point that you
placed a second time

583
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Icon Action Description


(this point will be
green), or press Enter
on your keyboard.

Each point added to the


polyline is connected to
the previous point by a
line. The polyline does
not have to be closed
(the start point and end
point do not have to be
the same) and there are
no restrictions on the
angles formed between
lines.

Each polyline you add


is associated with the
category you choose
from the Label category
drop down menu.
Select the polyline or
its associated label to
adjust it.

Polygon Add a polygon Choose this icon to add


annotation. a polygon. To add a
polygon, continuously
click around the object
of interest to add new
points. To stop drawing
the polygon, select the
start point (this point
will be green).

A polygon is a closed
shape defined by a
series of points that
you place. Each point
added to the polygon
is connected to the
previous point by a
line and there are no
restrictions on the
angles formed between
lines. The start and end
point must be the same.

Each polygon you add


is associated with the
category you choose
from the Label category
drop down menu.
Select the polygon or
its associated label to
adjust it.

584
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Icon Action Description

Copy to Next Copy annotations to the If one or more


next frame. annotations are
selected in the
current frame, those
annotations are copied
to the next frame.
If no annotations
are selected, all
annotations in the
current frame will be
copied to the next
frame.

Copy to All Copy annotations to all If one or more


subsequent frames. annotations are
selected in the
current frame,
those annotations
are copied to all
subsequent frames.
If no annotations
are selected, all
annotations in the
current frame will
be copied to all
subsequent frames.

Icons Guide

Use this table to learn about the icons you see in your UI. You can automatically select some of these
icons using the keyboard shortcuts found in the Shortcuts menu.

Icon Action Description

brightness Choose this icon to adjust the brightness of all video frames.

contrast Choose this icon to adjust the contrast of all video frames.

zoom in Choose this icon to zoom into all of the video frames.

585
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Icon Action Description

zoom out Choose this icon to zoom out of all of the video frames.

move screen After you've zoomed into a video frame, choose this icon to
move around in that video frame. You can move around the
video frame using your mouse by clicking and dragging the
frame in the direction you want it to move. This will change
the view in all view frames.

fit screen Reset all video frames to their original position.

undo Undo an action. You can use this icon to remove a bounding
box that you just added, or to undo an adjustment you made
to a bounding box.

redo Redo an action that was undone using the undo icon.

delete label Delete a label. This will delete the bounding box associated
with the label in a single frame.

show or hide label Select this icon to show a label that has been hidden. If this
icon has a slash through it, select it to hide the label.

edit label Select this icon to open the Edit instance menu. Use this
menu to edit a label category, ID, and to add or edit label
attributes.

Shortcuts

The keyboard shortcuts listed in the Shortcuts menu can help you quickly select icons, undo and redo
annotations, and use tools to add and edit annotations. For example, once you add a bounding box, you
can use P to quickly predict the location of that box in subsequent frames.

Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.

586
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Release, Stop and Resume, and Decline Tasks

When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:

• Decline task: You should only decline a task if something is wrong with the task, such as unclear video
frame images or an issue with the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.

Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.

Saving Your Work and Submitting

You should periodically save your work using the Save button. Ground Truth will automatically save your
work ever 15 minutes.

When you open a task, you must complete your work on it before pressing Submit.

Work on Video Frame Object Detection Tasks


Video frame object detection tasks required you to classify and identify the location of objects in video
frames using annotations. A video frame is a still image from a video scene.

You can use the worker UI to navigate between video frames and create annotations to identify objects
of interest. Use the sections on this page to learn how to navigate your worker UI, use the tools provided,
and complete your task.

It is recommended that you complete your task using a Google Chrome web browser.
Important
If you see annotations have already been added to one or more video frames when you open
your task, adjust those annotations and add additional annotations as needed.

Topics
• Your Task (p. 588)
• Navigate the UI (p. 589)
• Bulk Edit Label and Frame Attributes (p. 589)
• Tool Guide (p. 590)
• UI Icon Guide (p. 593)
• Shortcuts (p. 594)
• Release, Stop and Resume, and Decline Tasks (p. 594)
• Saving Your Work and Submitting (p. 595)

587
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Your Task

When you work on a video frame object detection task, you need to select a category from the Label
category menu on the right side of your worker portal to start annotating. After you've chosen a
category, draw annotations around objects that this category applies to. To learn more about the tools
you see in your worker UI, refer to the Tool Guide (p. 590).

After you've added a label, you may see a downward pointing arrow next to the label in the Labels
menu. Select this arrow and then select one option for each label attribute you see to provide more
information about that label.

You may see frame attributes under the Labels menu. These attributes will appear on each frame in your
task. Use these attribute prompts to enter additional information about each frame.

588
Amazon SageMaker Developer Guide
Label Videos and Video Frames

To edit an annotation, select the label of the annotation that you want to edit in the Labels menu or
select the annotation in the frame. When you edit or delete an annotation, the action will only modify
the annotation in a single frame.

If you are working on a task that includes a bounding box tool, use the predict next icon to predict the
location of all bounding boxes that you have drawn in a frame in the next frame. If you select a single
box and then select the predict next icon, only that box will be predicted in the next frame. If you have
not added any boxes to the current frame, you will receive an error. You must add at least one box to the
frame before using this feature.
Note
The predict next feature will not overwrite manually created annotations. It will only add
annotations. If you use predict next and as a result have more than one bounding box around a
single object, delete all but one box. Each object should only be identified with a single box.

After you've used the predict next icon, review the location of each box in the next frame and make
adjustments to the box location and size if necessary.

For all other tools, you can use the Copy to next and Copy to all tools to copy your annotations to the
next or all frames respectively.

Navigate the UI
You can navigate between video frames using the navigation bar in the bottom-left corner of your UI.

Use the play button to automatically play through multiple frames.

Use the next frame and previous frame buttons to move forward or back one frame at a time. You can
also input a frame number to navigate to that frame.

You can zoom in to and out of all video frames. Once you have zoomed into a video frame, you can move
around in that frame using the move icon. When you navigate to a new view in a single video frame by
zooming and moving within that frame, all video frames are set to the same view. You can reset all video
frames to their original view using the fit screen icon. To learn more, see UI Icon Guide (p. 593).

When you are in the worker UI, you see the following menus:

• Instructions – Review these instructions before starting your task. Additionally, select More
instructions and review these instructions.
• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate video frames and
use the annotation tools provided.
• Help – Use this option to refer back to this documentation.

If you

Bulk Edit Label and Frame Attributes


You can bulk edit label attributes and frame attributes (attributes).

When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.

To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.

You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.

589
Amazon SageMaker Developer Guide
Label Videos and Video Frames

To bulk edit a label or attribute:

1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.

If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.

You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.

Tool Guide
Your task will include one or more tools. The tool provided dictates the type of annotations you will
create to identify and label objects. Use the following table to learn more about the tool or tools you
may see in your worker UI.

Tool Icon Action Description

Bounding box Add a bounding box Choose this icon to add


annotation. a bounding box. Each
bounding box you add
is associated with the
category you choose
from the Label category
drop down menu.
Select the bounding
box or its associated
label to adjust it.

Predict next Predict bounding boxes Select a bounding box,


in the next frame. and then choose this
icon to predict the
location of that box in
the next frame. You
can select the icon
multiple times in a
row to automatically
detect the location of
box in multiple frames.
For example, select
this icon 5 times to
predict the location of
a bounding box in the
next 5 frames.

Keypoint Add a keypoint Choose this icon to add


annotation. a keypoint. Click on an
object the image to
place the keypoint at
that location.

Each keypoint you add


is associated with the

590
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Icon Action Description


category you choose
from the Label category
drop down menu.
Select a keypoint or
its associated label to
adjust it.

Polyline Add a polyline Choose this icon to add


annotation. a polyline. To add a
polyline, continuously
click around the object
of interest to add new
points. To stop drawing
a polyline, select the
last point that you
placed a second time
(this point will be
green), or press Enter
on your keyboard.

Each point added to the


polyline is connected to
the previous point by a
line. The polyline does
not have to be closed
(the start point and end
point do not have to be
the same) and there are
no restrictions on the
angles formed between
lines.

Each polyline you add


is associated with the
category you choose
from the Label category
drop down menu.
Select the polyline or
its associated label to
adjust it.

591
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Icon Action Description

Polygon Add a polygon Choose this icon to add


annotation. a polygon. To add a
polygon, continuously
click around the object
of interest to add new
points. To stop drawing
the polygon, select the
start point (this point
will be green).

A polygon is a closed
shape defined by a
series of points that
you place. Each point
added to the polygon
is connected to the
previous point by a
line and there are
no restrictions on
the angles formed
between lines. Two
lines (sides) of the
polygon cannot cross. A
line will become red if it
violates this condition.
The start and end point
must be the same.

Each polygon you add


is associated with the
category you choose
from the Label category
drop down menu.
Select the polygon or
its associated label to
adjust it.

Copy to Next Copy annotations to the If one or more


next frame. annotations are
selected in the
current frame, those
annotations are copied
to the next frame.
If no annotations
are selected, all
annotations in the
current frame will be
copied to the next
frame.

592
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Tool Icon Action Description

Copy to All Copy annotations to all If one or more


subsequent frames. annotations are
selected in the
current frame,
those annotations
are copied to all
subsequent frames.
If no annotations
are selected, all
annotations in the
current frame will
be copied to all
subsequent frames.

UI Icon Guide

Use this table to learn about the icons you see in your worker task portal. You can automatically select
these icons using the keyboard shortcuts found in the Shortcuts menu.

Icon Description

brightness Choose this icon to adjust the brightness of all video frames.

contrast Choose this icon to adjust the contrast of all video frames.

zoom in Choose this icon to zoom into all of the video frames.

zoom out Choose this icon to zoom out of all of the video frames.

move screen After you've zoomed into a video frame, choose this icon to
move around in that video frame. You can move around in
the video frame using your mouse by clicking and dragging
the frame in the direction you want it to move. This will
change the view in all view frames.

fit screen Reset all video frames to their original position.

593
Amazon SageMaker Developer Guide
Label Videos and Video Frames

Icon Description

undo Undo an action. You can use this icon to remove a bounding
box that you just added, or to undo an adjustment you made
to a bounding box.

redo Redo an action that was undone using the undo icon.

delete label Delete a label. This will delete the bounding box associated
with the label in a single frame.

show or hide label Select this icon to show a label that has been hidden. If this
icon has a slash through it, select it to hide the label.

Shortcuts

The keyboard shortcuts listed in the Shortcuts menu can help you quickly select icons, undo and redo
annotations, and use tools to add and edit annotations. For example, once you add a bounding box, you
can use P to quickly predict the location of that box in subsequent frames.

Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.

Release, Stop and Resume, and Decline Tasks

When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:

• Decline task: You should only decline a task if something is wrong with the task, such as unclear video
frame images or an issue with the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.

Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.

594
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Saving Your Work and Submitting

You should periodically save your work. Ground Truth automatically saves your work every 15 minutes.

When you open a task, you must complete your work before pressing Submit.

Use Ground Truth to Label 3D Point Clouds


Create a 3D point cloud labeling job to have workers label objects in 3D point clouds generated from
3D sensors like Light Detection and Ranging (LiDAR) sensors and depth cameras, or generated from 3D
reconstruction by stitching images captured by an agent like a drone.

3D Point Clouds
Point clouds are made up of three-dimensional (3D) visual data that consists of points. Each point is
described using three coordinates, typically x, y, and z. To add color or variations in point intensity to
the point cloud, points may be described with additional attributes, such as i for intensity or values for
the red (r), green (g), and blue (b) 8-bit color channels. When you create a Ground Truth 3D point cloud
labeling job, you can provide point cloud and, optionally, sensor fusion data.

The following image shows a single, 3D point cloud scene rendered by Ground Truth and displayed in the
semantic segmentation worker UI.

595
Amazon SageMaker Developer Guide
Label 3D Point Clouds

596
Amazon SageMaker Developer Guide
Label 3D Point Clouds

LiDAR
A Light Detection and Ranging (LiDAR) sensor is a common type of sensor used to collect measurements
that are used to generate point cloud data. LiDAR is a remote sensing method that uses light in the form
of a pulsed laser to measure the distances of objects from the sensor. You can provide 3D point cloud
data generated from a LiDAR sensor for a Ground Truth 3D point cloud labeling job using the raw data
formats described in Accepted Raw 3D Data Formats (p. 746).

Sensor Fusion
Ground Truth 3D point cloud labeling jobs include a sensor fusion feature that supports video camera
sensor fusion for all task types. Some sensors come with multiple LiDAR devices and video cameras that
capture images and associate them with a LiDAR frame. To help annotators visually complete your tasks
with high confidence, you can use the Ground Truth sensor fusion feature to project annotations (labels)
from a 3D point cloud to 2D camera images and vice versa using 3D scanner (such as LiDAR) extrinsic
matrix and camera extrinsic and intrinsic matrices. To learn more, see Sensor Fusion (p. 763).

Label 3D Point Clouds


Ground Truth provides a user interface (UI) and tools that workers use to label or annotate 3D point
clouds. When you use the object detection or semantic segmentation task types, workers can annotate a
single point cloud frame. When you use object tracking, workers annotate a sequence of frames. You can
use object tracking to track object movement across all frames in a sequence.

The following demonstrates how a worker would use the Ground Truth worker portal and tools to
annotate a 3D point cloud for an object detection task. For similar visual examples of other task types,
see 3D Point Cloud Task types (p. 599).

597
Amazon SageMaker Developer Guide
Label 3D Point Clouds

598
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Assistive Labeling Tools for Point Cloud Annotation


Ground Truth offers assistive labeling tools to help workers complete your point cloud annotation tasks
faster and with more accuracy. For details about assistive labeling tools that are included in the worker UI
for each task type, select a task type (p. 599) and refer to the View the Worker Task Interface section
of that page.

Next Steps
You can create six types of tasks when you use Ground Truth 3D point cloud labeling jobs. Use the topics
in 3D Point Cloud Task types (p. 599) to learn more about these task types and to learn how to create a
labeling job using the task type of your choice.

The 3D point cloud labeling job is different from other Ground Truth labeling modalities. Before
creating a labeling job, we recommend that you read 3D Point Cloud Labeling Jobs Overview (p. 630).
Additionally, review input data quotas in 3D Point Cloud and Video Frame Labeling Job Quotas (p. 744).

For an end-to-end demo using the SageMaker API and AWS Python SDK (boto 3) to create a 3D point
cloud labeling job, see create-3D-pointcloud-labeling-job.ipynb in the SageMaker Examples notebook
tab.
Important
If you use a notebook instance created before June 5th, 2020 to run this notebook, you must
stop and restart that notebook instance for the notebook to work.

Topics
• 3D Point Cloud Task types (p. 599)
• 3D Point Cloud Labeling Jobs Overview (p. 630)
• Worker Instructions (p. 634)

3D Point Cloud Task types


You can use Ground Truth 3D point cloud labeling modality for a variety of use cases. The following list
briefly describes each 3D point cloud task type. For additional details and instructions on how to create a
labeling job using a specific task type, select the task type name to see its task type page.

• 3D point cloud object detection – Use this task type when you want workers to locate and classify
objects in a 3D point cloud by adding and fitting 3D cuboids around objects.
• 3D point cloud object tracking – Use this task type when you want workers to add and fit 3D cuboids
around objects to track their movement across a sequence of 3D point cloud frames. For example, you
can use this task type to ask workers to track the movement of vehicles across multiple point cloud
frames.
• 3D point cloud semantic segmentation – Use this task type when you want workers to create a point-
level semantic segmentation mask by painting objects in a 3D point cloud using different colors where
each color is assigned to one of the classes you specify.
• 3D point cloud adjustment task types – Each of the task types above has an associated adjustment task
type that you can use to audit and adjust annotations generated from a 3D point cloud labeling job.
Refer to the task type page of the associated type to learn how to create an adjustment labeling job
for that task.

3D Point Cloud Object Detection


Use this task type when you want workers to classify objects in a 3D point cloud by drawing 3D cuboids
around objects. For example, you can use this task type to ask workers to identify different types of
objects in a point cloud, such as cars, bikes, and pedestrians.

599
Amazon SageMaker Developer Guide
Label 3D Point Clouds

For this task type, the data object that workers label is a single point cloud frame. Ground Truth renders
a 3D point cloud using point cloud data you provide. You can also provide camera data to give workers
more visual information about scenes in the frame, and to help workers draw 3D cuboids around objects.

Ground Truth providers workers with tools to annotate objects with 9 degrees of freedom
(x,y,z,rx,ry,rz,l,w,h) in three dimensions in both 3D scene and projected side views (top, side, and back).
If you provide sensor fusion information (like camera data), when a worker adds a cuboid to identify an
object in the 3D point cloud, the cuboid shows up and can be modified in the 2D images. After a cuboid
has been added, all edits made to that cuboid in the 2D or 3D scene are projected into the other view.

You can create a job to adjust annotations created in a 3D point cloud object detection labeling job using
the 3D point cloud object detection adjustment task type.

If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this page provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.

Topics
• View the Worker Task Interface (p. 600)
• Create a 3D Point Cloud Object Detection Labeling Job (p. 604)
• Create a 3D Point Cloud Object Detection Adjustment or Verification Labeling Job (p. 605)
• Output Data Format (p. 606)

View the Worker Task Interface

Ground Truth provides workers with a web portal and tools to complete your 3D point cloud object
detection annotation tasks. When you create the labeling job, you provide the Amazon Resource Name
(ARN) for a pre-built Ground Truth worker UI in the HumanTaskUiArn parameter. When you create a
labeling job using this task type in the console, this worker UI is automatically used. You can preview
and interact with the worker UI when you create a labeling job in the console. If you are a new user, it
is recommended that you create a labeling job using the console to ensure your label attributes, point
cloud frames, and if applicable, images, appear as expected.

The following is a GIF of the 3D point cloud object detection worker task interface. If you provide camera
data for sensor fusion in the world coordinate system, images are matched up with scenes in the point
cloud frame. These images appear in the worker portal as shown in the following GIF.

600
Amazon SageMaker Developer Guide
Label 3D Point Clouds

601
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Worker can navigate in the 3D scene using their keyboard and mouse. They can:

• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

Once a worker places a cuboid in the 3D scene, a side-view will appear with the three projected side
views: top, side, and back. These side-views show points in and around the placed cuboid and help
workers refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views
using their mouse.

The following video demonstrates movements around the 3D point cloud and in the side-view.

602
Amazon SageMaker Developer Guide
Label 3D Point Clouds

603
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Additional view options and features are available in the View menu in the worker UI. See the worker
instruction page for a comprehensive overview of the Worker UI.

Assistive Labeling Tools

Ground Truth helps workers annotate 3D point clouds faster and more accurately using machine learning
and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks. The
following assistive labeling tools are available for this task type:

• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object.
• Set to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side panel displays front,
side, and top perspectives to help the worker adjust the cuboid tightly around the object. In all of
these views, the cuboid includes an arrow that indicates the orientation, or heading of the object.
When the worker adjusts the cuboid, the adjustment will appear in real time on all of the views (that is,
3D, top, side, and front).
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations will be projected into the other view in real time. Additionally,
workers will have the option to view the direction the camera is facing and the camera frustum.
• View options – Enables workers to easily hide or view cuboids, label text, a ground mesh, and
additional point attributes like color or intensity. Workers can also choose between perspective and
orthogonal projections.

Create a 3D Point Cloud Object Detection Labeling Job

You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:

• A single-frame input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Frame Input Manifest File (p. 748). If you are a new user of Ground Truth 3D point cloud
labeling modalities, you may also want to review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for video
frame labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).

Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).

Use one of the following sections to learn how to create a labeling job using the console or an API.

Create a Labeling Job (Console)

You can follow the instructions Create a Labeling Job (Console) (p. 706) in order to learn how to create
a 3D point cloud object detection labeling job in the SageMaker console. While you are creating your
labeling job, be aware of the following:

• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• Optionally, you can provide label category and frame attributes. Workers can assign one or more of
these attributes to annotations to provide more information about that object. For example, you might
want to use the attribute occluded to have workers identify when an object is partially obstructed.

604
Amazon SageMaker Developer Guide
Label 3D Point Clouds

• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud object detection labeling jobs can take multiple hours to complete. You can specify
a longer time limit for these labeling jobs when you select your work team (up to 7 days, or 604800
seconds).

Create a Labeling Job (API)

This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.

Create a Labeling Job (API) (p. 709), provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/
PointCloudObjectDetection. Replace <region> with the AWS Region you are creating the
labeling job in.

There should not be an entry for the UiTemplateS3Uri parameter.


• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• You specify your labels, label category and frame attributes, and worker instructions in a label
category configuration file. To learn how to create this file, see Create a Labeling Category
Configuration File with Label Category and Frame Attributes (p. 719).
• You need to provide pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the
Region you are creating your labeling job in to find the correct ARN. For example, if
you are creating your labeling job in us-east-1, the ARN will be arn:aws:lambda:us-
east-1:432418664414:function:PRE-3DPointCloudObjectDetection.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn.
Use the Region you are creating your labeling job in to find the correct ARN. For example,
if you are creating your labeling job in us-east-1, the ARN will be arn:aws:lambda:us-
east-1:432418664414:function:ACS-3DPointCloudObjectDetection.
• The number of workers specified in NumberOfHumanWorkersPerDataObject must be 1.
• Automated data labeling is not supported for 3D point cloud labeling jobs. You should not specify
values for parameters in LabelingJobAlgorithmsConfig.
• 3D point cloud object detection labeling jobs can take multiple hours to complete. You can specify
a longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800
seconds).

Create a 3D Point Cloud Object Detection Adjustment or Verification Labeling Job

You can create an adjustment or verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).

When you create an adjustment labeling job, your input data to the labeling job can include labels, and
yaw, pitch, and roll measurements from a previous labeling job or external source. In the adjustment job,
pitch, and roll will be visualized in the worker UI, but cannot be modified. Yaw is adjustable.

605
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Ground Truth uses Tait-Bryan angles with the following intrinsic rotations to visualize yaw, pitch and roll
in the worker UI. First, rotation is applied to the vehicle according to the z-axis (yaw). Next, the rotated
vehicle is rotated according to the intrinsic y'-axis (pitch). Finally, the vehicle is rotated according to the
intrinsic x''-axis (roll).

Output Data Format

When you create a 3D point cloud object detection labeling job, tasks are sent to workers. When these
workers complete their tasks, labels are written to the Amazon S3 bucket you specified when you created
the labeling job. The output data format determines what you see in your Amazon S3 bucket when your
labeling job status (LabelingJobStatus) is Completed.

If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object detection output data format, see 3D Point
Cloud Object Detection Output (p. 794).

3D Point Cloud Object Tracking

Use this task type when you want workers to add and fit 3D cuboids around objects to track their
movement across 3D point cloud frames. For example, you can use this task type to ask workers to track
the movement of vehicles across multiple point cloud frames.

For this task type, the data object that workers label is a sequence of point cloud frames. A sequence
is defined as a temporal series of point cloud frames. Ground Truth renders a series of 3D point cloud
visualizations using a sequence you provide and workers can switch between these 3D point cloud
frames in the worker task interface.

Ground Truth providers workers with tools to annotate objects with 9 degrees of freedom:
(x,y,z,rx,ry,rz,l,w,h) in three dimensions in both 3D scene and projected side views (top, side, and back).
When a worker draws a cuboid around an object, that cuboid is given a unique ID, for example Car:1 for
one car in the sequence and Car:2 for another. Workers use that ID to label the same object in multiple
frames.

You can also provide camera data to give workers more visual information about scenes in the frame,
and to help workers draw 3D cuboids around objects. When a worker adds a 3D cuboid to identify an
object in either the 2D image or the 3D point cloud, and the cuboid shows up in the other view.

You can adjust annotations created in a 3D point cloud object detection labeling job using the 3D point
cloud object tracking adjustment task type.

If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this page provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.

Topics
• View the Worker Task Interface (p. 606)
• Create a 3D Point Cloud Object Tracking Labeling Job (p. 614)
• Create a 3D Point Cloud Object Tracking Adjustment or Verification Labeling Job (p. 615)
• Output Data Format (p. 615)

View the Worker Task Interface

Ground Truth provides workers with a web portal and tools to complete your 3D point cloud object
tracking annotation tasks. When you create the labeling job, you provide the Amazon Resource Name

606
Amazon SageMaker Developer Guide
Label 3D Point Clouds

(ARN) for a pre-built Ground Truth UI in the HumanTaskUiArn parameter. When you create a labeling
job using this task type in the console, this UI is automatically used. You can preview and interact with
the worker UI when you create a labeling job in the console. If you are a new use, it is recommended that
you create a labeling job using the console to ensure your label attributes, point cloud frames, and if
applicable, images, appear as expected.

The following is a GIF of the 3D point cloud object tracking worker task interface and demonstrates how
the worker can navigate the point cloud frames in the sequence. The annotating tools are a part of the
worker task interface. They are not available for the preview interface.

607
Amazon SageMaker Developer Guide
Label 3D Point Clouds

608
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Once workers add a single cuboid, that cuboid is replicated in all frames of the sequence with the same
ID. Once workers adjust the cuboid in another frame, Ground Truth will interpolate the movement of that
object and adjust all cuboids between the manually adjusted frames. The following GIF demonstrates
this interpolation feature. In the navigation bar on the bottom-left, red-areas indicate manually adjusted
frames.

609
Amazon SageMaker Developer Guide
Label 3D Point Clouds

610
Amazon SageMaker Developer Guide
Label 3D Point Clouds

If you provide camera data for sensor fusion, images are matched up with scenes in point cloud frames.
These images appear in the worker portal as shown in the following GIF.

Worker can navigate in the 3D scene using their keyboard and mouse. They can:

• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

Once a worker places a cuboids in the 3D scene, a side-view will appear with the three projected side
views: top, side, and back. These side-views show points in and around the placed cuboid and help
workers refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views
using their mouse.

The following video demonstrates movements around the 3D point cloud and in the side-view.

611
Amazon SageMaker Developer Guide
Label 3D Point Clouds

612
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.

Worker Tools

Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. If workers click on a point in the point cloud,
the UI will automatically zoom into that area. Workers can use various tools to draw 3D cuboid around
objects. For more information, see Assistive Labeling Tools.

After workers have placed a 3D cuboid in the point cloud, they can adjust these cuboids to fit tightly
around cars using a variety of views: directly in the 3D cuboid, in a side-view featuring three zoomed-in
perspectives of the point cloud around the box, and if you include images for sensor fusion, directly in
the 2D image.

View options that enable workers to easily hide or view label text, a ground mesh, and additional point
attributes. Workers can also choose between perspective and orthogonal projections.

Assistive Labeling Tools

Ground Truth helps workers annotate 3D point clouds faster and more accurately using UX, machine
learning and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks.
The following assistive labeling tools are available for this task type:

• Label autofill – When a worker adds a cuboid to a frame, a cuboid with the same dimensions and
orientation is automatically added to all frames in the sequence.
• Label interpolation – After a worker has labeled a single object in two frames, Ground Truth uses
those annotations to interpolate the movement of that object between those two frames. Label
interpolation can be turned on and off.
• Bulk label and attribute management – Workers can add, delete, and rename annotations, label
category attributes, and frame attributes in bulk.
• Workers can manually delete annotations for a given object before or after a frame. For example,
a worker can delete all labels for an object after frame 10 if that object is no longer located in the
scene after that frame.
• If a worker accidentally bulk deletes all annotations for a object, they can add them back. For
example, if a worker deletes all annotations for an object before frame 100, they can bulk add them
to those frames.
• Workers can rename a label in one frame and all 3D cuboids assigned that label are updated with
the new name across all frames.
• Workers can use bulk editing to add or edit label category attributes and frame attributes in
multiple frames.
• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object's boundaries.
• Fit to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side -panel displays front
and two side perspectives to help the worker adjust the cuboid tightly around the object. Workers can
annotation the 3D point cloud, the side panel and the adjustments appear in the other views in real
time.
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations will be projected into the other view in real time.
• Auto-merge cuboids – Workers can automatically merge two cuboids across all frames if they
determine that cuboids with different labels actually represent a single object.

613
Amazon SageMaker Developer Guide
Label 3D Point Clouds

• View options – Enables workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.

Create a 3D Point Cloud Object Tracking Labeling Job

You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:

• A sequence input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Sequence Input Manifest (p. 754). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for 3D
point cloud labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).

Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).

To learn how to create a labeling job using the console or an API, see the following sections.

Create a Labeling Job (API)

This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.

Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/
PointCloudObjectTracking. Replace <region> with the AWS Region you are creating the labeling
job in.

There should not be an entry for the UiTemplateS3Uri parameter.


• Your LabelAttributeName must end in -ref. For example, ot-labels-ref.
• Your input manifest file must be a point cloud frame sequence manifest file. For more information, see
Create a Point Cloud Sequence Input Manifest (p. 754).
• You specify your labels, label category and frame attributes, and worker instructions in a label
category configuration file. For more information, see Create a Labeling Category Configuration File
with Label Category and Frame Attributes (p. 719) to learn how to create this file.
• You need to provide pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the
Region you are creating your labeling job in to find the correct ARN that ends with
PRE-3DPointCloudObjectTracking.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn.
Use the Region you are creating your labeling job in to find the correct ARN that ends with
ACS-3DPointCloudObjectTracking.
• The number of workers specified in NumberOfHumanWorkersPerDataObject should be 1.
• Automated data labeling is not supported for 3D point cloud labeling jobs. You should not specify
values for parameters in LabelingJobAlgorithmsConfig.

614
Amazon SageMaker Developer Guide
Label 3D Point Clouds

• 3D point cloud object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800
seconds).

Create a Labeling Job (Console)


You can follow the instructions Create a Labeling Job (Console) (p. 706) in order to learn how to create
a 3D point cloud object tracking labeling job in the SageMaker console. While you are creating your
labeling job, be aware of the following:

• Your input manifest file must be a sequence manifest file. For more information, see Create a Point
Cloud Sequence Input Manifest (p. 754).
• Optionally, you can provide label category attributes. Workers can assign one or more of these
attributes to annotations to provide more information about that object. For example, you might want
to use the attribute occluded to have workers identify when an object is partially obstructed.
• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud object tracking labeling jobs can take multiple hours to complete. You can specify a
longer time limit for these labeling jobs when you select your work team (up to 7 days, or 604800
seconds).

Create a 3D Point Cloud Object Tracking Adjustment or Verification Labeling Job


You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).

When you create an adjustment labeling job, your input data to the labeling job can include labels, and
yaw, pitch, and roll measurements from a previous labeling job or external source. In the adjustment job,
pitch, and roll will be visualized in the worker UI, but cannot be modified. Yaw is adjustable.

Ground Truth uses Tait-Bryan angles with the following intrinsic rotations to visualize yaw, pitch and roll
in the worker UI. First, rotation is applied to the vehicle according to the z-axis (yaw). Next, the rotated
vehicle is rotated according to the intrinsic y'-axis (pitch). Finally, the vehicle is rotated according to the
intrinsic x''-axis (roll).

Output Data Format


When you create a 3D point cloud object tracking labeling job, tasks are sent to workers. When these
workers complete their tasks, their annotations are written to the Amazon S3 bucket you specified when
you created the labeling job. The output data format determines what you see in your Amazon S3 bucket
when your labeling job status (LabelingJobStatus) is Completed.

If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object tracking output data format, see 3D Point
Cloud Object Tracking Output (p. 796).

3D Point Cloud Semantic Segmentation


Semantic segmentation involves classifying individual points of a 3D point cloud into pre-specified
categories. Use this task type when you want workers to create a point-level semantic segmentation
mask for 3D point clouds. For example, if you specify the classes car, pedestrian, and bike, workers
select one class at a time, and color all of the points that this class applies to the same color in the point
cloud.

For this task type, the data object that workers label is a single point cloud frame. Ground Truth
generates a 3D point cloud visualization using point cloud data you provide. You can also provide camera

615
Amazon SageMaker Developer Guide
Label 3D Point Clouds

data to give workers more visual information about scenes in the frame, and to help workers paint
objects. When a worker paints an object in either the 2D image or the 3D point cloud, the paint shows up
in the other view.

You can adjust annotations created in a 3D point cloud object detection labeling job using the 3D point
cloud semantic segmentation adjustment task type.

If you are a new user of the Ground Truth 3D point cloud labeling modality, we recommend you review
3D Point Cloud Labeling Jobs Overview (p. 630). This labeling modality is different from other Ground
Truth task types, and this topic provides an overview of important details you should be aware of when
creating a 3D point cloud labeling job.

Topics
• View the Worker Task Interface (p. 616)
• Create a 3D Point Cloud Semantic Segmentation Labeling Job (p. 622)
• Create a 3D Point Cloud Semantic Segmentation Adjustment or Verification Labeling Job (p. 623)
• Output Data Format (p. 623)

View the Worker Task Interface

Ground Truth provides workers with a web portal and tools to complete your 3D point cloud semantic
segmentation annotation tasks. When you create the labeling job, you provide the Amazon Resource
Name (ARN) for a pre-built Ground Truth UI in the HumanTaskUiArn parameter. When you create
a labeling job using this task type in the console, this UI is automatically used. You can preview and
interact with the worker UI when you create a labeling job in the console. If you are a new use, it is
recommended that you create a labeling job using the console to ensure your label attributes, point
cloud frames, and if applicable, images, appear as expected.

The following is a GIF of the 3D point cloud semantic segmentation worker task interface. If you provide
camera data for sensor fusion, images are matched with scenes in the point cloud frame. Workers can
paint objects in either the 3D point cloud or the 2D image, and the paint appears in the corresponding
location in the other medium. These images appear in the worker portal as shown in the following GIF.

616
Amazon SageMaker Developer Guide
Label 3D Point Clouds

617
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Worker can navigate in the 3D scene using their keyboard and mouse. They can:

• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

The following video demonstrates movements around the 3D point cloud. Workers can hide and re-
expand all side views and menus. In this GIF, the side-views and menus have been collapsed.

618
Amazon SageMaker Developer Guide
Label 3D Point Clouds

619
Amazon SageMaker Developer Guide
Label 3D Point Clouds

The following GIF demonstrates how a worker can label multiple objects quickly, refine painted objects
using the Unpaint option and then view only points that have been painted.

620
Amazon SageMaker Developer Guide
Label 3D Point Clouds

621
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.

Worker Tools

Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. When you create a semantic segmentation
job, workers have the following tools available to them:

• A paint brush to paint and unpaint objects. Workers paint objects by selecting a label category and
then painting in the 3D point cloud. Workers unpaint objects by selecting the Unpaint option from the
label category menu and using the paint brush to erase paint.
• A polygon tool that workers can use to select and paint an area in the point cloud.
• A background paint tool, which enables workers to paint behind objects they have already annotated
without altering the original annotations. For example, workers might use this tool to paint the road
after painting all of the cars on the road.
• View options that enable workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.

Create a 3D Point Cloud Semantic Segmentation Labeling Job

You can create a 3D point cloud labeling job using the SageMaker console or API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:

• A single-frame input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Frame Input Manifest File (p. 748). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk workers
for 3D point cloud labeling jobs. To learn how to create workforces and work teams, see Create and
Manage Workforces (p. 863).
• A label category configuration file. For more information, see Create a Labeling Category
Configuration File with Label Category and Frame Attributes (p. 719).

Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use Ground
Truth (p. 817).

Use one of the following sections to learn how to create a labeling job using the console or an API.

Create a Labeling Job (Console)

You can follow the instructions Create a Labeling Job (Console) (p. 706) in order to learn how to create
a 3D point cloud semantic segmentation labeling job in the SageMaker console. While you are creating
your labeling job, be aware of the following:

• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• Automated data labeling and annotation consolidation are not supported for 3D point cloud labeling
tasks.
• 3D point cloud semantic segmentation labeling jobs can take multiple hours to complete. You can
specify a longer time limit for these labeling jobs when you select your work team (up to 7 days, or
604800 seconds).

622
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Create a Labeling Job (API)

This section covers details you need to know when you create a labeling job using the SageMaker
API operation CreateLabelingJob. This API defines this operation for all AWS SDKs. To see
a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.

The page, Create a Labeling Job (API) (p. 709), provides an overview of the CreateLabelingJob
operation. Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/
PointCloudSemanticSegmentation. Replace <region> with the AWS Region you are creating the
labeling job in.

There should not be an entry for the UiTemplateS3Uri parameter.


• Your LabelAttributeName must end in -ref. For example, ss-labels-ref.
• Your input manifest file must be a single-frame manifest file. For more information, see Create a Point
Cloud Frame Input Manifest File (p. 748).
• You specify your labels and worker instructions in a label category configuration file. See Create a
Labeling Category Configuration File with Label Category and Frame Attributes (p. 719) to learn how
to create this file.
• You need to provide a pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the
Region you are creating your labeling job in to find the correct ARN. For example, if
you are creating your labeling job in us-east-1, the ARN will be arn:aws:lambda:us-
east-1:432418664414:function:PRE-3DPointCloudSemanticSegmentation.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn.
Use the Region you are creating your labeling job in to find the correct ARN. For example,
if you are creating your labeling job in us-east-1, the ARN will be arn:aws:lambda:us-
east-1:432418664414:function:ACS-3DPointCloudSemanticSegmentation.
• The number of workers specified in NumberOfHumanWorkersPerDataObject should be 1.
• Automated data labeling is not supported for 3D point cloud labeling jobs. You should not specify
values for parameters in LabelingJobAlgorithmsConfig.
• 3D point cloud semantic segmentation labeling jobs can take multiple hours to complete. You can
specify a longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or
604800 seconds).

Create a 3D Point Cloud Semantic Segmentation Adjustment or Verification Labeling Job

You can create an adjustment and verification labeling job using the Ground Truth console or
CreateLabelingJob API. To learn more about adjustment and verification labeling jobs, and to learn
how create one, see Verify and Adjust Labels (p. 664).

Output Data Format

When you create a 3D point cloud semantic segmentation labeling job, tasks are sent to workers. When
these workers complete their tasks, their annotations are written to the Amazon S3 bucket you specified
when you created the labeling job. The output data format determines what you see in your Amazon S3
bucket when your labeling job status (LabelingJobStatus) is Completed.

If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D point cloud object detection output data format, see 3D Point
Cloud Semantic Segmentation Output (p. 792).

623
Amazon SageMaker Developer Guide
Label 3D Point Clouds

3D-2D Point Cloud Object Tracking


Use this task type when you want workers to link 3D point cloud annotations with 2D images
annotations and also link 2D image annotations among various cameras. Currently, Ground Truth
supports cuboids for annotation in a 3D point cloud and bounding boxes for annotation in 2D videos. For
example, you can use this task type to ask workers to link the movement of a vehicle in 3D point cloud
with its 2D video. Using 3D-2D linking, you can easily correlate point cloud data (like the distance of a
cuboid) to video data (bounding box) for up to 8 cameras.

Ground Truth provides workers with tools to annotate cuboids in a 3D point cloud and bounding boxes
in up to 8 cameras using the same annotation UI. Workers can also link various bounding boxes for
the same object across different cameras. For example, a bounding box in camera1 can be linked to a
bounding box in camera2. This lets you to correlate an object across multiple cameras using a unique ID.
Note
Currently, SageMaker does not support creating a 3D-2D linking job using the console. To create
a 3D-2D linking job using the SageMaker API, see Create a Labeling Job (API) (p. 629).

Topics
• View the Worker Task Interface (p. 624)
• Input Data Format (p. 628)
• Create a 3D-2D Point Cloud Object Tracking Labeling Job (p. 629)
• Output Data (p. 630)

View the Worker Task Interface

Ground Truth provides workers with a web portal and tools to complete your 3D-2D object tracking
annotation tasks. When you create the labeling job, you provide the Amazon Resource Name (ARN) for a
pre-built Ground Truth UI in the HumanTaskUiArn parameter. To use the UI when you create a labeling
job for this task type using the API, you need to provide the HumanTaskUiArn. You can preview and
interact with the worker UI when you create a labeling job through the API. The annotating tools are a
part of the worker task interface. They are not available for the preview interface. The following image
demonstrates the worker task interface used for the 3D-2D point cloud object tracking annotation task.

624
Amazon SageMaker Developer Guide
Label 3D Point Clouds

625
Amazon SageMaker Developer Guide
Label 3D Point Clouds

When interpolation is enabled by default. After a worker adds a single cuboid, that cuboid is replicated in
all frames of the sequence with the same ID. If the worker adjusts the cuboid in another frame, Ground
Truth interpolates the movement of that object and adjust all cuboids between the manually adjusted
frames. Additionally, using the camera view section, a cuboid can be shown with a projection (using to B
button for "toggle labels" in the camera view) that provides the worker with a reference from the camera
images. The accuracy of the cuboid to image projection is based on accuracy of calibrations captured in
the extrinsic and intrinsinc data.

If you provide camera data for sensor fusion, images are matched up with scenes in point cloud frames.
Note that the camera data should be time synchronized with the point cloud data to ensure an accurate
depiction of point cloud to imagery over each frame in the sequence as shown in the following image.

The manifest file holds the extrinsic and intrinsic data and the pose to allow the cuboid projection on the
camera image to be shown by using the P button.

Worker can navigate in the 3D scene using their keyboard and mouse. They can:

• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

Once a worker places a cuboids in the 3D scene, a side-view appears with the three projected side views:
top, side, and front. These side-views show points in and around the placed cuboid and help workers
refine cuboid boundaries in that area. Workers can zoom in and out of each of those side-views using
their mouse.

The worker should first select the cuboid to draw a corresponding bounding box on any of the camera
views. This links the cuboid and the bounding box with a common name and unique ID.

The worker can also first draw a bounding box, select it and draw the corresponding cuboid to link them.

Additional view options and features are available. See the worker instruction page for a comprehensive
overview of the Worker UI.

Worker Tools

Workers can navigate through the 3D point cloud by zooming in and out, and moving in all directions
around the cloud using the mouse and keyboard shortcuts. If workers click on a point in the point cloud,
the UI automatically zooms into that area. Workers can use various tools to draw 3D cuboid around
objects. For more information, see Assistive Labeling Tools in the following discussion.

626
Amazon SageMaker Developer Guide
Label 3D Point Clouds

After workers have placed a 3D cuboid in the point cloud, they can adjust these cuboids to fit tightly
around cars using a variety of views: directly in the 3D point cloud, in a side-view featuring three
zoomed-in perspectives of the point cloud around the box, and if you include images for sensor fusion,
directly in the 2D image.

Additional view options enable workers to easily hide or view label text, a ground mesh, and additional
point attributes. Workers can also choose between perspective and orthogonal projections.

Assistive Labeling Tools

Ground Truth helps workers annotate 3D point clouds faster and more accurately using UX, machine
learning and computer vision powered assistive labeling tools for 3D point cloud object tracking tasks.
The following assistive labeling tools are available for this task type:

• Label autofill – When a worker adds a cuboid to a frame, a cuboid with the same dimensions,
orientation and xyz position is automatically added to all frames in the sequence.
• Label interpolation – After a worker has labeled a single object in two frames, Ground Truth
uses those annotations to interpolate the movement of that object between all the frames. Label
interpolation can be turned on and off. It is on by default. For example, if a worker working with 5
frames adds a cuboid in frame 2, it is copied to all the 5 frames. If the worker then makes adjustments
in frame 4, frame 2 and 4 now act as two points, through which a line is fit. The cuboid is then
interpolated in frames 1,3 and 5.
• Bulk label and attribute management – Workers can add, delete, and rename annotations, label
category attributes, and frame attributes in bulk.
• Workers can manually delete annotations for a given object before and after a frame, or in all
frames. For example, a worker can delete all labels for an object after frame 10 if that object is no
longer located in the scene after that frame.
• If a worker accidentally bulk deletes all annotations for a object, they can add them back. For
example, if a worker deletes all annotations for an object before frame 100, they can bulk add them
to those frames.
• Workers can rename a label in one frame and all 3D cuboids assigned that label are updated with
the new name across all frames.
• Workers can use bulk editing to add or edit label category attributes and frame attributes in
multiple frames.
• Snapping – Workers can add a cuboid around an object and use a keyboard shortcut or menu option
to have Ground Truth's autofit tool snap the cuboid tightly around the object's boundaries.
• Fit to ground – After a worker adds a cuboid to the 3D scene, the worker can automatically snap the
cuboid to the ground. For example, the worker can use this feature to snap a cuboid to the road or
sidewalk in the scene.
• Multi-view labeling – After a worker adds a 3D cuboid to the 3D scene, a side-panel displays front
and two side perspectives to help the worker adjust the cuboid tightly around the object. Workers can
annotation the 3D point cloud, the side panel and the adjustments appear in the other views in real
time.
• Sensor fusion – If you provide data for sensor fusion, workers can adjust annotations in the 3D scenes
and in 2D images, and the annotations are projected into the other view in real time. To learn more
about the data for sensor fusion, see Understand Coordinate Systems and Sensor Fusion.
• Auto-merge cuboids – Workers can automatically merge two cuboids across all frames if they
determine that cuboids with different labels actually represent a single object.
• View options – Enables workers to easily hide or view label text, a ground mesh, and additional
point attributes like color or intensity. Workers can also choose between perspective and orthogonal
projections.

627
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Input Data Format

You can create a 3D-2D object tracking job using the SageMaker API operation, CreateLabelingJob.
To create a labeling job for this task type you need the following:

• A sequence input manifest file. To learn how to create this type of manifest file, see Create a Point
Cloud Sequence Input Manifest (p. 754). If you are a new user of Ground Truth 3D point cloud
labeling modalities, we recommend that you review Accepted Raw 3D Data Formats (p. 746).
• You specify your labels, label category and frame attributes, and worker instructions in a label
category configuration file. For more information, see Create a Labeling Category Configuration File
with Label Category and Frame Attributes to learn how to create this file. The following is an example
showing a label category configuration file for creating a 3D-2D object tracking job.

{
"document-version": "2020-03-01",
"categoryGlobalAttributes": [
{
"name": "Occlusion",
"description": "global attribute that applies to all label categories",
"type": "string",
"enum":[
"Partial",
"Full"
]
}
],
"labels":[
{
"label": "Car",
"attributes": [
{
"name": "Type",
"type": "string",
"enum": [
"SUV",
"Sedan"
]
}
]
},
{
"label": "Bus",
"attributes": [
{
"name": "Size",
"type": "string",
"enum": [
"Large",
"Medium",
"Small"
]
}
]
}
],
"instructions": {
"shortIntroduction": "Draw a tight cuboid around objects after you select a
category.",
"fullIntroduction": "<p>Use this area to add more detailed worker instructions.</
p>"
},
"annotationType": [
{

628
Amazon SageMaker Developer Guide
Label 3D Point Clouds

"type": "BoundingBox"
},
{
"type": "Cuboid"
}
]
}

Note
You need to provide BoundingBox and Cuboid as annotationType in the label category
configuration file to create a 3D-2D object tracking job.

Create a 3D-2D Point Cloud Object Tracking Labeling Job


You can create a 3D-2D point cloud labeling job using the SageMaker API operation,
CreateLabelingJob. To create a labeling job for this task type you need the following:

• A work team from a private or vendor workforce. You cannot use Amazon Mechanical Turk for 3D
point cloud labeling jobs. To learn how to create workforces and work teams, see Create and Manage
Workforces (p. 863).
• Add a CORS policy to an S3 bucket that contains input data in the Amazon S3 console. To set the
required CORS headers on the S3 bucket that contains your input images in the S3 console, follow the
directions detailed in CORS Permission Requirement.
• Additionally, make sure that you have reviewed and satisfied the Assign IAM Permissions to Use
Ground Truth (p. 817).

To learn how to create a labeling job using the API, see the following sections.

Create a Labeling Job (API)


This section covers details you need to know when you create a 3D-2D object tracking labeling job using
the SageMaker API operation CreateLabelingJob. This API defines this operation for all AWS SDKs.
To see a list of language-specific SDKs supported for this operation, review the See Also section of
CreateLabelingJob.

Create a Labeling Job (API) (p. 709) provides an overview of the CreateLabelingJob operation.
Follow these instructions and do the following while you configure your request:

• You must enter an ARN for HumanTaskUiArn. Use


arn:aws:sagemaker:<region>:394669845002:human-task-ui/
PointCloudObjectTracking. Replace <region> with the AWS Region you are creating the labeling
job in.

There should not be an entry for the UiTemplateS3Uri parameter.


• Your LabelAttributeName must end in -ref. For example, ot-labels-ref.
• Your input manifest file must be a point cloud frame sequence manifest file. For more information,
see Create a Point Cloud Sequence Input Manifest (p. 754). You also need to provide a label category
configuration file as mentioned above.
• You need to provide pre-defined ARNs for the pre-annotation and post-annotation (ACS) Lambda
functions. These ARNs are specific to the AWS Region you use to create your labeling job.
• To find the pre-annotation Lambda ARN, refer to PreHumanTaskLambdaArn. Use the
Region you are creating your labeling job in to find the correct ARN that ends with
PRE-3DPointCloudObjectTracking.
• To find the post-annotation Lambda ARN, refer to AnnotationConsolidationLambdaArn.
Use the Region you are creating your labeling job in to find the correct ARN that ends with
ACS-3DPointCloudObjectTracking.

629
Amazon SageMaker Developer Guide
Label 3D Point Clouds

• The number of workers specified in NumberOfHumanWorkersPerDataObject should be 1.


• Automated data labeling is not supported for 3D point cloud labeling jobs. You should not specify
values for parameters in LabelingJobAlgorithmsConfig.
• 3D-2D object tracking labeling jobs can take multiple hours to complete. You can specify a longer time
limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800 seconds).

Note
After you have successfully created a 3D-2D object tracking job, it shows up on the console
under labeling jobs. The task type for the job is displayed as Point Cloud Object Tracking.

Output Data

When you create a 3D-2D object tracking labeling job, tasks are sent to workers. When these workers
complete their tasks, their annotations are written to the Amazon S3 bucket you specified when you
created the labeling job. The output data format determines what you see in your Amazon S3 bucket
when your labeling job status (LabelingJobStatus) is Completed.

If you are a new user of Ground Truth, see Output Data (p. 776) to learn more about the Ground Truth
output data format. To learn about the 3D-2D point cloud object tracking output data format, see 3D-2D
Object Tracking Point Cloud Object Tracking Output (p. 799).

3D Point Cloud Labeling Jobs Overview


This topic provides an overview of the unique features of a Ground Truth 3D point cloud labeling job.
You can use the 3D point cloud labeling jobs to have workers label objects in a 3D point cloud generated
from a 3D sensors like LiDAR and depth cameras or generated from 3D reconstruction by stitching
images captured by an agent like a drone.

Job Pre-processing Time


When you create a 3D point cloud labeling job, you need to provide an input manifest file (p. 746). The
input manifest file can be:

• A frame input manifest file that has a single point cloud frame on each line.
• A sequence input manifest file that has a single sequence on each line. A sequence is defined as a
temporal series of point cloud frames.

For both types of manifest files, job pre-processing time (that is, the time before Ground Truth starts
sending tasks to your workers) depends on the total number and size of point cloud frames you provide
in your input manifest file. For frame input manifest files, this is the number of lines in your manifest
file. For sequence manifest files, this is the number of frames in each sequence multiplied by the total
number of sequences, or lines, in your manifest file.

Additionally, the number of points per point cloud and the number of fused sensor data objects (like
images) factor into job pre-processing times. On average, Ground Truth can pre-process 200 point cloud
frames in approximately 5 minutes. If you create a 3D point cloud labeling job with a large number of
point cloud frames, you might experience longer job pre-processing times. For example, if you create a
sequence input manifest file with 4 point cloud sequences, and each sequence contains 200 point clouds,
Ground Truth pre-processes 800 point clouds and so your job pre-processing time might be around 20
minutes. During this time, your labeling job status is InProgress.

While your 3D point cloud labeling job is pre-processing, you receive CloudWatch
messages notifying you of the status of your job. To identify these messages, search for
3D_POINT_CLOUD_PROCESSING_STATUS in your labeling job logs.

For frame input manifest files, your CloudWatch logs will have a message similar to the following:

630
Amazon SageMaker Developer Guide
Label 3D Point Clouds

{
"labeling-job-name": "example-point-cloud-labeling-job",
"event-name": "3D_POINT_CLOUD_PROCESSING_STATUS",
"event-log-message": "datasetObjectId from: 0 to 10, status: IN_PROGRESS"
}

The event log message, datasetObjectId from: 0 to 10, status: IN_PROGRESS identifies the
number of frames from your input manifest that have been processed. You receive a new message every
time a frame has been processed. For example, after a single frame has processed, you receive another
message that says datasetObjectId from: 1 to 10, status: IN_PROGRESS.

For sequence input manifest files, your CloudWatch logs will have a message similar to the following:

{
"labeling-job-name": "example-point-cloud-labeling-job",
"event-name": "3D_POINT_CLOUD_PROCESSING_STATUS",
"event-log-message": "datasetObjectId: 0, status: IN_PROGRESS"
}

The event log message, datasetObjectId from: 0, status: IN_PROGRESS identifies the number
of sequences from your input manifest that have been processed. You receive a new message every
time a sequence has been processed. For example, after a single sequence has processed, you receive
a message that says datasetObjectId from: 1, status: IN_PROGRESS as the next sequence
begins processing.

Job Completion Times


3D point cloud labeling jobs can take workers hours to complete. You can set the total amount of time
that workers can work on each task when you create a labeling job. The maximum time you can set for
workers to work on tasks is 7 days. The default value is 3 days.

It is strongly recommended that you create tasks that workers can complete within 12 hours. Workers
must keep the worker UI open while working on a task. They can save work as they go and Ground Truth
will save their work every 15 minutes.

When using the SageMaker CreateLabelingJob API operation, set the total time a task is available to
workers in the TaskTimeLimitInSeconds parameter of HumanTaskConfig.

When you create a labeling job in the console, you can specify this time limit when you select your
workforce type and your work team.

Workforces
When you create a 3D point cloud labeling job, you need to specify a work team that will complete
your point cloud annotation tasks. You can choose a work team from a private workforce of your own
workers, or from a vendor workforce that you select in the AWS Marketplace. You cannot use the Amazon
Mechanical Turk workforce for 3D point cloud labeling jobs.

To learn more about vendor workforce, see Managing Vendor Workforces (p. 867).

To learn how to create and manage a private workforce, see Use a Private Workforce (p. 868).

Worker User Interface (UI)


Ground Truth provides a worker user interface (UI), tools, and assistive labeling features to help workers
complete your 3D point cloud labeling tasks.

You can preview the worker UI when you create a labeling job in the console.

When you create a labeling job using the API operation CreateLabelingJob, you must provide an ARN
provided by Ground Truth in the parameter HumanTaskUiArn to specify the worker UI for your task

631
Amazon SageMaker Developer Guide
Label 3D Point Clouds

type. You can use HumanTaskUiArn with the SageMaker RenderUiTemplate API operation to preview
the worker UI.

You provide worker instructions, labels, and optionally, label category attributes that are displayed in the
worker UI.

Label Category Attributes


When you create a 3D point cloud object tracking or object detection labeling job, you can add one or
more label category attributes. You can add frame attributes to all 3D point cloud task types:

• Label category attribute – A list of options (strings), a free form text box, or a numeric field associated
with one or more labels. It is used by workers to to provide metadata about a label.
• Frame attribute – A list of options (strings), a free form text box, or a numeric field that appears on
each point cloud frame a worker is sent to annotate. It is used by workers to provide metadata about
frames.

Additionally, you can use label and frame attributes to have workers verify labels in a 3D point cloud
label verification job.

Use the following sections to learn more about these attributes. To learn how to add label category and
frame attributes to a labeling job, use the Create Labeling Job section on the task type page of your
choice.

Label Category Attributes


Add label category attributes to labels to give workers the ability to provide more information about the
annotations they create. A label category attribute is added to an individual label, or to all labels. When a
label category attribute is applied to all labels it is referred to as a global label category attribute.

For example, if you add the label category car, you might also want to capture additional data about
your labeled cars, such as if they are occluded or the size of the car. You can capture this metadata using
label category attributes. In this example, if you added the attribute occluded to the car label category,
you can assign partial, completely, no to the occluded attribute and enable workers to select one of these
options.

When you create a label verification job, you add labels category attributes to each label you want
workers to verify.

Frame Attributes
Add frame attributes to give workers the ability to provide more information about individual point
cloud frames. You can specify up to 10 frame attributes, and these attributes will appear on all frames.

For example, you can add a frame attribute that allows workers to enter a number. You may want to use
this attribute to have workers identify the number of objects they see in a particular frame.

In another example, you may want to provide a free-form text box to give workers the ability to provide
a free form answer to a question.

When you create a label verification job, you can add one or more frame attributes to ask workers to
provide feedback on all labels in a point cloud frame.

Worker Instructions
You can provide worker instructions to help your workers complete your point cloud labeling tasks. You
might want to use these instructions to do the following:

• Best practices and things to avoid when annotating objects.


• Explanation of the label category attributes provided (for object detection and object tracking tasks),
and how to use them.

632
Amazon SageMaker Developer Guide
Label 3D Point Clouds

• Advice on how to save time while labeling by using keyboard shortcuts.

You can add your worker instructions using the SageMaker console while creating a labeling job. If you
create a labeling job using the API operation CreateLabelingJob, you specify worker instructions in
your label category configuration file.

In addition to your instructions, Ground Truth provides a link to help workers navigate and use the
worker portal. View these instructions by selecting the task type on Worker Instructions (p. 634).

Declining Tasks
Workers are able to decline tasks.

Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.

3D Point Cloud Labeling Job Permission Requirements


When you create a 3D point cloud labeling job, in addition to the permission requirements found in
Assign IAM Permissions to Use Ground Truth (p. 817), you must add a CORS policy to your S3 bucket
that contains your input manifest file.

Add a CORS Permission Policy to S3 Bucket


When you create a 3D point cloud labeling job, you specify buckets in S3 where your input data and
manifest file are located and where your output data will be stored. These buckets may be the same. You
must attach the following Cross-origin resource sharing (CORS) policy to your input and output buckets.
If you use the Amazon S3 console to add the policy to your bucket, you must use the JSON format.

JSON

[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"HEAD",
"PUT"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": [
"Access-Control-Allow-Origin"
],
"MaxAgeSeconds": 3000
}
]

XML

<?xml version="1.0" encoding="UTF-8"?>


<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>HEAD</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>

633
Amazon SageMaker Developer Guide
Label 3D Point Clouds

<MaxAgeSeconds>3000</MaxAgeSeconds>
<ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

To learn how to add a CORS policy to an S3 bucket, see How do I add cross-domain resource sharing with
CORS? in the Amazon Simple Storage Service User Guide.

Worker Instructions
This topic provides an overview of the Ground Truth worker portal and the tools available to complete
your 3D Point Cloud labeling task. First, select the type of task you are working on from Topics.

For adjustment jobs, select the original labeling job task type that produced the labels you are adjusting.
Review and adjust the labels in your task as needed.
Important
It is recommended that you complete your task using a Google Chrome or Firefox web browser.

Topics
• 3D Point Cloud Semantic Segmentation (p. 634)
• 3D Point Cloud Object Detection (p. 643)
• 3D Point Cloud Object Tracking (p. 653)

3D Point Cloud Semantic Segmentation


Use this page to become familiarize with the user interface and tools available to complete your 3D point
cloud semantic segmentation task.

Topics
• Your Task (p. 634)
• Navigate the UI (p. 639)
• Icon Guide (p. 641)
• Shortcuts (p. 642)
• Release, Stop and Resume, and Decline Tasks (p. 642)
• Saving Your Work and Submitting (p. 643)

Your Task

When you work on a 3D point cloud semantic segmentation task, you need to select a category from the
Annotations menu on the right side of your worker portal using the drop down menu Label Categories.
After you've selected a category, use the paint brush and polygon tools to paint each object in the 3D
point cloud that this category applies to. For example, if you select the category Car, you would use
these tools to paint all of the cars in the point cloud. The following video demonstrates how to use the
paint brush tool to paint an object.

If you see one or more images in your worker portal, you can paint in the images or paint in the 3D point
cloud and the paint will show up in the other medium.

You may see frame attributes under the Labels menu. Use these attribute prompts to enter additional
information about the point cloud.

634
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Important
If you see that objects have already been painted when you open the task, adjust those
annotations.

The following video includes an image that can be annotated. You may not see an image in your task.

635
Amazon SageMaker Developer Guide
Label 3D Point Clouds

636
Amazon SageMaker Developer Guide
Label 3D Point Clouds

After you've painted one or more objects using a label category, you can select that category from the
Label Category menu on the right to only view points painted for that category.

637
Amazon SageMaker Developer Guide
Label 3D Point Clouds

638
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Navigate the UI

You can navigate in the 3D scene using their keyboard and mouse. You can:

• Double click on specific objects in the point cloud to zoom into them.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

The following video demonstrates movements around the 3D point cloud and in the side-view. You can
hide and re-expand all side views using the full screen icon. In this GIF, the side-views and menus have
been collapsed.

639
Amazon SageMaker Developer Guide
Label 3D Point Clouds

640
Amazon SageMaker Developer Guide
Label 3D Point Clouds

When you are in the worker UI, you see the following menus:

• Instructions – Review these instructions before starting your task.


• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate the point cloud
and use the annotation tools provided.
• View – Use this menu to toggle different view options on and off. For example, you can use this menu
to add a ground mesh to the point cloud, and to choose the projection of the point cloud.
• 3D Point Cloud – Use this menu to add additional attributes to the points in the point cloud, such as
color, and pixel intensity. Note that some or all of these options may not be available.
• Paint – Use this menu to modify the functionality of the paint brush.

When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon.

After you select the paint icon, you can add paint to the point cloud and images (if included). You must
select the move scene icon again to move to another area in the 3D point cloud or image.

To collapse all panels on the right and make the 3D point cloud full screen, select the full screen icon.

For the camera images and side-panels, you have the following view options:

• C – View the camera angle on point cloud view.


• F – View the frustum, or field of view, of the camera used to capture that image on point cloud view.
• P – View the point cloud overlaid on the image.

Icon Guide

Use this table to learn about the icons available in your worker task portal.

Icon Name Description

brush Choose this icon to turn on the brush tool. To use with this
tool, choose and move over the objects that you want to
paint with your mouse. After you choose it, everything you
paint be associated with the category you chose.

polygon Choose this icon to use the polygon paint tool. Use this tool
to draw polygons around objects that you want to paint.
After you choose it, everything you draw a polygon around
will be associated with the category you have chosen.

reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.

move scene Choose this icon to move the scene. By default, this icon will
be selected when you first start a task.

641
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Icon Name Description

full screen Choose this icon to make the 3D point cloud visualization
full screen, and to collapse all side panels.

ruler Use this icon to measure distances, in meters, in the point


cloud. You may want to use this tool if your instructions
ask you to annotate all objects in a given distance from the
center of the cuboid or the object used to capture data.

When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.

After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.

After you set a distance, you can edit it by selecting either


marker. You can delete a ruler by selecting anywhere on the
ruler and using the Delete key on your keyboard.

Shortcuts

The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use the paint
tool.

Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands.

Release, Stop and Resume, and Decline Tasks

When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:

• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point cloud, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: If you release a task, you loose all work done on that task. When the task is released,
other workers on your team can pick it up. If enough workers pick up the task, you may not be able
to return to it. When you select this button and then select Confirm, you are returned to the worker
portal. If the task is still available, its status will be Available. If other workers pick it up, it will
disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and

642
Amazon SageMaker Developer Guide
Label 3D Point Clouds

resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.

Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.

Saving Your Work and Submitting

You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.

When you open a task, you must complete your work on it before pressing Submit.

3D Point Cloud Object Detection


Use this page to become familiarize with the user interface and tools available to complete your 3D point
cloud object detection task.

Topics
• Your Task (p. 643)
• Navigate the UI (p. 645)
• Icon Guide (p. 651)
• Shortcuts (p. 652)
• Release, Stop and Resume, and Decline Tasks (p. 652)
• Saving Your Work and Submitting (p. 653)

Your Task

When you work on a 3D point cloud object detection task, you need to select a category from the
Annotations menu on the right side of your worker portal using the Label Categories menu. After you've
chosen a category, use the add cuboid and fit cuboid tools to fit a cuboid around objects in the 3D point
cloud that this category applies to. After you place a cuboid, you can modify its dimensions, location, and
orientation directly in the point cloud, and the three panels shown on the right.

If you see one or more images in your worker portal, you can also modify cuboids in the images or in the
3D point cloud and the edits will show up in the other medium.

If you see cuboids have already been added to the 3D point cloud when you open your task, adjust those
cuboids and add additional cuboids as needed.

To edit a cuboid, including moving, re-orienting, and changing cuboid dimensions, you must use
shortcut keys. You can see a full list of shortcut keys in the Shortcuts menu in your UI. The following are
important key-combinations that you should become familiar with before starting your labeling task.

Mac Command Windows Command Action

Cmd + Drag Ctrl + Drag Modify the dimensions of the


cuboid.

Option + Drag Alt + Drag Move the cuboid.

Shift + Drag Shift + Drag Rotate the cuboid.

Option + O Alt + O Fit the cuboid tightly around the


points it has been drawn around.
Before using the option, make

643
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Mac Command Windows Command Action


sure the cuboid fully-surrounds
the object of interest.

Option + G Alt + G Set the cuboid to the ground.

Individual labels may have one or more label attributes. If a label has a label attribute associated with it,
it will appear when you select the downward pointing arrow next to the label from the Label Id menu.
Fill in required values for all label attributes.

You may see frame attributes under the Labels menu. Use these attribute prompts to enter additional
information about each frame.

644
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Navigate the UI

You can navigate in the 3D scene using your keyboard and mouse. You can:

• Double click on specific objects in the point cloud to zoom into them.
• You can use the [ and ] keys on your keyboard to zoom into and move from one label to the next. If no
label is selected, when you select [ or ], the UI will zoom into the first label in the Lable Id list.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

Once you place a cuboids in the 3D scene, a side-view will appear with three projected views: top, side,
and back. These side-views show points in and around the placed cuboid and help workers refine cuboid
boundaries in that area. Workers can zoom in and out of each of those side-views using their mouse.

The following video demonstrates movements around the 3D point cloud and in the side-view.

645
Amazon SageMaker Developer Guide
Label 3D Point Clouds

646
Amazon SageMaker Developer Guide
Label 3D Point Clouds

When you are in the worker UI, you see the following menus:

• Instructions – Review these instructions before starting your task.


• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate the point cloud
and use the annotation tools provided.
• Label – Use this menu to modify a cuboid. First, select a cuboid, and then choose an option from this
menu. This menu includes assistive labeling tools like setting a cuboid to the ground and automatically
fitting the cuboid to the object's boundaries.
• View – Use this menu to toggle different view options on and off. For example, you can use this menu
to add a ground mesh to the point cloud, and to choose the projection of the point cloud.
• 3D Point Cloud – Use this menu to add additional attributes to the points in the point cloud, such as
color, and pixel intensity. Note that these options may not be available.

When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon. Resetting the view will not modify
your annotations.

After you select the add cuboid icon, you can add cuboids to the 3D point cloud visualization. Once
you've added a cuboid, you can adjust it in the three views (top, side, and front) and in the images (if
included).

647
Amazon SageMaker Developer Guide
Label 3D Point Clouds

648
Amazon SageMaker Developer Guide
Label 3D Point Clouds

You must choose the move scene icon again to move to another area in the 3D point cloud or image.

To collapse all panels on the right and make the 3D point cloud full-screen, choose the full screen icon.

If camera images are included, you may have the following view options:

• C – View the camera angle on point cloud view.


• F – View the frustum, or field of view, of the camera used to capture that image on point cloud view.
• P – View the point cloud overlaid on the image.
• B – View cuboids in the image.

The following video demonstrates how to use these view options. The F option is used to view the field
of view of the camera (the gray area), the C options shows the direction the camera is facing and angle of
the camera (blue lines), and the B option is used to view the cuboid.

649
Amazon SageMaker Developer Guide
Label 3D Point Clouds

650
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Icon Guide

Use this table to learn about the icons you see in your worker task portal.

Icon Description

add cuboid Choose this icon to add a cuboid. Each cuboid you add is
associated with the category you chose.

edit cuboid Choose this icon to edit a cuboid. After you have added
a cuboid, you can edit its dimensions, location, and
orientation. After a cuboid is added, it automatically
switches to edit cuboid mode.

ruler Use this icon to measure distances, in meters, in the point


cloud. You may want to use this tool if your instructions
ask you to annotate all objects in a given distance from the
center of the cuboid or the object used to capture data.

When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.

After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.

After you set a distance, you can edit it by selecting either


marker. You can delete a ruler by selecting anywhere on the
ruler and using the Delete key on your keyboard.

reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.

move scene Choose this icon to move the scene. By default, this icon is
chosen when you first start a task.

651
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Icon Description

full screen Choose this icon to make the 3D point cloud visualization
full screen, and to collapse all side panels.

show labels Show labels in the 3D point cloud visualization, and if


applicable, in images.

hide labels Hide labels in the 3D point cloud visualization, and if


applicable, in images.

delete labels Delete a label.

Shortcuts

The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use tools to
add and edit cuboids.

Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands. You need to use some of the 3D cuboid controls to edit your cuboid.

Release, Stop and Resume, and Decline Tasks

When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:

• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point cloud, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: If you release a task, you loose all work done on that task. When the task is released,
other workers on your team can pick it up. If enough workers pick up the task, you may not be able
to return to it. When you select this button and then select Confirm, you are returned to the worker
portal. If the task is still available, its status will be Available. If other workers pick it up, it will
disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.

Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.

652
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Saving Your Work and Submitting

You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.

When you open a task, you must complete your work on it before pressing Submit.

3D Point Cloud Object Tracking


Use this page to become familiarize with the user interface and tools available to complete your 3D point
cloud object detection task.

Topics
• Your Task (p. 653)
• Navigate the UI (p. 657)
• Bulk Edit Label Category and Frame Attributes (p. 661)
• Icon Guide (p. 662)
• Shortcuts (p. 663)
• Release, Stop and Resume, and Decline Tasks (p. 663)
• Saving Your Work and Submitting (p. 664)

Your Task

When you work on a 3D point cloud object tracking task, you need to select a category from the
Annotations menu on the right side of your worker portal using the Label Categories menu. After you've
selected a category, use the add cuboid and fit cuboid tools to fit a cuboid around objects in the 3D point
cloud that this category applies to. After you place a cuboid, you can modify its location, dimensions, and
orientation directly in the point cloud, and the three panels shown on the right. If you see one or more
images in your worker portal, you can also modify cuboids in the images or in the 3D point cloud and the
edits will show up in the other medium.
Important
If you see cuboids have already been added to the 3D point cloud frames when you open your
task, adjust those cuboids and add additional cuboids as needed.

To edit a cuboid, including moving, re-orienting, and changing cuboid dimensions, you must use
shortcut keys. You can see a full list of shortcut keys in the Shortcuts menu in your UI. The following are
important key-combinations that you should become familiar with before starting your labeling task.

Mac Command Windows Command Action

Cmd + Drag Ctrl + Drag Modify the dimensions of the


cuboid.

Option + Drag Alt + Drag Move the cuboid.

Shift + Drag Shift + Drag Rotate the cuboid.

Option + O Alt + O Fit the cuboid tightly around the


points it has been drawn around.
Before using the option, make
sure the cuboid fully-surrounds
the object of interest.

Option + G Alt + G Set the cuboid to the ground.

653
Amazon SageMaker Developer Guide
Label 3D Point Clouds

When you open your task, two frames will be loaded. If your task includes more than two frames, you
need to use the navigation bar in the lower-left corner, or the load frames icon to load additional frames.
You should annotate and adjust labels in all frames before submitting.

After you fit a cuboid tightly around the boundaries of an object, navigate to another frame using the
navigation bar in the lower-left corner of the UI. If that same object has moved to a new location, add
another cuboid and fit it tightly around the boundaries of the object. Each time you manually add a
cuboid, you see the frame sequence bar in the lower-left corner of the screen turn red where that frame
is located temporally in the sequence.

Your UI automatically infers the location of that object in all other frames after you've placed a cuboid.
This is called interpolation. You can see the movement of that object, and the inferred and manually
created cuboids using the arrows. Adjust inferred cuboids as needed. The following video demonstrates
how to navigate between frames. The following video shows how, if you add a cuboid in one frame, and
then adjust it in another, your UI will automatically infer the location of the cuboid in all of the frames
in-between.

654
Amazon SageMaker Developer Guide
Label 3D Point Clouds

655
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Tip
You can turn off the automatic cuboid interpolation across frames using the 3D Point Cloud
menu item. Select 3D Point Cloud from the top-menu, and then select Interpolate Cuboids
Across Frames. This will uncheck this option and stop cuboid interpolation. You can reselect this
item to turn cuboid interpolation back on.
Turning cuboid interpolation off will not impact cuboids that have already been interpolated
across frames.

Individual labels may have one or more label attributes. If a label has a label attribute associated with it,
it will appear when you select the downward pointing arrow next to the label from the Label Id menu.
Fill in required values for all label attributes.

You may see frame attributes under the Label Id menu. These attributes will appear on each frame in
your task. Use these attribute prompts to enter additional information about each frame.

656
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Navigate the UI

You can navigate in the 3D scene using your keyboard and mouse. You can:

• Double click on specific objects in the point cloud to zoom into them.
• You can use the [ and ] keys on your keyboard to zoom into and move from one label to the next. If no
label is selected, when you select [ or ], the UI will zoom into the first label in the Label Id list.
• Use a mouse-scroller or trackpad to zoom in and out of the point cloud.
• Use both keyboard arrow keys and Q, E, A, and D keys to move Up, Down, Left, Right. Use keyboard
keys W and S to zoom in and out.

Once you place a cuboids in the 3D scene, a side-view will appear with three projected views: top, side,
and back. These side-views show points in and around the placed cuboid and help workers refine cuboid
boundaries in that area. Workers can zoom in and out of each of those side-views using their mouse.

The following video demonstrates movements around the 3D point cloud and in the side-view.

657
Amazon SageMaker Developer Guide
Label 3D Point Clouds

658
Amazon SageMaker Developer Guide
Label 3D Point Clouds

When you are in the worker UI, you see the following menus:

• Instructions – Review these instructions before starting your task.


• Shortcuts – Use this menu to view keyboard shortcuts that you can use to navigate the point cloud
and use the annotation tools provided.
• Label – Use this menu to modify a cuboid. First, select a cuboid, and then choose an option from this
menu. This menu includes assistive labeling tools like setting a cuboid to the ground and automatically
fitting the cuboid to the object's boundaries.
• View – Use this menu to toggle different view options on and off. For example, you can use this menu
to add a ground mesh to the point cloud, and to choose the projection of the point cloud.
• 3D Point Cloud – Use this menu to add additional attributes to the points in the point cloud, such as
color, and pixel intensity. Note that these options may not be available.

When you open a task, the move scene icon is on, and you can move around the point cloud using your
mouse and the navigation buttons in the point cloud area of the screen. To return to the original view
you see when you first opened the task, choose the reset scene icon.

After you select the add cuboid icon, you can add cuboids to the point cloud and images (if included).
You must select the move scene icon again to move to another area in the 3D point cloud or image.

To collapse all panels on the right and make the 3D point cloud full-screen, choose the full screen icon.

If camera images are included, you may have the following view options:

• C – View the camera angle on point cloud view.


• F – View the frustum, or field of view, of the camera used to capture that image on point cloud view.
• P – View the point cloud overlaid on the image.
• B – View cuboids in the image.

The following video demonstrates how to use these view options. The F option is used to view the field
of view of the camera (the gray area), the C options shows the direction the camera is facing and angle of
the camera (blue lines), and the B option is used to view the cuboid.

659
Amazon SageMaker Developer Guide
Label 3D Point Clouds

660
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Delete Cuboids
You can select a cuboid or label ID and:

• Delete an individual cuboid in the current frame you are viewing.


• Delete all cuboids with that label ID before or after the frame you are viewing.
• Delete all cuboids with that label ID in all frames.

A common use-case for cuboid deletion is if the object leaves the scene.

You can use one or more of these options to delete both manually placed and interpolated cuboids with
the same label ID.

• To delete all cuboids before or after the frame you are currently on, select the cuboid, select the Label
menu item at the top of the UI and then select one of Delete in previous frames or Delete in next
frames. Use the Shortcuts menu to see the shortcut keys you can use for these options.
• To delete a label in all frames, select Delete in all frames from the Labels menu, or use the shortcut
Shift + Delete on your keyboard.
• To delete an individual cuboid from a single frame, select the cuboid and either select the trashcan

icon ( ) next to that label ID in the Label ID sidebar on the right or use the Delete key on your
keyboard to delete that cuboid.

If you have manually placed more than one cuboid with the same label in different frames, when
you delete one of the manually placed cuboids, all interpolated cuboids adjust. This adjustment
happens because the UI uses manually placed cuboids as anchor points when calculating the location of
interpolated cuboid. When you remove one of these anchor points, the UI must recalculate the position
of interpolated cuboids.

If you delete a cuboid from a frame, but later decide that you want to get it back, you can use the
Duplicate to previous frames or Duplicate to next frames options in the Label menu to copy the cuboid
into all the previous or all of the following frames, respectively.

Bulk Edit Label Category and Frame Attributes


You can bulk edit label attributes and frame attributes.

When you bulk edit an attribute, you specify one or more ranges of frames that you want to apply the
edit to. The attribute you select is edited in all frames in that range, including the start and end frames
you specify. When you bulk edit label attributes, the range you specify must contain the label that the
label attribute is attached to. If you specify frames that do not contain this label, you will receive an
error.

To bulk edit an attribute you must specify the desired value for the attribute first. For example, if you
want to change an attribute from Yes to No, you must select No, and then perform the bulk edit.

You can also specify a new value for an attribute that has not been filled in and then use the bulk edit
feature to fill in that value in multiple frames. To do this, select the desired value for the attribute and
complete the following procedure.

To bulk edit a label or attribute:

1. Use your mouse to right click the attribute you want to bulk edit.
2. Specify the range of frames you want to apply the bulk edit to using a dash (-) in the text box. For
example, if you want to apply the edit to frames one through ten, enter 1-10. If you want to apply
the edit to frames two to five, eight to ten and twenty enter 2-5,8-10,20.
3. Select Confirm.

661
Amazon SageMaker Developer Guide
Label 3D Point Clouds

If you get an error message, verify that you entered a valid range and that the label associated with the
label attribute you are editing (if applicable) exists in all frames specified.

You can quickly add a label to all previous or subsequent frames using the Duplicate to previous frames
and Duplicate to next frames options in the Label menu at the top of your screen.

Icon Guide
Use this table to learn about the icons you see in your worker task portal.

Icon Description

add cuboid Choose this icon to add a cuboid. Each cuboid you add is
associated with the category you chose.

edit cuboid Choose this icon to edit a cuboid. After you add a cuboid,
you can edit its dimensions, location, and orientation. After
a cuboid is added, it automatically switches to edit cuboid
mode.

ruler Use this icon to measure distances, in meters, in the point


cloud. You may want to use this tool if your instructions
ask you to annotate all objects in a given distance from the
center of the cuboid or the object used to capture data.

When you select this icon, you can place the starting point
(first marker) anywhere in the point cloud by selecting
it with your mouse. The tool will automatically use
interpolation to place a marker on the closest point within
threshold distance to the location you select, otherwise the
marker will be placed on ground. If you place a starting point
by mistake, you can use the Escape key to revert marker
placement.

After you place the first marker, you see a dotted line and a
dynamic label that indicates the distance you have moved
away from the first marker. Click somewhere else on the
point cloud to place a second marker. When you place the
second marker, the dotted line becomes solid, and the
distance is set.

After you set a distance, you can edit it by selecting either


marker. You can delete a ruler by selecting anywhere on the
ruler and using the Delete key on your keyboard.

reset scene Choose this icon to reset the view of the point cloud, side
panels, and if applicable, all images to their original position
when the task was first opened.

move scene Choose this icon to move the scene. By default, this icon is
chosen when you first start a task.

662
Amazon SageMaker Developer Guide
Label 3D Point Clouds

Icon Description

full screen Choose this icon to make the 3D point cloud visualization
full screen and to collapse all side panels.

load frames Choose this icon to load additional frames.

hide labels Hide labels in the 3D point cloud visualization, and if


applicable, in images.

show labels Show labels in the 3D point cloud visualization, and if


applicable, in images.

delete labels Delete a label. This option can only be used to delete labels
you have manually created or adjusted.

Shortcuts

The shortcuts listed in the Shortcuts menu can help you navigate the 3D point cloud and use tools to
add and edit cuboids.

Before you start your task, it is recommended that you review the Shortcuts menu and become
acquainted with these commands. You need to use some of the 3D cuboid controls to edit your cuboid.

Release, Stop and Resume, and Decline Tasks

When you open the labeling task, three buttons on the top right allow you to decline the task (Decline
task), release it (Release task), and stop and resume it at a later time (Stop and resume later). The
following list describes what happens when you select one of these options:

• Decline task: You should only decline a task if something is wrong with the task, such as an issue with
the 3D point clouds, images or the UI. If you decline a task, you will not be able to return to the task.
• Release Task: Use this option to release a task and allow others to work on it. When you release a task,
you loose all work done on that task and other workers on your team can pick it up. If enough workers
pick up the task, you may not be able to return to it. When you select this button and then select
Confirm, you are returned to the worker portal. If the task is still available, its status will be Available.
If other workers pick it up, it will disappear from your portal.
• Stop and resume later: You can use the Stop and resume later button to stop working and return to
the task at a later time. You should use the Save button to save your work before you select Stop and
resume later. When you select this button and then select Confirm, you are returned to the worker
portal, and the task status is Stopped. You can select the same task to resume work on it.

663
Amazon SageMaker Developer Guide
Verify and Adjust Labels

Be aware that the person that creates your labeling tasks specifies a time limit in which all tasks much
be completed by. If you do not return to and complete this task within that time limit, it will expire and
your work will not be submitted. Contact your administrator for more information.

Saving Your Work and Submitting

You should periodically save your work. Ground Truth will automatically save your work ever 15 minutes.

When you open a task, you must complete your work on it before pressing Submit.

Verify and Adjust Labels


When the labels on a dataset need to be validated, Amazon SageMaker Ground Truth provides
functionality to have workers verify that labels are correct or to adjust previous labels.

These types of jobs fall into two distinct categories:

• Label verification — Workers indicate if the existing labels are correct, or rate their quality, and can add
comments to explain their reasoning. Workers will not be able to modify or adjust labels.

If you create a 3D point cloud or video frame label adjustment or verification job, you can choose to
make label category attributes (not supported for 3D point cloud semantic segmentation) and frame
attributes editable by workers.
• Label adjustment — Workers adjust prior annotations and, if applicable, label category and frame
attributes to correct them.

The following Ground Truth built-in task types support adjustment and verification labeling jobs:

• Bounding box
• Semantic segmentation
• 3D point cloud object detection, 3D point cloud object tracking, and 3D point cloud semantic
segmentation
• All video frame object detection and video frame object tracking task types — bounding box, polyline,
polygon and keypoint

Tip
For 3D point cloud and video frame labeling verification jobs, it is recommended that you add
new label category attributes or frame attributes to the labeling job. Workers can use these
attribute to verify individual labels or the entire frame. To learn more about label category and
frame attributes, see Worker User Interface (UI) (p. 631) for 3D point cloud and Worker User
Interface (UI) (p. 577) for video frame.

You can start a label verification and adjustment jobs using the SageMaker console or the API.

Topics
• Requirements to Create Verification and Adjustment Labeling Jobs (p. 665)
• Create a Label Verification Job (Console) (p. 665)
• Create a Label Adjustment Job (Console) (p. 667)
• Start a Label Verification or Adjustment Job (API) (p. 668)
• Label Verification and Adjustment Data in the Output Manifest (p. 670)
• Cautions and Considerations (p. 671)

664
Amazon SageMaker Developer Guide
Verify and Adjust Labels

Requirements to Create Verification and Adjustment Labeling


Jobs
To create a label verification or adjustment job, the following criteria must be satisfied.

• For non streaming labeling jobs: The input manifest file you use must contain the label attribute
name (LabelAttributeName) of the labels that you want adjusted. When you chain a successfully
completed labeling job, the output manifest file is used as the input manifest file for the new, chained
job. To learn more about the format of the output manifest file Ground Truth produces for each task
type, see Output Data (p. 776).

For streaming labeling jobs: The Amazon SNS message you sent to the Amazon SNS input topic of the
adjustment or verification labeling job must contain the label attribute name of the labels you want
adjusted or verified. To see an example of how you can create an adjustment or verification labeling
job with streaming labeling jobs, see this Jupyter Notebook example in GitHub.
• The task type of the verification or adjustment labeling job must be the same as the task type of the
original job unless you are using the Image Label Verification (p. 551) task type to verify bounding
box or semantic segmentation image labels. See the next bullet point for more details about the video
frame task type requirements.
• For video frame annotation verification and adjustment jobs, you must use the same annotation task
type used to create the annotations from the previous labeling job. For example, if you create a video
frame object detection job to have workers draw bounding boxes around objects, and then you create
a video object detection adjustment job, you must specify bounding boxes as the annotation task type.
To learn more video frame annotation task types, see Task Types (p. 576).
• The task type you select for the adjustment or verification labeling job must support an audit
workflow. The following Ground Truth built-in task types support adjustment and verification labeling
jobs: bounding box, semantic segmentation, 3D point cloud object detection, 3D point cloud object
tracking, and 3D point cloud semantic segmentation, and all video frame object detection and video
frame object tracking task types — bounding box, polyline, polygon and keypoint.

Create a Label Verification Job (Console)


Bounding box and semantic segmentation labeling jobs are created by choosing the Label verification
task type in the console. To create a verification job for 3D point cloud and video frame task types, you
must choose the same task type as the original labeling job and choose to display existing labels. Use
one of the following sections to create a label verification job for your task type.

Topics
• Create an Image Label Verification Job (Console) (p. 665)
• Create a Point Cloud or Video Frame Label Verification Job (Console) (p. 666)

Create an Image Label Verification Job (Console)


Use the following procedure to create a bounding box or semantic segmentation verification job
using the console. This procedure assumes that you have already created a bounding box or semantic
segmentation labeling job and its status is Complete. This the labeling job that produces the labels you
want verified.

To create an image label verification job:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/ and choose Labeling


jobs.
2. Start a new labeling job by chaining (p. 813) a prior job or start from scratch, specifying an input
manifest that contains labeled data objects.

665
Amazon SageMaker Developer Guide
Verify and Adjust Labels

3. In the Task type pane, select Label verification.


4. Choose Next.
5. In the Workers section, choose the type of workforce you would like to use. For more details about
your workforce options see Create and Manage Workforces (p. 863).
6. (Optional) After you've selected your workforce, specify the Task timeout and Task expiration time.
7. In the Existing-labels display options pane, the system shows the available label attribute names
in your manifest. Choose the label attribute name that identifies the labels that you want workers
to verify. Ground Truth tries to detect and populate these values by analyzing the manifest, but you
might need to set the correct value.
8. Use the instructions areas of the tool designer to provide context about what the previous labelers
were asked to do and what the current verifiers need to check.

You can add new labels that workers choose from to verify labels. For example, you can ask workers
to verify the image quality, and provide the labels Clear and Blurry. Workers will also have the option
to add a comment to explain their selection.
9. Choose See preview to check that the tool is displaying the prior labels correctly and presents the
label verification task clearly.
10. Select Create. This will create and start your labeling job.

Create a Point Cloud or Video Frame Label Verification Job (Console)


Use the following procedure to create a 3D point cloud or video frame verification job using the console.
This procedure assumes that you have already created a labeling job using the task type that produces
the types of labels you want to be verified and its status is Complete.

To create an image label verification job:

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/ and choose Labeling


jobs.
2. Start a new labeling job by chaining (p. 813) a prior job or start from scratch, specifying an input
manifest that contains labeled data objects.
3. In the Task type pane, select the same task type as the labeling job that you chained. For example,
if the original labeling job was a video frame object detection keypoint labeling job, select that task
type.
4. Choose Next.
5. In the Workers section, choose the type of workforce you would like to use. For more details about
your workforce options see Create and Manage Workforces (p. 863).
6. (Optional) After you've selected your workforce, specify the Task timeout and Task expiration time.
7. Toggle on the switch next to Display existing labels.
8. Select Verification.
9. For Label attribute name, choose the name from your manifest that corresponds to the labels that
you want to display for verification. You will only see label attribute names for labels that match
the task type you selected on the previous screen. Ground Truth tries to detect and populate these
values by analyzing the manifest, but you might need to set the correct value.
10. Use the instructions areas of the tool designer to provide context about what the previous labelers
were asked to do and what the current verifiers need to check.

You cannot modify or add new labels. You can remove, modify and add new label category
attributes or frame attributes. It is recommended that you add new label category attributes or
frame attributes to the labeling job. Workers can use these attribute to verify individual labels or the
entire frame.

666
Amazon SageMaker Developer Guide
Verify and Adjust Labels

By default, preexisting label category attributes and frame attributes will not be editable by workers.
If you want to make a label category or frame attribute editable, select the Allow workers to edit
this attribute check box for that attribute.

To learn more about label category and frame attributes, see Worker User Interface (UI) (p. 631)
for 3D point cloud and Worker User Interface (UI) (p. 577) for video frame.
11. Choose See preview to check that the tool is displaying the prior labels correctly and presents the
label verification task clearly.
12. Select Create. This will create and start your labeling job.

Create a Label Adjustment Job (Console)


Use one of the following sections to create a label verification job for your task type.

Topics
• Create an Image Label Adjustment Job (Console) (p. 667)
• Create a Point Cloud or Video Frame Label Adjustment Job (Console) (p. 668)

Create an Image Label Adjustment Job (Console)


Use the following procedure to create a bounding box or semantic segmentation adjustment labeling
job using the console. This procedure assumes that you have already created a bounding box or semantic
segmentation labeling job and its status is Complete. This the labeling job that produces the labels you
want adjusted.

To create an image label adjustment job (console)

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/ and choose Labeling


jobs.
2. Start a new labeling job by chaining (p. 813) a prior job or start from scratch, specifying an input
manifest that contains labeled data objects.
3. Choose the same task type as the original labeling job.
4. Choose Next.
5. In the Workers section, choose the type of workforce you would like to use. For more details about
your workforce options see Create and Manage Workforces (p. 863).
6. (Optional) After you've selected your workforce, specify the Task timeout and Task expiration time.
7. Expand Existing-labels display options by selecting the arrow next to the title.
8. Check the box next to I want to display existing labels from the dataset for this job.
9. For Label attribute name, choose the name from your manifest that corresponds to the labels that
you want to display for adjustment. You will only see label attribute names for labels that match
the task type you selected on the previous screen. Ground Truth tries to detect and populate these
values by analyzing the manifest, but you might need to set the correct value.
10. Use the instructions areas of the tool designer to provide context about what the previous labelers
were tasked with doing and what the current verifiers need to check and adjust.
11. Choose See preview to check that the tool shows the prior labels correctly and presents the task
clearly.
12. Select Create. This will create and start your labeling job.

667
Amazon SageMaker Developer Guide
Verify and Adjust Labels

Create a Point Cloud or Video Frame Label Adjustment Job (Console)


Use the following procedure to create a 3D point cloud or video frame adjustment job using the console.
This procedure assumes that you have already created a labeling job using the task type that produces
the types of labels you want to be verified and its status is Complete.

To create a 3D point cloud or video frame label adjustment job (console)

1. Open the SageMaker console: https://console.aws.amazon.com/sagemaker/ and choose Labeling


jobs.
2. Start a new labeling job by chaining (p. 813) a prior job or start from scratch, specifying an input
manifest that contains labeled data objects.
3. Choose the same task type as the original labeling job.
4. Toggle on the switch next to Display existing labels.
5. Select Adjustment.
6. For Label attribute name, choose the name from your manifest that corresponds to the labels that
you want to display for adjustment. You will only see label attribute names for labels that match
the task type you selected on the previous screen. Ground Truth tries to detect and populate these
values by analyzing the manifest, but you might need to set the correct value.
7. Use the instructions areas of the tool designer to provide context about what the previous labelers
were asked to do and what the current adjusters need to check.

You cannot remove or modify existing labels but you can add new labels. You can remove, modify
and add new label category attributes or frame attributes.

Be default, preexisting label category attributes and frame attributes will be editable by workers. If
you want to make a label category or frame attribute uneditable, deselect the Allow workers to edit
this attribute check box for that attribute.

To learn more about label category and frame attributes, see Worker User Interface (UI) (p. 631)
for 3D point cloud and Worker User Interface (UI) (p. 577) for video frame.
8. Choose See preview to check that the tool shows the prior labels correctly and presents the task
clearly.
9. Select Create. This will create and start your labeling job.

Start a Label Verification or Adjustment Job (API)


Start a label verification or adjustment job by chaining a successfully completed job or starting a new
job from scratch using the CreateLabelingJob operation. The procedure is almost the same as setting
up a new labeling job with CreateLabelingJob, with a few modifications. Use the following sections
to learn what modifications are required to chain a labeling job to create an adjustment or verification
labeling job.

When you create an adjustment or verification labeling job using the Ground Truth API, you must use a
different LabelAttributeName than the original labeling job. The original labeling job is the job used
to create the labels you want adjusted or verified.
Important
The label category configuration file you identify for an adjustment or verification job in
LabelCategoryConfigS3Uri of CreateLabelingJob must contain the same labels used in
the original labeling job. You can add new labels. For 3D point cloud and video frame jobs, you
can add new label category and frame attributes to the label category configuration file.

668
Amazon SageMaker Developer Guide
Verify and Adjust Labels

Bounding Box and Semantic Segmentation

To create a bounding box or semantic segmentation label verification or adjustment job, use the
following guidelines to specify API attributes for the CreateLabelingJob operation.

• Use the LabelAttributeName parameter to specify the output label name that you want to use for
verified or adjusted labels. You must use a different LabelAttributeName than the one used for the
original labeling job.
• If you are chaining the job, the labels from the previous labeling job to be adjusted or verified will be
specified in the custom UI template. To learn how to create a custom template, see Create Custom
Worker Task Templates (p. 2995).

Identify the location of the UI template in the UiTemplateS3Uri parameter. SageMaker provides
widgets that you can use in your custom template to display old labels. Use the initial-value
attribute in one of the following crowd elements to extract the labels that need verification or
adjustment and include them in your task template:
• crowd-semantic-segmentation (p. 948)—Use this crowd element in your custom UI task template
to specify semantic segmentation labels that need to be verified or adjusted.
• crowd-bounding-box (p. 894)—Use this crowd element in your custom UI task template to specify
bounding box labels that need to be verified or adjusted.
• The LabelCategoryConfigS3Uri parameter must contain the same label categories as the previous
labeling job.
• Use the bounding box or semantic segmentation adjustment or verification lambda ARNs for
PreHumanTaskLambdaArn and AnnotationConsolidationLambdaArn:
• For bounding box, the adjustment labeling job lambda function ARNs end with
AdjustmentBoundingBox and the verification lambda function ARNs end with
VerificationBoundingBox.
• For semantic segmentation, the adjustment labeling job lambda function ARNs end with
AdjustmentSemanticSegmentation and the verification lambda function ARNs end with
VerificationSemanticSegmentation.

3D Point Cloud and Video Frame

• Use the LabelAttributeName parameter to specify the output label name that you want to use for
verified or adjusted labels. You must use a different LabelAttributeName than the one used for the
original labeling job.
• You must use the human task UI Amazon Resource Name (ARN) (HumanTaskUiArn) used for the
original labeling job. To see supported ARNs, see HumanTaskUiArn.
• In the label category configuration file, you must specify the label attribute name
(LabelAttributeName) of the previous labeling job that you use to create the adjustment or
verification labeling job in the auditLabelAttributeName parameter.
• You specify whether your labeling job is a verification or adjustment labeling job using
the editsAllowed parameter in your label category configuration file identified by the
LabelCategoryConfigS3Uri parameter.
• For verification labeling jobs, you must use the editsAllowed parameter to specify that all labels
cannot be modified. editsAllowed must be set to "none" in each entry in labels. Optionally,
you can specify whether or not label categories attributes and frame attributes can be adjusted by
workers.
• Optionally, for adjustment labeling jobs, you can use the editsAllowed parameter to specify
labels, label category attributes, and frame attributes that can or cannot be modified by workers.
If you do not use this parameter, all labels, label category attributes, and frame attributes will be
adjustable.

669
Amazon SageMaker Developer Guide
Verify and Adjust Labels

To learn more about the editsAllowed parameter and configuring your label category configuration
file, see Label Category Configuration File Schema (p. 719).
• Use the 3D point cloud or video frame adjustment lambda ARNs for PreHumanTaskLambdaArn and
AnnotationConsolidationLambdaArn for both adjustment and verification labeling jobs:
• For 3D point clouds, the adjustment and verification labeling job lambda
function ARNs end with Adjustment3DPointCloudSemanticSegmentation,
Adjustment3DPointCloudObjectTracking, and
Adjustment3DPointCloudObjectDetection for 3D point cloud semantic segmentation, object
detection, and object tracking respectively.
• For video frames, the adjustment and verification labeling job lambda function ARNs end with
AdjustmentVideoObjectDetection and AdjustmentVideoObjectTracking for video frame
object detection and object tracking respectively.

Ground Truth stores the output data from a label verification or adjustment job in the S3 bucket that
you specified in the S3OutputPath parameter of the CreateLabelingJob operation. For more
information about the output data from a label verification or adjustment labeling job, see Label
Verification and Adjustment Data in the Output Manifest (p. 670).

Label Verification and Adjustment Data in the Output Manifest


Amazon SageMaker Ground Truth writes label verification data to the output manifest within the
metadata for the label. It adds two properties to the metadata:

• A type property, with a value of "groundtruth/label-verification.


• A worker-feedback property, with an array of comment values. This property is added when the
worker enters comments. If there are no comments, the field doesn't appear.

The following example output manifest shows how label verification data appears:

{
"source-ref":"S3 bucket location",
"verify-bounding-box":"1",
"verify-bounding-box-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-bounding-boxes",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"worker-feedback": [
{"comment": "The bounding box on the bird is too wide on the right side."},
{"comment": "The bird on the upper right is not labeled."}
]
}
}

The worker output of adjustment tasks resembles the worker output of the original task, except that
it contains the adjusted values and an adjustment-status property with the value of adjusted or
unadjusted to indicate whether an adjustment was made.

For more examples of the output of different tasks, see Output Data (p. 776).

670
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Cautions and Considerations


To get expected behavior when creating a label verification or adjustment job, carefully verify your input
data.

• If you are using image data, verify that your manifest file contains hexadecimal RGB color information.
• To save money on processing costs, filter your data to ensure you are not including unwanted objects
in your labeling job input manifest.
• Add required Amazon S3 permissions to ensure your input data is processed correctly.

When you create an adjustment or verification labeling job using the Ground Truth API, you must use a
different LabelAttributeName than the original labeling job.

Color Information Requirements for Semantic Segmentation Jobs


To properly reproduce color information in verification or adjustment tasks, the tool requires
hexadecimal RGB color information in the manifest (for example, #FFFFFF for white). When you set up a
Semantic Segmentation verification or adjustment job, the tool examines the manifest to determine if
this information is present. If it can't find it,Amazon SageMaker Ground Truth displays an error message
and the ends job setup.

In prior iterations of the Semantic Segmentation tool, category color information wasn't output in
hexadecimal RGB format to the output manifest. That feature was introduced to the output manifest
at the same time the verification and adjustment workflows were introduced. Therefore, older output
manifests aren't compatible with this new workflow.

Filter Your Data Before Starting the Job


Amazon SageMaker Ground Truth processes all objects in your input manifest. If you have a partially
labeled data set, you might want to create a custom manifest using an Amazon S3 Select query on your
input manifest. Unlabeled objects individually fail, but they don't cause the job to fail, and they might
incur processing costs. Filtering out objects you don't want verified reduces your costs.

If you create a verification job using the console, you can use the filtering tools provided there. If you
create jobs using the API, make filtering your data part of your workflow where needed.

Creating Custom Labeling Workflows


This document guides you through the process of setting up a workflow with a custom labeling
template. To learn more about starting a labeling job, see Getting started (p. 527). In that section,
when you choose the Task type, select Custom labeling task, and then follow this section's instructions
to configure it.

Topics
• Step 1: Setting up your workforce (p. 672)
• Step 2: Creating your custom worker task template (p. 672)
• Step 3: Processing with AWS Lambda (p. 678)
• Demo Template: Annotation of Images with crowd-bounding-box (p. 692)
• Demo Template: Labeling Intents with crowd-classifier (p. 696)
• Custom Workflows via the API (p. 703)

For more information about creating custom labeling workflows, see Build a custom data labeling
workflow with Amazon SageMaker Ground Truth.

671
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Step 1: Setting up your workforce


In this step you use the console to establish which worker type to use and make the necessary sub-
selections for the worker type. It assumes you have already completed the steps up to this point in the
Getting started (p. 527) section and have chosen the Custom labeling task as the Task type.

To configure your workforce.

1. First choose an option from the Worker types. There are three types currently available:

• Public uses an on-demand workforce of independent contractors, powered by Amazon Mechanical


Turk. They are paid on a per-task basis.
• Private uses your employees or contractors for handling data that needs to stay within your
organization.
• Vendor uses third party vendors that specialize in providing data labeling services, available via
the AWS Marketplace.
2. If you choose the Public option, you are asked to set the number of workers per dataset object.
Having more than one worker perform the same task on the same object can help increase the
accuracy of your results. The default is three. You can raise or lower that depending on the accuracy
you need.

You are also asked to set a price per task by using a drop-down menu. The menu recommends price
points based on how long it will take to complete the task.

The recommended method to determine this is to first run a short test of your task with a private
workforce. The test provides a realistic estimate of how long the task takes to complete. You can
then select the range your estimate falls within on the Price per task menu. If your average time is
more than 5 minutes, consider breaking your task into smaller units.

Next
Step 2: Creating your custom worker task template (p. 672)

Step 2: Creating your custom worker task template


A worker task template is a file used by Ground Truth to customize the worker user interface (UI), or
human task UI. You can create a worker task template using HTML, CSS, JavaScript, Liquid template
language, and Crowd HTML Elements. Liquid is used to automate the template, and Crowd HTML
Elements can be used to include common annotation tools and provide the logic to submit to Ground
Truth.
• Starting with a base template (p. 673)
• Developing templates locally (p. 673)
• Using External Assets (p. 673)
• Track your variables (p. 673)
• A simple sample (p. 674)
• Adding automation with Liquid (p. 675)
• End-to-end demos (p. 677)
• Next (p. 678)

Use the following topics to learn how you can create a worker task template. You can see a repository of
example Ground Truth worker task templates on GitHub.

672
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Starting with a base template


You can use a template editor in the Ground Truth console to start creating a template. This editor
includes a number of pre-designed base templates and an HTML and Crowd HTML Element autofill
feature.

To access the Ground Truth custom template editor:

1. Following the instructions in Create a Labeling Job (Console) (p. 706) and select Custom for the
labeling job Task type.
2. When you select Next, you will be able to access the template editor and base templates in the
Custom labeling task setup section.
3. (Optional) Select a base template from the drop-down menu under Templates. If you prefer to
create a template from scratch, choose Custom from the drop down-menu for a minimal template
skeleton.

Developing templates locally


While you need to be in the console to test how your template will process incoming data, you can test
the look and feel of your template's HTML and custom elements in your browser by adding this code to
the top of your HTML file.

Example

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

This loads the necessary code to render the custom HTML elements. Use this if you want to develop your
template's look and feel in your preferred editor rather than in the console.

Remember, though, this will not parse your variables. You may want to replace them with sample
content while developing locally.

Using External Assets


Amazon SageMaker Ground Truth custom templates allow external scripts and style sheets to be
embedded. For example, the following code block demonstrates how you would add a style sheet
located at https://www.example.com/my-enhancement-styles.css to your template.

Example

<script src="https://www.example.com/my-enhancment-script.js"></script>
<link rel="stylesheet" type="text/css" href="https://www.example.com/my-enhancement-
styles.css">

If you encounter errors, ensure that your originating server is sending the correct MIME type and
encoding headers with the assets.

For example, the MIME and encoding types for remote scripts are: application/
javascript;CHARSET=UTF-8.

The MIME and encoding type for remote stylesheets are: text/css;CHARSET=UTF-8.

Track your variables


In the process of building the sample below, there will be a step that adds variables to it to represent
the pieces of data that may change from task to task, worker to worker. If you're starting with one of the

673
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

sample templates, you will need to make sure you're aware of the variables it already uses. When you
create your pre-annotation AWS Lambda script, its output will need to contain values for any of those
variables you choose to keep.

The values you use for the variables can come from your manifest file. All the key-value pairs in your
data object are provided to your pre-annotation Lambda. If it's a simple pass-through script, matching
keys for values in your data object to variable names in your template is the easiest way to pass those
values through to the tasks forms your workers see.

A simple sample
All tasks begin and end with the <crowd-form> </crowd-form> elements. Like standard HTML
<form> elements, all of your form code should go between them.

For a simple tweet-analysis task, use the <crowd-classifier> element. It requires the following
attributes:

• name - the variable name to use for the result in the form output.
• categories - a JSON formatted array of the possible answers.
• header - a title for the annotation tool

As children of the <crowd-classifier> element, you must have three regions.

• <classification-target> - the text the worker will classify based on the options specified in the
categories attribute above.
• <full-instructions> - instructions that are available from the "View full instructions" link in the tool.
This can be left blank, but it is recommended that you give good instructions to get better results.
• <short-instructions> - a more brief description of the task that appears in the tool's sidebar. This can be
left blank, but it is recommended that you give good instructions to get better results.

A simple version of this tool would look like this.

Example of using crowd-classifier

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="tweetFeeling"
categories="['positive','negative','neutral', 'unclear']"
header="Which term best describes this tweet?"
>
<classification-target>
My favorite football team won today!
Bring on the division finals!
</classification-target>

<full-instructions header="Sentiment Analysis Instructions">


Try to determine the sentiment the author
of the tweet is trying to express.
If none seem to match, choose "cannot determine."
</full-instructions>

<short-instructions>
Pick the term best describing the sentiment
of the tweet.
</short-instructions>

</crowd-classifier>
</crowd-form>

674
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

You can copy and paste the code into the editor in the Ground Truth labeling job creation workflow to
preview the tool, or try out a demo of this code on CodePen.

Adding automation with Liquid


Our custom template system uses Liquid for automation. It is an open source inline markup language. In
Liquid, the text between single curly braces and percent symbols is an instruction or tag that performs
an operation like control flow or iteration. Text between double curly braces is a variable or object that
outputs its value.

The most common use of Liquid will be to parse the data coming from your pre-annotation Lambda
and pull out the relevant variables to create the task. The taskInput object returned by your Pre-
annotation Lambda (p. 679) will be available as the task.input object in your templates.

The properties in your manifest's data objects are passed into your Pre-annotation Lambda (p. 679)
as the event.dataObject. A simple pass-through script simply returns that object as the taskInput
object. You would represent values from your manifest as variables as follows.

Example Manifest data object

{
"source": "This is a sample text for classification",
"labels": [ "angry" , "sad" , "happy" , "inconclusive" ],
"header": "What emotion is the speaker feeling?"
}

Example Sample HTML using variables

<crowd-classifier
name='tweetFeeling'
categories='{{ task.input.labels | to_json }}'
header='{{ task.input.header }}' >
<classification-target>

675
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

{{ task.input.source }}
</classification-target>

Note the addition of " | to_json" to the labels property above. That's a filter to turn the array into a
JSON representation of the array. Variable filters are explained in the next section.

The following list includes two types of Liquid tags that you may find useful to automate template
input data processing. If you select one of the following tag-types, you will be redirected to the Liquid
documentation.

• Control flow: Includes programming logic operators like if/else, unless, and case/when.
• Iteration: Enables you to run blocks of code repeatedly using statements like for loops.

For an example of an HTML template that uses Liquid elements to create a for loop, see translation-
review-and-correction.liquid.html in GitHub.

For more information and documentation, visit the Liquid homepage.

Variable filters
In addition to the standard Liquid filters and actions, Ground Truth offers a few additional filters. Filters
are applied by placing a pipe (|) character after the variable name, then specifying a filter name. Filters
can be chained in the form of:

Example

{{ <content> | <filter> | <filter> }}

Autoescape and explicit escape


By default, inputs will be HTML escaped to prevent confusion between your variable text and HTML.
You can explicitly add the escape filter to make it more obvious to someone reading the source of your
template that the escaping is being done.

escape_once
escape_once ensures that if you've already escaped your code, it doesn't get re-escaped on top of that.
For example, so that &amp; doesn't become &amp;amp;.

skip_autoescape
skip_autoescape is useful when your content is meant to be used as HTML. For example, you might
have a few paragraphs of text and some images in the full instructions for a bounding box.
Use skip_autoescape sparingly
The best practice in templates is to avoid passing in functional code or markup with
skip_autoescape unless you are absolutely sure you have strict control over what's being
passed. If you're passing user input, you could be opening your workers up to a Cross Site
Scripting attack.

to_json
to_json will encode what you feed it to JSON (JavaScript Object Notation). If you feed it an object, it
will serialize it.

grant_read_access
grant_read_access takes an S3 URI and encodes it into an HTTPS URL with a short-lived access token
for that resource. This makes it possible to display to workers photo, audio, or video objects stored in S3
buckets that are not otherwise publicly accessible.

676
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Example of the filters


Input

auto-escape: {{ "Have you read 'James & the Giant Peach'?" }}


explicit escape: {{ "Have you read 'James & the Giant Peach'?" | escape }}
explicit escape_once: {{ "Have you read 'James &amp; the Giant Peach'?" | escape_once }}
skip_autoescape: {{ "Have you read 'James & the Giant Peach'?" | skip_autoescape }}
to_json: {{ jsObject | to_json }}
grant_read_access: {{ "s3://mybucket/myphoto.png" | grant_read_access }}

Example
Output

auto-escape: Have you read &#39;James &amp; the Giant Peach&#39;?


explicit escape: Have you read &#39;James &amp; the Giant Peach&#39;?
explicit escape_once: Have you read &#39;James &amp; the Giant Peach&#39;?
skip_autoescape: Have you read 'James & the Giant Peach'?
to_json: { "point_number": 8, "coords": [ 59, 76 ] }
grant_read_access: https://s3.amazonaws.com/mybucket/myphoto.png?<access token and other
params>

Example of an automated classification template.


To automate the simple text classification sample, replace the tweet text with a variable.

The text classification template is below with automation added. The changes/additions are highlighted
in bold.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="tweetFeeling"
categories="['positive', 'negative', 'neutral', 'cannot determine']"
header="Which term best describes this tweet?"
>
<classification-target>
{{ task.input.source }}
</classification-target>

<full-instructions header="Analyzing a sentiment">


Try to determine the feeling the author
of the tweet is trying to express.
If none seem to match, choose "other."
</full-instructions>

<short-instructions>
Pick the term best describing the sentiment
of the tweet.
</short-instructions>

</crowd-classifier>
</crowd-form>

The tweet text that was in the prior sample is now replaced with an object. The entry.taskInput
object uses source (or another name you specify in your pre-annotation Lambda) as the property name
for the text and it is inserted directly in the HTML by virtue of being between double curly braces.

End-to-end demos
You can view the following end-to-end demos which include sample Lambda function:

677
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• Demo Template: Annotation of Images with crowd-bounding-box (p. 692)


• Demo Template: Labeling Intents with crowd-classifier (p. 696)

Next
Step 3: Processing with AWS Lambda (p. 678)

Step 3: Processing with AWS Lambda


In this step, you learn how to create and specify the two types of AWS Lambda functions that are
required to create a custom labeling workflow:

• Pre-annotation Lambda: This function initiates for and pre-processes each data object sent to your
labeling job prior to sending it to workers.
• Post-annotation Lambda: This function processes the results once workers submit a task. If you specify
multiple workers per data object, this function may include logic to consolidate annotations.

If you are a new user of Lambda and Ground Truth, we recommend that you use the pages in this section
as follows:

1. First, review Pre-annotation and Post-annotation Lambda Function Requirements (p. 678).
2. Then, use the page Required Permissions To Use AWS Lambda With Ground Truth (p. 685) to learn
about security and permission requirements to use your pre-annotation and post-annotation Lambda
functions in a Ground Truth custom labeling job.
3. Next, you need to visit the Lambda console or use Lambda's APIs to create your functions. Use the
section Create Lambda Functions for a Custom Labeling Workflow (p. 689) to learn how to create
Lambda functions.
4. To learn how to test your Lambda functions, see Test Pre-Annotation and Post-Annotation Lambda
Functions (p. 689).
5. After you create pre-processing and post-processing Lambda functions, select them from the Lambda
functions section that comes after the code editor for your custom HTML in the Ground Truth console.
To learn how to use these functions in a CreateLabelingJob API request, see Create a Labeling Job
(API) (p. 709).

For a custom labeling workflow tutorial that includes example pre-annotation and post-annotation
Lambda functions, in the "Demo Template: Annotation of Images with crowd-bounding-
box (p. 692)" document.

Topics
• Pre-annotation and Post-annotation Lambda Function Requirements (p. 678)
• Required Permissions To Use AWS Lambda With Ground Truth (p. 685)
• Create Lambda Functions for a Custom Labeling Workflow (p. 689)
• Test Pre-Annotation and Post-Annotation Lambda Functions (p. 689)

Pre-annotation and Post-annotation Lambda Function Requirements


Use this section to learn about the syntax of the requests sent to pre-annotation and post-annotation
Lambda functions, and the response syntax that Ground Truth requires to run a custom labeling
workflow.

Topics

678
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• Pre-annotation Lambda (p. 679)


• Post-annotation Lambda (p. 681)

Pre-annotation Lambda

Before a labeling task is sent to the worker, your pre-annotation Lambda function is invoked.

Ground Truth sends your Lambda function a JSON-formatted request to provide details about the
labeling job and the data object. The following table contains the pre-annotation request schemas. Each
parameter is described below.

Data object identified with "source-ref"

{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source-ref": <s3Uri>
}
}

Data object identified with "source"

{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source": <string>
}
}

• version (string): This is a version number used internally by Ground Truth.


• labelingJobArn (string): This is the Amazon Resource Name, or ARN, of your labeling job. This
ARN can be used to reference the labeling job when using Ground Truth API operations such as
DescribeLabelingJob.
• The dataObject (JSON object): The key contains a single JSON line, either from your input manifest
file or sent from Amazon SNS. The JSON line objects in your manifest can be up to 100 kilobytes in
size and contain a variety of data. For a very basic image annotation job, the dataObject JSON may
just contain a source-ref key, identifying the image to be annotated. If the data object (for example,
a line of text) is included directly in the input manifest file, the data object is identified with source.
If you create a verification or adjustment job, this line may contain label data and metadata from the
previous labeling job.

The following table includes code block examples of a pre-annotation request. Each parameter in these
example requests is explained below the tabbed table.

Data object identified with "source-ref"

{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:<aws_region>:<aws_account_number>:labeling-
job/<labeling_job_name>"
"dataObject" : {
"source-ref": "s3://<input-data-bucket>/<data-object-file-name>"
}

679
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Data object identified with "source"

{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:<aws_region>:<aws_account_number>:labeling-
job/<labeling_job_name>"
"dataObject" : {
"source": "Sue purchased 10 shares of the stock on April 10th, 2020"
}
}

In return, Ground Truth requires a response formatted like the following:

Example of expected return data

{
"taskInput": <json object>,
"isHumanAnnotationRequired": <boolean> # Optional
}

In the previous example, the <json object> needs to contain all the data your custom worker task
template needs. If you're doing a bounding box task where the instructions stay the same all the time, it
may just be the HTTP(S) or Amazon S3 resource for your image file. If it's a sentiment analysis task and
different objects may have different choices, it is the object reference as a string and the choices as an
array of strings.
Implications of isHumanAnnotationRequired
This value is optional because it defaults to true. The primary use case for explicitly setting it is
when you want to exclude this data object from being labeled by human workers.

If you have a mix of objects in your manifest, with some requiring human annotation and some not
needing it, you can include a isHumanAnnotationRequired value in each data object. You can add
logic to your pre-annotation Lambda to dynamically determine if an object requires annotation, and set
this boolean value accordingly.

Examples of Pre-annotation Lambda Functions

The following, basic pre-annotation Lambda function accesses the JSON object in dataObject from the
initial request, and returns it in the taskInput parameter.

import json

def lambda_handler(event, context):


return {
"taskInput": event['dataObject']
}

Assuming the input manifest file uses "source-ref" to identify data objects, the worker task template
used in the same labeling job as this pre-annotation Lambda must include a Liquid element like the
following to ingest dataObject:

{{ task.input.source-ref | grant_read_access }}

If the input manifest file used source to identify the data object, the work task template can ingest
dataObject with the following:

680
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

{{ task.input.source }}

The following pre-annotation Lambda example includes logic to identify the key used in dataObject,
and to point to that data object using taskObject in the Lambda's return statement.

import json

def lambda_handler(event, context):

# Event received
print("Received event: " + json.dumps(event, indent=2))

# Get source if specified


source = event['dataObject']['source'] if "source" in event['dataObject'] else None

# Get source-ref if specified


source_ref = event['dataObject']['source-ref'] if "source-ref" in event['dataObject']
else None

# if source field present, take that otherwise take source-ref


task_object = source if source is not None else source_ref

# Build response object


output = {
"taskInput": {
"taskObject": task_object
},
"humanAnnotationRequired": "true"
}

print(output)
# If neither source nor source-ref specified, mark the annotation failed
if task_object is None:
print(" Failed to pre-process {} !".format(event["labelingJobArn"]))
output["humanAnnotationRequired"] = "false"

return output

Post-annotation Lambda

When all workers have annotated the data object or when TaskAvailabilityLifetimeInSeconds
has been reached, whichever comes first, Ground Truth sends those annotations to your post-annotation
Lambda. This Lambda is generally used for Consolidate Annotations (p. 806).
Tip
To see an example of a post-consolidation Lambda function, see
annotation_consolidation_lambda.py in the aws-sagemaker-ground-truth-recipe GitHub
repository.

The following code block contains the post-annotation request schema. Each parameter is described in
the following bulleted list.

{
"version": "2018-10-16",
"labelingJobArn": <string>,
"labelCategories": [<string>],
"labelAttributeName": <string>,
"roleArn" : <string>,
"payload": {
"s3Uri": <string>
}

681
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• version (string): A version number used internally by Ground Truth.


• labelingJobArn (string): The Amazon Resource Name, or ARN, of your labeling job. This
ARN can be used to reference the labeling job when using Ground Truth API operations such as
DescribeLabelingJob.
• labelCategories (list of strings): Includes the label categories and other attributes you either
specified in the console, or that you include in the label category configuration file.
• labelAttributeName (string): Either the name of your labeling job, or the label attribute name you
specify when you create the labeling job.
• roleArn (string): The Amazon Resource Name (ARN) of the IAM execution role you specify when you
create the labeling job.
• payload (JSON object): A JSON that includes an s3Uri key, which identifies the location of the
annotation data for that data object in Amazon S3. The second code block below shows an example of
this annotation file.

The following code block contains an example of a post-annotation request. Each parameter in this
example request is explained below the code block.

Example of an post-annotation Lambda request

{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:us-west-2:111122223333:labeling-job/labeling-job-
name",
"labelCategories": ["Ex Category1","Ex Category2", "Ex Category3"],
"labelAttributeName": "labeling-job-attribute-name",
"roleArn" : "arn:aws:iam::111122223333:role/role-name",
"payload": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/annotations.json"
}
}

Note
If no worker works on the data object and TaskAvailabilityLifetimeInSeconds has been
reached, the data object is marked as failed and not included as part of post-annotation Lambda
invocation.

The following code block contains the payload schema. This is the file that is indicated by the
s3Uri parameter in the post-annotation Lambda request payload JSON object. For example, if the
previous code block is the post-annotation Lambda request, the following annotation file is located at
s3://DOC-EXAMPLE-BUCKET/annotations.json.

Each parameter is described in the following bulleted list.

Example of an annotation file

[
{
"datasetObjectId": <string>,
"dataObject": {
"s3Uri": <string>,
"content": <string>
},
"annotations": [{
"workerId": <string>,
"annotationData": {
"content": <string>,

682
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

"s3Uri": <string>
}
}]
}
]

• datasetObjectId (string): Identifies a unique ID that Ground Truth assigns to each data object you
send to the labeling job.
• dataObject (JSON object): The data object that was labeled. If the data object is included in the
input manifest file and identified using the source key (for example, a string), dataObject includes a
content key, which identifies the data object. Otherwise, the location of the data object (for example,
a link or S3 URI) is identified with s3Uri.
• annotations (list of JSON objects): This list contains a single JSON object for each annotation
submitted by workers for that dataObject. A single JSON object contains a unique workerId
that can be used to identify the worker that submitted that annotation. The annotationData key
contains one of the following:
• content (string): Contains the annotation data.
• s3Uri (string): Contains an S3 URI that identifies the location of the annotation data.

The following table contains examples of the content that you may find in payload for different types of
annotation.

Named Entity Recognition Payload

[
{
"datasetObjectId": "1",
"dataObject": {
"content": "Sift 3 cups of flour into the bowl."
},
"annotations": [
{
"workerId": "private.us-west-2.ef7294f850a3d9d1",
"annotationData": {
"content": "{\"crowd-entity-annotation\":{\"entities\":[{\"endOffset
\":4,\"label\":\"verb\",\"startOffset\":0},{\"endOffset\":6,\"label\":\"number
\",\"startOffset\":5},{\"endOffset\":20,\"label\":\"object\",\"startOffset\":15},
{\"endOffset\":34,\"label\":\"object\",\"startOffset\":30}]}}"
}
}
]
}
]

Semantic Segmentation Payload

[
{
"datasetObjectId": "2",
"dataObject": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/gt-input-data/images/bird3.jpg"
},
"annotations": [
{
"workerId": "private.us-west-2.ab1234c5678a919d0",
"annotationData": {
"content": "{\"crowd-semantic-segmentation\":{\"inputImageProperties\":
{\"height\":2000,\"width\":3020},\"labelMappings\":{\"Bird\":{\"color\":\"#2ca02c\"}},
\"labeledImage\":{\"pngImageData\":\"iVBOR...\"}}}"

683
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

}
}
]
}
]

Bounding Box Payload

[
{
"datasetObjectId": "0",
"dataObject": {
"s3Uri": "s3://DOC-EXAMPLE-BUCKET/gt-input-data/images/bird1.jpg"
},
"annotations": [
{
"workerId": "private.us-west-2.ab1234c5678a919d0",
"annotationData": {
"content": "{\"boundingBox\":{\"boundingBoxes\":[{\"height\":2052,\"label
\":\"Bird\",\"left\":583,\"top\":302,\"width\":1375}],\"inputImageProperties\":
{\"height\":2497,\"width\":3745}}}"
}
}
]
}
]

Your post-annotation Lambda function may contain logic similar to the following to
loop through and access all annotations contained in the request. For a full example, see
annotation_consolidation_lambda.py in the aws-sagemaker-ground-truth-recipe GitHub repository. In
this GitHub example, you must add your own annotation consolidation logic.

for i in range(len(annotations)):
worker_id = annotations[i]["workerId"]
annotation_content = annotations[i]['annotationData'].get('content')
annotation_s3_uri = annotations[i]['annotationData'].get('s3uri')
annotation = annotation_content if annotation_s3_uri is None else
s3_client.get_object_from_s3(
annotation_s3_uri)
annotation_from_single_worker = json.loads(annotation)

print("{} Received Annotations from worker [{}] is [{}]"


.format(log_prefix, worker_id, annotation_from_single_worker))

Tip
When you run consolidation algorithms on the data, you can use an AWS database service to
store results, or you can pass the processed results back to Ground Truth. The data you return
to Ground Truth is stored in consolidated annotation manifests in the S3 bucket specified for
output during the configuration of the labeling job.

In return, Ground Truth requires a response formatted like the following:

Example of expected return data

[
{
"datasetObjectId": <string>,
"consolidatedAnnotation": {
"content": {
"<labelattributename>": {

684
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

# ... label content


}
}
}
},
{
"datasetObjectId": <string>,
"consolidatedAnnotation": {
"content": {
"<labelattributename>": {
# ... label content
}
}
}
}
.
.
.
]

At this point, all the data you're sending to your S3 bucket, other than the datasetObjectId, is in the
content object.

When you return annotations in content, this results in an entry in your job's output manifest like the
following:

Example of label format in output manifest

{ "source-ref"/"source" : "<s3uri or content>",


"<labelAttributeName>": {
# ... label content from you
},
"<labelAttributeName>-metadata": { # This will be added by Ground Truth
"job_name": <labelingJobName>,
"type": "groundTruth/custom",
"human-annotated": "yes",
"creation_date": <date> # Timestamp of when received from Post-labeling Lambda
}
}

Because of the potentially complex nature of a custom template and the data it collects, Ground Truth
does not offer further processing of the data.

Required Permissions To Use AWS Lambda With Ground Truth


You may need to configure some or all the following to create and use AWS Lambda with Ground Truth.

• You need to grant an IAM role or user (collectively, an IAM entity) permission to create the pre-
annotation and post-annotation Lambda functions using AWS Lambda, and to choose them when
creating the labeling job.
• The IAM execution role specified when the labeling job is configured needs permission to invoke the
pre-annotation and post-annotation Lambda functions.
• The post-annotation Lambda functions may need permission to access Amazon S3.

Use the following sections to learn how to create the IAM entities and grant permissions described
above.

Topics
• Grant Permission to Create and Select an AWS Lambda Function (p. 686)

685
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• Grant IAM Execution Role Permission to Invoke AWS Lambda Functions (p. 686)
• Grant Post-Annotation Lambda Permissions to Access Annotation (p. 687)

Grant Permission to Create and Select an AWS Lambda Function

If you do not require granular permissions to develop pre-annotation and post-annotation Lambda
functions, you can attach the AWS managed policy AWSLambda_FullAccess to a user or role. This
policy grants broad permissions to use all Lambda features, as well as permission to perform actions in
other AWS services with which Lambda interacts.

To create a more granular policy for security-sensitive use cases, refer to the documentation Identity-
based IAM policies for Lambda in the to AWS Lambda Developer Guide to learn how to create an IAM
policy that fits your use case.

Policies to Use the Lambda Console

If you want to grant an IAM entity permission to use the Lambda console, see Using the Lambda console
in the AWS Lambda Developer Guide.

Additionally, if you want the user to be able to access and deploy the Ground Truth starter pre-
annotation and post-annotation functions using the AWS Serverless Application Repository in the
Lambda console, you must specify the <aws-region> where you want to deploy the functions (this
should be the same AWS Region used to create the labeling job), and add the following policy to the IAM
role.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"serverlessrepo:ListApplicationVersions",
"serverlessrepo:GetApplication",
"serverlessrepo:CreateCloudFormationTemplate"
],
"Resource": "arn:aws:serverlessrepo:<aws-region>:838997950401:applications/aws-
sagemaker-ground-truth-recipe"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "serverlessrepo:SearchApplications",
"Resource": "*"
}
]
}

Policies to See Lambda Functions in the Ground Truth Console

To grant an IAM entity permission to view Lambda functions in the Ground Truth console when the user
is creating a custom labeling job, the entity must have the permissions described in Grant IAM Permission
to Use the Amazon SageMaker Ground Truth Console (p. 818), including the permissions described in
the section Custom Labeling Workflow Permissions (p. 821).

Grant IAM Execution Role Permission to Invoke AWS Lambda Functions

If you add the IAM managed policy AmazonSageMakerGroundTruthExecution to the IAM execution role
used to create the labeling job, this role has permission to list and invoke Lambda functions with one

686
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

of the following strings in the function name: GtRecipe, SageMaker, Sagemaker, sagemaker, or
LabelingFunction.

If the pre-annotation or post-annotation Lambda function names do not include one of the
terms in the preceding paragraph, or if you require more granular permission than those in the
AmazonSageMakerGroundTruthExecution managed policy, you can add a policy similar to the
following to give the execution role permission to invoke pre-annotation and post-annotation functions.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action":
"lambda:InvokeFunction",
"Resource": [
"arn:aws:lambda:<region>:<account-id>:function:<pre-annotation-lambda-
name>",
"arn:aws:lambda:<region>:<account-id>:function:<post-annotation-lambda-
name>"
]
}
]
}

Grant Post-Annotation Lambda Permissions to Access Annotation

As described in Post-annotation Lambda (p. 681), the post-annotation Lambda request includes the
location of the annotation data in Amazon S3. This location is identified by the s3Uri string in the
payload object. To process the annotations as they come in, even for a simple pass through function,
you need to assign the necessary permissions to the post-annotation Lambda execution role to read files
from the Amazon S3.

There are many ways that you can configure your Lambda to access annotation data in Amazon S3. Two
common ways are:

• Allow the Lambda execution role to assume the SageMaker execution role identified in roleArn in the
post-annotation Lambda request. This SageMaker execution role is the one used to create the labeling
job, and has access to the Amazon S3 output bucket where the annotation data is stored.
• Grant the Lambda execution role permission to access the Amazon S3 output bucket directly.

Use the following sections to learn how to configure these options.

Grant Lambda Permission to Assume SageMaker Execution Role

To allow a Lambda function to assume a SageMaker execution role, you must attach a policy to the
Lambda function's execution role, and modify the trust relationship of the SageMaker execution role to
allow Lambda to assume it.

1. Attach the following IAM policy to your Lambda function's execution role to assume the SageMaker
execution role identified in Resource. Replace 222222222222 with an AWS account ID. Replace sm-
execution-role with the name of the assumed role.

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",

687
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

"Resource": "arn:aws:iam::222222222222:role/sm-execution-role"
}
}

2. Modify the trust policy of the SageMaker execution role to include the following Statement. Replace
222222222222 with an AWS account ID. Replace my-lambda-execution-role with the name of
the assumed role.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::222222222222:role/my-lambda-execution-role"
},
"Action": "sts:AssumeRole"
}
]
}

Grant Lambda Execution Role Permission to Access S3

You can add a policy similar to the following to the post-annotation Lambda function execution role to
give it S3 read permissions. Replace DOC-EXAMPLE-BUCKET with the name of the output bucket you
specify when you create a labeling job.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
}
]
}

To add S3 read permissions to a Lambda execution role in the Lambda console, use the following
procedure.

Add S3 read permissions to post-annotation Lambda:

1. Open the Functions page in the Lambda console.


2. Choose the name of the post-annotation function.
3. Choose Configuration and then choose Permissions.
4. Select the Role name and the summary page for that role opens in the IAM console in a new tab.
5. Select Attach policies.
6. Do one of the following:

• Search for and select AmazonS3ReadOnlyAccess to give the function permission to read all
buckets and objects in the account.
• If you require more granular permissions, select Create policy and use the policy example in
the preceding section to create a policy. Note that you must navigate back to the execution role
summary page after you create the policy.

688
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

7. If you used the AmazonS3ReadOnlyAccess managed policy, select Attach policy.

If you created a new policy, navigate back to the Lambda execution role summary page and attach
the policy you just created.

Create Lambda Functions for a Custom Labeling Workflow


You can create a Lambda function using the Lambda console, the AWS CLI, or an AWS SDK in a
supported programming language of your choice. Use the AWS Lambda Developer Guide to learn more
about each of these options:

• To learn how to create a Lambda function using the console, see Create a Lambda function with the
console.
• To learn how to create a Lambda function using the AWS CLI, see Using AWS Lambda with the AWS
Command Line Interface.
• Select the relevant section in the table of contents to learn more about working with Lambda in the
language of your choice. For example, select Working with Python to learn more about using Lambda
with the AWS SDK for Python (Boto3).

Ground Truth provides pre-annotation and post-annotation templates through an AWS Serverless
Application Repository (SAR) recipe. Use the following procedure to select the Ground Truth recipe in the
Lambda console.

Use the Ground Truth SAR recipe to create pre-annotation and post-annotation Lambda
functions:

1. Open the Functions page on the Lambda console.


2. Select Create function.
3. Select Browse serverless app repository.
4. In the search text box, enter aws-sagemaker-ground-truth-recipe and select that app.
5. Select Deploy. The app may take a couple of minutes to deploy.

Once the app deploys, two functions appear in the Functions section of the Lambda console:
serverlessrepo-aws-sagema-GtRecipePreHumanTaskFunc-<id> and serverlessrepo-
aws-sagema-GtRecipeAnnotationConsol-<id>.
6. Select one of these functions and add your custom logic in the Code section.
7. When you are finished making changes, select Deploy to deploy them.

Test Pre-Annotation and Post-Annotation Lambda Functions


You can test your pre-annotation and post annotation Lambda functions in the Lambda console. If you
are a new user of Lambda, you can learn how to test, or invoke, your Lambda functions in the console
using the Create a Lambda function tutorial with the console in the AWS Lambda Developer Guide.

You can use the sections on this page to learn how to test the Ground Truth pre-annotation and post-
annotation templates provided through an AWS Serverless Application Repository (SAR).

Topics
• Prerequisites (p. 690)
• Test the Pre-annotation Lambda Function (p. 690)
• Test the Post-Annotation Lambda Function (p. 691)

689
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Prerequisites
You must do the following to use the tests described on this page.

• You need access to the Lambda console, and you need permission to create and invoke Lambda
functions. To learn how to set up these permissions, see Grant Permission to Create and Select an AWS
Lambda Function (p. 686).
• If you have not deployed the Ground Truth SAR recipe, use the procedure in Create Lambda Functions
for a Custom Labeling Workflow (p. 689) to do so.
• To test the post-annotation Lambda function, you must have a data file in Amazon S3 with sample
annotation data. For a simple test, you can copy and paste the following code into a file and save it
as sample-annotations.json and upload this file to Amazon S3. Note the S3 URI of this file—you
need this information to configure the post-annotation Lambda test.

[{"datasetObjectId":"0","dataObject":{"content":"To train a machine learning model,


you need a large, high-quality, labeled dataset. Ground Truth helps you build
high-quality training datasets for your machine learning models."},"annotations":
[{"workerId":"private.us-west-2.0123456789","annotationData":{"content":"{\"crowd-
entity-annotation\":{\"entities\":[{\"endOffset\":8,\"label\":\"verb\",\"startOffset
\":3},{\"endOffset\":27,\"label\":\"adjective\",\"startOffset\":11},{\"endOffset\":33,
\"label\":\"object\",\"startOffset\":28},{\"endOffset\":51,\"label\":\"adjective\",
\"startOffset\":46},{\"endOffset\":65,\"label\":\"adjective\",\"startOffset\":53},
{\"endOffset\":74,\"label\":\"adjective\",\"startOffset\":67},{\"endOffset\":82,
\"label\":\"adjective\",\"startOffset\":75},{\"endOffset\":102,\"label\":\"verb
\",\"startOffset\":97},{\"endOffset\":112,\"label\":\"verb\",\"startOffset\":107},
{\"endOffset\":125,\"label\":\"adjective\",\"startOffset\":113},{\"endOffset\":134,
\"label\":\"adjective\",\"startOffset\":126},{\"endOffset\":143,\"label\":\"object
\",\"startOffset\":135},{\"endOffset\":169,\"label\":\"adjective\",\"startOffset
\":153},{\"endOffset\":176,\"label\":\"object\",\"startOffset\":170}]}}"}}]},
{"datasetObjectId":"1","dataObject":{"content":"Sift 3 cups of flour into the
bowl."},"annotations":[{"workerId":"private.us-west-2.0123456789","annotationData":
{"content":"{\"crowd-entity-annotation\":{\"entities\":[{\"endOffset\":4,\"label
\":\"verb\",\"startOffset\":0},{\"endOffset\":6,\"label\":\"number\",\"startOffset
\":5},{\"endOffset\":20,\"label\":\"object\",\"startOffset\":15},{\"endOffset\":34,
\"label\":\"object\",\"startOffset\":30}]}}"}}]},{"datasetObjectId":"2","dataObject":
{"content":"Jen purchased 10 shares of the stock on Janurary 1st, 2020."},"annotations":
[{"workerId":"private.us-west-2.0123456789","annotationData":{"content":"{\"crowd-
entity-annotation\":{\"entities\":[{\"endOffset\":3,\"label\":\"person\",\"startOffset
\":0},{\"endOffset\":13,\"label\":\"verb\",\"startOffset\":4},{\"endOffset\":16,
\"label\":\"number\",\"startOffset\":14},{\"endOffset\":58,\"label\":\"date\",
\"startOffset\":40}]}}"}}]},{"datasetObjectId":"3","dataObject":{"content":"The
narrative was interesting, however the character development was weak."},"annotations":
[{"workerId":"private.us-west-2.0123456789","annotationData":{"content":"{\"crowd-entity-
annotation\":{\"entities\":[{\"endOffset\":29,\"label\":\"adjective\",\"startOffset
\":18},{\"endOffset\":73,\"label\":\"adjective\",\"startOffset\":69}]}}"}}]}]

• You must use the directions in Grant Post-Annotation Lambda Permissions to Access
Annotation (p. 687) to give your post-annotation Lambda function's execution role permission
to assume the SageMaker execution role you use to create the labeling job. The post-annotation
Lambda function uses the SageMaker execution role to access the annotation data file, sample-
annotations.json, in S3.

Test the Pre-annotation Lambda Function


Use the following procedure to test the pre-annotation Lambda function created when you deployed the
Ground Truth AWS Serverless Application Repository (SAR) recipe.

Test the Ground Truth SAR recipe pre-annotation Lambda function

1. Open the Functions page in the Lambda console.

690
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

2. Select the pre-annotation function that was deployed from the Ground Truth SAR
recipe. The name of this function is similar to serverlessrepo-aws-sagema-
GtRecipePreHumanTaskFunc-<id>.
3. In the Code source section, select the arrow next to Test.
4. Select Configure test event.
5. Keep the Create new test event option selected.
6. Under Event template, select SageMaker Ground Truth PreHumanTask.
7. Give your test an Event name.
8. Select Create.
9. Select the arrow next to Test again and you should see that the test you created is selected, which is
indicated with a dot by the event name. If it is not selected, select it.
10. Select Test to run the test.

After you run the test, you can see the Execution results. In the Function logs, you should see a
response similar to the following:

START RequestId: cd117d38-8365-4e1a-bffb-0dcd631a878f Version: $LATEST


Received event: {
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:us-east-2:123456789012:labeling-job/example-job",
"dataObject": {
"source-ref": "s3://sagemakerexample/object_to_annotate.jpg"
}
}
{'taskInput': {'taskObject': 's3://sagemakerexample/object_to_annotate.jpg'},
'isHumanAnnotationRequired': 'true'}
END RequestId: cd117d38-8365-4e1a-bffb-0dcd631a878f
REPORT RequestId: cd117d38-8365-4e1a-bffb-0dcd631a878f Duration: 0.42 ms Billed Duration: 1
ms Memory Size: 128 MB Max Memory Used: 43 MB

In this response, we can see the Lambda function's output matches the required pre-annotation response
syntax:

{'taskInput': {'taskObject': 's3://sagemakerexample/object_to_annotate.jpg'},


'isHumanAnnotationRequired': 'true'}

Test the Post-Annotation Lambda Function

Use the following procedure to test the post-annotation Lambda function created when you deployed
the Ground Truth AWS Serverless Application Repository (SAR) recipe.

Test the Ground Truth SAR recipe post-annotation Lambda

1. Open the Functions page in the Lambda console.


2. Select the post-annotation function that was deployed from the Ground Truth SAR
recipe. The name of this function is similar to serverlessrepo-aws-sagema-
GtRecipeAnnotationConsol-<id>.
3. In the Code source section, select the arrow next to Test.
4. Select Configure test event.
5. Keep the Create new test event option selected.
6. Under Event template, select SageMaker Ground Truth AnnotationConsolidation.
7. Give your test an Event name.
8. Modify the template code provided as follows:

691
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• Replace the Amazon Resource Name (ARN) in roleArn with the ARN of the SageMaker execution
role you used to create the labeling job.
• Replace the S3 URI in s3Uri with the URI of the sample-annotations.json file you added to
Amazon S3.

After you make these modifications, your test should look similar to the following:

{
"version": "2018-10-16",
"labelingJobArn": "arn:aws:sagemaker:us-east-2:123456789012:labeling-job/example-
job",
"labelAttributeName": "example-attribute",
"roleArn": "arn:aws:iam::222222222222:role/sm-execution-role",
"payload": {
"s3Uri": "s3://your-bucket/sample-annotations.json"
}
}

9. Select Create.
10. Select the arrow next to Test again and you should see that the test you created is selected, which is
indicated with a dot by the event name. If it is not selected, select it.
11. Select the Test to run the test.

After you run the test, you should see a -- Consolidated Output -- section in the Function Logs,
which contains a list of all annotations included in sample-annotations.json.

Demo Template: Annotation of Images with crowd-bounding-


box
When you chose to use a custom template as your task type in the Amazon SageMaker Ground Truth
console, you reach the Custom labeling task panel. There you can choose from multiple base templates.
The templates represent some of the most common tasks and provide a sample to work from as you
create your customized labeling task's template. If you are not using the console, or as an additional
recourse, see Amazon SageMaker Ground Truth Sample Task UIs for a repository of demo templates for
a variety of labeling job task types.

This demonstration works with the BoundingBox template. The demonstration also works with the AWS
Lambda functions needed for processing your data before and after the task. In the Github repository
above, to find templates that work with AWS Lambda functions, look for {{ task.input.<property
name> }} in the template.

Topics
• Starter Bounding Box custom template (p. 692)
• Your own Bounding Box custom template (p. 693)
• Your manifest file (p. 694)
• Your pre-annotation Lambda function (p. 695)
• Your post-annotation Lambda function (p. 695)
• The output of your labeling job (p. 696)

Starter Bounding Box custom template


This is the starter bounding box template that is provided.

692
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-bounding-box
name="boundingBox"
src="{{ task.input.taskObject | grant_read_access }}"
header="{{ task.input.header }}"
labels="{{ task.input.labels | to_json | escape }}"
>

<!-- The <full-instructions> tag is where you will define the full instructions of your
task. -->
<full-instructions header="Bounding Box Instructions" >
<p>Use the bounding box tool to draw boxes around the requested target of interest:</
p>
<ol>
<li>Draw a rectangle using your mouse over each instance of the target.</li>
<li>Make sure the box does not cut into the target, leave a 2 - 3 pixel margin</li>
<li>
When targets are overlapping, draw a box around each object,
include all contiguous parts of the target in the box.
Do not include parts that are completely overlapped by another object.
</li>
<li>
Do not include parts of the target that cannot be seen,
even though you think you can interpolate the whole shape of the target.
</li>
<li>Avoid shadows, they're not considered as a part of the target.</li>
<li>If the target goes off the screen, label up to the edge of the image.</li>
</ol>
</full-instructions>

<!-- The <short-instructions> tag allows you to specify instructions that are displayed
in the left hand side of the task interface.
It is a best practice to provide good and bad examples in this section for quick
reference. -->
<short-instructions>
Use the bounding box tool to draw boxes around the requested target of interest.
</short-instructions>
</crowd-bounding-box>
</crowd-form>

The custom templates use the Liquid template language, and each of the items between double
curly braces is a variable. The pre-annotation AWS Lambda function should provide an object named
taskInput and that object's properties can be accessed as {{ task.input.<property name> }} in
your template.

Your own Bounding Box custom template


As an example, assume you have a large collection of animal photos in which you know the kind of
animal in an image from a prior image-classification job. Now you want to have a bounding box drawn
around it.

In the starter sample, there are three variables: taskObject, header, and labels.

Each of these would be represented in different parts of the bounding box.

• taskObject is an HTTP(S) URL or S3 URI for the photo to be annotated. The added |
grant_read_access is a filter that will convert an S3 URI to an HTTPS URL with short-lived access to
that resource. If you're using an HTTP(S) URL, it's not needed.
• header is the text above the photo to be labeled, something like "Draw a box around the bird in the
photo."

693
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

• labels is an array, represented as ['item1', 'item2', ...]. These are labels that can be
assigned by the worker to the different boxes they draw. You can have one or many.

Each of the variable names come from the JSON object in the response from your pre-annotation
Lambda, The names above are merely suggested, Use whatever variable names make sense to you and
will promote code readability among your team.
Only use variables when necessary
If a field will not change, you can remove that variable from the template and replace it with
that text, otherwise you have to repeat that text as a value in each object in your manifest or
code it into your pre-annotation Lambda function.

Example : Final Customized Bounding Box Template


To keep things simple, this template will have one variable, one label, and very basic instructions.
Assuming your manifest has an "animal" property in each data object, that value can be re-used in two
parts of the template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-bounding-box
name="boundingBox"
labels="[ '{{ task.input.animal }}' ]"
src="{{ task.input.source-ref | grant_read_access }}"
header="Draw a box around the {{ task.input.animal }}."
>
<full-instructions header="Bounding Box Instructions" >
<p>Draw a bounding box around the {{ task.input.animal }} in the image. If
there is more than one {{ task.input.animal }} per image, draw a bounding
box around the largest one.</p>
<p>The box should be tight around the {{ task.input.animal }} with
no more than a couple of pixels of buffer around the
edges.</p>
<p>If the image does not contain a {{ task.input.animal }}, check the <strong>
Nothing to label</strong> box.
</full-instructions>
<short-instructions>
<p>Draw a bounding box around the {{ task.input.animal }} in each image. If
there is more than one {{ task.input.animal }} per image, draw a bounding
box around the largest one.</p>
</short-instructions>
</crowd-bounding-box>
</crowd-form>

Note the re-use of {{ task.input.animal }} throughout the template. If your manifest had
all of the animal names beginning with a capital letter, you could use {{ task.input.animal |
downcase }}, incorporating one of Liquid's built-in filters in sentences where it needed to be presented
lowercase.

Your manifest file


Your manifest file should provide the variable values you're using in your template. You can do some
transformation of your manifest data in your pre-annotation Lambda, but if you don't need to, you
maintain a lower risk of errors and your Lambda will run faster. Here's a sample manifest file for the
template.

{"source-ref": "<S3 image URI>", "animal": "horse"}


{"source-ref": "<S3 image URI>", "animal" : "bird"}
{"source-ref": "<S3 image URI>", "animal" : "dog"}
{"source-ref": "<S3 image URI>", "animal" : "cat"}

694
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Your pre-annotation Lambda function


As part of the job set-up, provide the ARN of an AWS Lambda function that can be called to process your
manifest entries and pass them to the template engine.
Naming your Lambda function
The best practice in naming your function is to use one of the following four strings as part of
the function name: SageMaker, Sagemaker, sagemaker, or LabelingFunction. This applies
to both your pre-annotation and post-annotation functions.

When you're using the console, if you have AWS Lambda functions that are owned by your account, a
drop-down list of functions meeting the naming requirements will be provided to choose one.

In this very basic example, you're just passing through the information from the manifest without doing
any additional processing on it. This sample pre-annotation function is written for Python 3.7.

import json

def lambda_handler(event, context):


return {
"taskInput": event['dataObject']
}

The JSON object from your manifest will be provided as a child of the event object. The properties
inside the taskInput object will be available as variables to your template, so simply setting the value
of taskInput to event['dataObject'] will pass all the values from your manifest object to your
template without having to copy them individually. If you wish to send more values to the template, you
can add them to the taskInput object.

Your post-annotation Lambda function


As part of the job set-up, provide the ARN of an AWS Lambda function that can be called to process the
form data when a worker completes a task. This can be as simple or complex as you want. If you want
to do answer consolidation and scoring as it comes in, you can apply the scoring and/or consolidation
algorithms of your choice. If you want to store the raw data for offline processing, that is an option.
Provide permissions to your post-annotation Lambda
The annotation data will be in a file designated by the s3Uri string in the payload object. To
process the annotations as they come in, even for a simple pass through function, you need to
assign S3ReadOnly access to your Lambda so it can read the annotation files.
In the Console page for creating your Lambda, scroll to the Execution role panel. Select Create
a new role from one or more templates. Give the role a name. From the Policy templates
drop-down, choose Amazon S3 object read-only permissions. Save the Lambda and the role
will be saved and selected.

The following sample is in Python 2.7.

import json
import boto3
from urlparse import urlparse

def lambda_handler(event, context):


consolidated_labels = []

parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);

for dataset in annotations:


for annotation in dataset['annotations']:

695
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'boxesInfo': new_annotation,
'imageSource': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)

return consolidated_labels

The post-annotation Lambda will often receive batches of task results in the event object. That batch
will be the payload object the Lambda should iterate through. What you send back will be an object
meeting the API contract (p. 678).

The output of your labeling job


You'll find the output of the job in a folder named after your labeling job in the target S3 bucket you
specified. It will be in a subfolder named manifests.

For a bounding box task, the output you find in the output manifest will look a bit like the demo below.
The example has been cleaned up for printing. The actual output will be a single line per record.

Example : JSON in your output manifest

{
"source-ref":"<URL>",
"<label attribute name>":
{
"workerId":"<URL>",
"imageSource":"<image URL>",
"boxesInfo":"{\"boundingBox\":{\"boundingBoxes\":[{\"height\":878, \"label\":\"bird
\", \"left\":208, \"top\":6, \"width\":809}], \"inputImageProperties\":{\"height\":924,
\"width\":1280}}}"},
"<label attribute name>-metadata":
{
"type":"groundTruth/custom",
"job_name":"<Labeling job name>",
"human-annotated":"yes"
},
"animal" : "bird"
}

Note how the additional animal attribute from your original manifest is passed to the output manifest
on the same level as the source-ref and labeling data. Any properties from your input manifest,
whether they were used in your template or not, will be passed to the output manifest.

Demo Template: Labeling Intents with crowd-classifier


If you choose a custom template, you'll reach the Custom labeling task panel. There you can select from
multiple starter templates that represent some of the more common tasks. The templates provide a
starting point to work from in building your customized labeling task's template.

In this demonstration, you work with the Intent Detection template, which uses the crowd-
classifier (p. 903) element, and the AWS Lambda functions needed for processing your data
before and after the task.

696
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Topics
• Starter Intent Detection custom template (p. 697)
• Your Intent Detection custom template (p. 697)
• Your pre-annotation Lambda function (p. 701)
• Your post-annotation Lambda function (p. 701)
• Your labeling job output (p. 702)

Starter Intent Detection custom template


This is the intent detection template that is provided as a starting point.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-classifier
name="intent"
categories="{{ task.input.labels | to_json | escape }}"
header="Pick the most relevant intention expressed by the below text"
>
<classification-target>
{{ task.input.utterance }}
</classification-target>

<full-instructions header="Intent Detection Instructions">


<p>Select the most relevant intention expressed by the text.</p>
<div>
<p><strong>Example: </strong>I would like to return a pair of shoes</p>
<p><strong>Intent: </strong>Return</p>
</div>
</full-instructions>

<short-instructions>
Pick the most relevant intention expressed by the text
</short-instructions>
</crowd-classifier>
</crowd-form>

The custom templates use the Liquid template language, and each of the items between double
curly braces is a variable. The pre-annotation AWS Lambda function should provide an object named
taskInput and that object's properties can be accessed as {{ task.input.<property name> }} in
your template.

Your Intent Detection custom template


In the starter template, there are two variables: the task.input.labels property in the crowd-
classifier element opening tag and the task.input.utterance in the classification-target
region's content.

Unless you need to offer different sets of labels with different utterances, avoiding a variable and
just using text will save processing time and creates less possibility of error. The template used in this
demonstration will remove that variable, but variables and filters like to_json are explained in more
detail in the crowd-bounding-box demonstration article.

Styling Your Elements

Two parts of these custom elements that sometimes get overlooked are the <full-instructions>
and <short-instructions> regions. Good instructions generate good results.

697
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

In the elements that include these regions, the <short-instructions> appear automatically in the
"Instructions" pane on the left of the worker's screen. The <full-instructions> are linked from the
"View full instructions" link near the top of that pane. Clicking the link opens a modal pane with more
detailed instructions.

You can not only use HTML, CSS, and JavaScript in these sections, you are encouraged to if you believe
you can provide a strong set of instructions and examples that will help workers complete your tasks
with better speed and accuracy.

Example Try out a sample with JSFiddle

Try out an example <crowd-classifier> task. The example is rendered by JSFiddle, therefore all the
template variables are replaced with hard-coded values. Click the "View full instructions" link to see a set
of examples with extended CSS styling. You can fork the project to experiment with your own changes to
the CSS, adding sample images, or adding extended JavaScript functionality.

Example : Final Customized Intent Detection Template

This uses the example <crowd-classifier> task, but with a variable for the <classification-
target>. If you are trying to keep a consistent CSS design among a series of different labeling jobs, you
can include an external stylesheet using a <link rel...> element the same way you'd do in any other
HTML document.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-classifier
name="intent"
categories="['buy', 'eat', 'watch', 'browse', 'leave']"
header="Pick the most relevant intent expressed by the text below"
>
<classification-target>

698
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

{{ task.input.source }}
</classification-target>

<full-instructions header="Emotion Classification Instructions">


<p>In the statements and questions provided in this exercise, what category of action
is the speaker interested in doing?</p>
<table>
<tr>
<th>Example Utterance</th>
<th>Good Choice</th>
</tr>
<tr>
<td>When is the Seahawks game on?</td>
<td>
eat<br>
<greenbg>watch</greenbg>
<botchoice>browse</botchoice>
</td>
</tr>
<tr>
<th>Example Utterance</th>
<th>Bad Choice</th>
</tr>
<tr>
<td>When is the Seahawks game on?</td>
<td>
buy<br>
<greenbg>eat</greenbg>
<botchoice>watch</botchoice>
</td>
</tr>
</table>
</full-instructions>

<short-instructions>
What is the speaker expressing they would like to do next?
</short-instructions>
</crowd-classifier>
</crowd-form>
<style>
greenbg {
background: #feee23;
display: block;
}

table {
*border-collapse: collapse; /* IE7 and lower */
border-spacing: 0;
}

th, tfoot, .fakehead {


background-color: #8888ee;
color: #f3f3f3;
font-weight: 700;
}

th, td, tfoot {


border: 1px solid blue;
}

th:first-child {
border-radius: 6px 0 0 0;
}

th:last-child {
border-radius: 0 6px 0 0;

699
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

th:only-child{
border-radius: 6px 6px 0 0;
}

tfoot:first-child {
border-radius: 0 0 6px 0;
}

tfoot:last-child {
border-radius: 0 0 0 6px;
}

tfoot:only-child{
border-radius: 6px 6px;
}

td {
padding-left: 15px ;
padding-right: 15px ;
}

botchoice {
display: block;
height: 17px;
width: 490px;
overflow: hidden;
position: relative;
background: #fff;
padding-bottom: 20px;
}

botchoice:after {
position: absolute;
bottom: 0;
left: 0;
height: 100%;
width: 100%;
content: "";
background: linear-gradient(to top,
rgba(255,255,255, 1) 55%,
rgba(255,255,255, 0) 100%
);
pointer-events: none; /* so the text is still selectable */
}
</style>

Example : Your manifest file

If you are preparing your manifest file manually for a text-classification task like this, have your data
formatted in the following manner.

{"source": "Roses are red"}


{"source": "Violets are Blue"}
{"source": "Ground Truth is the best"}
{"source": "And so are you"}

This differs from the manifest file used for the "Demo Template: Annotation of Images with crowd-
bounding-box (p. 692)" demonstration in that source-ref was used as the property name instead
of source. The use of source-ref designates S3 URIs for images or other files that must be converted
to HTTP. Otherwise, source should be used like it is with the text strings above.

700
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

Your pre-annotation Lambda function


As part of the job set-up, provide the ARN of an AWS Lambda that can be called to process your manifest
entries and pass them to the template engine.

This Lambda function is required to have one of the following four strings as part of the function name:
SageMaker, Sagemaker, sagemaker, or LabelingFunction.

This applies to both your pre-annotation and post-annotation Lambdas.

When you're using the console, if you have Lambdas that are owned by your account, a drop-down list of
functions meeting the naming requirements will be provided to choose one.

In this very basic sample, where you have only one variable, it's primarily a pass-through function. Here's
a sample pre-labeling Lambda using Python 3.7.

import json

def lambda_handler(event, context):


return {
"taskInput": event['dataObject']
}

The dataObject property of the event contains the properties from a data object in your manifest.

In this demonstration, which is a simple pass through, you just pass that straight through as
the taskInput value. If you add properties with those values to the event['dataObject']
object, they will be available to your HTML template as Liquid variables with the format
{{ task.input.<property name> }}.

Your post-annotation Lambda function


As part of the job set up, provide the ARN of an Lambda function that can be called to process the
form data when a worker completes a task. This can be as simple or complex as you want. If you want
to do answer-consolidation and scoring as data comes in, you can apply the scoring or consolidation
algorithms of your choice. If you want to store the raw data for offline processing, that is an option.
Set permissions for your post-annotation Lambda function
The annotation data will be in a file designated by the s3Uri string in the payload object. To
process the annotations as they come in, even for a simple pass through function, you need to
assign S3ReadOnly access to your Lambda so it can read the annotation files.
In the Console page for creating your Lambda, scroll to the Execution role panel. Select Create
a new role from one or more templates. Give the role a name. From the Policy templates
drop-down, choose Amazon S3 object read-only permissions. Save the Lambda and the role
will be saved and selected.

The following sample is for Python 3.7.

import json
import boto3
from urllib.parse import urlparse

def lambda_handler(event, context):


consolidated_labels = []

parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);

for dataset in annotations:

701
Amazon SageMaker Developer Guide
Creating Custom Labeling Workflows

for annotation in dataset['annotations']:


new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'result': new_annotation,
'labeledContent': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)

return consolidated_labels

Your labeling job output


The post-annotation Lambda will often receive batches of task results in the event object. That batch will
be the payload object the Lambda should iterate through.

You'll find the output of the job in a folder named after your labeling job in the target S3 bucket you
specified. It will be in a subfolder named manifests.

For an intent detection task, the output in the output manifest will look a bit like the demo below. The
example has been cleaned up and spaced out to be easier for humans to read. The actual output will be
more compressed for machine reading.

Example : JSON in your output manifest

[
{
"datasetObjectId":"<Number representing item's place in the manifest>",
"consolidatedAnnotation":
{
"content":
{
"<name of labeling job>":
{
"workerId":"private.us-east-1.XXXXXXXXXXXXXXXXXXXXXX",
"result":
{
"intent":
{
"label":"<label chosen by worker>"
}
},
"labeledContent":
{
"content":"<text content that was labeled>"
}
}
}
}
},
"datasetObjectId":"<Number representing item's place in the manifest>",
"consolidatedAnnotation":
{
"content":
{
"<name of labeling job>":
{

702
Amazon SageMaker Developer Guide
Create a Labeling Job

"workerId":"private.us-east-1.6UDLPKQZHYWJQSCA4MBJBB7FWE",
"result":
{
"intent":
{
"label": "<label chosen by worker>"
}
},
"labeledContent":
{
"content": "<text content that was labeled>"
}
}
}
}
},
...
...
...
]

This should help you create and use your own custom template.

Custom Workflows via the API


When you have created your custom UI template (Step 2) and processing Lambda functions
(Step 3), you should place the template in an Amazon S3 bucket with a file name format of:
<FileName>.liquid.html.

Use the CreateLabelingJob action to configure your task. You'll use the location of a custom template
(Step 2: Creating your custom worker task template (p. 672)) stored in a <filename>.liquid.html
file on S3 as the value for the UiTemplateS3Uri field in the UiConfig object within the
HumanTaskConfig object.

For the AWS Lambda tasks described in Step 3: Processing with AWS Lambda (p. 678), the post-
annotation task's ARN will be used as the value for the AnnotationConsolidationLambdaArn field,
and the pre-annotation task will be used as the value for the PreHumanTaskLambdaArn.

Create a Labeling Job


You can create a labeling job in the Amazon SageMaker console and by using an AWS SDK in your
preferred language to run CreateLabelingJob. After a labeling job has been created, you can track
worker metrics (for private workforces) and your labeling job status using CloudWatch.

Before you create a labeling job it is recommended that you review the following pages, as applicable:

• You can specify our input data using an automatic data setup in the console, or an input manifest
file in either the console or when using CreateLabelingJob API. For automated data setup, see
Automated Data Setup (p. 736). To learn how to create an input manifest file, see Use an Input
Manifest File (p. 735).
• Review labeling job input data quotas: Input Data Quotas (p. 742).

After you have chosen your task type, use the topics on this page to learn how to create a labeling job.

If you are a new Ground Truth user, we recommend that you start by walking through the demo in
Getting started (p. 527).
Important
Ground Truth requires all S3 buckets that contain labeling job input image data to have a CORS
policy attached. To learn more, see CORS Permission Requirement (p. 816).

703
Amazon SageMaker Developer Guide
Create a Labeling Job

Topics
• Built-in Task Types (p. 704)
• Creating Instruction Pages (p. 704)
• Create a Labeling Job (Console) (p. 706)
• Create a Labeling Job (API) (p. 709)
• Create a Streaming Labeling Job (p. 714)
• Create a Labeling Category Configuration File with Label Category and Frame Attributes (p. 719)

Built-in Task Types


Amazon SageMaker Ground Truth has several built-in task types. Ground Truth provides a worker
task template for buit-in task types. Additionally, some built in task types support Automate Data
Labeling (p. 807). The following topics describe each built-in task type and demo the worker task
templates that are provided by Ground Truth in the console. To learn how to create a labeling job in the
console using one of these task types, select the task type page.

Label Images Label Text Label Videos and Video Label 3D Point Clouds
Frames

• Bounding • Named Entity • Video • 3D Point


Box (p. 532) Recognition (p. 552) Classification (p. 562) Cloud Object
• Image Classification • Text Classification • Video Frame Object Detection (p. 599)
(Single (Single Detection (p. 567) • 3D Point
Label) (p. 545) Label) (p. 556) • Video Frame Object Cloud Object
• Image Classification • Text Classification Tracking (p. 571) Tracking (p. 606)
(Multi- (Multi- • 3D Point Cloud
label) (p. 547) label) (p. 559) Semantic
• Image Semantic Segmentation (p. 615)
Segmentation (p. 538)
• Verify and Adjust
Labels (p. 664)

Note
Each of the video frame and 3D point cloud task types has an adjustment task type that you use
to verify and adjust labels from a previous labeling job. Select a video frame or 3D point cloud
task type page above to learn how to adjust labels created using that task type.

Creating Instruction Pages


Create custom instructions for labeling jobs to improve your worker's accuracy in completing their task.
You can modify the default instructions that are provided in the console or you can create your own. The
instructions are shown to the worker on the page where they complete their labeling task.

There are two kinds of instructions:

• Short instructions—instructions that are shown on the same webpage where the worker completes
their task. These instructions should provide an easy reference to show the worker the correct way to
label an object.
• Full instructions—instructions that are shown on a dialog box that overlays the page where the worker
completes their task. We recommend that you provide detailed instructions for completing the task
with multiple examples showing edge cases and other difficult situations for labeling objects.

704
Amazon SageMaker Developer Guide
Create a Labeling Job

Create instructions in the console when you are creating your labeling job. Start with the existing
instructions for the task and use the editor to modify them to suit your labeling job.
Note
Once you create your labeling job, it will automatically start and you will not be able to modify
your worker instructions. If you need to change your worker instructions, stop the labeling job
that you created, clone it, and modify your worker instructions before creating a new job.
You can clone a labeling job in the console by selecting the labeling job and then selecting
Clone in the Actions menu.
To clone a labeling job using the Amazon SageMaker API or your preferred Amazon SageMaker
SDK, make a new request to the CreateLabelingJob operation with the same specifications
as your original job after modifying your worker instructions.

Short Instructions
Short instructions appear on the same web page that workers use to label your data object. For example,
the following is the editing page for a bounding box task. The short instructions panel is on the left.

Keep in mind that a worker will only spend seconds looking at the short instructions. Workers must be
able to scan and understand your information quickly. In all cases it should take less time to understand
the instructions than it takes to complete the task. Keep these points in mind:

• Your instructions should be clear and simple.

705
Amazon SageMaker Developer Guide
Create a Labeling Job

• Pictures are better than words. Create a simple illustration of your task that your workers can
immediately understand.
• If you must use words, use short, concise examples.
• Your short instructions are more important than your full instructions.

The Amazon SageMaker Ground Truth console provides an editor so that you can create your short
instructions. Replace the placeholder text and images with instructions for your task. Preview the
worker's task page by choosing Preview. The preview will open in a new window, be sure to turn off pop-
up blocking so that the window will show.

Full Instructions
You can provide additional instructions for your workers in a dialog box that overlays the page where
workers label your data objects. Use full instructions to explain more complex tasks and to show workers
the proper way to label edge cases or other difficult objects.

You can create full instructions using an editor in the Ground Truth console. As with quick instructions,
keep the following in mind:

• Workers will want detailed instruction the first few times that the complete your task. Any information
that they must have should be in the quick instructions.
• Pictures are more important than words.
• Text should be concise.
• Full instructions should supplement the short instructions. Don't repeat information that appears in
the short instructions.

The Ground Truth console provides an editor so that you can create your full instructions. Replace the
placeholder text and images with instructions for your task. Preview the full instruction page by choosing
Preview. The preview will open in a new window, be sure to turn off pop-up blocking so that the window
will show.

Add example images to your instructions


Images provide useful examples for your workers. To add a publicly accessible image to your instructions:

• Place the cursor where the image should go in the instructions editor.
• Click the image icon in the editor toolbar.
• Enter the URL of your image.

If your instruction image in Amazon S3 is not publicly accessible:

• As the image URL, enter: {{ 'https://s3.amazonaws.com/your-bucket-name/image-file-


name' | grant_read_access }}.
• This renders the image URL with a short-lived, one-time access code appended so the worker's browser
can display it. A broken image icon is displayed in the instructions editor, but previewing the tool
displays the image in the rendered preview.

Create a Labeling Job (Console)


You can use the Amazon SageMaker console to create a labeling job for all of the Ground Truth built-in
task types and custom labeling workflows. For built-in task types, we recommend that you use this page
alongside the page for your task type. Each task type page includes specific details on creating a labeling
job using that task type.

706
Amazon SageMaker Developer Guide
Create a Labeling Job

You need to provide the following to create a labeling job in the SageMaker console:

• An input manifest file in Amazon S3. You can place your input dataset in Amazon S3 and automatically
generate a manifest file using the Ground Truth console (not supported for 3D point cloud labeling
jobs).

Alternatively, you can manually create an input manifest file. To learn how, see Input Data (p. 734).
• An Amazon S3 bucket to store your output data.
• An IAM role with permission to access your resources in Amazon S3 and with a SageMaker
execution policy attached. For a general solution, you can attach the managed policy,
AmazonSageMakerFullAccess, to an IAM role and include sagemaker in your bucket name.

For more granular policies, see the section called “IAM Permissions” (p. 817).

3D point cloud task types have additional security considerations. Learn more.
• A work team. You create a work team from a workforce made up of Amazon Mechanical Turk workers,
vendors, or your own private workers.To lean more, see Create and Manage Workforces (p. 863).

You cannot use the Mechanical Turk workforce for 3D point cloud or video frame labeling jobs.
• If you are using a custom labeling workflow, you must save a worker task template in Amazon S3 and
provide an Amazon S3 URI for that template. For more information, see Step 2: Creating your custom
worker task template (p. 672).
• (Optional) An AWS KMS key ARN if you want SageMaker to encrypt the output of your labeling job
using your own AWS KMS encryption key instead of the default Amazon S3 service key.
• (Optional) Existing labels for the dataset you use for your labeling job. Use this option if you want
workers to adjust, or approve and reject labels.
• If you want to create an adjustment or verification labeling job, you must have an output manifest file
in Amazon S3 that contains the labels you want adjusted or verified. This option is only supported for
bounding box and semantic segmentation image labeling jobs and 3D point cloud and video frame
labeling jobs. It is recommended that you use the instructions on Verify and Adjust Labels (p. 664) to
create a verification or adjustment labeling job.

Important
Your work team, input manifest file, output bucket, and other resources in Amazon S3 must be
in the same AWS Region you use to create your labeling job.

When you create a labeling job using the SageMaker console, you add worker instructions and labels to
the worker UI that Ground Truth provides. You can preview and interact with the worker UI while creating
your labeling job in the console. You can also see a preview of the worker UI on your built-in task type
page.

To create a labeling job (console)

1. Sign in to the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the left navigation pane, choose Labeling jobs.
3. On the Labeling jobs page, choose Create labeling job.
4. For Job name, enter a name for your labeling job.
5. (Optional) If you want to identify your labels with a key, select I want to specify a label attribute
name different from the labeling job name. If you do not select this option, the labeling job name
you specified in the previous step will be used to identify your labels in your output manifest file.
6. Choose a data setup to setup to set up a connection between your input dataset and Ground Truth.

• For Automated data setup:


• Follow the instructions in Automated Data Setup (p. 736) for image, text, and video clip
labeling jobs.

707
Amazon SageMaker Developer Guide
Create a Labeling Job

• Follow the instructions in Automated Video Frame Input Data Setup (p. 773) for video frame
labeling jobs.
• For Manual data setup:
• For Input dataset location, provide the location in Amazon S3 in which your input manifest file
is located. For example, if your input manifest file, manifest.json, is located in example-bucket,
enter s3://example-bucket/manifest.json.
• For Output dataset location, provide the location in Amazon S3 where you want Ground Truth
to store the output data from your labeling job.
7. For IAM Role, choose an existing IAM role or create an IAM role with permission to access your
resources in Amazon S3, to write to the output Amazon S3 bucket specified above, and with a
SageMaker execution policy attached.
8. (Optional) For Additional configuration, you can specify how much of your dataset you want
workers to label, and if you want SageMaker to encrypt the output data for your labeling job using
an AWS KMS encryption key. To encrypt your output data, you must have the required AWS KMS
permissions attached to the IAM role you provided in the previous step. For more details, see the
section called “IAM Permissions” (p. 817).
9. In the Task type section, under Task category, use the dropdown list to select your task category.
10. In Task selection, choose your task type.
11. (Optional) Provide tags for your labeling job to make it easier to find in the console later.
12. Choose Next.
13. In the Workers section, choose the type of workforce you would like to use. For more details about
your workforce options see Create and Manage Workforces (p. 863).
14. (Optional) After you've selected your workforce, specify the Task timeout. This is the maximum
amount of time a worker has to work on a task.

For 3D point cloud annotation tasks, the default task timeout is 3 days. The default timeout for text
and image classification and label verification labeling jobs is 5 minutes. The default timeout for all
other labeling jobs is 60 minutes.
15. (Optional) For bounding box, semantic segmentation, video frame, and 3D point cloud task types,
you can select Display existing labels if you want to display labels for your input data set for
workers to verify or adjust.

For bounding box and semantic segmentation labeling jobs, this will create an adjustment labeling
job.

For 3D point cloud and video frame labeling jobs:

• Select Adjustment to create an adjustment labeling job. When you select this option, you can add
new labels but you cannot remove or edit existing labels from the previous job. Optionally, you
can choose label category attributes and frame attributes that you want workers to edit. To make
an attribute editable, select the check box Allow workers to edit this attribute for that attribute.

Optionally, you can add new label category and frame attributes.
• Select Verification to create an adjustment labeling job. When you select this option, you cannot
add, modify, or remove existing labels from the previous job. Optionally, you can choose label
category attributes and frame attributes that you want workers to edit. To make an attribute
editable, select the check box Allow workers to edit this attribute for that attribute.

We recommend that you can add new label category attributes to the labels that you want
workers to verify, or add one or more frame attributes to have workers provide information about
the entire frame.

For more information, see Verify and Adjust Labels (p. 664).

708
Amazon SageMaker Developer Guide
Create a Labeling Job

16. Configure your workers' UI:

• If you are using a built-in task type, specify workers instructions and labels.
• For image classification and text classification (single and multi-label) you must specify at
least two label categories. For all other built-in task types, you must specify at least one label
category.
• (Optional) If you are creating a 3D point cloud or video frame labeling job, you can specify
label category attributes (not supported for 3D point cloud semantic segmentation) and frame
attributes. Label category attributes can be assigned to one or more labels. Frame attributes
will appear on each point cloud or video frame workers label. To learn more, see Worker User
Interface (UI) (p. 631) for 3D point cloud and Worker User Interface (UI) (p. 577) for video
frame.
• (Optional) Add Additional instructions to help your worker complete your task.
• If you are creating a custom labeling workflow you must :
• Enter a custom template in the code box. Custom templates can be created using a combination
of HTML, the Liquid templating language and our pre-built web components. Optionally, you
can choose a base-template from the drop-down menu to get started.
• Specify pre-annotation and post-annotation lambda functions. To learn how to create these
functions, see Step 3: Processing with AWS Lambda (p. 678).
17. (Optional) You can select See preview to preview your worker instructions, labels, and interact
with the worker UI. Make sure the pop-up blocker of the browser is disabled before generating the
preview.
18. Choose Create.

After you've successfully created your labeling job, you are redirected to the Labeling jobs page. The
status of the labeling job you just created is In progress. This status progressively updates as workers
complete your tasks. When all tasks are successfully completed, the status changes to Completed.

If an issue occurs while creating the labeling job, its status changes to Failed.

To view more details about the job, choose the labeling job name.

Next Steps
After your labeling job status changes to Completed, you can view your output data in the Amazon S3
bucket that you specified while creating that labeling job. For details about the format of your output
data, see Output Data (p. 776).

Create a Labeling Job (API)


To create a labeling job using the Amazon SageMaker API, you use the CreateLabelingJob operation.
For specific instructions on creating a labeling job for a built-in task type, see that task type page. To
learn how to create a streaming labeling job, which is a labeling job that runs perpetually, see Create a
Streaming Labeling Job (p. 714).

To use the CreateLabelingJob operation, you need the following:

• A worker task template (UiTemplateS3Uri) or human task UI ARN (HumanTaskUiArn) in Amazon


S3.
• For 3D point cloud jobs, video object detection and tracking jobs, and NER jobs, use the ARN listed in
HumanTaskUiArn for your task type.
• If you are using a built-in task type other than 3D point cloud tasks, you can add your worker
instructions to one of the pre-built templates and save the template (using a .html or .liquid
extension) in your S3 bucket. Find the pre-build templates on your task type page.

709
Amazon SageMaker Developer Guide
Create a Labeling Job

• If you are using a custom labeling workflow, you can create a custom template and save the
template in your S3 bucket. To learn how to built a custom worker template, see Step 2: Creating
your custom worker task template (p. 672). For custom HTML elements that you can use to
customize your template, see Crowd HTML Elements Reference (p. 889). For a repository of demo
templates for a variety of labeling tasks, see Amazon SageMaker Ground Truth Sample Task UIs .
• An input manifest file that specifies your input data in Amazon S3. Specify the location of your
input manifest file in ManifestS3Uri. For information about creating an input manifest, see Input
Data (p. 734). If you create a streaming labeling job, this is optional. To learn how to create a
streaming labeling job, see Create a Streaming Labeling Job (p. 714).
• An Amazon S3 bucket to store your output data. You specify this bucket, and optionally, a prefix in
S3OutputPath.
• A label category configuration file. Each label category name must be unique. Specify the location
of this file in Amazon S3 using the LabelCategoryConfigS3Uri parameter. The format and label
categories for this file depend on the task type you use:
• For image classification and text classification (single and multi-label) you must specify at least two
label categories. For all other task types, the minimum number of label categories required is one.
• For named entity recognition tasks, you must provide worker instructions in this file. See Provide
Worker Instructions in a Label Category Configuration File (p. 555) for details and an example.
• For 3D point cloud and video frame task type, use the format in Create a Labeling Category
Configuration File with Label Category and Frame Attributes (p. 719).
• For all other built-in task types and custom tasks, your label category configuration file must be
a JSON file in the following format. Identify the labels you want to use by replacing label_1,
label_2,...,label_n with your label categories.

{
"document-version": "2018-11-28"
"labels": [
{"label": "label_1"},
{"label": "label_2"},
...
{"label": "label_n"}
]
}

• An AWS Identity and Access Management (IAM) role with the


AmazonSageMakerGroundTruthExecution managed IAM policy attached and with permissions to
access your S3 buckets. Specify this role in RoleArn. To learn more about this policy, see Use IAM
Managed Policies with Ground Truth (p. 817). If you require more granular permissions, see the
section called “IAM Permissions” (p. 817).

If your input or output bucket name does not contain sagemaker, you can attach a policy similar to
the following to the role that is passed to the CreateLabelingJob operation.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::my_input_bucket/*"
]
},
{
"Effect": "Allow",
"Action": [

710
Amazon SageMaker Developer Guide
Create a Labeling Job

"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my_output_bucket/*"
]
}
]
}

• A pre-annotation and post-annotation (or annotation-consolidation) AWS Lambda function Amazon


Resource Name (ARN) to process your input and output data.
• Lambda functions are predefined in each AWS Region for built-in task types. To find the pre-
annotation Lambda ARN for your Region, see PreHumanTaskLambdaArn. To find the annotation-
consolidation Lambda ARN for your Region, see AnnotationConsolidationLambdaArn.
• For custom labeling workflows, you must provide a custom pre- and post-annotation Lambda ARN.
To learn how to create these Lambda functions, see Step 3: Processing with AWS Lambda (p. 678).
• A work team ARN that you specify in WorkteamArn. You receive a work team ARN when you subscribe
to a vendor workforce or create a private workteam. If you are creating a labeling job for a video frame
or point cloud task type, you cannot use the Amazon Mechanical Turk workforce. For all other task
types, to use the Mechanical Turk workforce, use the following ARN. Replace region with the AWS
Region you are using to create the labeling job.

arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default

If you use the Amazon Mechanical Turk workforce, use the ContentClassifiers parameter in
DataAttributes of InputConfig to declare that your content is free of personally identifiable
information and adult content.

Ground Truth requires that your input data is free of personally identifiable information (PII) if you
use the Mechanical Turk workforce. If you use Mechanical Turk and do not specify that your input
data is free of PII using the FreeOfPersonallyIdentifiableInformation flag, your labeling
job will fail. Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Amazon Mechanical Turk workers that can view your task if it
contains adult content.

To learn more about work teams and workforces, see Create and Manage Workforces (p. 863).
• If you use the Mechanical Turk workforce, you must specify the price you'll pay workers for performing
a single task in PublicWorkforceTaskPrice.
• To configure the task, you must provide a task description and title using TaskDescription and
TaskTitle respectively. Optionally, you can provide time limits that control how long the workers
have to work on an individual task (TaskTimeLimitInSeconds) and how long tasks remain in the
worker portal, available to workers (TaskAvailabilityLifetimeInSeconds).
• (Optional) For some task types, you can have multiple workers label a single data object by inputting
a number greater than one for the NumberOfHumanWorkersPerDataObject parameter. For more
information about annotation consolidation, see Consolidate Annotations (p. 806).
• (Optional) To create an automated data labeling job, specify one of the ARNs listed in
LabelingJobAlgorithmSpecificationArn in LabelingJobAlgorithmsConfig. This ARN identifies
the algorithm used in the automated data labeling job. The task type associated with this ARN must
match the task type of the PreHumanTaskLambdaArn and AnnotationConsolidationLambdaArn
you specify. Automated data labeling is supported for the following task types: image classification,
bounding box, semantic segmentation, and text classification. The minimum number of objects
allowed for automated data labeling is 1,250, and we strongly suggest providing a minimum of 5,000
objects. To learn more about automated data labeling jobs, see Automate Data Labeling (p. 807).
• (Optional) You can provide StoppingConditions that cause the labeling job to stop if one the
conditions is met. You can use stopping conditions to control the cost of the labeling job.

711
Amazon SageMaker Developer Guide
Create a Labeling Job

Examples
The following code examples demonstrate how to create a labeling job using CreateLabelingJob.
For additional examples, we recommend you use one of the Ground Truth Labeling Jobs Jupyter
notebooks in the SageMaker Examples section of a SageMaker notebook instance. To learn how to use
a notebook example from the SageMaker Examples, see Example Notebooks (p. 220). You can also see
these example notebooks on GitHub in the SageMaker Examples repository.

AWS SDK for Python (Boto3)

The following is an example of an AWS Python SDK (Boto3) request to create a labeling job for a
built-in task type in the US East (N. Virginia) Region using a private workforce. Replace all red-
italized text with your labeling job resources and specifications.

response = client.create_labeling_job(
LabelingJobName="example-labeling-job",
LabelAttributeName="label",
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': "s3://bucket/path/manifest-with-input-data.json"
}
},
'DataAttributes': {
'ContentClassifiers': [
"FreeOfPersonallyIdentifiableInformation"|"FreeOfAdultContent",
]
}
},
OutputConfig={
'S3OutputPath': "s3://bucket/path/file-to-store-output-data",
'KmsKeyId': "string"
},
RoleArn="arn:aws:iam::*:role/*",
LabelCategoryConfigS3Uri="s3://bucket/path/label-categories.json",
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': "arn:aws:sagemaker:region:*:workteam/private-crowd/*",
'UiConfig': {
'UiTemplateS3Uri': "s3://bucket/path/custom-worker-task-template.html"
},
'PreHumanTaskLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype",
'TaskKeywords': [
"Images",
"Classification",
"Multi-label"
],
'TaskTitle': "Multi-label image classification task",
'TaskDescription': "Select all labels that apply to the images shown",
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 3600,
'TaskAvailabilityLifetimeInSeconds': 21600,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:ACS-"
},
Tags=[
{
'Key': "string",

712
Amazon SageMaker Developer Guide
Create a Labeling Job

'Value': "string"
},
]
)

AWS CLI

The following is an example of an AWS CLI request to create a labeling job for a built-in task type in
the US East (N. Virginia) Region using the Amazon Mechanical Turk workforce. For more information,
see start-human-loop in the AWS CLI Command Reference. Replace all red-italized text with
your labeling job resources and specifications.

$ aws --region us-east-1 sagemaker create-labeling-job \


--labeling-job-name "example-labeling-job" \
--label-attribute-name "label" \
--role-arn "arn:aws:iam::account-id:role/role-name" \
--input-config '{
"DataAttributes": {
"ContentClassifiers": [
"FreeOfPersonallyIdentifiableInformation",
"FreeOfAdultContent"
]
},
"DataSource": {
"S3DataSource": {
"ManifestS3Uri": "s3://bucket/path/manifest-with-input-data.json"
}
}
}' \
--output-config '{
"KmsKeyId": "",
"S3OutputPath": "s3://bucket/path/file-to-store-output-data"
}' \
--human-task-config '{
"AnnotationConsolidationConfig": {
"AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-
east-1:432418664414:function:ACS-"
},
"TaskAvailabilityLifetimeInSeconds": 21600,
"TaskTimeLimitInSeconds": 3600,
"NumberOfHumanWorkersPerDataObject": 1,
"PreHumanTaskLambdaArn": "arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype",
"WorkteamArn": "arn:aws:sagemaker:us-east-1:394669845002:workteam/public-crowd/
default",
"PublicWorkforceTaskPrice": {
"AmountInUsd": {
"Dollars": 0,
"TenthFractionsOfACent": 6,
"Cents": 3
}
},
"TaskDescription": "Select all labels that apply to the images shown",
"MaxConcurrentTaskCount": 1000,
"TaskTitle": "Multi-label image classification task",,
"TaskKeywords": [
"Images",
"Classification",
"Multi-label"
],
"UiConfig": {
"UiTemplateS3Uri": "s3://bucket/path/custom-worker-task-template.html"
}
}'

713
Amazon SageMaker Developer Guide
Create a Labeling Job

For more information about this operation, see CreateLabelingJob. For information about how to use
other language-specific SDKs, see See Also in the CreateLabelingJobs topic.

Create a Streaming Labeling Job


Streaming labeling jobs enable you to send individual data objects in real time to a perpetually running,
streaming labeling job. To create a streaming labeling job, you must create an Amazon SNS input
topic and specify this topic in CreateLabelingJob parameters InputConfig of SnsDataSource.
Optionally, you can also create an Amazon SNS output topic and specify it in OutputConfigif you want
to receive label data in real time.
Important
If you are a new user of Ground Truth streaming labeling jobs, it is recommended that you
review Ground Truth Streaming Labeling Jobs (p. 738) before creating a streaming labeling
job.

Use the following sections to create the resources that you need and can use to create a streaming
labeling job:

• Learn how to create SNS topics with the permissions required for Ground Truth streaming labeling jobs
by following the steps in Create Amazon SNS Input and Output Topics (p. 714). Your SNS topics must
be created in the same AWS Region as your labeling job.
• See Subscribe an Endpoint to Your Amazon SNS Output Topic (p. 716) to learn how to set up an
endpoint to receive labeling task output data at a specified endpoint each time a labeling task is
completed.
• To learn how to configure your Amazon S3 bucket to send notifications to your Amazon SNS input
topic, see Set up Amazon S3 Bucket Event Notifications (p. 717).
• Optionally, add data objects that you want to have labeled as soon as the labeling job starts to your
input manifest. For more information, see Create a Manifest File (Optional) (p. 717).
• There are other resources required to create a labeling job, such as an IAM role, Amazon S3 bucket, a
worker task template and label categories. These are described in the Ground Truth documentation on
creating a labeling job. For more information, see Create a Labeling Job (p. 703).
Important
When you create a labeling job you must provide an IAM execution role. Attach the AWS
managed policy AmazonSageMakerGroundTruthExecution to this role to ensure it has
required permissions to execute your labeling job.

When you submit a request to create a streaming labeling job, the state of your labeling job is
Initializing. Once the labeling job is active, the state changes to InProgress. Do not send new
data objects to your labeling job or attempt to stop your labeling job while it is in the Initializing
state. Once the state changes to InProgress, you can start sending new data objects using Amazon
SNS and the Amazon S3 configuration.

Topics
• Create Amazon SNS Input and Output Topics (p. 714)
• Set up Amazon S3 Bucket Event Notifications (p. 717)
• Create a Manifest File (Optional) (p. 717)
• Example: Use SageMaker API To Create Streaming Labeling Job (p. 717)
• Stop a Streaming Labeling Job (p. 718)

Create Amazon SNS Input and Output Topics


You need to create an Amazon SNS input to create a streaming labeling job. Optionally, you may provide
an Amazon SNS output topic.

714
Amazon SageMaker Developer Guide
Create a Labeling Job

When you create an Amazon SNS topic to use in your streaming labeling job, note down the topic
Amazon Resource Name (ARN). The ARN will be the input values for the parameter SnsTopicArn in
InputConfig and OutputConfig when you create a labeling job.

Create an Input Topic

Your input topic is used to send new data objects to Ground Truth. To create an input topic, follow the
instructions in Creating an Amazon SNS topic in the Amazon Simple Notification Service Developer
Guide.

Note down your input topic ARN and use it as input for the CreateLabelingJob parameter
SnsTopicArn in InputConfig.

Create an Output Topic

If you provide an output topic, it is used to send notifications when a data object is labeled. When
you create a topic, you have the option to add an encryption key. Use this option to add a AWS Key
Management Service customer managed key to your topic to encrypt the output data of your labeling
job before it is published to your output topic.

To create an output topic, follow the instructions in Creating an Amazon SNS topic in the Amazon Simple
Notification Service Developer Guide.

If you add encryption, you must attach additional permission to the topic. See Add Encryption to Your
Output Topic (Optional) (p. 715). for more information.
Important
To add a customer managed key to your output topic while creating a topic in the console, do
not use the (Default) alias/aws/sns option. Select a customer managed key that you created.

Note down your input topic ARN and use it in your CreateLabelingJob request in the parameter
SnsTopicArn in OutputConfig.

Add Encryption to Your Output Topic (Optional)

To encrypt messages published to your output topic, you need to provide an AWS KMS customer
managed key to your topic. Modify the following policy and add it to your customer managed key to give
Ground Truth permission to encrypt output data before publishing it to your output topic.

Replace <account_id> with the ID of the account that you are using to create your topic. To learn how
to find your AWS account ID, see Finding Your AWS Account ID.

{
"Id": "key-console-policy",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_id>:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow access for Key Administrators",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_id>:role/Admin"
},

715
Amazon SageMaker Developer Guide
Create a Labeling Job

"Action": [
"kms:Create*",
"kms:Describe*",
"kms:Enable*",
"kms:List*",
"kms:Put*",
"kms:Update*",
"kms:Revoke*",
"kms:Disable*",
"kms:Get*",
"kms:Delete*",
"kms:TagResource",
"kms:UntagResource",
"kms:ScheduleKeyDeletion",
"kms:CancelKeyDeletion"
],
"Resource": "*"
}
]
}

Additionally, you must modify and add the following policy to the execution role that you use to create
your labeling job (the input value for RoleArn).

Replace <account_id> with the ID of the account that you are using to create your topic. Replace
<region> with the AWS Region you are using to create your labeling job. Replace <key_id> with your
customer managed key ID.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "sid1",
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Resource": "arn:aws:kms:<region>:<account_id>:key/<key_id>"
}
]
}

For more information on creating and securing keys, see Creating Keys and Using Key Policies in the AWS
Key Management Service Developer Guide.

Subscribe an Endpoint to Your Amazon SNS Output Topic

When a worker completes a labeling job task from a Ground Truth streaming labeling job, Ground Truth
uses your output topic to publish output data to one or more endpoints that you specify. To receive
notifications when a worker finishes a labeling task, you must subscribe an endpoint to your Amazon
SNS output topic.

To learn how to add endpoints to your output topic, see Subscribing to an Amazon SNS topic in the
Amazon Simple Notification Service Developer Guide.

To learn more about the output data format that is published to these endpoints, see Output
Data (p. 776).
Important
If you do not subscribe an endpoint to your Amazon SNS output topic, you will not receive
notifications when new data objects are labeled.

716
Amazon SageMaker Developer Guide
Create a Labeling Job

Set up Amazon S3 Bucket Event Notifications


You can add an event notification to your Amazon S3 bucket using the Amazon S3 console, API, and
language specific AWS SDKs, or the AWS Command Line Interface. Set up this event to send notifications
to the same Amazon SNS input topic that you specify using SnsTopicArn in InputConfig when you
create a labeling job. Do not set up event notifications using the same Amazon S3 location that you
specified for S3OutputPath in OutputConfig – doing so may result in unwanted data objects being
processed by Ground Truth for labeling.

You decide the types of events that you want to send to your Amazon SNS topic. Ground Truth creates a
labeling job when you send object creation events.

The event structure sent to your Amazon SNS input topic must be a JSON message formatted using the
same structure found in Event message structure.

To see examples of how you can set up an event notification for your Amazon S3 bucket using the
Amazon S3 console, AWS SDK for .NET, and AWS SDK for Java, follow this walkthrough, Walkthrough:
Configure a bucket for notifications (SNS topic or SQS queue) in the Amazon Simple Storage Service User
Guide.

Create a Manifest File (Optional)


When you create a streaming labeling job, you have the one time option to add objects (such as images
or text) to an input manifest file that you specify in ManifestS3Uri of CreateLabelingJob. When
the streaming labeling job starts, these objects are sent to workers or added to the Amazon SQS queue
if the total number of objects exceed MaxConcurrentTaskCount. The results are added to the Amazon
S3 path that you specify when creating the labeling job periodically as workers complete labeling tasks.
Output data is sent to any endpoint that you subscribe to your output topic.

If you want to provide initial objects to be labeled, create a manifest file that identifies these objects and
place it in Amazon S3. Specify the S3 URI of this manifest file in ManifestS3Uri within InputConfig.

To learn how to format your manifest file, see Input Data (p. 734). To use the SageMaker console to
automatically generate a manifest file (not supported for 3D point cloud task types), see Automated
Data Setup (p. 736).

Example: Use SageMaker API To Create Streaming Labeling Job


The following is an example of an AWS Python SDK (Boto3) request that you can use to start a streaming
labeling job for a built-in task type in the US East (N. Virginia) Region. For more details about each
parameter below see CreateLabelingJob. To learn how you can create a labeling job using this API
and associated language specific SDKs, see Create a Labeling Job (API).

In this example, note the following parameters:

• SnsDataSource – This parameter appears in InputConfig and OutputConfig and is used to


identify your input and output Amazon SNS topics respectively. To create a streaming labeling job, you
are required to provide an Amazon SNS input topic. Optionally, you can also provide an Amazon SNS
output topic.
• S3DataSource – This parameter is optional. Use this parameter if you want to include an input
manifest file of data objects that you want labeled as soon as the labeling job starts.
• StoppingConditions – This parameter is ignored when you create a streaming labeling job. To learn
more about stopping a streaming labeling job, see Stop a Streaming Labeling Job (p. 718).
• Streaming labeling jobs do not support automated data labeling. Do not include the
LabelingJobAlgorithmsConfig parameter.

response = client.create_labeling_job(

717
Amazon SageMaker Developer Guide
Create a Labeling Job

LabelingJobName= 'example-labeling-job',
LabelAttributeName='label',
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': 's3://bucket/path/manifest-with-input-data.json'
},
'SnsDataSource': {
'SnsTopicArn': 'arn:aws:sns:us-east-1:123456789012:your-sns-input-topic'
}
},
'DataAttributes': {
'ContentClassifiers': [
'FreeOfPersonallyIdentifiableInformation'|'FreeOfAdultContent',
]
}
},
OutputConfig={
'S3OutputPath': 's3://bucket/path/file-to-store-output-data',
'KmsKeyId': 'string',
'SnsTopicArn': 'arn:aws:sns:us-east-1:123456789012:your-sns-output-topic'
},
RoleArn='arn:aws:iam::*:role/*',
LabelCategoryConfigS3Uri='s3://bucket/path/label-categories.json',
HumanTaskConfig={
'WorkteamArn': 'arn:aws:sagemaker:us-east-1:*:workteam/private-crowd/*',
'UiConfig': {
'UiTemplateS3Uri': 's3://bucket/path/custom-worker-task-template.html'
},
'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype',
'TaskKeywords': [
'Example key word',
],
'TaskTitle': 'Multi-label image classification task',
'TaskDescription': 'Select all labels that apply to the images shown',
'NumberOfHumanWorkersPerDataObject': 123,
'TaskTimeLimitInSeconds': 123,
'TaskAvailabilityLifetimeInSeconds': 123,
'MaxConcurrentTaskCount': 123,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-
east-1:432418664414:function:ACS-tasktype'
}
},
Tags=[
{
'Key': 'string',
'Value': 'string'
},
]
)

Stop a Streaming Labeling Job


You can manually stop your streaming labeling job using the operation StopLabelingJob.

If your labeling job remains idle for over 10 days, it is automatically stopped by Ground Truth. In this
context, a labeling job is considered idle if no objects are sent to the Amazon SNS input topic and no
objects remain in your Amazon SQS queue, waiting to be labeled. For example, if no data objects are fed
to the Amazon SNS input topic and all the objects fed to the labeling job are already labeled, Ground
Truth starts a timer. After the timer starts, if no items are received within a 10 day period, the labeling
job is stopped.

718
Amazon SageMaker Developer Guide
Create a Labeling Job

When a labeling job is stopped, its status is STOPPING while Ground Truth cleans up labeling job
resources and unsubscribes your Amazon SNS topic from your Amazon SQS queue. The Amazon SQS
is not deleted by Ground Truth because this queue may contain unprocessed data objects. You should
manually delete the queue if you want to avoid incurring additional charges from Amazon SQS. To learn
more, see Amazon SQS pricing .

Create a Labeling Category Configuration File with Label


Category and Frame Attributes
When you create a 3D point cloud or video frame labeling job using the Amazon SageMaker API
operation CreateLabelingJob, you use a label category configuration file to specify your labels and
worker instructions. Optionally, you can also provide the following in your label category attribute file:

• You can provide label category attributes for video frame and 3D point cloud object tracking and
object detection task types. Workers can use one or more attributes to give more information about
an object. For example, you may want to use the attribute occluded to have workers identify when an
object is partially obstructed. You can either specify a label category attribute for a single label using
the categoryAttributes parameter, or for all labels using the categoryGlobalAttributes
parameter.
• You can provide frame attributes for video frame and 3D point cloud object tracking and object
detection task types using frameAttributes. When you create a frame attribute, it appears on each
frame or point cloud in the worker task. In video frame labeling jobs, these are attributes that workers
assign to an entire video frame. For 3D point cloud labeling jobs, these attributes are applied to a
single point cloud. Use frame attributes to have workers provide more information about the scene in
a specific frame or point cloud.
• For video frame labeling jobs, you use the label category configuration file to specify the task type
(bounding box, polyline, polygon, or keypoint) sent to workers.

For workers, specifying values for label category attributes and frame attributes will be optional.
Important
You should only provide a label attribute name in auditLabelAttributeName if
you are running an audit job to verify or adjust labels. Use this parameter to input the
LabelAttributeName used in the labeling job that generated the annotations you want your
worker to adjust. When you create a labeling job in the console, if you did not specify a label
attribute name, the Name of your job is used as the LabelAttributeName.

Topics
• Label Category Configuration File Schema (p. 719)
• Example: Label Category Configuration Files for 3D Point Cloud Labeling Jobs (p. 726)
• Example: Label Category Configuration Files for Video Frame Labeling Jobs (p. 730)
• Creating Worker Instructions (p. 733)

Label Category Configuration File Schema


The following table lists elements you can and must include in your label category configuration file.
Note
The parameter annotationType is only supported for video frame labeling jobs.

Parameter Required Accepted Values Description

frameAttributes No A list of JSON objects. Use this parameter to


create a frame attribute

719
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description


Required Parameters in each JSON that is applied to all
Object: frames or 3D point
clouds in your labeling
name, type, description job.
See the third table in
minimum and maximum are required this section for more
if type is "number" information.
Optional Parameters in each JSON
Object:

enum, editsAllowed, isRequired

No
categoryGlobalAttributes A list of JSON objects. Use this parameter to
create label category
Required Parameters in each JSON attributes that are
Object: applied to all labels you
specify in labels.
name, type See the third table in
this section for more
minimum and maximum are required
information.
if type is "number"

Optional Parameters in each JSON


Object:

description, enum,
editsAllowed, isRequired

labels Yes A list of up to 30 JSON objects Use this parameter to


specify your labels, or
Required Parameters in each JSON classes. Add one label
Object: for each class.
label To add a label category
attribute to a label, add
Optional Parameters in each JSON categoryAttributes
Object: to that label.
categoryAttributes, Use editsAllowed
editsAllowed to specify whether
or not a label can be
edited in an adjustment
labeling job. Set
editsAllowed to
"none" for verification
labeling jobs.

See the following table


for more information.

720
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description

annotationType (only No String Use this to specify the


supported for video task type for your video
frame labeling jobs) Accepted Parameters: frame labeling jobs. For
example, for a polygon
BoundingBox, Polyline, video frame object
Polygon, Keypoint detection task, choose
Polygon.
Default:
If you do not specify
BoundingBox
an annotationType
when you create a
video frame labeling
job, Ground Truth will
use BoundingBox by
default.

instructions No A JSON object Use this parameter to


Required Parameters in each JSON add worker instructions
Object: to help your workers
complete their tasks.
"shortInstruction", For more information
"fullInstruction" about worker
instructions, see Worker
Instructions (p. 632).

Short instructions
must be under 255
characters and long
instruction must be
under 2,048 characters.

For more information,


see Creating Worker
Instructions (p. 733).

Required
auditLabelAttributeName String Enter the
for LabelAttributeName
adjustment used in the labeling
and job you want to adjust
verification annotations of.
task types
Only use this parameter
if you are creating
an adjustment job
for video frame
and 3D point cloud
object detection,
object tracking, or 3D
point cloud semantic
segmentation.

The following table describes the parameters that you can and must use to create a list of Labels. Each
parameter should be included in a JSON object.

721
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description

label Yes String The name of the


label category that is
displayed to workers.
Each label category
name must be unique.

categoryAttributes No A list of JSON objects. Use this parameter


to add label category
Required Parameters attributes to specific
in each JSON Object: labels you specify in
labels.
name, type
To add one or more
minimum and maximum label category
required if type is attributes to a
"number" label, include the
categoryAttributes
Optional Parameters in
JSON object in the
each JSON Object:
same labels JSON
description, enum, object as that label.
editsAllowed, See the following table
isRequired for more information.

editsAllowed No String Specifies whether or


not a label can be
Supported Values: edited by workers.

"none": no For video frame or 3D


modifications are not point cloud adjustment
allowed. labeling jobs, add this
parameter to one or
or more JSON objects
in the labels list to
"any" (Default): all
specify whether or not
modifications are
a worker can edit a
allowed.
label.

For 3D point cloud and


video frame verification
labeling jobs, add
this parameter with
the value "none" to
each JSON object in
the labels list. This
will make all labels
uneditable.

The following table describes the parameters that you can and must use to create a frame attributes
using frameAttributes and label category attribute using the categoryGlobalAttributes and
categoryAttributes parameters.

722
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description

name Yes String Use this parameter to


assign a name to your
label category or frame
attribute. This is the
attribute name that
workers see.

Each label category


attribute name in
your label category
configuration file must
be unique. Global label
category attributes
and label specific label
category attributes
cannot have the same
name.

type Yes String Use this parameter


to define the label
Required Values: category or frame
attribute type.
"string" or
"number" If you specify
"string" for type
and provide an enum
value for this attribute,
workers will be able to
choose from one of the
choices you provide.

If you specify
"string" for type
and do not provide an
enum value, workers
can enter free form
text.

If you specify number


for type, worker
can enter a number
between the minimum
and maximum numbers
you specify.

enum No List of strings Use this parameter to


define options that
workers can choose
from for this label
category or frame
attribute. Workers
can choose one value
specified in enum. For
example, if you specify
["foo", "buzz",

723
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description


"bar"] for enum,
workers can choose one
of foo, buzz, or bar.

You must specify


"string" for type to
use an enum list.

description frameAttributes: String Use this parameter


Yes to add a description
of the label category
categoryAttributes or frame attribute.
or You can use this field
categoryGlobalAttributes: to give workers more
No information about the
attribute.

This field is only


required for frame
attributes.

minimum and maximum Required if attribute Integers Use these parameters


type is "number" to specify minimum
and maximum
(inclusive) values
workers can enter for
numeric label category
or frame attributes.

You must specify


"number" for type
to use minimum and
maximum.

editsAllowed No String Specifies whether or


not a label category or
Required Values: frame attribute can be
edited by workers.
"none": no
modifications are not For video frame or 3D
allowed. point cloud adjustment
and verification
or labeling jobs, add this
parameter to label
"any" (Default): all
category and frame
modifications are
attribute JSON objects
allowed.
to specify whether or
not a worker can edit an
attribute.

724
Amazon SageMaker Developer Guide
Create a Labeling Job

Parameter Required Accepted Values Description

isRequired No Boolean Specifies whether


workers are required to
annotate an attribute.
Workers cannot
submit the job until all
required attributes are
annotated.

Label and label category attribute quotas

You can specify up to 10 label category attributes per class. This 10-attribute quotas includes global
label category attributes. For example, if you create four global label category attributes, and then
assign three label category attributes to label X, that label will have 4+3=7 label category attributes in
total. For all label category and label category attribute limits, refer to the following table.

Type Min Max

Labels (Labels) 1 30

Label name character quota 1 16

Label category attributes 0 10


per label (sum of
categoryAttributes and
categoryGlobalAttributes)

Free form text entry 0 5


label category attributes
per label (sum of
categoryAttributes and
categoryGlobalAttributes).

Frame attributes 0 10

Free form text entry attributes in 0 5


frameAttributes.

Attribute name character quota 1 16


(name)

Attribute description character 0 128


quota (description)

Attribute type characters quota 1 16


(type)

Allowed values in the enum list 1 10


for a string attribute

Character quota for a value in 1 16


enum list

Maximum characters in free 0 1000


form text response for free form
text frameAttributes

725
Amazon SageMaker Developer Guide
Create a Labeling Job

Type Min Max

Maximum characters in free 0 80


form text response for free form
text categoryAttributes and
categoryGlobalAttributes

Example: Label Category Configuration Files for 3D Point Cloud Labeling Jobs
Select a tab in the following tables to see examples of 3D point cloud label category configuration files
for object detection, object tracking, semantic segmentation, adjustment, and verification labeling jobs.

3D Point Cloud Object Tracking and Object Detection

The following is an example of a label category configuration file that includes label category
attributes for a 3D point cloud object detection or object tracking labeling job. This example
includes a two frame attributes, which will be added to all point clouds submitted to the labeling
job. The Car label will include four label category attributes—X, Y, Z, and the global attribute, W.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"],
"isRequired":true
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"]
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",

726
Amazon SageMaker Developer Guide
Create a Labeling Job

}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"}
}

3D Point Cloud Semantic Segmentation

The following is an example of a label category configuration file for a 3D point cloud semantic
segmentation labeling job.

Label category attributes are not supported for 3D point cloud semantic segmentation task types.
Frame attributes are supported. If you provide label category attributes for a semantic segmentation
labeling job, they will be ignored.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"labels": [
{
"label": "Car",
},
{
"label": "Pedestrian",
},
{
"label": "Cyclist",
}
],
"instructions": {"shortInstruction":"Select the appropriate label and
paint all objects in the point cloud that it applies to the same color",
"fullInstruction":"<html markup>"}
}

Select a tab in the following table to see an example of a label category configuration file for 3D point
cloud verification or adjustment labeling jobs.

3D Point Cloud Adjustment

The following is an example of a label category configuration file for a 3D point cloud object
detection or object tracking adjustment labeling job. For 3D point cloud semantic segmentation
adjustment labeling jobs, categoryGlobalAttributes and categoryAttributes are not
supported.

727
Amazon SageMaker Developer Guide
Create a Labeling Job

You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the adjustment labeling job. Optionally, you can use the
editsAllowed parameter to specify whether or not a label or frame attribute can be edited.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"none",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"any",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"any",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}

728
Amazon SageMaker Developer Guide
Create a Labeling Job

3D Point Cloud Verification

The following is an example of a label category configuration file you may use for a 3D point
cloud object detection or object tracking verification labeling job. For a 3D point cloud semantic
segmentation verification labeling job, categoryGlobalAttributes and categoryAttributes
are not supported.

You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Additionally, you must use the
editsAllowed parameter to specify that no labels can be edited.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"any",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"any",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"none",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buzz", "biz"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"none",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"none"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum":["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{

729
Amazon SageMaker Developer Guide
Create a Labeling Job

"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"instructions": {"shortInstruction":"Draw a tight Cuboid", "fullInstruction":"<html
markup>"},
// include auditLabelAttributeName for label verification jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}

Example: Label Category Configuration Files for Video Frame Labeling Jobs
The annotation tools available to your worker and task type used depends on the value you specify for
annotationType. For example, if you want workers to use key points to track changes in the pose of
specific objects across multiple frames, you would specify Keypoint for the annotationType. If you do
not specify an annotation type, BoundingBox will be used by default.

The following is an example of a video frame keypoint label category configuration file with label
category attributes. This example includes two frame attributes, which will be added to all frames
submitted to the labeling job. The Car label will include four label category attributes—X, Y, Z, and the
global attribute, W.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [
{
"label": "Car",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"]
},
{

730
Amazon SageMaker Developer Guide
Create a Labeling Job

"name":"Z",
"description":"submit a free-form response",
"type":"string",
}
]
},
{
"label": "Pedestrian",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"}
}

Select a tab from the following table to see examples of label category configuration files for video
frame adjustment and verification labeling jobs.

Video Frame Adjustment

The following is an example of a label category configuration file you may use for a video frame
adjustment labeling job.

You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Optionally, you can use the
editsAllowed parameter to specify whether or not labels, label category attributes, or frame
attributes can be edited.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"none",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"any",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [
{
"label": "Car",
"editsAllowed":"any",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"any"

731
Amazon SageMaker Developer Guide
Create a Labeling Job

},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}

Video Frame Verification

The following is an example of a label category configuration file for a video frame labeling job.

You must include auditLabelAttributeName to specify the label attribute name of the previous
labeling job that you use to create the verification labeling job. Additionally, you must use the
editsAllowed parameter to specify that no labels can be edited.

{
"documentVersion": "2020-03-01",
"frameAttributes": [
{
"name":"count players",
"editsAllowed":"none",
"description":"How many players to you see in the scene?",
"type":"number"
},
{
"name":"select one",
"editsAllowed":"any",
"description":"describe the scene",
"type":"string",
"enum":["clear","blurry"]
},
],
"categoryGlobalAttributes": [
{
"name":"W",
"editsAllowed":"none",
"description":"label-attributes-for-all-labels",
"type":"string",
"enum": ["foo", "buz", "buz2"]
}
],
"labels": [

732
Amazon SageMaker Developer Guide
Create a Labeling Job

{
"label": "Car",
"editsAllowed":"none",
"categoryAttributes": [
{
"name":"X",
"description":"enter a number",
"type":"number",
"editsAllowed":"any"
},
{
"name":"Y",
"description":"select an option",
"type":"string",
"enum": ["y1", "y2"],
"editsAllowed":"any"
},
{
"name":"Z",
"description":"submit a free-form response",
"type":"string",
"editsAllowed":"none"
}
]
},
{
"label": "Pedestrian",
"editsAllowed":"none",
"categoryAttributes": [...]
}
],
"annotationType":"Keypoint",
"instructions": {"shortInstruction":"add example short instructions here",
"fullInstruction":"<html markup>"},
// include auditLabelAttributeName for label adjustment jobs
"auditLabelAttributeName": "myPrevJobLabelAttributeName"
}

Creating Worker Instructions


Create custom instructions for labeling jobs to improve your worker's accuracy in completing their task.
Your instructions are accessible when workers select the Instructions menu option in the worker UI.
Short instructions must be under 255 characters and long instruction must be under 2,048 characters.

There are two kinds of instructions:

• Short instructions – These instructions are shown to works when they select Instructions in the
worker UI menu. They should provide an easy reference to show the worker the correct way to label an
object.
• Full instructions – These instructions are shown when workers select More Instructions in instructions
the pop-up window. We recommend that you provide detailed instructions for completing the task
with multiple examples showing edge cases and other difficult situations for labeling objects.

For 3D point cloud and video frame labeling jobs, you can add worker instructions to your label category
configuration file. You can use a single string to create instructions or you can add HTML mark up to
customize the appearance of your instructions and add images. Make sure that any images you include in
your instructions are publicly available, or if your instructions are in Amazon S3, that your workers have
read-access so that they can view them.

733
Amazon SageMaker Developer Guide
Use Input and Output Data

Use Input and Output Data


The input data that you provide to Amazon SageMaker Ground Truth is sent to your workers for labeling.
You choose the data to send to your workers by creating a single manifest file that defines all of the
data that requires labeling or by sending input data objects to an ongoing, streaming labeling job to be
labeled in real time.

The output data is the result of your labeling job. The output data file, or augmented manifest file,
contains label data for each object you send to the labeling job and metadata about the label assigned
to data objects.

When you use image classification (single and multi-label), text classification (single and multi-label),
object detection, and semantic segmentation built in task types to create a labeling job, you can use the
resulting augmented manifest file to launch a SageMaker training job. For a demonstration of how to use
an augmented manifest to train an object detection machine learning model with Amazon SageMaker,
see object_detection_augmented_manifest_training.ipynb. For more information, see Provide Dataset
Metadata to Training Jobs with an Augmented Manifest File (p. 2138).

Topics
• Input Data (p. 734)
• 3D Point Cloud Input Data (p. 746)
• Video Frame Input Data (p. 770)
• Output Data (p. 776)

Input Data
The input data are the data objects that you send to your workforce to be labeled. There are two ways to
send data objects to Ground Truth for labeling:

• Send a list of data objects that require labeling using an input manifest file.
• Send individual data objects in real time to a perpetually running, streaming labeling job.

If you have a dataset that needs to be labeled one time, and you do not require an ongoing labeling job,
create a standard labeling job using an input manifest file.

If you want to regularly send new data objects to your labeling job after it has started, create a
streaming labeling job. When you create a streaming labeling job, you can optionally use an input
manifest file to specify a group of data that you want labeled immediately when the job starts. You can
continuously send new data objects to a streaming labeling job as long as it is active.
Note
Streaming labeling jobs are only supported through the SageMaker API. You cannot create a
streaming labeling job using the SageMaker console.

The following task types have special input data requirements and options:

• For 3D point cloud labeling job input data requirements, see 3D Point Cloud Input Data (p. 746).
• For video frame labeling job input data requirements, see Video Frame Input Data (p. 770).

Topics
• Use an Input Manifest File (p. 735)
• Automated Data Setup (p. 736)
• Supported Data Formats (p. 737)
• Ground Truth Streaming Labeling Jobs (p. 738)

734
Amazon SageMaker Developer Guide
Use Input and Output Data

• Input Data Quotas (p. 742)


• Filter and Select Data for Labeling (p. 745)

Use an Input Manifest File


Each line in an input manifest file is an entry containing an object, or a reference to an object, to label.
An entry can also contain labels from previous jobs and for some task types, additional information.

Input data and the manifest file must be stored in Amazon Simple Storage Service (Amazon S3). Each
has specific storage and access requirements, as follows:

• The Amazon S3 bucket that contains the input data must be in the same AWS Region in which you
are running Amazon SageMaker Ground Truth. You must give Amazon SageMaker access to the data
stored in the Amazon S3 bucket so that it can read it. For more information about Amazon S3 buckets,
see Working with Amazon S3 buckets.
• The manifest file must be in the same AWS Region as the data files, but it doesn't need to be in the
same location as the data files. It can be stored in any Amazon S3 bucket that is accessible to the AWS
Identity and Access Management (IAM) role that you assigned to Ground Truth when you created the
labeling job.

Note
3D point cloud and video frame task types have different input manifest requirements and
attributes.
For 3D point cloud task types, refer to Create an Input Manifest File for a 3D Point Cloud
Labeling Job (p. 748).
For video frame task types, refer to Create a Video Frame Input Manifest File (p. 775).

The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line is
delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you can't
have unescaped line break characters. For more information about data format, see JSON Lines.

Each JSON object in the manifest file can be no larger than 100,000 characters. No single attribute
within an object can be larger than 20,000 characters. Attribute names can't begin with $ (dollar sign).

Each JSON object in the manifest file must contain one of the following keys: source-ref or source.
The value of the keys are interpreted as follows:

• source-ref – The source of the object is the Amazon S3 object specified in the value. Use this value
when the object is a binary object, such as an image.
• source – The source of the object is the value. Use this value when the object is a text value.

The following is an example of a manifest file for files stored in an Amazon S3 bucket:

{"source-ref": "S3 bucket location 1"}


{"source-ref": "S3 bucket location 2"}
...
{"source-ref": "S3 bucket location n"}

Use the source-ref key for image files for bounding box, image classification (single and multi-label),
semantic segmentation, and video clips for video classification labeling jobs. 3D point cloud and video
frame labeling jobs also use the source-ref key but these labeling jobs require additional information
in the input manifest file. For more information see 3D Point Cloud Input Data (p. 746) and Video
Frame Input Data (p. 770).

The following is an example of a manifest file with the input data stored in the manifest:

735
Amazon SageMaker Developer Guide
Use Input and Output Data

{"source": "Lorem ipsum dolor sit amet"}


{"source": "consectetur adipiscing elit"}
...
{"source": "mollit anim id est laborum"}

Use the source key for single and multi-label text classification and named entity recognition labeling
jobs.

You can include other key-value pairs in the manifest file. These pairs are passed to the output file
unchanged. This is useful when you want to pass information between your applications. For more
information, see Output Data (p. 776).

Automated Data Setup


You can use the automated data setup to create manifest files for your labeling jobs in the Ground Truth
console using images, videos, video frames, text (.txt) files, and comma-separated value (.csv) files stored
in Amazon S3. When you use automated data setup, you specify an Amazon S3 location where your
input data is stored and the input data type, and Ground Truth looks for the files that match that type in
the location you specify.
Note
Ground Truth does not use an AWS KMS key to access your input data or write the input
manifest file in the Amazon S3 location that you specify. The user or role that creates the
labeling job must have permissions to access your input data objects in Amazon S3.

Before using the following procedure, ensure that your input images or files are correctly formatted:

• Image files – Image files must comply with the size and resolution limits listed in the tables found in
Input File Size Quota (p. 742).
• Text files – Text data can be stored in one or more .txt files. Each item that you want labeled must be
separated by a standard line break.
• CSV files – Text data can be stored in one or more .csv files. Each item that you want labeled must be
in a separate row.
• Videos – Video files can be any of the following formats: .mp4, .ogg, and .webm. If you want to
extract video frames from your video files for object detection or object tracking, see Provide Video
Files (p. 772).
• Video frames – Video frames are images extracted from a videos. All images extracted from a single
video are referred to as a sequence of video frames. Each sequence of video frames must have unique
prefix keys in Amazon S3. See Provide Video Frames (p. 771). For this data type, see Automated
Video Frame Input Data Setup (p. 773)

Important
For video frame object detection and video frame object tracking labeling jobs, see Automated
Video Frame Input Data Setup (p. 773) to learn how to use the automated data setup.

Use these instructions to automatically set up your input dataset connection with Ground Truth.

Automatically connect your data in Amazon S3 with Ground Truth

1. Navigate to the Create labeling job page in the Amazon SageMaker console at https://
console.aws.amazon.com/sagemaker/.

This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is in an Amazon S3
bucket in another Region, switch to that Region. To change your AWS Region, on the navigation bar,
choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.

736
Amazon SageMaker Developer Guide
Use Input and Output Data

4. In the section Input data setup, select Automated data setup.


5. Enter an Amazon S3 URI for S3 location for input datasets.
6. Specify your S3 location for output datasets. This is where your output data is stored.
7. Choose your Data type using the dropdown list.
8. Use the drop down menu under IAM Role to select an execution role. If you select Create a new role,
specify the Amazon S3 buckets that you want grant this role permission to access. This role must
have permission to access the S3 buckets you specified in Steps 5 and 6.
9. Select Complete data setup.

The following GIF demonstrates how to use the automated data setup for image data. This example will
create a file, dataset-YYMMDDTHHMMSS.manifest in the Amazon S3 bucket example-groundtruth-
images where YYMMDDTHHmmSS indicates the year (YY), month (MM), day (DD) and time in hours (HH),
minutes (mm) and seconds (ss), that the input manifest file was created.

Supported Data Formats


When you create an input manifest file for a built-in task types manually, your input data must be in one
of the following support file formats for the respective input data type. To learn about automated data
setup, see Automated Data Setup (p. 736).
Tip
When you use the automated data setup, additional data formats can be used to generate an
input manifest file for video frame and text based task types.

Task Types Input Data Type Support Formats Example Input


Manifest Line

Bounding Image .jpg, .jpeg, .png


{"source-ref":
Box, Semantic "s3://DOC-EXAMPLE-
Segmentation, Image BUCKET1/example-
Classification (Single image.png"}
Label and Multi-label),
Verify and Adjust Labels

Named Entity Text Raw text


{"source": "Lorem
Recognition, Text ipsum dolor sit
Classification (Single amet"}
and Multi-Label)

Video Classification Video clips .mp4, .ogg, and .webm


{"source-ref":
"s3:///example-
video.mp4"}

Video Frame Object Video frames and video Video Refer to Create a Video
Detection, Video frame sequence files frames: .jpg, .jpeg, .png Frame Input Manifest
Frame Object Tracking (for Object Tracking) File (p. 775).
(bounding boxes, Sequence files: .json
polylines, polygons or
keypoint)

3D Point Cloud Point clouds and point Point clouds: Binary Refer to Create an Input
Semantic cloud sequence files pack format and ASCII. Manifest File for a 3D
Segmentation, 3D Point (for Object Tracking) For more information Point Cloud Labeling
Cloud Object Detection, see Accepted Raw 3D Job (p. 748).
Data Formats (p. 746).

737
Amazon SageMaker Developer Guide
Use Input and Output Data

Task Types Input Data Type Support Formats Example Input


Manifest Line
3D Point Cloud Object Sequence files: .json
Tracking

Ground Truth Streaming Labeling Jobs


If you want to perpetually send new data objects to Amazon SageMaker Ground Truth to be labeled, use
a streaming labeling job. Streaming labeling jobs allow you to:

• Send new dataset objects to workers in real time using a perpetually running labeling job. Workers
continuously receive new data objects to label as long as the labeling job is active and new objects are
being sent to it.
• Gain visibility into the number of objects that have been queued and are waiting to be labeled. Use
this information to control the flow of data objects sent to your labeling job.
• Receive label data for individual data objects in real time as workers finish labeling them.

Ground Truth streaming labeling jobs remain active until they are manually stopped or have been idle
for more than 10 days. You can intermittently send new data objects to workers while the labeling job is
active.

If you are a new user of Ground Truth streaming labeling jobs, it is recommended that you review How It
Works (p. 738).

Use Create a Streaming Labeling Job (p. 714) to learn how to create a streaming labeling job.
Note
Ground Truth streaming labeling jobs are only supported through the SageMaker API.

Topics
• How It Works (p. 738)
• Send Data to a Streaming Labeling Job (p. 738)
• Manage Labeling Requests with an Amazon SQS Queue (p. 740)
• Receive Output Data from a Streaming Labeling Job (p. 740)
• Duplicate Message Handling (p. 740)

How It Works

When you create a Ground Truth streaming labeling job, the job remains active until it is manually
stopped, remains idle for more than 10 days, or is unable to access input data sources. You can
intermittently send new data objects to workers while it is active. A worker can continue to receive
new data objects in real time as long as the total number of tasks currently available to the worker
is less than the value in MaxConcurrentTaskCount. Otherwise, the data object is sent to a queue
that Ground Truth creates on your behalf in Amazon Simple Queue Service (Amazon SQS) for later
processing. These tasks are sent to workers as soon as the total number of tasks currently available to
a worker falls below MaxConcurrentTaskCount. If a data object is not sent to a worker after 14 days,
it expires. You can view the number of tasks pending in the queue and adjust the number of objects
you send to the labeling job. For example, you may decrease the speed at which you send objects to the
labeling job if the backlog of pending objects moves above a threshold.

Send Data to a Streaming Labeling Job

You can optionally submit input data to a streaming labeling job one time when you create the labeling
job using an input manifest file. Once the labeling job has started and the state is InProgress, you

738
Amazon SageMaker Developer Guide
Use Input and Output Data

can submit new data objects to your labeling job in real time using your Amazon SNS input topic and
Amazon S3 event notifications.

Submit Data Objects When you Start the Labeling Job (One Time):

• Use an Input Manifest File – You can optionally specify an input manifest file Amazon S3 URI in
ManifestS3Uri when you create the streaming labeling job. Ground Truth sends each data object in
the manifest file to workers for labeling as soon as the labeling job starts. To learn more, see Create a
Manifest File (Optional) (p. 717).

After you submit a request to create the streaming labeling job, its status will be Initializing.
Once the labeling job is active, the state changes to InProgress and you can start using the real-time
options to submit additional data objects for labeling.

Submit Data Objects in Real Time:

• Send data objects using Amazon SNS messages – You can send Ground Truth new data objects to
label by sending an Amazon SNS message. You will send this message to an Amazon SNS input topic
that you create and specify when you create your streaming labeling job. For more information, see
Send Data Objects Using Amazon SNS (p. 739).
• Send data objects by placing them in an Amazon S3 bucket – Each time you add a new data object
to an Amazon S3 bucket, you can prompt Ground Truth to process that object for labeling. To do this,
you add an event notification to the bucket so that it notifies your Amazon SNS input topic each time
a new object is added to (or created in) that bucket. For more information, see Send Data Objects
using Amazon S3 (p. 740). This option is not available for text-based labeling jobs such as text
classification and named entity recognition.
Important
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your
input data configuration and your output data. You specify the S3 prefix for your output data
when you create a labeling job.

Send Data Objects Using Amazon SNS

You can send data objects to your streaming labeling job using Amazon Simple Notification Service
(Amazon SNS). Amazon SNS is a web service that coordinates and manages the delivery of messages
to and from endpoints (for example, an email address or AWS Lambda function). An Amazon SNS topic
acts as a communication channel between two or more endpoints. You use Amazon SNS to send, or
publish, new data objects to the topic specified in the CreateLabelingJob parameter SnsTopicArn in
InputConfig. The format of these messages is the same as a single line from an input manifest file.

For example, you may send a piece of text to an active text classification labeling job by publishing it to
your input topic. The message that you publish may look similar to the following:

{"source": "Lorem ipsum dolor sit amet"}

To send a new image object to an image classification labeling job, your message may look similar to the
following:

{"source-ref": "s3://awsexamplebucket/example-image.jpg"}

Note
You can also include custom deduplication IDs and deduplication keys in your Amazon SNS
messages. To learn more, see Duplicate Message Handling (p. 740).

When Ground Truth creates your streaming labeling job, it subscribes to your Amazon SNS input topic.

739
Amazon SageMaker Developer Guide
Use Input and Output Data

Send Data Objects using Amazon S3

You can send one or more new data objects to a streaming labeling job by placing them in an Amazon
S3 bucket that is configured with an Amazon SNS event notification. You can set up an event to notify
your Amazon SNS input topic anytime a new object is created in your bucket. You must specify this same
Amazon SNS input topic in the CreateLabelingJob parameter SnsTopicArn in InputConfig.

Anytime you configure an Amazon S3 bucket to send notifications to Amazon SNS, Ground Truth
will publish a test event, "s3:TestEvent", to ensure that the topic exists and that the owner of the
Amazon S3 bucket specified has permission to publish to the specified topic. It is recommended that you
set up your Amazon S3 connection with Amazon SNS before starting a streaming labeling job. If you do
not, this test event may register as a data object and be sent to Ground Truth for labeling.
Important
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your input
data configuration and your output data. You specify the S3 prefix for your output data when
you create a labeling job.
For image-based labeling jobs, Ground Truth requires all S3 buckets to have a CORS policy
attached. To learn more, see CORS Permission Requirement (p. 816).

Once you have configured your Amazon S3 bucket and created your labeling job, you can add objects
to your bucket and Ground Truth either sends that object to workers or places it on your Amazon SQS
queue.

To learn more, see Set up Amazon S3 Bucket Event Notifications (p. 717).
Important
This option is not available for text-based labeling jobs such as text classification and named
entity recognition.

Manage Labeling Requests with an Amazon SQS Queue

When Ground Truth creates your streaming labeling job, it creates an Amazon SQS queue in the AWS
account used to create the labeling job. The queue name is GroundTruth-labeling_job_name where
labeling_job_name is the name of your labeling job, in lowercase letters. When you send data objects
to your labeling job, Ground Truth either sends the data objects directly to workers or places the task in
your queue to be processed at a later time. If a data object is not sent to a worker after 14 days, it expires
and is removed from the queue. You can setup an alarm in Amazon SQS to detect when objects expire
and use this mechanism to control the volume of objects you send to your labeling job.
Important
Modifying, deleting, or sending objects directly to the Amazon SQS queue associated with your
streaming labeling job may lead to job failures.

Receive Output Data from a Streaming Labeling Job

Your Amazon S3 output bucket is periodically updated with new output data from your streaming
labeling job.

Optionally, you can specify an Amazon SNS output topic. Each time a worker submits a labeled object, a
notification with the output data is sent to that topic. You can subscribe an endpoint to your SNS output
topic to receive notifications or trigger events when you receive output data from a labeling task. Use an
Amazon SNS output topic if you want to do real time chaining to another streaming job and receive an
Amazon SNS notifications each time a data object is submitted by a worker.

To learn more, see Subscribe an Endpoint to Your Amazon SNS Output Topic (p. 716).

Duplicate Message Handling

For data objects sent in real time, Ground Truth guarantees idempotency by ensuring each unique object
is only sent for labeling once, even if the input message referring to that object is received multiple

740
Amazon SageMaker Developer Guide
Use Input and Output Data

times (duplicate messages). To do this, each data object sent to a streaming labeling job is assigned a
deduplication ID, which is identified with a deduplication key.

If you send your requests to label data objects directly through your Amazon SNS input topic using
Amazon SNS messages, you can optionally choose a custom deduplication key and deduplication IDs
for your objects. For more information, see Specify A Deduplication Key and ID in an Amazon SNS
Message (p. 741).

If you do not provide your own deduplication key, or if you use the Amazon S3 configuration to send
data objects to your labeling job, Ground Truth uses one of the following for the deduplication ID:

• For messages sent directly to your Amazon SNS input topic, Ground Truth uses the SNS message ID.
• For messages that come from an Amazon S3 configuration, Ground Truth creates a deduplication ID by
combining the Amazon S3 URI of the object with the sequencer token in the message.

Specify A Deduplication Key and ID in an Amazon SNS Message

When you send a data object to your streaming labeling job using an Amazon SNS message, you have
the option to specify your deduplication key and deduplication ID in one of the following ways. In all of
these scenarios, identify your deduplication key with dataset-objectid-attribute-name.

Bring Your Own Deduplication Key and ID

Create your own deduplication key and deduplication ID by configuring your Amazon SNS message as
follows. Replace byo-key with your key and UniqueId with the deduplication ID for that data object.

{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"byo-key",
"byo-key":"UniqueId"
}

Your deduplication key can be up to 140 characters. Supported patterns include: "^[$a-zA-Z0-9](-
*[a-zA-Z0-9])*".

Your deduplication ID can be up to 1,024 characters. Supported patterns include: ^(https|s3)://


([^/]+)/?(.*)$.

Use an Existing Key for your Deduplication Key

You can use an existing key in your message as the deduplication key. When you do this, the value
associated with that key is used for the deduplication ID.

For example, you can specify use the source-ref key as your deduplication key by formatting your
message as follows:

{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"source-ref"
}

In this example, Ground Truth uses "s3://bucket/prefix/object1" for the deduplication id.

Find Deduplication Key and ID in Your Output Data

You can see the deduplication key and ID in your output data. The deduplication key is identified by
dataset-objectid-attribute-name.

741
Amazon SageMaker Developer Guide
Use Input and Output Data

When you use your own custom deduplication key, your output contains something similar to the
following:

"dataset-objectid-attribute-name": "byo-key",
"byo-key": "UniqueId",

When you do not specify a key, you can find the deduplication ID that Ground Truth assigned to
your data object as follows. The $label-attribute-name-object-id parameter identifies your
deduplication ID.

{
"source-ref":"s3://bucket/prefix/object1",
"dataset-objectid-attribute-name":"$label-attribute-name-object-id"
"label-attribute-name" :0,
"label-attribute-name-metadata": {...},
"$label-attribute-name-object-id":"<service-generated-key>"
}

For <service-generated-key>, if the data object came through an Amazon S3 configuration, Ground
Truth adds a unique value used by the service and emits a new field keyed by $sequencer which shows
the Amazon S3 sequencer used. If object was fed to SNS directly, Ground Truth use the SNS message ID.
Note
Do not use the $ character in your label attribute name.

Input Data Quotas


Input datasets used in semantic segmentation labeling jobs have a quota of 20,000 items. For all other
labeling job types, the dataset size quota is 100,000 items. To request an increase to the quota for
labeling jobs other than semantic segmentation jobs, review the procedures in AWS Service Quotas to
request a quota increase.

Input image data for active and non-active learning labeling jobs must not exceed size and resolution
quotas. Active learning refers to labeling job that use automated data labeling. Non-active learning refers
to labeling jobs that don't use automated data labeling.

Additional quotas apply for label categories for all task types, and for input data and labeling category
attributes for 3D point cloud and video frame task types.

Input File Size Quota

Input files can't exceed the following size- quotas for both active and non-active learning labeling jobs.
There is no input file size quota for videos used in video classification labeling jobs.

Labeling Job Task Type Input File Size Quota

Image classification 40 MB

Bounding box (Object detection) 40 MB

Semantic segmentation 40 MB

Bounding box (Object detection) label adjustment 40 MB

Semantic segmentation label adjustment 40 MB

Bounding box (Object detection) label verification 40 MB

Semantic segmentation label verification 40 MB

742
Amazon SageMaker Developer Guide
Use Input and Output Data

Input Image Resolution Quotas


Image file resolution refers to the number of pixels in an image, and determines the amount of detail
an image holds. Image resolution quotas differ depending on the labeling job type and the SageMaker
built-in algorithm used. The following table lists the resolution quotas for images used in active and non-
active learning labeling jobs.

Labeling Job Task Type Resolution Quota - Non Active Resolution Quota - Active
Learning Learning

Image classification 100 million pixels 3840 x 2160 pixels (4 K)

Bounding box (Object detection) 100 million pixels 3840 x 2160 pixels (4 K)

Semantic segmentation 100 million pixels 1920 x 1080 pixels (1080 p)

Object detection label 100 million pixels 3840 x 2160 pixels (4 K)


adjustment

Semantic segmentation label 100 million pixels 1920 x 1080 pixels (1080 p)
adjustment

Object detection label 100 million pixels Not available


verification

Semantic segmentation label 100 million pixels Not available


verification

Label Category Quotas


Each labeling job task type has a quota for the number of label categories you can specify. Workers
select label categories to create annotations. For example, you may specify label categories car,
pedestrian, and biker when creating a bounding box labeling job and workers will select the car category
before drawing bounding boxes around cars.
Important
Label category names cannot exceed 256 characters.
All label categories must be unique. You cannot specify duplicate label categories.

The following label category limits apply to labeling jobs. Quotas for label categories depend on
whether you use the SageMaker API operation CreateLabelingJob or the console to create a labeling
job.

Labeling Job Task Type Label Category Quota - API Label Category Quota - Console

Image classification (Multi-label) 50 50

Image classification (Single Unlimited 30


label)

Bounding box (Object detection) 50 50

Label verification Unlimited 30

Semantic segmentation (with 20 10


active learning)

Semantic segmentation (without Unlimited 10


active learning)

743
Amazon SageMaker Developer Guide
Use Input and Output Data

Labeling Job Task Type Label Category Quota - API Label Category Quota - Console

Named entity recognition Unlimited 30

Text classification (Multi-label) 50 50

Text classification (Single label) Unlimited 30

Video classification 30 30

Video frame object detection 30 30

Video frame object tracking 30 30

3D point cloud object detection 30 30

3D point cloud object tracking 30 30

3D point cloud semantic 30 30


segmentation

3D Point Cloud and Video Frame Labeling Job Quotas

The following quotas apply for 3D point cloud and video frame labeling job input data.

Labeling Job Task Type Input Data Quota

Video frame object detection 2,000 video frames (images) per sequence

Video frame object detection 10 video frame sequences per manifest file

Video frame object tracking 2,000 video frames (images) per sequence

Video frame object tracking 10 video frame sequences per manifest file

3D point cloud object detection 100,000 point cloud frames per labeling job

3D point cloud object tracking 100,000 point cloud frame sequences per labeling
job

3D point cloud object tracking 500 point cloud frames in each sequence file

When you create a video frame or 3D point cloud labeling job, you can add one or more label category
attributes to each label category that you specify to have workers provide more information about an
annotation.

Each label category attribute has a single label category attribute name, and a list of one or more
options (values) to choose from. To learn more, see Worker User Interface (UI) (p. 631) for 3D point
cloud labeling jobs and Worker User Interface (UI) (p. 577) for video frame labeling jobs.

The following quotas apply to the number of label category attributes names and values you can specify
for labeling jobs.

Labeling Job Task Type Label Category Attribute Label Category Attribute
(name) Quota Values Quota

Video frame object detection 10 10

744
Amazon SageMaker Developer Guide
Use Input and Output Data

Labeling Job Task Type Label Category Attribute Label Category Attribute
(name) Quota Values Quota

Video frame object tracking 10 10

3D point cloud object detection 10 10

3D point cloud object tracking 10 10

3D point cloud semantic 10 10


segmentation

Filter and Select Data for Labeling


You can use the Amazon SageMaker console to select a portion of your dataset for labeling. The data
must be stored in an Amazon S3 bucket. You have three options:

• Use the full dataset.


• Choose a randomly selected sample of the dataset.
• Specify a subset of the dataset using a query.

The following options are available in the Labeling jobs section of the SageMaker console after selecting
Create labeling job. To learn how to create a labeling job in the console, see Getting started (p. 527).
To configure the dataset that you use for labeling, in the Job overview section, choose Additional
configuration.

Use the Full Dataset


When you choose to use the Full dataset, you must provide a manifest file for your data objects. You
can provide the path of the Amazon S3 bucket that contains the manifest file or use the SageMaker
console to create the file. To learn how to create a manifest file using the console, see Automated Data
Setup (p. 736).

Choose a Random Sample


When you want to label a random subset of your data, select Random sample. The dataset is stored in
the Amazon S3 bucket specified in the Input dataset location field.

After you have specified the percentage of data objects that you want to include in the sample, choose
Create subset. SageMaker randomly picks the data objects for your labeling job. After the objects are
selected, choose Use this subset.

SageMaker creates a manifest file for the selected data objects. It also modifies the value in the Input
dataset location field to point to the new manifest file.

Specify a Subset
You can specify a subset of your data objects using an Amazon S3 SELECT query on the object file
names.

The SELECT statement of the SQL query is defined for you. You provide the WHERE clause to specify
which data objects should be returned.

For more information about the Amazon S3 SELECT statement, see Selecting Content from Objects.

Choose Create subset to start the selection, and then choose Use this subset to use the selected data.

SageMaker creates a manifest file for the selected data objects. It also updates the value in the Input
dataset location field to point to the new manifest file.

745
Amazon SageMaker Developer Guide
Use Input and Output Data

3D Point Cloud Input Data


To create a 3D point cloud labeling job, you must create an input manifest file. Use this topic to learn the
formatting requirements of the input manifest file for each task type. To learn about the raw input data
formats Ground Truth accepts for 3D point cloud labeling jobs, see the section Accepted Raw 3D Data
Formats (p. 746).

Use your labeling job task type to choose a topics on Create an Input Manifest File for a 3D Point Cloud
Labeling Job (p. 748) to learn about the formatting requirements for each line of your input manifest
file.

Topics
• Accepted Raw 3D Data Formats (p. 746)
• Create an Input Manifest File for a 3D Point Cloud Labeling Job (p. 748)
• Understand Coordinate Systems and Sensor Fusion (p. 761)

Accepted Raw 3D Data Formats


Ground Truth uses your 3D point cloud data to render a 3D scenes that workers annotate. This section
describes the raw data formats that are accepted for point cloud data and sensor fusion data for a point
cloud frame. To learn how to create an input manifest file to connect your raw input data files with
Ground Truth, see Create an Input Manifest File for a 3D Point Cloud Labeling Job (p. 748).

For each frame, Ground Truth supports Compact Binary Pack Format (.bin) and ASCII (.txt) files. These
files contain information about the location (x, y, and z coordinates) of all points that make up that
frame, and, optionally, information about the pixel color of each point for colored point clouds. When
you create a 3D point cloud labeling job input manifest file, you can specify the format of your raw data
in the format parameter.

The following table lists elements that Ground Truth supports in point cloud frame files to describe
individual points.

Symbol Value

x The x coordinate of the point.

y The y coordinate of the point.

z The z coordinate of the point.

i The intensity of the point.

r The red color channel component. An 8-bit value


(0-255).

g The green color channel component. An 8-bit


value (0-255)

b The blue color channel component. An 8-bit value


(0-255)

Ground Truth assumes the following about your input data:

• All of the positional coordinates (x, y, z) are in meters.


• All the pose headings (qx, qy, qz, qw) are measured in Spatial Quaternions .

746
Amazon SageMaker Developer Guide
Use Input and Output Data

Compact Binary Pack Format

The Compact Binary Pack Format represents a point cloud as an ordered set of a stream of points.
Each point in the stream is an ordered binary pack of 4-byte float values in some variant of the form
xyzirgb. The x, y, and z elements are required and additional information about that pixel can be
included in a variety of ways using i, r, g, and b.

To use a binary file to input point cloud frame data to a Ground Truth 3D point cloud labeling job, enter
binary/ in the format parameter for your input manifest file and replace with the order of elements
in each binary pack. For example, you may enter one of the following for the format parameter.

• binary/xyzi – When you use this format, your point element stream would be in the following
order: x1y1z1i1x2y2z2i2...
• binary/xyzrgb – When you use this format, your point element stream would be in the following
order: x1y1z1r1g1b1x2y2z2r2g2b2...
• binary/xyzirgb – When you use this format, your point element stream would be in the following
order: x1y1z1i1r1g1b1x2y2z2i2r2g2b2...

When you use a binary file for your point cloud frame data, if you do not enter a value for format, the
default pack format binary/xyzi is used.

ASCII Format

The ASCII format uses a text file to represent a point cloud, where each line in the ASCII point cloud file
represents a single point. Each point is a line the text file and contains white space separated values,
each of which is a 4-byte float ASCII values. The x, y, and z elements are required for each point and
additional information about that point can be included in a variety of ways using i, r, g, and b.

To use a text file to input point cloud frame data to a Ground Truth 3D point cloud labeling job, enter
text/ in the format parameter for your input manifest file and replace with the order of point
elements on each line.

For example, if you enter text/xyzi for format, your text file for each point cloud frame should look
similar to the following:

x1 y1 z1 i1
x2 y2 z2 i2
...
...

If you enter text/xyzrgb, your text file should look similar to the following:

x1 y1 z1 r1 g1 b1
x2 y2 z2 r2 g2 b1
...
...

When you use a text file for your point cloud frame data, if you do not enter a value for format, the
default format text/xyzi will be used.

Point Cloud Resolution Limits

Ground Truth does not have a resolution limit for 3D point cloud frames. However, we recommend that
you limit each point cloud frame to 500K points for optimal performance. When Ground Truth renders
the 3D point cloud visualization, it must be viewable on your workers' computers, which depends on
workers' computer hardware. Point cloud frames that are larger than 1 million points may not render on
standard machines, or may take too long to load.

747
Amazon SageMaker Developer Guide
Use Input and Output Data

Create an Input Manifest File for a 3D Point Cloud Labeling Job


When you create a labeling job, you provide an input manifest file where each line of the manifest
describes a unit of task to be completed by annotators. The format of your input manifest file depends
on your task type.

• If you are creating a 3D point cloud object detection or semantic segmentation labeling job, each
line in your input manifest file contains information about a single 3D point cloud frame. This is called
a point cloud frame input manifest. To learn more, see Create a Point Cloud Frame Input Manifest
File (p. 748).
• If you are creating a 3D point cloud object tracking labeling job, each line of your input manifest file
contains a sequence of 3D point cloud frames and associated data. This is called a point cloud sequence
input manifest. To learn more, see Create a Point Cloud Sequence Input Manifest (p. 754).

Create a Point Cloud Frame Input Manifest File


The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line is
delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you can't
have unescaped line break characters. In the single-frame input manifest file, each line in the manifest
contains data for a single point cloud frame. The point cloud frame data can either be stored in binary or
ASCII format (see Accepted Raw 3D Data Formats (p. 746)). This is the manifest file formatting required
for 3D point cloud object detection and semantic segmentation. Optionally, you can also provide camera
sensor fusion data for each point cloud frame.

Ground Truth supports point cloud and video camera sensor fusion in the world coordinate
system (p. 761) for all modalities. If you can obtain your 3D sensor extrinsic (like a LiDAR extrinsic),
we recommend that you transform 3D point cloud frames into the world coordinate system using the
extrinsic. For more information, see Sensor Fusion (p. 763).

However, if you cannot obtain a point cloud in world coordinate system, you can provide coordinates
in the original coordinate system that the data was captured in. If you are providing camera data for
sensor fusion, it is recommended that you provide LiDAR sensor and camera pose in the world coordinate
system.

To create a single-frame input manifest file, you will identify the location of each point cloud frame that
you want workers to label using the source-ref key. Additionally, you must use the source-ref-
metadata key to identify the format of your dataset, a timestamp for that frame, and, optionally, sensor
fusion data and video camera images.

The following example demonstrates the syntax used for an input manifest file for a single-frame point
cloud labeling job. The example includes two point cloud frames. For details about each parameter, see
the table following this example.
Important
Each line in your input manifest file must be in JSON Lines format. The following code block
shows an input manifest file with two JSON objects. Each JSON object is used to point to and
provide details about a single point cloud frame. The JSON objects have been expanded for
readability, but you must minimize each JSON object to fit on a single line when creating an
input manifest file. An example is provided under this code block.

{
"source-ref": "s3://awsexamplebucket/examplefolder/frame1.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861644.759115,
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,

748
Amazon SageMaker Developer Guide
Use Input and Output Data

"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"prefix": "s3://awsexamplebucket/lidar_singleframe_dataset/someprefix/",
"images": [
{
"image-path": "images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
}
}
{
"source-ref": "s3://awsexamplebucket/examplefolder/frame2.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861632.759133,
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,
"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"prefix": "s3://awsexamplebucket/lidar_singleframe_dataset/someprefix/",
"images": [
{
"image-path": "images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,

749
Amazon SageMaker Developer Guide
Use Input and Output Data

"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
}
}

When you create an input manifest file, you must collapse your JSON objects to fit on a single line. For
example, the code block above would appear as follows in an input manifest file:

{"source-ref":"s3://awsexamplebucket/examplefolder/frame1.bin","source-ref-metadata":
{"format":"binary/xyzi","unix-timestamp":1566861644.759115,"ego-vehicle-pose":{"position":
{"x":-2.7161461413869947,"y":116.25822288149078,"z":1.8348751887989483},"heading":
{"qx":-0.02111296123795955,"qy":-0.006495469416730261,"qz":-0.008024565904865688,"qw":0.999718119229808
awsexamplebucket/lidar_singleframe_dataset/someprefix/","images":
[{"image-path":"images/frame300.bin_camera0.jpg","unix-
timestamp":1566861644.759115,"fx":847.7962624528487,"fy":850.0340893791985,"cx":576.2129134707038,"cy":
{"x":-2.2722515189268138,"y":116.86003310568965,"z":1.454614668542299},"heading":
{"qx":0.7594754093069037,"qy":0.02181790885672969,"qz":-0.02461725233103356,"qw":-0.6496916273040025},"
model":"pinhole"}]}}
{"source-ref":"s3://awsexamplebucket/examplefolder/frame2.bin","source-ref-metadata":
{"format":"binary/xyzi","unix-timestamp":1566861632.759133,"ego-vehicle-pose":{"position":
{"x":-2.7161461413869947,"y":116.25822288149078,"z":1.8348751887989483},"heading":
{"qx":-0.02111296123795955,"qy":-0.006495469416730261,"qz":-0.008024565904865688,"qw":0.999718119229808
awsexamplebucket/lidar_singleframe_dataset/someprefix/","images":
[{"image-path":"images/frame300.bin_camera0.jpg","unix-
timestamp":1566861644.759115,"fx":847.7962624528487,"fy":850.0340893791985,"cx":576.2129134707038,"cy":
{"x":-2.2722515189268138,"y":116.86003310568965,"z":1.454614668542299},"heading":
{"qx":0.7594754093069037,"qy":0.02181790885672969,"qz":-0.02461725233103356,"qw":-0.6496916273040025},"
model":"pinhole"}]}}

The following table shows the parameters you can include in your input manifest file:

Parameter Required Accepted Values Description

source-ref Yes String The Amazon S3


location of a single
Accepted string value point cloud frame.
format:

s3://<bucket-
name>/<folder-
name>/point-cloud-
frame-file

750
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description

source-ref- Yes JSON object Use this parameter


metadata to include additional
Accepted parameters: information about the
point cloud in source-
format, unix- ref, and to provide
timestamp, ego- camera data for sensor
vehicle-pose, fusion.
position, prefix,
images

format No String Use this parameter to


specify the format of
Accepted string your point cloud data.
values: "binary/ For more information,
xyz", "binary/ see Accepted Raw 3D
xyzi", "binary/ Data Formats (p. 746).
xyzrgb", "binary/
xyzirgb", "text/
xyz", "text/xyzi",
"text/xyzrgb",
"text/xyzirgb"

Default Values:

When the file identified


in source-ref has
a .bin extension,
binary/xyzi

When the file identified


in source-ref has
a .txt extension, text/
xyzi

unix-timestamp Yes Number The unix timestamp is


the number of seconds
A unix timestamp. since January 1st, 1970
until the UTC time that
the data was collected
by a sensor.

ego-vehicle-pose No JSON object The pose of the device


used to collect the
point cloud data. For
more information
about this parameter,
see Include Vehicle
Pose Information
in Your Input
Manifest (p. 752).

751
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description

prefix No String The location in


Amazon S3 where
Accepted string value your metadata, such
format: as camera images, is
stored for this frame.
s3://<bucket-
name>/<folder- The prefix must end
name>/ with a forward slash: /.

images No List A list of parameters


describing color camera
images used for sensor
fusion. You can include
up to 8 images in
this list. For more
information about the
parameters required
for each image, see
Include Camera
Data in Your Input
Manifest (p. 753).

Include Vehicle Pose Information in Your Input Manifest

Use the ego-vehicle location to provide information about the location of the vehicle used to capture
point cloud data. Ground Truth use this information to compute LiDAR extrinsic matrix.

Ground Truth uses extrinsic matrices to project labels to and from the 3D scene and 2D images. For more
information, see Sensor Fusion (p. 763).

The following table provides more information about the position and orientation (heading)
parameters that are required when you provide ego-vehicle information.

Parameter Required Accepted Values Description

position Yes JSON object The translation vector


of the ego vehicle in
Required Parameters: the world coordinate
system.
x, y, and z. Enter
numbers for these
parameters.

heading Yes JSON Object The orientation of the


frame of reference of
Required Parameters: the device or sensor
mounted on the
qx, qy, qz, and qw. vehicle sensing the
Enter numbers for surrounding, measured
these parameters. in quaternions, (qx,
qy, qz, qw) in the a
coordinate system.

752
Amazon SageMaker Developer Guide
Use Input and Output Data

Include Camera Data in Your Input Manifest

If you want to include video camera data with a frame, use the following parameters to provide
information about each image. The Required column below applies when the images parameter is
included in the input manifest file under source-ref-metadata. You are not required to include
images in your input manifest file.

If you include camera images, you must include information about the camera position and heading
used the capture the images in the world coordinate system.

If your images are distorted, Ground Truth can automatically undistort them using information you
provide about the image in your input manifest file, including distortion coefficients (k1, k2, k3, k4, p1,
p1), the camera model and the camera intrinsic matrix. The intrinsic matrix is made up of focal length
(fx, fy), and the principal point (cx, cy). See Intrinsic Matrix (p. 765) to learn how Ground Truth uses
the camera intrinsic. If distortion coefficients are not included, Ground Truth will not undistort an image.

Parameter Required Accepted Values Description

image-path Yes String The relative location,


in Amazon S3 of your
Example of format: image file. This relative
path will be appended
<folder- to the path you specify
name>/ in prefix.
<imagefile.png>

unix-timestamp Yes Number The unix timestamp is


the number of seconds
since January 1st, 1970
until the UTC time that
the data was collected
by a camera.

camera-model No String: The model of the


camera used to
Accepted Values: capture the image. This
information is used
"pinhole", to undistort camera
"fisheye" images.
Default:

"pinhole"

fx, fy Yes Numbers The focal length of the


camera, in the x (fx)
and y (fy) directions.

cx, cy Yes Numbers The x (cx) and y (cy)


coordinates of the
principal point.

k1, k2, k3, k4 No Number Radial distortion


coefficients. Supported
for both fisheye and
pinhole camera models.

p1, p2 No Number Tangential distortion


coefficients. Supported

753
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description


for pinhole camera
models.

skew No Number A parameter to measure


the skew of an image.

position Yes JSON object The location or origin of


the frame of reference
Required Parameters: of the camera mounted
on the vehicle capturing
x, y, and z. Enter images.
numbers for these
parameters.

heading Yes JSON Object The orientation of the


frame of reference of
Required Parameters: the camera mounted on
the vehicle capturing
qx, qy, qz, and qw. images, measured using
Enter numbers for quaternions, (qx, qy,
these parameters. qz, qw), in the world
coordinate system.

Point Cloud Frame Limits

You can include up to 100,000 point cloud frames in your input manifest file. 3D point cloud labeling job
have longer pre-processing times than other Ground Truth task types. For more information, see Job Pre-
processing Time (p. 630).

Create a Point Cloud Sequence Input Manifest

The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line
is delimited by a standard line break, \n or \r\n. Because each line must be a valid JSON object, you
can't have unescaped line break characters. In the point cloud sequence input manifest file, each line
in the manifest contains a sequence of point cloud frames. The point cloud data for each frame in the
sequence can either be stored in binary or ASCII format. For more information, see Accepted Raw 3D
Data Formats (p. 746). This is the manifest file formatting required for 3D point cloud object tracking.
Optionally, you can also provide point attribute and camera sensor fusion data for each point cloud
frame. When you create a sequence input manifest file, you must provide LiDAR and video camera sensor
fusion data in a world coordinate system (p. 761).

The following example demonstrates the syntax used for an input manifest file when each line in the
manifest is a sequence file. Each line in your input manifest file must be in JSON Lines format.

{"source-ref": "s3://awsexamplebucket/example-folder/seq1.json"}
{"source-ref": "s3://awsexamplebucket/example-folder/seq2.json"}

The data for each sequence of point cloud frames needs to be stored in a JSON data object. The
following is an example of the format you use for a sequence file. Information about each frame is
included as a JSON object and is listed in the frames list. This is an example of a sequence file with two
point cloud frame files, frame300.bin and frame303.bin. The ... is used to indicated where you
should include information for additional frames. Add a JSON object for each frame in the sequence.

The following code block includes a JSON object for a single sequence file. The JSON object has been
expanded for readability.

754
Amazon SageMaker Developer Guide
Use Input and Output Data

{
"seq-no": 1,
"prefix": "s3://awsexamplebucket/example_lidar_sequence_dataset/seq1/",
"number-of-frames": 100,
"frames":[
{
"frame-no": 300,
"unix-timestamp": 1566861644.759115,
"frame": "example_lidar_frames/frame300.bin",
"format": "binary/xyzi",
"ego-vehicle-pose":{
"position": {
"x": -2.7161461413869947,
"y": 116.25822288149078,
"z": 1.8348751887989483
},
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
}
},
"images": [
{
"image-path": "example_images/frame300.bin_camera0.jpg",
"unix-timestamp": 1566861644.759115,
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera-model": "pinhole"
}]
},
{
"frame-no": 303,
"unix-timestamp": 1566861644.759115,
"frame": "example_lidar_frames/frame303.bin",
"format": "text/xyzi",
"ego-vehicle-pose":{...},
"images":[{...}]
},
...
]
}

755
Amazon SageMaker Developer Guide
Use Input and Output Data

The following table provides details about the top-level parameters of a sequence file. For detailed
information about the parameters required for individual frames in the sequence file, see Parameters for
Individual Point Cloud Frames (p. 756).

Parameter Required Accepted Values Description

seq-no Yes Integer The ordered number of


the sequence.

prefix Yes String The Amazon S3


location where the
Accepted Values: sequence files are
located.
s3://<bucket-
name>/<prefix>/ The prefix must end
with a forward slash: /.

number-of-frames Yes Integer The total number of


frames included in
the sequence file. This
number must match the
total number of frames
listed in the frames
parameter in the next
row.

frames Yes List of JSON objects A list of frame data.


The length of the list
must equal number-
of-frames. In the
worker UI, frames in a
sequence will be the
same as the order of
frames in this array.

For details about the


format of each frame,
see Parameters for
Individual Point Cloud
Frames (p. 756).

Parameters for Individual Point Cloud Frames

The following table shows the parameters you can include in your input manifest file.

Parameter Required Accepted Values Description

frame-no No Integer A frame number.


This is an optional
identifier specified
by the customer to
identify the frame
within a sequence. It
is not used by Ground
Truth.

756
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description

unix-timestamp Yes Number The unix timestamp is


the number of seconds
since January 1st, 1970
until the UTC time that
the data was collected
by a sensor.

The timestamp for


each frame must
be different and
timestamps must be
sequential because they
are used for cuboid
interpolation. Ideally,
this should be the real
timestamp when the
data was collected. If
this is not available,
you must use an
incremental sequence
of timestamps, where
the first frame in
your sequence file
corresponds to the
first timestamp in the
sequence.

frame Yes String The relative location,


in Amazon S3 of your
Example of format sequence file. This
relative path will be
<folder- appended to the path
name>/<sequence- you specify in prefix.
file.json>

757
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description

format No String Use this parameter to


specify the format of
Accepted string your point cloud data.
values: "binary/ For more information,
xyz", "binary/ see Accepted Raw 3D
xyzi", "binary/ Data Formats (p. 746).
xyzrgb", "binary/
xyzirgb", "text/
xyz", "text/xyzi",
"text/xyzrgb",
"text/xyzirgb"

Default Values:

When the file identified


in source-ref has
a .bin extension,
binary/xyzi

When the file identified


in source-ref has
a .txt extension, text/
xyzi

ego-vehicle-pose No JSON object The pose of the device


used to collect the
point cloud data. For
more information
about this parameter,
see Include Vehicle
Pose Information
in Your Input
Manifest (p. 759).

prefix No String The location in


Amazon S3 where
Accepted string value your metadata, such
format: as camera images, is
stored for this frame.
s3://<bucket-
name>/<folder- The prefix must end
name>/ with a forward slash: /.

images No List A list parameters


describing color camera
images used for sensor
fusion. You can include
up to 8 images in
this list. For more
information about the
parameters required
for each image, see
Include Camera
Data in Your Input
Manifest (p. 759).

758
Amazon SageMaker Developer Guide
Use Input and Output Data

Include Vehicle Pose Information in Your Input Manifest

Use the ego-vehicle location to provide information about the pose of the vehicle used to capture point
cloud data. Ground Truth use this information to compute LiDAR extrinsic matrices.

Ground Truth uses extrinsic matrices to project labels to and from the 3D scene and 2D images. For more
information, see Sensor Fusion (p. 763).

The following table provides more information about the position and orientation (heading)
parameters that are required when you provide ego-vehicle information.

Parameter Required Accepted Values Description

position Yes JSON object The translation vector


of the ego vehicle in
Required Parameters: the world coordinate
system.
x, y, and z. Enter
numbers for these
parameters.

heading Yes JSON Object The orientation of the


frame of reference of
Required Parameters: the device or sensor
mounted on the
qx, qy, qz, and qw. vehicle sensing the
Enter numbers for surrounding, measured
these parameters. in quaternions, (qx,
qy, qz, qw) in the a
coordinate system.

Include Camera Data in Your Input Manifest

If you want to include color camera data with a frame, use the following parameters to provide
information about each image. The Required column in the following table applies when the images
parameter is included in the input manifest file. You are not required to include images in your input
manifest file.

If you include camera images, you must include information about the position and orientation
(heading) of the camera used the capture the images.

If your images are distorted, Ground Truth can automatically undistort them using information you
provide about the image in your input manifest file, including distortion coefficients (k1, k2, k3, k4, p1,
p1), camera model and focal length (fx, fy), and the principal point (cx, cy). To learn more about these
coefficients and undistorting images, see Camera calibration With OpenCV. If distortion coefficients are
not included, Ground Truth will not undistort an image.

Parameter Required Accepted Values Description

image-path Yes String The relative location,


in Amazon S3 of your
Example of format: image file. This relative
path will be appended
<folder- to the path you specify
name>/ in prefix.
<imagefile.png>

759
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description

unix-timestamp Yes Number The timestamp of the


image.

camera-model No String: The model of the


camera used to
Accepted Values: capture the image. This
information is used
"pinhole", to undistort camera
"fisheye" images.
Default:

"pinhole"

fx, fy Yes Numbers The focal length of the


camera, in the x (fx)
and y (fy) directions.

cx, cy Yes Numbers The x (cx) and y (cy)


coordinates of the
principal point.

k1, k2, k3, k4 No Number Radial distortion


coefficients. Supported
for both fisheye and
pinhole camera models.

p1, p2 No Number Tangential distortion


coefficients. Supported
for pinhole camera
models.

skew No Number A parameter to measure


any known skew in the
image.

position Yes JSON object The location or origin of


the frame of reference
Required Parameters: of the camera mounted
on the vehicle capturing
x, y, and z. Enter images.
numbers for these
parameters.

heading Yes JSON Object The orientation of the


frame of reference of
Required Parameters: the camera mounted on
the vehicle capturing
qx, qy, qz, and qw. images, measured using
Enter numbers for quaternions, (qx, qy,
these parameters. qz, qw).

Sequence File and Point Cloud Frame Limits

You can include up to 100,000 point cloud frame sequences in your input manifest file. You can include
up to 500 point cloud frames in each sequence file.

760
Amazon SageMaker Developer Guide
Use Input and Output Data

Keep in mind that 3D point cloud labeling job have longer pre-processing times than other Ground Truth
task types. For more information, see Job Pre-processing Time (p. 630).

Understand Coordinate Systems and Sensor Fusion


Point cloud data is always located in a coordinate system. This coordinate system may be local to the
vehicle or the device sensing the surroundings, or it may be a world coordinate system. When you use
Ground Truth 3D point cloud labeling jobs, all the annotations are generated using the coordinate
system of your input data. For some labeling job task types and features, you must provide data in a
world coordinate system.

In this topic, you'll learn the following:

• When you are required to provide input data in a world coordinate system or global frame of reference.
• What a world coordinate is and how you can convert point cloud data to a world coordinate system.
• How you can use your sensor and camera extrinsic matrices to provide pose data when using sensor
fusion.

Coordinate System Requirements for Labeling Jobs

If your point cloud data was collected in a local coordinate system, you can use an extrinsic matrix of the
sensor used to collect the data to convert it to a world coordinate system or a global frame of reference.
If you cannot obtain an extrinsic for your point cloud data and, as a result, cannot obtain point clouds in
a world coordinate system, you can provide point cloud data in a local coordinate system for 3D point
cloud object detection and semantic segmentation task types.

For object tracking, you must provide point cloud data in a world coordinate system. This is because
when you are tracking objects across multiple frames, the ego vehicle itself is moving in the world and so
all of the frames need a point of reference.

If you include camera data for sensor fusion, it is recommended that you provide camera poses in the
same world coordinate system as the 3D sensor (such as a LiDAR sensor).

Using Point Cloud Data in a World Coordinate System

This section explains what a world coordinate system (WCS), also referred to as a global frame of
reference, is and explains how you can provide point cloud data in a world coordinate system.

What is a World Coordinate System?

A WCS or global frame of reference is a fixed universal coordinate system in which vehicle and sensor
coordinate systems are placed. For example, if multiple point cloud frames are located in different
coordinate systems because they were collected from two sensors, a WCS can be used to translate
all of the coordinates in these point cloud frames into a single coordinate system, where all frames
have the same origin, (0,0,0). This transformation is done by translating the origin of each frame to
the origin of the WCS using a translation vector, and rotating the three axes (typically x, y, and z) to
the right orientation using a rotation matrix. This rigid body transformation is called a homogeneous
transformation.

A world coordinate system is important in global path planning, localization, mapping, and driving
scenario simulations. Ground Truth uses the right-handed Cartesian world coordinate system such as the
one defined in ISO 8855, where the x axis is forward toward the car’s movement, y axis is left, and the z
axis points up from the ground.

The global frame of reference depends on the data. Some datasets use the LiDAR position in the first
frame as the origin. In this scenario, all the frames use the first frame as a reference and device heading
and position will be near the origin in the first frame. For example, KITTI datasets have the first frame as
a reference for world coordinates. Other datasets use a device position that is different from the origin.

761
Amazon SageMaker Developer Guide
Use Input and Output Data

Note that this is not the GPS/IMU coordinate system, which is typically rotated by 90 degrees along
the z-axis. If your point cloud data is in a GPS/IMU coordinate system (such as OxTS in the open source
AV KITTI dataset), then you need to transform the origin to a world coordinate system (typically the
vehicle's reference coordinate system). You apply this transformation by multiplying your data with
transformation metrics (the rotation matrix and translation vector). This will transform the data
from its original coordinate system to a global reference coordinate system. Learn more about this
transformation in the next section.

Convert 3D Point Cloud Data to a WCS

Ground Truth assumes that your point cloud data has already been transformed into a reference
coordinate system of your choice. For example, you can choose the reference coordinate system of the
sensor (such as LiDAR) as your global reference coordinate system. You can also take point clouds from
various sensors and transform them from the sensor's view to the vehicle's reference coordinate system
view. You use the a sensor's extrinsic matrix, made up of a rotation matrix and translation vector, to
convert your point cloud data to a WCS or global frame of reference.

Collectively, the translation vector and rotation matrix can be used to make up an extrinsic matrix, which
can be used to convert data from a local coordinate system to a WCS. For example, your LiDAR extrinsic
matrix may be composed as follows, where R is the rotation matrix and T is the translation vector:

LiDAR_extrinsic = [R T;0 0 0 1]

For example, the autonomous driving KITTI dataset includes a rotation matrix and translation vector
for the LiDAR extrinsic transformation matrix for each frame. The pykitti python module can be used
for loading the KITTI data, and in the dataset dataset.oxts[i].T_w_imu gives the LiDAR extrinsic
th
transform for the i frame with can be multiplied with points in that frame to convert them to a world
frame - np.matmul(lidar_transform_matrix, points). Multiplying a point in LiDAR frame with
a LiDAR extrinsic matrix transforms it into world coordinates. Multiplying a point in the world frame with
the camera extrinsic matrix gives the point coordinates in the camera's frame of reference.

The following code example demonstrates how you can convert point cloud frames from the KITTI
dataset into a WCS.

import pykitti
import numpy as np

basedir = '/Users/nameofuser/kitti-data'
date = '2011_09_26'
drive = '0079'

# The 'frames' argument is optional - default: None, which loads the whole dataset.
# Calibration, timestamps, and IMU data are read automatically.
# Camera and velodyne data are available via properties that create generators
# when accessed, or through getter methods that provide random access.
data = pykitti.raw(basedir, date, drive, frames=range(0, 50, 5))

# i is frame number
i = 0

# lidar extrinsic for the ith frame


lidar_extrinsic_matrix = data.oxts[i].T_w_imu

# velodyne raw point cloud in lidar scanners own coordinate system


points = data.get_velo(i)

# transform points from lidar to global frame using lidar_extrinsic_matrix


def generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix):
tps = []
for point in points:

762
Amazon SageMaker Developer Guide
Use Input and Output Data

transformed_points = np.matmul(lidar_extrinsic_matrix, np.array([point[0],


point[1], point[2], 1], dtype=np.float32).reshape(4,1)).tolist()
if len(point) > 3 and point[3] is not None:
tps.append([transformed_points[0][0], transformed_points[1][0],
transformed_points[2][0], point[3]])

return tps

# customer transforms points from lidar to global frame using lidar_extrinsic_matrix


transformed_pcl = generate_transformed_pcd_from_point_cloud(points, lidar_extrinsic_matrix)

Sensor Fusion

Ground Truth supports sensor fusion of point cloud data with up to 8 video camera inputs. This feature
allows human labellers to view the 3D point cloud frame side-by-side with the synchronized video
frame. In addition to providing more visual context for labeling, sensor fusion allows workers to adjust
annotations in the 3D scene and in 2D images and the adjustment are projected into the other view. The
following video demonstrates a 3D point cloud labeling job with LiDAR and camera sensor fusion.

763
Amazon SageMaker Developer Guide
Use Input and Output Data

764
Amazon SageMaker Developer Guide
Use Input and Output Data

For best results, when using sensor fusion, your point cloud should be in a WCS. Ground Truth uses your
sensor (such as LiDAR), camera, and ego vehicle pose information to compute extrinsic and intrinsic
matrices for sensor fusion.

Extrinsic Matrix

Ground Truth uses sensor (such as LiDAR) extrinsic and camera extrinsic and intrinsic matrices to project
objects to and from the point cloud data's frame of reference to the camera's frame of reference.

For example, in order to project a label from the 3D point cloud to camera image plane, Ground Truth
transforms 3D points from LiDAR’s own coordinate system to the camera's coordinate system. This is
typically done by first transforming 3D points from LiDAR’s own coordinate system to a world coordinate
system (or a global reference frame) using the LiDAR extrinsic matrix. Ground Truth then uses the
camera inverse extrinsic (which converts points from a global frame of reference to the camera's frame
of reference) to transform the 3D points from world coordinate system obtained in previous step into
the camera image plane. The LiDAR extrinsic matrix can also be used to transform 3D data into a world
coordinate system. If your 3D data is already transformed into world coordinate system then the first
transformation doesn’t have any impact on label translation, and label translation only depends on the
camera inverse extrinsic. A view matrix is used to visualize projected labels. To learn more about these
transformations and the view matrix, see Ground Truth Sensor Fusion Transformations (p. 769).

Ground Truth computes these extrinsic matrices by using LiDAR and camera pose data that you provide:
heading ( in quaternions: qx, qy, qz, and qw) and position (x, y, z). For the vehicle, typically the
heading and position are described in vehicle's reference frame in a world coordinate system and are
called a ego vehicle pose. For each camera extrinsic, you can add pose information for that camera. For
more information, see Pose (p. 766).

Intrinsic Matrix

Ground Truth use the camera extrinsic and intrinsic matrices to compute view metrics to transform
labels to and from the 3D scene to camera images. Ground Truth computes the camera intrinsic matrix
using camera focal length (fx, fy) and optical center coordinates (cx,cy) that you provide. For more
information, see Intrinsic and Distortion (p. 769).

Image Distortion

Image distortion can occur for a variety of reasons. For example, images may be distorted due to barrel
or fish-eye effects. Ground Truth uses intrinsic parameters along with distortion co-efficient to undistort
images you provide when creating 3D point cloud labeling jobs. If a camera image is already been
undistorted, all distortion coefficients should be set to 0.

For more information about the transformations Ground Truth performs to undistort images, see
Camera Calibrations: Extrinsic, Intrinsic and Distortion (p. 769).

Ego Vehicle

To collect data for autonomous driving applications, the measurements used to generate point cloud
data and are taken from sensors mounted on a vehicle, or the ego vehicle. To project label adjustments to
and from the 3D scene and 2D images, Ground Truth needs your ego vehicle pose in a world coordinate
system. The ego vehicle pose is comprised of position coordinates and orientation quaternion.

Ground Truth uses your ego vehicle pose to compute rotation and transformations matrices. Rotations
in 3 dimensions can be represented by a sequence of 3 rotations around a sequence of axes. In theory,
any three axes spanning the 3D Euclidean space are enough. In practice, the axes of rotation are chosen
to be the basis vectors. The three rotations are expected to be in a global frame of reference (extrinsic).
Ground Truth does not a support body centered frame of reference (intrinsic) which is attached to,
and moves with, the object under rotation. To track objects, Ground Truth needs to measure from a
global reference where all vehicles are moving. When using Ground Truth 3D point cloud labeling jobs, z
specifies the axis of rotation (extrinsic rotation) and yaw Euler angles are in radians (rotation angle).

765
Amazon SageMaker Developer Guide
Use Input and Output Data

Pose

Ground Truth uses pose information for 3D visualizations and sensor fusion. Pose information you input
through your manifest file is used to compute extrinsic matrices. If you already have an extrinsic matrix,
you can use it to extract sensor and camera pose data.

For example in the autonomous driving KITTI dataset, the pykitti python module can be used for
loading the KITTI data. In the dataset dataset.oxts[i].T_w_imu gives the LiDAR extrinsic
th
transform for the i frame and it can be multiplied with the points to get them in a world
frame - matmul(lidar_transform_matrix, points). This transform can be converted
into position (translation vector) and heading (in quaternion) of LiDAR for the input manifest
th
file JSON format. Camera extrinsic transform for cam0 in i frame can be calculated by
inv(matmul(dataset.calib.T_cam0_velo, inv(dataset.oxts[i].T_w_imu))) and this can
be converted into heading and position for cam0.

import numpy

rotation = [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03],


[ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02],
[-3.24531964e-04, 1.03694477e-02, 9.99946183e-01]]

origin= [1.71104606e+00,
5.80000039e-01,
9.43144935e-01]

from scipy.spatial.transform import Rotation as R

# position is the origin


position = origin
r = R.from_matrix(np.asarray(rotation))

# heading in WCS using scipy


heading = r.as_quat()
print(f"pose:{position}\nheading: {heading}")

Position

In the input manifest file, position refers to the position of the sensor with respect to a world frame. If
you are unable to put the device position in a world coordinate system, you can use LiDAR data with local
coordinates. Similarly, for mounted video cameras you can specify the position and heading in a world
coordinate system. For camera, if you do not have position information, please use (0, 0, 0).

The following are the fields in the position object:

1. x (float) – x coordinate of ego vehicle, sensor, or camera position in meters.


2. y (float) – y coordinate of ego vehicle, sensor, or camera position in meters.
3. z (float) – z coordinate of ego vehicle, sensor, or camera position in meters.

The following is an example of a position JSON object:

{
"position": {
"y": -152.77584902657554,
"x": 311.21505956090624,
"z": -10.854137529636024
}
}

766
Amazon SageMaker Developer Guide
Use Input and Output Data

Heading

In the input manifest file, heading is an object that represents the orientation of a device with respect
to world frame. Heading values should be in quaternion. A quaternion is a representation of the
orientation consistent with geodesic spherical properties. If you are unable to put the sensor heading
in world coordinates, please use the identity quaternion (qx = 0, qy = 0, qz = 0, qw = 1).
Similarly, for cameras, specify the heading in quaternions. If you are unable to obtain extrinsic camera
calibration parameters, please also use the identity quaternion.

Fields in heading object are as follows:

1. qx (float) - x component of ego vehicle, sensor, or camera orientation.


2. qy (float) - y component of ego vehicle, sensor, or camera orientation.
3. qz (float) - z component of ego vehicle, sensor, or camera orientation.
4. qw (float) - w component of ego vehicle, sensor, or camera orientation.

The following is an example of a heading JSON object:

{
"heading": {
"qy": -0.7046155108831117,
"qx": 0.034278837280808494,
"qz": 0.7070617895701465,
"qw": -0.04904659893885366
}
}

To learn more, see Compute Orientation Quaternions and Position (p. 767).

Compute Orientation Quaternions and Position

Ground Truth requires that all orientation, or heading, data be given in quaternions. A quaternions is
a representation of the orientation consistent with geodesic spherical properties that can be used to
approximate of rotation. Compared to Euler angles they are simpler to compose and avoid the problem
of gimbal lock. Compared to rotation matrices they are more compact, more numerically stable, and
more efficient.

You can compute quaternions from a rotation matrix or a transformation matrix.

If you have a rotation matrix (made up of the axis rotations) and translation vector (or origin) in world
coordinate system instead of a single 4x4 rigid transformation matrix, then you can directly use the
rotation matrix and translation vector to compute quaternions. Libraries like scipy and pyqaternion can
help. The following code-block shows an example using these libraries to compute quaternion from a
rotation matrix.

import numpy

rotation = [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03],


[ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02],
[-3.24531964e-04, 1.03694477e-02, 9.99946183e-01]]

origin = [1.71104606e+00,
5.80000039e-01,
9.43144935e-01]

from scipy.spatial.transform import Rotation as R


# position is the origin
position = origin

767
Amazon SageMaker Developer Guide
Use Input and Output Data

r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy
heading = r.as_quat()
print(f"position:{position}\nheading: {heading}")

A UI tool like 3D Rotation Converter can also be useful.

If you have a 4x4 extrinsic transformation matrix, note that the transformation matrix is in the form [R
T; 0 0 0 1] where R is the rotation matrix and T is the origin translation vector. That means you can
extract rotation matrix and translation vector from the transformation matrix as follows.

import numpy as np

transformation
= [[ 9.96714314e-01, -8.09890350e-02, 1.16333982e-03, 1.71104606e+00],
[ 8.09967396e-02, 9.96661051e-01, -1.03090934e-02, 5.80000039e-01],
[-3.24531964e-04, 1.03694477e-02, 9.99946183e-01, 9.43144935e-01],
[ 0, 0, 0, 1]]

transformation = np.array(transformation )
rotation = transformation[0:3][0:3]
translation= transformation[0:3][3]

from scipy.spatial.transform import Rotation as R


# position is the origin translation
position = translation
r = R.from_matrix(np.asarray(rotation))
# heading in WCS using scipy
heading = r.as_quat()
print(f"position:{position}\nheading: {heading}")

With your own setup, you can compute an extrinsic transformation matrix using the GPS/IMU
position and orientation (latitude, longitude, altitude and roll, pitch, yaw) with respect to the LiDAR
sensor on the ego vehicle. For example, you can compute pose from KITTI raw data using pose =
convertOxtsToPose(oxts) to transform the oxts data into a local euclidean poses, specified by
4x4 rigid transformation matrices. You can then transform this pose transformation matrix to a global
reference frame using the reference frames transformation matrix in the world coordinate system.

struct Quaternion
{
double w, x, y, z;
};

Quaternion ToQuaternion(double yaw, double pitch, double roll) // yaw (Z), pitch (Y), roll
(X)
{
// Abbreviations for the various angular functions
double cy = cos(yaw * 0.5);
double sy = sin(yaw * 0.5);
double cp = cos(pitch * 0.5);
double sp = sin(pitch * 0.5);
double cr = cos(roll * 0.5);
double sr = sin(roll * 0.5);

Quaternion q;
q.w = cr * cp * cy + sr * sp * sy;
q.x = sr * cp * cy - cr * sp * sy;
q.y = cr * sp * cy + sr * cp * sy;
q.z = cr * cp * sy - sr * sp * cy;

return q;
}

768
Amazon SageMaker Developer Guide
Use Input and Output Data

Ground Truth Sensor Fusion Transformations


The following sections go into greater detail about the Ground Truth sensor fusion transformations that
are performed using the pose data you provide.

LiDAR Extrinsic
In order to project to and from a 3D LiDAR scene to a 2D camera image, Ground Truth computes the
rigid transformation projection metrics using the ego vehicle pose and heading. Ground Truth computes
rotation and translation of a world coordinates into the 3D plane by doing a simple sequence of
rotations and translation.

Ground Truth computes rotation metrics using the heading quaternions as follows:

Here, [x, y, z, w] corresponds to parameters in the heading JSON object, [qx, qy, qz, qw].
Ground Truth computes the translation column vector as T = [poseX, poseY, poseZ]. Then the
extrinsic metrics is simply as follows:

LiDAR_extrinsic = [R T;0 0 0 1]

Camera Calibrations: Extrinsic, Intrinsic and Distortion


Geometric camera calibration, also referred to as camera resectioning, estimates the parameters of a
lens and image sensor of an image or video camera. You can use these parameters to correct for lens
distortion, measure the size of an object in world units, or determine the location of the camera in the
scene. Camera parameters include intrinsics and distortion coefficients.

Camera Extrinsic
If the camera pose is given, then Ground Truth computes the camera extrinsic based on a rigid
transformation from the 3D plane into the camera plane. The calculation is the same as the one used for
the LiDAR Extrinsic (p. 769), except that Ground Truth uses camera pose (position and heading) and
computes the inverse extrinsic.

camera_inverse_extrinsic = inv([Rc Tc;0 0 0 1]) #where Rc and Tc are camera pose


components

Intrinsic and Distortion


Some cameras, such as pinhole or fisheye cameras, may introduce significant distortion in photos. This
distortion can be corrected using distortion coefficients and the camera focal length. To learn more, see
Camera calibration With OpenCV in the OpenCV documentation.

There are two types of distortion Ground Truth can correct for: radial distortion and tangential
distortion.

Radial distortion occurs when light rays bend more near the edges of a lens than they do at its optical
center. The smaller the lens, the greater the distortion. The presence of the radial distortion manifests in
form of the barrel or fish-eye effect and Ground Truth uses Formula 1 to undistort it.

Formula 1:

769
Amazon SageMaker Developer Guide
Use Input and Output Data

Tangential distortion occurs because the lenses used to take the images are not perfectly parallel to the
imaging plane. This can be corrected with Formula 2.

Formula 2:

In the input manifest file, you can provide distortion coefficients and Ground Truth will undistort your
images. All distortion coefficients are floats.

• k1, k2, k3, k4 – Radial distortion coefficients. Supported for both fisheye and pinhole camera models.
• p1 ,p2 – Tangential distortion coefficients. Supported for pinhole camera models.

If images are already undistorted, all distortion coefficients should be 0 in your input manifest.

In order to correctly reconstruct the corrected image, Ground Truth does a unit conversion of the images
based on focal lengths. If a common focal length is used with a given aspect ratio for both axes, such as
1, in the upper formula we will have a single focal length. The matrix containing these four parameters is
referred to as the in camera intrinsic calibration matrix.

While the distortion coefficients are the same regardless of the camera resolutions used, these should be
scaled with the current resolution from the calibrated resolution.

The following are float values.

• fx - focal length in x direction.


• fy - focal length in y direction.
• cx - x coordinate of principal point.
• cy - y coordinate of principal point.

Ground Truth use the camera extrinsic and camera intrinsic to compute view metrics as shown in the
following code block to transform labels between the 3D scene and 2D images.

def generate_view_matrix(intrinsic_matrix, extrinsic_matrix):


intrinsic_matrix = np.c_[intrinsic_matrix, np.zeros(3)]
view_matrix = np.matmul(intrinsic_matrix, extrinsic_matrix)
view_matrix = np.insert(view_matrix, 2, np.array((0, 0, 0, 1)), 0)
return view_matrix

Video Frame Input Data


When you create a video frame object detection or object tracking labeling job, you can choose video
files (MP4 files) or video frames for input data. All worker tasks are created using video frames, so if you
choose video files, use the Ground Truth frame extraction tool to extract video frames (images) from
your video files.

770
Amazon SageMaker Developer Guide
Use Input and Output Data

For both of these options, you can use the Automated data setup option in the Ground Truth section
of the Amazon SageMaker console to set up a connection between Ground Truth and your input data in
Amazon S3 so that Ground Truth knows where to look for your input data when creating your labeling
tasks. This creates and stores an input manifest file in your Amazon S3 input dataset location. To learn
more, see Automated Video Frame Input Data Setup (p. 773).

Alternatively, you can manually create sequence files for each sequence of video frames that you want
labeled and provide the Amazon S3 location of an input manifest file that references each of these
sequences files using the source-ref key. To learn more, see Create a Video Frame Input Manifest
File (p. 775).

Topics
• Choose Video Files or Video Frames for Input Data (p. 771)
• Input Data Setup (p. 772)

Choose Video Files or Video Frames for Input Data


When you create a video frame object detection or object tracking labeling job, you can provide a
sequence of video frames (images) or you can use the Amazon SageMaker console to have Ground Truth
automatically extract video frames from your video files. Use the following sections to learn more about
these options.

Provide Video Frames

Video frames are sequences of images extracted from a video file. You can create a Ground Truth
labeling job to have workers label multiple sequences of video frames. Each sequence is made up of
images extracted from a single video.

To create a labeling job using video frame sequences, you must store each sequence using a unique key
name prefix in Amazon S3. In the Amazon S3 console, key name prefixes are folders. So in the Amazon
S3 console, each sequence of video frames must be located in its own folder in Amazon S3.

For example, if you have two sequences of video frames, you might use the key name prefixes
sequence1/ and sequence2/ to identify your sequences. In this example, your sequences may be
located in s3://DOC-EXAMPLE-BUCKET/video-frames/sequence1/ and s3://DOC-EXAMPLE-
BUCKET/video-frames/sequence2/.

If you are using the Ground Truth console to create an input manifest file, all of the sequence key name
prefixes should be in the same location in Amazon S3. For example, in the Amazon S3 console, each
sequence could be in a folder in s3://DOC-EXAMPLE-BUCKET/video-frames/. In this example,
your first sequence of video frames (images) may be located in s3://DOC-EXAMPLE-BUCKET/video-
frames/sequence1/ and your second sequence may be located in s3://DOC-EXAMPLE-BUCKET/
video-frames/sequence2/.
Important
Even if you only have a single sequence of video frames that you want workers to label, that
sequence must have a key name prefix in Amazon S3. If you are using the Amazon S3 console,
this means that your sequence is located in a folder. It cannot be located in the root of your S3
bucket.

When creating worker tasks using sequences of video frames, Ground Truth uses one sequence per task.
In each task, Ground Truth orders your video frames using UTF-8 binary order.

For example, video frames might be in the following order in Amazon S3:

[0001.jpg, 0002.jpg, 0003.jpg, ..., 0011.jpg]

771
Amazon SageMaker Developer Guide
Use Input and Output Data

They are arranged in the same order in the worker’s task: 0001.jpg, 0002.jpg, 0003.jpg, ...,
0011.jpg.

Frames might also be ordered using a naming convention like the following:

[frame1.jpg, frame2.jpg, ..., frame11.jpg]

In this case, frame10.jpg and frame11.jpg come before frame2.jpg in the worker task. Your
worker sees your video frames in the following order: frame1.jpg, frame10.jpg, frame11.jpg,
frame2.jpg, ..., frame9.jpg.

Provide Video Files

You can use the Ground Truth frame splitting feature when creating a new labeling job in the console to
extract video frames from video files (MP4 files). A series of video frames extracted from a single video
file is referred to as a sequence of video frames.

You can either have Ground Truth automatically extract all frames, up to 2,000, from the video, or you
th
can specify a frequency for frame extraction. For example, you can have Ground Truth extract every 10
frame from your videos.

You can provide up to 50 videos when you use automated data setup to extract frames, however your
input manifest file cannot reference more than 10 video frame sequence files when you create a video
frame object tracking and video frame object detection labeling job. If you use the automated data setup
console tool to extract video frames from more than 10 video files, you will need to modify the manifest
file the tool generates or create a new one to include 10 video frame sequence files or less. To learn
more about these quotas, see 3D Point Cloud and Video Frame Labeling Job Quotas (p. 744).

To use the video frame extraction tool, see Automated Video Frame Input Data Setup (p. 773).

When all of your video frames have been successfully extracted from your videos, you will see the
following in your S3 input dataset location:

• A key name prefix (a folder in the Amazon S3 console) named after each video. Each of these prefixes
leads to:
• A sequence of video frames extracted from the video used to name that prefix.
• A sequence file used to identify all of the images that make up that sequence.
• An input manifest file with a .manifest extension. This identifies all of the sequence files that will be
used to create your labeling job.

All of the frames extracted from a single video file are used for a labeling task. If you extract video
frames from multiple video files, multiple tasks are created for your labeling job, one for each sequence
of video frames.

Ground Truth stores each sequence of video frames that it extracts in your Amazon S3 location for input
datasets using a unique key name prefix. In the Amazon S3 console, key name prefixes are folders.

Input Data Setup


When you create a video frame labeling job, you need to let Ground Truth know where to look for your
input data. You can do this in one of two ways:

• You can store your input data in Amazon S3 and have Ground Truth automatically detect the input
dataset used for your labeling job. See Automated Video Frame Input Data Setup (p. 773) to learn
more about this option.
• You can create an input manifest file and sequence files and upload them to Amazon S3. See Manual
Input Data Setup (p. 775) to learn more about this option.

772
Amazon SageMaker Developer Guide
Use Input and Output Data

Topics
• Automated Video Frame Input Data Setup (p. 773)
• Manual Input Data Setup (p. 775)

Automated Video Frame Input Data Setup

You can use the Ground Truth automated data setup to automatically detect video files in your Amazon
S3 bucket and extract video frames from those files. To learn how, see Provide Video Files (p. 772).

If you already have video frames in Amazon S3, you can use the automated data setup to use these video
frames in your labeling job. For this option, all video frames from a single video must be stored using a
unique prefix. To learn about the requirements to use this option, see Provide Video Frames (p. 771).

Select one of the following sections to learn how to set up your automatic input dataset connection with
Ground Truth.

Provide Video Files and Extract Frames

Use the following procedure to connect your video files with Ground Truth and automatically extract
video frames from those files for video frame object detection and object tracking labeling jobs.
Note
If you use the automated data setup console tool to extract video frames from more than 10
video files, you will need to modify the manifest file the tool generates or create a new one to
include 10 video frame sequence files or less. To learn more, see Provide Video Files (p. 772).

Make sure your video files are stored in an Amazon S3 bucket in the same AWS Region that you perform
the automated data setup in.

Automatically connect your video files in Amazon S3 with Ground Truth and extract video
frames:

1. Navigate to the Create labeling job page in the Amazon SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.

Your input and output S3 buckets must be located in the same AWS Region that you create your
labeling job in. This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is
in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on
the navigation bar, choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.
4. In the section Input data setup, select Automated data setup.
5. Enter an Amazon S3 URI for S3 location for input datasets. An S3 URI looks like the following:
s3://DOC-EXAMPLE-BUCKET/path-to-files/. This URI should point to the Amazon S3 location
where your video files are stored.
6. Specify your S3 location for output datasets. This is where your output data is stored. You can
choose to store your output data in the Same location as input dataset or Specify a new location
and entering the S3 URI of the location that you want to store your output data.
7. Choose Video Files for your Data type using the dropdown list.
8. Choose Yes, extract frames for object tracking and detection tasks.
9. Choose a method of Frame extraction.

• When you choose Use all frames extracted from the video to create a labeling task, Ground
Truth extracts all frames from each video in your S3 location for input datasets, up to 2,000

773
Amazon SageMaker Developer Guide
Use Input and Output Data

frames. If a video in your input dataset contains more than 2,000 frames, the first 2,000 are
extracted and used for that labeling task.
• When you choose Use every x frame from a video to create a labeling task, Ground Truth
th
extracts every x frame from each video in your S3 location for input datasets.

For example, if your video is 2 seconds long, and has a frame rate of 30 frames per second, there
th
are 60 frames in your video. If you specify 10 here, Ground Truth extracts every 10 frame from
st th th th th th th
your video. This means the 1 , 10 , 20 , 30 , 40 , 50 , and 60 frames are extracted.
10. Choose or create an IAM execution role. Make sure that this role has permission to access your
Amazon S3 locations for input and output data specified in steps 5 and 6.
11. Select Complete data setup.

Provide Video Frames

Use the following procedure to connect your sequences of video frames with Ground Truth for video
frame object detection and object tracking labeling jobs.

Make sure your video frames are stored in an Amazon S3 bucket in the same AWS Region that you
perform the automated data setup in. Each sequence of video frames should have a unique prefix.
For example, if you have two sequences stored in s3://DOC-EXAMPLE-BUCKET/video-frames/
sequences/, each should have a unique prefix like sequence1 and sequence2 and should both
be located directly under the /sequences/ prefix. In the example above, the locations of these
two sequences is: s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence1/ and
s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence2/.

Automatically connect your video frame in Amazon S3 with Ground Truth:

1. Navigate to the Create labeling job page in the Amazon SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.

Your input and output S3 buckets must be located in the same AWS Region that you create your
labeling job in. This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is
in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on
the navigation bar, choose the name of the currently displayed Region.
2. Select Create labeling job.
3. Enter a Job name.
4. In the section Input data setup, select Automated data setup.
5. Enter an Amazon S3 URI for S3 location for input datasets.

This should be the Amazon S3 location where your sequences are stored. For example, if you have
two sequences stored in s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence1/,
s3://DOC-EXAMPLE-BUCKET/video-frames/sequences/sequence2/, enter s3://DOC-
EXAMPLE-BUCKET/video-frames/sequences/ here.
6. Specify your S3 location for output datasets. This is where your output data is stored. You can
choose to store your output data in the Same location as input dataset or Specify a new location
and entering the S3 URI of the location that you want to store your output data.
7. Choose Video frames for your Data type using the dropdown list.
8. Choose or create an IAM execution role. Make sure that this role has permission to access your
Amazon S3 locations for input and output data specified in steps 5 and 6.
9. Select Complete data setup.

These procedures will create an input manifest in the Amazon S3 location for input datasets that you
specified in step 5. If you are creating a labeling job using the SageMaker API or, AWS CLI, or an AWS
SDK, use the Amazon S3 URI for this input manifest file as input to the parameter ManifestS3Uri.

774
Amazon SageMaker Developer Guide
Use Input and Output Data

Manual Input Data Setup

Choose the manual data setup option if you have created sequence files for each of your video frame
sequences, and a manifest file listing references to those sequences files.

Create a Video Frame Input Manifest File

Ground Truth uses the input manifest file to identify the location of your input dataset when creating
labeling tasks. For video frame object detection and object tracking labeling jobs, each line in the input
manifest file identifies the location of a video frame sequence file. Each sequence file identifies the
images included in a single sequence of video frames.

Use this page to learn how to create a video frame sequence file and an input manifest file for video
frame object tracking and object detection labeling jobs.

If you want Ground Truth to automatically generate your sequence files and input manifest file, see
Automated Video Frame Input Data Setup (p. 773).

Create a Video Frame Sequence Input Manifest

In the video frame sequence input manifest file, each line in the manifest is a JSON object, with a
"source-ref" key that references a sequence file. Each sequence file identifies the location of a
sequence of video frames. This is the manifest file formatting required for all video frame labeling jobs.

The following example demonstrates the syntax used for an input manifest file:

{"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-folder/seq1.json"}
{"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-folder/seq2.json"}

Create a Video Frame Sequence File

The data for each sequence of video frames needs to be stored in a JSON data object. The following is an
example of the format you use for a sequence file. Information about each frame is included as a JSON
object and is listed in the frames list. The following JSON has been expanded for readability.

{
"seq-no": 1,
"prefix": "s3://mybucket/prefix/video1/",
"number-of-frames": 3,
"frames":[
{"frame-no": 1, "unix-timestamp": 1566861644, "frame": "frame0001.jpg" },
{"frame-no": 2, "unix-timestamp": 1566861644, "frame": "frame0002.jpg" },
{"frame-no": 3, "unix-timestamp": 1566861644, "frame": "frame0003.jpg" }
]
}

The following table provides details about the parameters shown in the this code example.

Parameter Required Accepted Values Description

seq-no Yes Integer The ordered number of


the sequence.

prefix Yes String The Amazon S3


location where the
Accepted Values: sequence files are
located.
s3://<bucket-
name>/<prefix>/

775
Amazon SageMaker Developer Guide
Use Input and Output Data

Parameter Required Accepted Values Description


The prefix must end
with a forward slash: /.

number-of-frames Yes Integer The total number of


frames included in
the sequence file. This
number must match the
total number of frames
listed in the frames
parameter in the next
row.

frames Yes List of JSON objects A list of frame data.


The length of the list
Required: must equal number-
of-frames. In the
frame-no, frame worker UI, frames in a
sequence are ordered in
Optional:
UTF-8 binary order. To
unix-timestamp learn more about this
ordering, see Provide
Video Frames (p. 771).

frame-no Yes Integer The frame order


number. This will
determine the order
of a frame in the
sequence.

unix-timestamp No Integer The unix timestamp of


a frame. The number of
seconds since January
1st, 1970 until the UTC
time when the frame
was captured.

frame Yes String The name of a video


frame image file.

Output Data
The output from a labeling job is placed in the Amazon S3 location that you specified in the console or in
the call to the CreateLabelingJob operation. Output data appears in this location when the workers have
submitted one or more tasks, or when tasks expire. Note that it may take a few minutes for output data
to appear in Amazon S3 after the worker submits the task or the task expires.

Each line in the output data file is identical to the manifest file with the addition of an attribute
and value for the label assigned to the input object. The attribute name for the value is defined in
the console or in the call to the CreateLabelingJob operation. You can't use -metadata in the
label attribute name. If you are running an image semantic segmentation, 3D point cloud semantic
segmentation, or 3D point cloud object tracking job, the label attribute must end with -ref. For any
other type of job, the attribute name can't end with -ref.

The output of the labeling job is the value of the key-value pair with the label. The label and the value
overwrites any existing JSON data in the input file with the new value.

776
Amazon SageMaker Developer Guide
Use Input and Output Data

For example, the following is the output from an image classification labeling job where the input data
files were stored in an Amazon S3 AWSDOC-EXAMPLE-BUCKET and the label attribute name was defined
as sport. In this example the JSON object is formatted for readability, in the actual output file the JSON
object is on a single line. For more information about the data format, see JSON Lines.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/image_example.png",
"sport":0,
"sport-metadata":
{
"class-name": "football",
"confidence": 0.00,
"type":"groundtruth/image-classification",
"job-name": "identify-sport",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256"
}
}

The value of the label can be any valid JSON. In this case the label's value is the index of the class in the
classification list. Other job types, such as bounding box, have more complex values.

Any key-value pair in the input manifest file other than the label attribute is unchanged in the output
file. You can use this to pass data to your application.

The output from a labeling job can be used as the input to another labeling job. You can use this when
you are chaining together labeling jobs. For example, you can send one labeling job to determine the
sport that is being played. Then you send another using the same data to determine if the sport is being
played indoors or outdoors. By using the output data from the first job as the manifest for the second
job, you can consolidate the results of the two jobs into one output file for easier processing by your
applications.

The output data file is written to the output location periodically while the job is in progress. These
intermediate files contain one line for each line in the manifest file. If an object is labeled, the label is
included. If the object hasn't been labeled, it is written to the intermediate output file identically to the
manifest file.

Output Directories
Ground Truth creates several directories in your Amazon S3 output path. These directories contain the
results of your labeling job and other artifacts of the job. The top-level directory for a labeling job is
given the same name as your labeling job; the output directories are placed beneath it. For example, if
you named your labeling job find-people, your output would be in the following directories:

s3://AWSDOC-EXAMPLE-BUCKET/find-people/activelearning
s3://AWSDOC-EXAMPLE-BUCKET/find-people/annotations
s3://AWSDOC-EXAMPLE-BUCKET/find-people/inference
s3://AWSDOC-EXAMPLE-BUCKET/find-people/manifests
s3://AWSDOC-EXAMPLE-BUCKET/find-people/training

Each directory contains the following output:

Active Learning Directory

The activelearning directory is only present when you are using automated data labeling. It contains
the input and output validation set for automated data labeling, and the input and output folder for
automatically labeled data.

777
Amazon SageMaker Developer Guide
Use Input and Output Data

Annotations Directory

The annotations directory contains all of the annotations made by the workforce. These are the
responses from individual workers that have not been consolidated into a single label for the data object.

There are three subdirectories in the annotations directory.

• The first, worker-response, contains the responses from individual workers. This contains
a subdirectory for each iteration, which in turn contains a subdirectory for each data object in
that iteration. The worker response data for each data object is stored in a timestamped JSON
file that contains the answers submitted by each worker for that data object, and if you use a
private workforce, metadata about those workers. To learn more about this metadata, see Worker
Metadata (p. 779).
• The second, consolidated-annotation, contains information required to consolidate the
annotations in the current batch into labels for your data objects.
• The third, intermediate, contains the output manifest for the current batch with any completed
labels. This file is updated as the label for each data object is completed.

Note
We recommend that you do not use files that are not mentioned in the documentation.

Inference Directory

The inference directory is only present when you are using automated data labeling. This directory
contains the input and output files for the SageMaker batch transform used while labeling data objects.

Manifest Directory

The manifest directory contains the output manifest from your labeling job. There is one subdirectory
in the manifest directory, output. The output directory contains the output manifest file for your
labeling job. The file is named output.manifest.

Training Directory

The training directory is only present when you are using automated data labeling. This directory
contains the input and output files used to train the automated data labeling model.

Confidence Score
When you have more than one worker annotate a single task, your label results from annotation
consolidation. Ground Truth calculates a confidence score for each label. A confidence score is a number
between 0 and 1 that indicates how confident Ground Truth is in the label. You can use the confidence
score to compare labeled data objects to each other, and to identify the least or most confident labels.

You should not interpret the value of a confidence score as an absolute value, or compare confidence
scores across labeling jobs. For example, if all of the confidence scores are between 0.98 and 0.998, you
should only compare the data objects with each other and not rely on the high confidence scores.

You should not compare the confidence scores of human-labeled data objects and auto-labeled data
objects. The confidence scores for humans are calculated using the annotation consolidation function
for the task, while the confidence scores for automated labeling are calculated using a model that
incorporates object features. The two models generally have different scales and average confidence.

For a bounding box labeling job, Ground Truth calculates a confidence score per box. You can compare
confidence scores within one image or across images for the same labeling type (human or auto). You
can't compare confidence scores across labeling jobs.

778
Amazon SageMaker Developer Guide
Use Input and Output Data

If a single worker annotates a task (NumberOfHumanWorkersPerDataObject is set to 1 or in the


console, you enter 1 for Number of workers per dataset object), the confidence score is set to 0.00.

Worker Metadata
Ground Truth provides information that you can use to track individual workers in task output data. The
following data is located in the directories under the worker-response located in the Annotations
Directory (p. 778):

• The acceptanceTime is the time that the worker accepted the task. The format of this date and time
stamp is YYYY-MM-DDTHH:MM:SS.mmmZ for the year (YYYY), month (MM), day (DD), hour (HH), minute
(MM), second (SS) and millisecond (mmm). The date and time are separated by a T.
• The submissionTime is the time that the worker submitted their annotations using the Submit
button. The format of this date and time stamp is YYYY-MM-DDTHH:MM:SS.mmmZ for the year (YYYY),
month (MM), day (DD), hour (HH), minute (MM), second (SS) and millisecond (mmm). The date and time are
separated by a T.
• timeSpentInSeconds reports the total time, in seconds, that a worker actively worked on that task.
This metric does not include time when a worker paused or took a break.
• The workerId is unique to each worker.
• If you use a private workforce, in workerMetadata, you see the following.
• The identityProviderType is the service used to manage the private workforce.
• The issuer is the Cognito user pool or OIDC Identity Provider (IdP) issuer associated with the work
team assigned to this human review task.
• A unique sub identifier refers to the worker. If you create a workforce using Amazon Cognito, you
can retrieve details about this worker (such as the name or user name) using this ID using Amazon
Cognito. To learn how, see Managing and Searching for User Accounts in Amazon Cognito Developer
Guide.

The following is an example of the output you may see if you use Amazon Cognito to create a private
workforce. This is identified in the identityProviderType.

"submissionTime": "2020-12-28T18:59:58.321Z",
"acceptanceTime": "2020-12-28T18:59:15.191Z",
"timeSpentInSeconds": 40.543,
"workerId": "a12b3cdefg4h5i67",
"workerMetadata": {
"identityData": {
"identityProviderType": "Cognito",
"issuer": "https://cognito-idp.aws-region.amazonaws.com/aws-region_123456789",
"sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}

The following is an example of the workerMetadata you may see if you use your own OIDC IdP to
create a private workforce:

"workerMetadata": {
"identityData": {
"identityProviderType": "Oidc",
"issuer": "https://example-oidc-ipd.com/adfs",
"sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}

To learn more about using private workforces, see Use a Private Workforce (p. 868).

779
Amazon SageMaker Developer Guide
Use Input and Output Data

Output Metadata
The output from each job contains metadata about the label assigned to data objects. These elements
are the same for all jobs with minor variations. The following example shows the metadata elements:

"confidence": 0.00,
"type": "groundtruth/image-classification",
"job-name": "identify-animal-species",
"human-annotated": "yes",
"creation-date": "2020-10-18T22:18:13.527256"

The elements have the following meaning:

• confidence – The confidence that Ground Truth has that the label is correct. For more information,
see Confidence Score (p. 778).
• type – The type of classification job. For job types, see Built-in Task Types (p. 704).
• job-name – The name assigned to the job when it was created.
• human-annotated – Whether the data object was labeled by a human or by automated data labeling.
For more information, see Automate Data Labeling (p. 807).
• creation-date – The date and time that the label was created.

Classification Job Output


The following are sample outputs (output manifest files) from an image classification job and a text
classification job. They include the label that Ground Truth assigned to the data object, the value for the
label, and metadata that describes the label.

In addition to the standard metadata elements, the metadata for a classification job includes the text
value of the label's class. For more information, see Image Classification - MXNet (p. 1506).

The red, italicized text in the examples below depends on labeling job specifications and output data.

{
"source-ref":"s3://AWSDOC-EXAMPLE-BUCKET/example_image.jpg",
"species":"0",
"species-metadata":
{
"class-name": "dog",
"confidence": 0.00,
"type": "groundtruth/image-classification",
"job-name": "identify-animal-species",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256"
}
}

{
"source":"The food was delicious",
"mood":"1",
"mood-metadata":
{
"class-name": "positive",
"confidence": 0.8,
"type": "groundtruth/text-classification",
"job-name": "label-sentiment",
"human-annotated": "yes",
"creation-date": "2020-10-18T22:18:13.527256"
}

780
Amazon SageMaker Developer Guide
Use Input and Output Data

Multi-label Classification Job Output


The following are example output manifest files from a multi-label image classification job and a multi-
label text classification job. They include the labels that Ground Truth assigned to the data object (for
example, the image or piece of text) and metadata that describes the labels the worker saw when
completing the labeling task.

The label attribute name parameter (for example, image-label-attribute-name) contains an array
of all of the labels selected by at least one of the workers who completed this task. This array contains
integer keys (for example, [1,0,8]) that correspond to the labels found in class-map. In the multi-
label image classification example, bicycle, person, and clothing were selected by at least one of
the workers who completed the labeling task for the image, exampleimage.jpg.

The confidence-map shows the confidence score that Ground Truth assigned to each label selected by
a worker. To learn more about Ground Truth confidence scores, see Confidence Score (p. 778).

The red, italicized text in the examples below depends on labeling job specifications and output data.

The following is an example of a multi-label image classification output manifest file.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.jpg",
"image-label-attribute-name":[1,0,8],
"image-label-attribute-name-metadata":
{
"job-name":"labeling-job/image-label-attribute-name",
"class-map":
{
"1":"bicycle","0":"person","8":"clothing"
},
"human-annotated":"yes",
"creation-date":"2020-02-27T21:36:25.000201",
"confidence-map":
{
"1":0.95,"0":0.77,"8":0.2
},
"type":"groundtruth/image-classification-multilabel"
}
}

The following is an example of a multi-label text classification output manifest file. In this example,
approving, sad and critical were selected by at least one of the workers who completed the
labeling task for the object exampletext.txt found in AWSDOC-EXAMPLE-BUCKET.

{
"source-ref": "AWSDOC-EXAMPLE-BUCKET/text_file.txt",
"text-label-attribute-name":[1,0,4],
"text-label-attribute-name-metadata":
{
"job-name":"labeling-job/text-label-attribute-name",
"class-map":
{
"1":"approving","0":"sad","4":"critical"
},
"human-annotated":"yes",
"creation-date":"2020-02-20T21:36:25.000201",
"confidence-map":
{
"1":0.95,"0":0.77,"4":0.2
},

781
Amazon SageMaker Developer Guide
Use Input and Output Data

"type":"groundtruth/text-classification-multilabel"
}
}

Bounding Box Job Output


The following is sample output (output manifest file) from a bounding box job. For this task, three
bounding boxes are returned. The label value contains information about the size of the image, and the
location of the bounding boxes.

The class_id element is the index of the box's class in the list of available classes for the task. The
class-map metadata element contains the text of the class.

The metadata has a separate confidence score for each bounding box. The metadata also includes the
class-map element that maps the class_id to the text value of the class. For more information, see
Object Detection - MXNet (p. 1530).

The red, italicized text in the examples below depends on labeling job specifications and output data.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
}
}

The output of a bounding box adjustment job looks like the following JSON. Note that the original JSON
is kept intact and two new jobs are listed, each with “adjust-” prepended to the original attribute’s name.

{
"source-ref": "S3 bucket location",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],

782
Amazon SageMaker Developer Guide
Use Input and Output Data

"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
},
"adjusted-bounding-box":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],
"annotations":
[
{"class_id": 0, "left": 110, "top": 135,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 10, "top": 10,
"width": 30, "height": 30}
]
},
"adjusted-bounding-box-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"job-name": "adjust-bounding-boxes-on-dogs-and-toys",
"adjustment-status": "adjusted"
}
}

In this output, the job's type doesn't change, but an adjustment-status field is added. This field has
the value of adjusted or unadjusted. If multiple workers have reviewed the object and at least one
adjusted the label, the status is adjusted.

783
Amazon SageMaker Developer Guide
Use Input and Output Data

Named Entity Recognition


The following is an example output manifest file from a named entity recognition (NER) labeling task.
For this task, seven entities are returned.

In the output manifest, the JSON object, annotations, includes a list of the labels (label categories)
that you provided.

Worker responses are in a list named entities. Each entity in this list is a JSON object that contains
a label value that matches one in the labels list, an integer startOffset value for labeled span's
starting Unicode offset, and an integer endOffset value for the ending Unicode offset.

The metadata has a separate confidence score for each entity. If a single worker labeled each data object,
the confidence value for each entity will be zero.

The red, italicized text in the examples below depends on labeling job inputs and worker responses.

{
"source": "Amazon SageMaker is a cloud machine-learning platform that was launched
in November 2017. SageMaker enables developers to create, train, and deploy machine-
learning (ML) models in the cloud. SageMaker also enables developers to deploy ML models on
embedded systems and edge-devices",
"ner-labeling-job-attribute-name": {
"annotations": {
"labels": [
{
"label": "Date",
"shortDisplayName": "dt"
},
{
"label": "Verb",
"shortDisplayName": "vb"
},
{
"label": "Thing",
"shortDisplayName": "tng"
},
{
"label": "People",
"shortDisplayName": "ppl"
}
],
"entities": [
{
"label": "Thing",
"startOffset": 22,
"endOffset": 53
},
{
"label": "Thing",
"startOffset": 269,
"endOffset": 281
},
{
"label": "Verb",
"startOffset": 63,
"endOffset": 71
},
{
"label": "Verb",
"startOffset": 228,
"endOffset": 234
},
{

784
Amazon SageMaker Developer Guide
Use Input and Output Data

"label": "Date",
"startOffset": 75,
"endOffset": 88
},
{
"label": "People",
"startOffset": 108,
"endOffset": 118
},
{
"label": "People",
"startOffset": 214,
"endOffset": 224
}
]
}
},
"ner-labeling-job-attribute-name-metadata": {
"job-name": "labeling-job/example-ner-labeling-job",
"type": "groundtruth/text-span",
"creation-date": "2020-10-29T00:40:39.398470",
"human-annotated": "yes",
"entities": [
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
},
{
"confidence": 0
}
]
}
}

Label Verification Job Output


The output (output manifest file) of a bounding box verification job looks different than the output of
a bounding box annotation job. That's because the workers have a different type of task. They're not
labeling objects, but evaluating the accuracy of prior labeling, making a judgment, and then providing
that judgment and perhaps some comments.

If human workers are verifying or adjusting prior bounding box labels, the output of a verification job
would look like the following JSON. The red, italicized text in the examples below depends on labeling
job specifications and output data.

{
"source-ref":"s3://AWSDOC-EXAMPLE-BUCKET/image_example.png",
"bounding-box-attribute-name":
{
"image_size": [{ "width": 500, "height": 400, "depth":3}],

785
Amazon SageMaker Developer Guide
Use Input and Output Data

"annotations":
[
{"class_id": 0, "left": 111, "top": 134,
"width": 61, "height": 128},
{"class_id": 5, "left": 161, "top": 250,
"width": 30, "height": 30},
{"class_id": 5, "left": 20, "top": 20,
"width": 30, "height": 30}
]
},
"bounding-box-attribute-name-metadata":
{
"objects":
[
{"confidence": 0.8},
{"confidence": 0.9},
{"confidence": 0.9}
],
"class-map":
{
"0": "dog",
"5": "bone"
},
"type": "groundtruth/object-detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-dogs-and-toys"
},
"verify-bounding-box-attribute-name":"1",
"verify-bounding-box-attribute-name-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-bounding-boxes",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"worker-feedback": [
{"comment": "The bounding box on the bird is too wide on the right side."},
{"comment": "The bird on the upper right is not labeled."}
]
}
}

Although the type on the original bounding box output was groundtruth/object-detection,
the new type is groundtruth/label-verification. Also note that the worker-feedback array
provides worker comments. If the worker doesn't provide comments, the empty fields are excluded
during consolidation.

Semantic Segmentation Job Output


The following is the output manifest file from a semantic segmentation labeling job. The value of the
label for this job is a reference to a PNG file in an Amazon S3 bucket.

In addition to the standard elements, the metadata for the label includes a color map that defines which
color is used to label the image, the class name associated with the color, and the confidence score for
each color. For more information, see Semantic Segmentation Algorithm (p. 1549).

The red, italicized text in the examples below depends on labeling job specifications and output data.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"city-streets-ref": "S3 bucket location",

786
Amazon SageMaker Developer Guide
Use Input and Output Data

"city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,
"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "label-city-streets",
},
"verify-city-streets-ref":"1",
"verify-city-streets-ref-metadata":
{
"class-name": "bad",
"confidence": 0.93,
"type": "groundtruth/label-verification",
"job-name": "verify-city-streets",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"worker-feedback": [
{"comment": "The mask on the leftmost building is assigned the wrong side of
the road."},
{"comment": "The curb of the road is not labeled but the instructions say
otherwise."}
]
}
}

Confidence is scored on a per-image basis. Confidence scores are the same across all classes within an
image.

The output of a semantic segmentation adjustment job looks similar to the following JSON.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"city-streets-ref": "S3 bucket location",
"city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,

787
Amazon SageMaker Developer Guide
Use Input and Output Data

"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "label-city-streets",
},
"adjusted-city-streets-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_city_image.png",
"adjusted-city-streets-ref-metadata": {
"internal-color-map": {
"0": {
"class-name": "BACKGROUND",
"confidence": 0.9,
"hex-color": "#ffffff"
},
"1": {
"class-name": "buildings",
"confidence": 0.9,
"hex-color": "#2acf59"
},
"2": {
"class-name": "road",
"confidence": 0.9,
"hex-color": "#f28333"
}
},
"type": "groundtruth/semantic-segmentation",
"human-annotated": "yes",
"creation-date": "2018-11-20T22:18:13.527256",
"job-name": "adjust-label-city-streets",
}
}

Video Frame Object Detection Output


The following is the output manifest file from a video frame object detection labeling job. The red,
italicized text in the examples below depends on labeling job specifications and output data.

In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. The metadata also includes job-name which is the name you assigned
to the labeling job. For adjustment tasks, If one or more bounding boxes were modified, there is an
adjustment-status parameter in the metadata for audit workflows that is set to adjusted.

{
"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-path/input-manifest.json",
"CarObjectDetection-ref": "s3://AWSDOC-EXAMPLE-BUCKET/output/labeling-job-name/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"CarObjectDetection-ref-metadata": {
"class-map": {
"0": "car",
"1": "bus"
},
"job-name": "labeling-job/labeling-job-name",
"human-annotated": "yes",
"creation-date": "2021-09-29T05:50:35.566000",
"type": "groundtruth/video-object-detection"
}
}

Ground Truth creates one output sequence file for each sequence of video frames that was labeled. Each
output sequence file contains the following:

788
Amazon SageMaker Developer Guide
Use Input and Output Data

• All annotations for all frames in a sequence in the detection-annotations list of JSON objects.
• For each frame that was annotated by a worker, the frame file name (frame), number (frame-no), a
list of JSON objects containing annotations (annotations), and if applicable, frame-attributes.
The name of this list is defined by the task type you use: polylines, polygons, keypoints, and for
bounding boxes, annotations.

Each JSON object contains information about a single annotation and associated label. The following
table outlines the parameters you'll see for each video frame task type.

Task Type Parameters

Bounding Box Box dimensions: height and width

Box top, left corner pixel location: top and left

Keypoint Keypoint vertices: { "x": int, "y": int }

Polygon A list of polygon vertices: vertices


Polygon vertices: { "x": int, "y": int }

A polygon is a closed shape and so the first point


will also represent the last point.

Polyline A list of polyline vertices: vertices


Polyline vertices: { "x": int, "y": int }

In addition to task type specific values, you will see the following in each JSON object:
• Values of any label-category-attributes that were specified for that label.
• The class-id of the box. Use the class-map in the output manifest file to see which label
category this ID maps to.

The following is an example of a SeqLabel.json file from a bounding box video frame object
detection labeling job. This file will be located under s3://your-output-bucket/output-prefix/
annotations/consolidated-annotation/output/annotation-number/

{
"detection-annotations": [
{
"annotations": [
{
"height": 41,
"width": 53,
"top": 152,
"left": 339,
"class-id": "1",
"label-category-attributes": {
"occluded": "no",
"size": "medium"
}
},
{
"height": 24,
"width": 37,
"top": 148,
"left": 183,
"class-id": "0",
"label-category-attributes": {

789
Amazon SageMaker Developer Guide
Use Input and Output Data

"occluded": "no",
}
}
],
"frame-no": 0,
"frame": "frame_0000.jpeg",
"frame-attributes": {name: value, name: value}
},
{
"annotations": [
{
"height": 41,
"width": 53,
"top": 152,
"left": 341,
"class-id": "0",
"label-category-attributes": {}
},
{
"height": 24,
"width": 37,
"top": 141,
"left": 177,
"class-id": "0",
"label-category-attributes": {
"occluded": "no",
}
}
],
"frame-no": 1,
"frame": "frame_0001.jpeg",
"frame-attributes": {name: value, name: value}
}
]
}

Video Frame Object Tracking Output


The following is the output manifest file from a video frame object tracking labeling job. The red,
italicized text in the examples below depends on labeling job specifications and output data.

In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence of frames. The metadata also includes job-name which is the name you
assigned to the labeling job. For adjustment tasks, If one or more bounding boxes were modified, there is
an adjustment-status parameter in the metadata for audit workflows that is set to adjusted.

{
"source-ref": "s3://DOC-EXAMPLE-BUCKET/example-path/input-manifest.json",
"CarObjectTracking-ref": "s3://AWSDOC-EXAMPLE-BUCKET/output/labeling-job-name/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"CarObjectTracking-ref-metadata": {
"class-map": {
"0": "car",
"1": "bus"
},
"job-name": "labeling-job/labeling-job-name",
"human-annotated": "yes",
"creation-date": "2021-09-29T05:50:35.566000",
"type": "groundtruth/video-object-tracking"
}
}

790
Amazon SageMaker Developer Guide
Use Input and Output Data

Ground Truth creates one output sequence file for each sequence of video frames that was labeled. Each
output sequence file contains the following:

• All annotations for all frames in a sequence in the tracking-annotations list of JSON objects.
• For each frame that was annotated by a worker, the frame (frame), number (frame-no), a list of
JSON objects containing annotations (annotations), and if applicable, frame attributes (frame-
attributes). The name of this list is defined by the task type you use: polylines, polygons,
keypoints, and for bounding boxes, annotations.

Each JSON object contains information about a single annotation and associated label. The following
table outlines the parameters you'll see for each video frame task type.

Task Type Parameters

Bounding Box Box dimensions: height and width

Box top, left corner pixel location: top and left

Keypoint Keypoint vertices: { "x": int, "y": int }

Polygon A list of polygon vertices: vertices


Polygon vertices: { "x": int, "y": int }

A polygon is a closed shape and so the first point


will also represent the last point.

Polyline A list of polyline vertices: vertices


Polyline vertices: { "x": int, "y": int }

In addition to task type specific values, you will see the following in each JSON object:
• Values of any label-category-attributes that were specified for that label.
• The class-id of the box. Use the class-map in the output manifest file to see which label
category this ID maps to.
• An object-id which identifies an instance of a label. This ID will be the same across frames if a
worker identified the same instance of an object in multiple frames. For example, if a car appeared in
multiple frames, all bounding boxes uses to identify that car would have the same object-id.
• The object-name which is the instance ID of that annotation.

The following is an example of a SeqLabel.json file from a bounding box video frame object
tracking labeling job. This file will be located under s3://your-output-bucket/output-prefix/
annotations/consolidated-annotation/output/annotation-number/

{
"tracking-annotations": [
{
"annotations": [
{
"height": 36,
"width": 46,
"top": 178,
"left": 315,
"class-id": "0",
"label-category-attributes": {
"occluded": "no"
},
"object-id": "480dc450-c0ca-11ea-961f-a9b1c5c97972",

791
Amazon SageMaker Developer Guide
Use Input and Output Data

"object-name": "car:1"
}
],
"frame-no": 0,
"frame": "frame_0001.jpeg",
"frame-attributes": {}
},
{
"annotations": [
{
"height": 30,
"width": 47,
"top": 163,
"left": 344,
"class-id": "1",
"label-category-attributes": {
"occluded": "no",
"size": "medium"
},
"object-id": "98f2b0b0-c0ca-11ea-961f-a9b1c5c97972",
"object-name": "bus:1"
},
{
"height": 28,
"width": 33,
"top": 150,
"left": 192,
"class-id": "0",
"label-category-attributes": {
"occluded": "partially"
},
"object-id": "480dc450-c0ca-11ea-961f-a9b1c5c97972",
"object-name": "car:1"
}
],
"frame-no": 1,
"frame": "frame_0002.jpeg",
"frame-attributes": {name: value, name: value}
}
]
}

3D Point Cloud Semantic Segmentation Output


The following is the output manifest file from a 3D point cloud semantic segmentation labeling job.

In addition to the standard elements, the metadata for the label includes a color map that defines
which color is used to label the image, the class name associated with the color, and the confidence
score for each color. Additionally, there is an adjustment-status parameter in the metadata for
audit workflows that is set to adjusted if the color mask is modified. If you added one or more
frameAttributes to your label category configuration file, worker responses for frame attributes are
in the JSON object, dataset-object-attributes.

The your-label-attribute-ref parameter contains the location of a compressed file with a .zlib
extension. When you uncompress this file, it contains an array. Each index in the array corresponds to the
index of an annotated point in the input point cloud. The value of the array at a given index gives the
class of the point at the same index in the point cloud, based on the semantic color map found in the
color-map parameter of the metadata.

You can use Python code similar to the following to decompress a .zlib file:

import zlib
from array import array

792
Amazon SageMaker Developer Guide
Use Input and Output Data

# read the label file


compressed_binary_file = open(zlib_file_path/file.zlib, 'rb').read()

# uncompress the label file


binary_content = zlib.decompress(compressed_binary_file)

# load labels to an array


my_int_array_data = array('B', binary_content);

print(my_int_array_data)

The code block above will produce an output similar to the following. Each element of the printed array
contains the class of a point at the that index in the point cloud. For example, my_int_array_data[0]
= 1 means point[0] in the input point cloud has a class 1. In the following output manifest file
example, class 0 corresponds with "Background", 1 with Car, and 2 with Pedestrian.

>> array('B', [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,


1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The following is an example of a semantic segmentation 3D point cloud labeling job output manifest file.
The red, italicized text in the examples below depends on labeling job specifications and output data.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/examplefolder/frame1.bin",
"source-ref-metadata":{
"format": "binary/xyzi",
"unix-timestamp": 1566861644.759115,
"ego-vehicle-pose":{...},
"prefix": "s3://AWSDOC-EXAMPLE-BUCKET/lidar_singleframe_dataset/prefix",
"images": [{...}]
},
"lidar-ss-label-attribute-ref": "s3://your-output-bucket/labeling-job-name/annotations/
consolidated-annotation/output/dataset-object-id/filename.zlib",
"lidar-ss-label-attribute-ref-metadata": {
'color-map': {
"0": {
"class-name": "Background",
"hex-color": "#ffffff",
"confidence": 0.00
},
"1": {
"class-name": "Car",
"hex-color": "#2ca02c",
"confidence": 0.00
},
"2": {
"class-name": "Pedestrian",
"hex-color": "#1f77b4",
"confidence": 0.00
},
"3": {
"class-name": "Tree",
"hex-color": "#ff7f0e",
"confidence": 0.00
}
},
'type': 'groundtruth/point_cloud_single_frame_semantic_segmentation',
'human-annotated': 'yes',
'creation-date': '2019-11-12T01:18:14.271944',
'job-name': 'labeling-job-name',
//only present for adjustment audit workflow
"adjustment-status": "adjusted", // "adjusted" means the label was adjusted

793
Amazon SageMaker Developer Guide
Use Input and Output Data

"dataset-object-attributes": {name: value, name: value}


}
}

3D Point Cloud Object Detection Output


The following is sample output from a 3D point cloud objected detection job. For this task type, the data
about 3D cuboids is returned in the 3d-bounding-box parameter, in a list named annotations. In this
list, each 3D cuboid is described using the following information.

• Each class, or label category, that you specify in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• These classes are used to give each 3D cuboid an object-name in the format <class>:<integer>
where integer is a unique number to identify that cuboid in the frame.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):

old_yaw_in_output = pi - yaw

• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.
• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.

If one or more cuboids were modified, there is an adjustment-status parameter in the metadata for
audit workflows that is set to adjusted. If you added one or more frameAttributes to your label
category configuration file, worker responses for frame attributes are in the JSON object, dataset-
object-attributes.

The red, italicized text in the examples below depends on labeling job specifications and output
data. The ellipses (...) denote a continuation of that list, where additional objects with the same format
as the proceeding object can appear.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/examplefolder/frame1.txt",
"source-ref-metadata":{
"format": "text/xyzi",
"unix-timestamp": 1566861644.759115,
"prefix": "s3://AWSDOC-EXAMPLE-BUCKET/lidar_singleframe_dataset/prefix",
"ego-vehicle-pose": {
"heading": {
"qx": -0.02111296123795955,
"qy": -0.006495469416730261,
"qz": -0.008024565904865688,
"qw": 0.9997181192298087
},
"position": {
"x": -2.7161461413869947,

794
Amazon SageMaker Developer Guide
Use Input and Output Data

"y": 116.25822288149078,
"z": 1.8348751887989483
}
},
"images": [
{
"fx": 847.7962624528487,
"fy": 850.0340893791985,
"cx": 576.2129134707038,
"cy": 317.2423573573745,
"k1": 0,
"k2": 0,
"k3": 0,
"k4": 0,
"p1": 0,
"p2": 0,
"skew": 0,
"unix-timestamp": 1566861644.759115,
"image-path": "images/frame_0_camera_0.jpg",
"position": {
"x": -2.2722515189268138,
"y": 116.86003310568965,
"z": 1.454614668542299
},
"heading": {
"qx": 0.7594754093069037,
"qy": 0.02181790885672969,
"qz": -0.02461725233103356,
"qw": -0.6496916273040025
},
"camera_model": "pinhole"
}
]
},
"3d-bounding-box":
{
"annotations": [
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.616382013657516,
"center-y": 125.04149850484193,
"center-z": 0.311272296465834,
"length": 2.993000265181146,
"width": 1.8355260519692056,
"height": 1.3233490884304047,
"roll": 0,
"pitch": 0,
"yaw": 1.6479308313703527
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.188984560617168,
"center-y": 99.7954483288783,
"center-z": 0.2226435567445657,
"length": 4,
"width": 2,

795
Amazon SageMaker Developer Guide
Use Input and Output Data

"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.6243170732068055
}
]
},
"3d-bounding-box-metadata":
{
"objects": [],
"class_map":
{
"0": "Car",
},
"type": "groundtruth/point_cloud_object_detection",
"human-annotated": "yes",
"creation-date": "2018-10-18T22:18:13.527256",
"job-name": "identify-3d-objects",
"adjustment-status": "adjusted",
"dataset-object-attributes": {name: value, name: value}
}
}

3D Point Cloud Object Tracking Output


The following is an example of an output manifest file from a 3D point cloud object tracking labeling
job. The red, italicized text in the examples below depends on labeling job specifications and
output data. The ellipses (...) denote a continuation of that list, where additional objects with the same
format as the proceeding object can appear.

In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. If one or more cuboids were modified, there is an adjustment-status
parameter in the metadata for audit workflows that is set to adjusted.

{
"source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/myfolder/seq1.json",
"lidar-label-attribute-ref": "s3://<CustomerOutputLocation>/<labelingJobName>/
annotations/consolidated-annotation/output/<datasetObjectId>/SeqLabel.json",
"lidar-label-attribute-ref-metadata": {
"objects":
[
{
"frame-no": 300,
"confidence": []
},
{
"frame-no": 301,
"confidence": []
},
...
],
'class-map': {'0': 'Car', '1': 'Person'},
'type': 'groundtruth/point_cloud_object_tracking',
'human-annotated': 'yes',
'creation-date': '2019-11-12T01:18:14.271944',
'job-name': 'identify-3d-objects',
"adjustment-status": "adjusted"
}
}

In the above example, the cuboid data for each frame in seq1.json is in SeqLabel.json in the
Amazon S3 location, s3://<customerOutputLocation>/<labelingJobName>/annotations/

796
Amazon SageMaker Developer Guide
Use Input and Output Data

consolidated-annotation/output/<datasetObjectId>/SeqLabel.json. The following is an


example of this label sequence file.

For each frame in the sequence, you see the frame-number, frame-name, if applicable, frame-
attributes, and a list of annotations. This list contains 3D cubiods that were drawn for that frame.
Each annotation includes the following information:

• An object-name in the format <class>:<integer> where class identifies the label category and
integer is a unique ID across the dataset.
• When workers draw a cuboid, it is associated with a unique object-id which is associated with all
cuboids that identify the same object across multiple frames.
• Each class, or label category, that you specified in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):

old_yaw_in_output = pi - yaw

• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.
• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.

{
"tracking-annotations": [
{
"frame-number": 0,
"frame-name": "0.txt.pcd",
"frame-attributes": {name: value, name: value},
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": -2.2906369208300674,
"center-y": 103.73924823843463,
"center-z": 0.37634114027023313,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5827222214406014,
"object-id": "ae5dc770-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"

797
Amazon SageMaker Developer Guide
Use Input and Output Data

},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.6451293634707413,
"center-y": 124.9534455706848,
"center-z": 0.5020834081743839,
"length": 4,
"width": 2,
"height": 2.080488827301309,
"roll": 0,
"pitch": 0,
"yaw": -1.5963335581398077,
"object-id": "06efb020-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.205611313118477,
"center-y": 99.91731932137061,
"center-z": 0.22917217081212138,
"length": 3.8747142207671956,
"width": 1.9999999999999918,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5672228760316775,
"object-id": "26fad020-a782-11ea-b57d-67c51a0561a1"
}
]
},
{
"frame-number": 1,
"frame-name": "1.txt.pcd",
"frame-attributes": {},
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": -2.2906369208300674,
"center-y": 103.73924823843463,
"center-z": 0.37634114027023313,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5827222214406014,
"object-id": "ae5dc770-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -2.6451293634707413,
"center-y": 124.9534455706848,
"center-z": 0.5020834081743839,
"length": 4,
"width": 2,

798
Amazon SageMaker Developer Guide
Use Input and Output Data

"height": 2.080488827301309,
"roll": 0,
"pitch": 0,
"yaw": -1.5963335581398077,
"object-id": "06efb020-a782-11ea-b57d-67c51a0561a1"
},
{
"label-category-attributes": {
"Occlusion": "Partial",
"Type": "Sedan"
},
"object-name": "Car:2",
"class-id": 0,
"center-x": -5.221311072916759,
"center-y": 100.4639841045424,
"center-z": 0.22917217081212138,
"length": 3.8747142207671956,
"width": 1.9999999999999918,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 1.5672228760316775,
"object-id": "26fad020-a782-11ea-b57d-67c51a0561a1"
}
]
}
]
}

3D-2D Object Tracking Point Cloud Object Tracking Output


The following is an example of an output manifest file from a 3D point cloud object tracking labeling
job. The red, italicized text in the examples below depends on labeling job specifications and
output data. The ellipses (...) denote a continuation of that list, where additional objects with the same
format as the proceeding object can appear.

In addition to the standard elements, the metadata includes a class map that lists each class that has at
least one label in the sequence. If one or more cuboids were modified, there is an adjustment-status
parameter in the metadata for audit workflows that is set to adjusted.

{
"source-ref": "s3://iad-groundtruth-lidar-test-bucket/artifacts/gt-point-cloud-demos/
sequences/seq2.json",
"source-ref-metadata": {
"json-paths": [
"number-of-frames",
"prefix",
"frames{frame-no, frame}"
]
},
"3D2D-linking-ref": "s3://iad-groundtruth-lidar-test-bucket/xyz/3D2D-linking/annotations/
consolidated-annotation/output/0/SeqLabel.json",
"3D2D-linking-ref-metadata": {
"objects": [
{
"frame-no": 0,
"confidence": []
},
{
"frame-no": 1,
"confidence": []
},
{

799
Amazon SageMaker Developer Guide
Use Input and Output Data

"frame-no": 2,
"confidence": []
},
{
"frame-no": 3,
"confidence": []
},
{
"frame-no": 4,
"confidence": []
},
{
"frame-no": 5,
"confidence": []
},
{
"frame-no": 6,
"confidence": []
},
{
"frame-no": 7,
"confidence": []
},
{
"frame-no": 8,
"confidence": []
},
{
"frame-no": 9,
"confidence": []
}
],
"class-map": {
"0": "Car"
},
"type": "groundtruth/point_cloud_object_tracking",
"human-annotated": "yes",
"creation-date": "2023-01-19T02:55:10.206508",
"job-name": "mcm-linking"
},
"3D2D-linking-chain-ref": "s3://iad-groundtruth-lidar-test-bucket/xyz/3D2D-linking-chain/
annotations/consolidated-annotation/output/0/SeqLabel.json",
"3D2D-linking-chain-ref-metadata": {
"objects": [
{
"frame-no": 0,
"confidence": []
},
{
"frame-no": 1,
"confidence": []
},
{
"frame-no": 2,
"confidence": []
},
{
"frame-no": 3,
"confidence": []
},
{
"frame-no": 4,
"confidence": []
},
{
"frame-no": 5,

800
Amazon SageMaker Developer Guide
Use Input and Output Data

"confidence": []
},
{
"frame-no": 6,
"confidence": []
},
{
"frame-no": 7,
"confidence": []
},
{
"frame-no": 8,
"confidence": []
},
{
"frame-no": 9,
"confidence": []
}
],
"class-map": {
"0": "Car"
},
"type": "groundtruth/point_cloud_object_tracking",
"human-annotated": "yes",
"creation-date": "2023-01-19T03:29:49.149935",
"job-name": "3d2d-linking-chain"
}
}

In the above example, the cuboid data for each frame in seq2.json is in SeqLabel.json in the
Amazon S3 location, s3://<customerOutputLocation>/<labelingJobName>/annotations/
consolidated-annotation/output/<datasetObjectId>/SeqLabel.json. The following is an
example of this label sequence file.

For each frame in the sequence, you see the frame-number, frame-name, if applicable, frame-
attributes, and a list of annotations. This list contains 3D cubiods that were drawn for that frame.
Each annotation includes the following information:

• An object-name in the format <class>:<integer> where class identifies the label category and
integer is a unique ID across the dataset.
• When workers draw a cuboid, it is associated with a unique object-id which is associated with all
cuboids that identify the same object across multiple frames.
• Each class, or label category, that you specified in your input manifest is associated with a class-id.
Use the class-map to identify the class associated with each class ID.
• center-x, center-y, and center-z are the coordinates of the center of the cuboid, in the same
coordinate system as the 3D point cloud input data used in your labeling job.
• length, width, and height describe the dimensions of the cuboid.
• yaw is used to describe the orientation (heading) of the cuboid in radians.
Note
yaw is now in the right-handed Cartesian system. Since this feature was added on September
02, 2022 19:02:17 UTC, you can convert the yaw measurement in the output data prior to
that using the following (all units are in radians):

old_yaw_in_output = pi - yaw

• In our definition, +x is to the right, +y is to the forward, and +z is up from the ground plane. The
rotation order is x - y - z. The roll, pitch and yaw are represented in the right-handed Cartesian
system. In 3D space, roll is along the x-axis, pitch is along the y-axis and yaw is along the z-axis. All
three are counterclockwise.

801
Amazon SageMaker Developer Guide
Use Input and Output Data

• If you included label attributes in your input manifest file for a given class, a label-category-
attributes parameter is included for all cuboids for which workers selected label attributes.

{
"lidar": {
"tracking-annotations": [
{
"frame-number": 0,
"frame-name": "0.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": 12.172361721602815,
"center-y": 120.23067521992364,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
},
{
"frame-number": 1,
"frame-name": "1.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -1.6841480600695489,
"center-y": 126.20198882749516,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,

802
Amazon SageMaker Developer Guide
Use Input and Output Data

"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
},
{
"frame-number": 2,
"frame-name": "2.txt.pcd",
"annotations": [
{
"label-category-attributes": {
"Type": "Sedan"
},
"object-name": "Car:1",
"class-id": 0,
"center-x": -1.6841480600695489,
"center-y": 126.20198882749516,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "505b39e0-97a4-11ed-8903-dd5b8b903715"
},
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"center-x": 17.192725195301094,
"center-y": 114.55705365827872,
"center-z": 1.590525771183712,
"length": 4,
"width": 2,
"height": 2,
"roll": 0,
"pitch": 0,
"yaw": 0,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
}
]
},
"camera-0": {
"tracking-annotations": [
{
"frame-no": 0,
"frame": "0.txt.pcd",

803
Amazon SageMaker Developer Guide
Enhanced Data Labeling

"annotations": [
{
"label-category-attributes": {
"Occlusion": "Partial"
},
"object-name": "Car:2",
"class-id": 0,
"width": 223,
"height": 164,
"top": 225,
"left": 486,
"object-id": "5229df60-97a4-11ed-8903-dd5b8b903715"
}
],
"frame-attributes": {}
},
{
"frame-no": 1,
"frame": "1.txt.pcd",
"annotations": [
{
"label-category-attributes": {},
"object-name": "Car:4",
"class-id": 0,
"width": 252,
"height": 246,
"top": 237,
"left": 473,
"object-id": "1afcb670-97a9-11ed-9a84-ff627d099e16"
}
],
"frame-attributes": {}
}
]
}
}

The cuboid and bounding box for an object are linked through a common object-id.

Enhanced Data Labeling


Amazon SageMaker Ground Truth manages sending your data objects to workers to be labeled. Labeling
each data object is a task. Workers complete each task until the entire labeling job is complete. Ground
Truth divides the total number of tasks into smaller batches that are sent to workers. A new batch is sent
to workers when the previous one is finished.

Ground Truth provides two features that help improve the accuracy of your data labels and reduce the
total cost of labeling your data:

• Annotation consolidation helps to improve the accuracy of your data object labels. It combines the
results of multiple workers' annotation tasks into one high-fidelity label.
• Automated data labeling uses machine learning to label portions of your data automatically without
having to send them to human workers.

Topics
• Control the Flow of Data Objects Sent to Workers (p. 805)
• Consolidate Annotations (p. 806)
• Automate Data Labeling (p. 807)
• Chaining Labeling Jobs (p. 813)

804
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Control the Flow of Data Objects Sent to Workers


Depending on the type of labeling job you create, Amazon SageMaker Ground Truth sends data objects
to workers in batches or in a streaming fashion. You can control the flow of data objects to workers in
the following ways:

• For both types of labeling jobs, you can use MaxConcurrentTaskCount to control the total number
of data objects available to all workers at a given point in time when the labeling job is running.
• For streaming labeling jobs, you can control the flow of data objects to workers by monitoring and
controlling the number of data objects sent to the Amazon SQS associated with your labeling job.

Use the following sections to learn more about these options. To learn more about streaming labeling
jobs, see Ground Truth Streaming Labeling Jobs (p. 738).

Topics
• Use MaxConcurrentTaskCount to Control the Flow of Data Objects (p. 805)
• Use Amazon SQS to Control the Flow of Data Objects to Streaming Labeling Jobs (p. 806)

Use MaxConcurrentTaskCount to Control the Flow of Data Objects


MaxConcurrentTaskCount defines the maximum number of data objects that can be labeled by
human workers at the same time. If you use the console, this parameter is set to 1,000. If you use
CreateLabelingJob, you can set this parameter to any integer between 1 and 1,000, inclusive.

When you start a labeling job using an input manifest file, Ground Truth does the following:

1. For each data object listed in your input manifest file, one or more tasks are created, depending on
the value you specify for NumberOfHumanWorkersPerDataObject. For example, if you set the
number of workers per data object to 3, 3 tasks will be created for each dataset object. To be marked
as successfully labeled, at least one worker must label the object. Alternatively, the tasks can expire or
be declined.
2. If you are using the Mechanical Turk workforce, Ground Truth first sends a batch of 10 dataset objects
to your workers. It uses this small batch to set up the labeling job and to make sure that the job is
correctly configured.
3. Next, Ground Truth sends MaxConcurrentTaskCount number of dataset objects to workers. For
example, if you have 2,000 input data objects in your input manifest file and have set the number of
workers per data object to 3 and set MaxConcurrentTaskCount to 900, the first 900 data objects in
your input manifest are sent to workers, corresponding to 2,700 tasks (900 x 3). This is the first full-
sized set of objects sent to workers.
4. What happens next depends on the type of labeling job you create. This step assumes one or more
dataset objects in your input manifest file, or sent using an Amazon SNS input data source (in a
streaming labeling job) were not include in the set sent to workers in step 3.
• Streaming labeling job: As long as the total number of objects available to workers is equal
to MaxConcurrentTaskCount, all remaining dataset objects on your input manifest file and
that you send in real time using Amazon SNS are placed on an Amazon SQS queue. When the
total number of objects available to workers falls below MaxConcurrentTaskCount minus
NumberOfHumanWorkersPerDataObject, a new data object from the queue is used to create
NumberOfHumanWorkersPerDataObject-tasks, which are sent to workers in real time.
• Non-streaming labeling job: As workers finish labeling one set of objects, up to
MaxConcurrentTaskCount times NumberOfHumanWorkersPerDataObject number of new
tasks will be sent to workers. This process is repeated until all data objects in the input manifest file
are labeled.

805
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Use Amazon SQS to Control the Flow of Data Objects to Streaming Labeling
Jobs
When you create a streaming labeling job, an Amazon SQS queue is automatically created in your
account. Data objects are only added to the Amazon SQS queue when the total number of objects sent
to workers is above MaxConcurrentTaskCount. Otherwise, objects are sent directly to workers.

You can use this queue to manage the flow of data objects to your labeling job. To learn more, see
Manage Labeling Requests with an Amazon SQS Queue (p. 740).

Consolidate Annotations
An annotation is the result of a single worker's labeling task. Annotation consolidation combines the
annotations of two or more workers into a single label for your data objects. A label, which is assigned to
each object in the dataset, is a probabilistic estimate of what the true label should be. Each object in the
dataset typically has multiple annotations, but only one label or set of labels.

You decide how many workers annotate each object in your dataset. Using more workers can increase the
accuracy of your labels, but also increases the cost of labeling. To learn more about Ground Truth pricing,
see Amazon SageMaker Ground Truth pricing .

If you use the Amazon SageMaker console to create a labeling job, the following are the defaults for the
number of workers who can annotate objects:

• Text classification—3 workers


• Image classification—3 workers
• Bounding boxes—5 workers
• Semantic segmentation—3 workers
• Named entity recognition—3 workers

When you use the CreateLabelingJob operation, you set the number of workers to annotate each
data object with the NumberOfHumanWorkersPerDataObject parameter. You can override the
default number of workers that annotate a data object using the console or the CreateLabelingJob
operation.

Ground Truth provides an annotation consolidation function for each of its predefined labeling
tasks: bounding box, image classification, name entity recognition, semantic segmentation, and text
classification. These are the functions:

• Multi-class annotation consolidation for image and text classification uses a variant of the Expectation
Maximization approach to annotations. It estimates parameters for each worker and uses Bayesian
inference to estimate the true class based on the class annotations from individual workers.
• Bounding box annotation consolidates bounding boxes from multiple workers. This function finds the
most similar boxes from different workers based on the Jaccard index, or intersection over union, of
the boxes and averages them.
• Semantic segmentation annotation consolidation treats each pixel in a single image as a multi-
class classification. This function treats the pixel annotations from workers as "votes," with more
information from surrounding pixels incorporated by applying a smoothing function to the image.
• Named entity recognition clusters text selections by Jaccard similarity and calculates selection
boundaries based on the mode, or the median if the mode isn't clear. The label resolves to the most
assigned entity label in the cluster, breaking ties by random selection.

You can use other algorithms to consolidate annotations. For information, see Create Your Own
Annotation Consolidation Function (p. 807).

806
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Create Your Own Annotation Consolidation Function


You can choose to use your own annotation consolidation function to determine the final labels for your
labeled objects. There are many possible approaches for writing a function and the approach that you
take depends on the nature of the annotations to consolidate. Broadly, consolidation functions look
at the annotations from workers, measure the similarity between them, and then use some form of
probabilistic judgment to determine what the most probable label should be.

If you want to use other algorithms to create annotation consolidations functions, you can find the
worker responses in the [project-name]/annotations/worker-response folder of the Amazon S3
bucket where you direct the job output.

Assess Similarity
To assess the similarity between labels, you can use one of the following strategies, or you can use one
that meets your data labeling needs:

• For label spaces that consist of discrete, mutually exclusive categories, such as multi-class
classification, assessing similarity can be straightforward. Discrete labels either match or do not match.
• For label spaces that don't have discrete values, such as bounding box annotations, find a broad
measure of similarity. For bounding boxes, one such measure is the Jaccard index. This measures
the ratio of the intersection of two boxes with the union of the boxes to assess how similar they
are. For example, if there are three annotations, then there can be a function that determines which
annotations represent the same object and should be consolidated.

Assess the Most Probable Label


With one of the strategies detailed in the previous sections in mind, make some sort of probabilistic
judgment on what the consolidated label should be. In the case of discrete, mutually exclusive
categories, this can be straightforward. One of the most common ways to do this is to take the results of
a majority vote between the annotations. This weights the annotations equally.

Some approaches attempt to estimate the accuracy of different annotators and weight their annotations
in proportion to the probability of correctness. An example of this is the Expectation Maximization
method, which is used in the default Ground Truth consolidation function for multi-class annotations.

For more information about creating an annotation consolidation function, see Step 3: Processing with
AWS Lambda (p. 678).

Automate Data Labeling


If you choose, Amazon SageMaker Ground Truth can use active learning to automate the labeling of your
input data for certain built-in task types. Active learning is a machine learning technique that identifies
data that should be labeled by your workers. In Ground Truth, this functionality is called automated data
labeling. Automated data labeling helps to reduce the cost and time that it takes to label your dataset
compared to using only humans. When you use automated labeling, you incur SageMaker training and
inference costs.

We recommend using automated data labeling on large datasets because the neural networks used with
active learning require a significant amount of data for every new dataset. Typically, as you provide more
data, the potential for high accuracy predictions goes up. Data will only be auto-labeled if the neural
network used in the auto-labeling model can achieve an acceptably high level of accuracy. Therefore,
with larger datasets, there is more potential to automatically label the data because the neural network
can achieve high enough accuracy for auto-labeling. Automated data labeling is most appropriate when
you have thousands of data objects. The minimum number of objects allowed for automated data
labeling is 1,250, but we strongly suggest providing a minimum of 5,000 objects.

Automated data labeling is available only for the following Ground Truth built-in task types:

807
Amazon SageMaker Developer Guide
Enhanced Data Labeling

• Image Classification (Single Label) (p. 545)


• Image Semantic Segmentation (p. 538)
• Object detection (Bounding Box (p. 532))
• Text Classification (Single Label) (p. 556)

Streaming labeling jobs do not support automated data labeling.

To learn how to create a custom active learning workflow using your own model, see Set up an active
learning workflow with your own model (p. 813).

Input data quotas apply for automated data labeling jobs. See Input Data Quotas (p. 742) for
information about dataset size, input data size and resolution limits.
Note
Before you use an the automated-labeling model in production, you need to fine-tune or test
it, or both. You might fine-tune the model (or create and tune another supervised model of
your choice) on the dataset produced by your labeling job to optimize the model’s architecture
and hyperparameters. If you decide to use the model for inference without fine-tuning it,
we strongly recommend making sure that you evaluate its accuracy on a representative (for
example, randomly selected) subset of the dataset labeled with Ground Truth and that it
matches your expectations.

How it Works
You enable automated data labeling when you create a labeling job. This is how it works:

1. When Ground Truth starts an automated data labeling job, it selects a random sample of input data
objects and sends them to human workers. If more than 10% of these data objects fail, the labeling
job will fail. If the labeling job fails, in addition to reviewing any error message Ground Truth returns,
check that your input data is displaying correctly in the worker UI, instructions are clear, and that you
have given workers enough time to complete tasks.
2. When the labeled data is returned, it is used to create a training set and a validation set. Ground Truth
uses these datasets to train and validate the model used for auto-labeling.
3. Ground Truth runs a batch transform job, using the validated model for inference on the validation
data. Batch inference produces a confidence score and quality metric for each object in the validation
data.
4. The auto labeling component will use these quality metrics and confidence scores to create a
confidence score threshold that ensures quality labels.
5. Ground Truth runs a batch transform job on the unlabeled data in the dataset, using the same
validated model for inference. This produces a confidence score for each object.
6. The Ground Truth auto labeling component determines if the confidence score produced in step 5
for each object meets the required threshold determined in step 4. If the confidence score meets the
threshold, the expected quality of automatically labeling exceeds the requested level of accuracy and
that object is considered auto-labeled.
7. Step 6 produces a dataset of unlabeled data with confidence scores. Ground Truth selects data points
with low confidence scores from this dataset and sends them to human workers.
8. Ground Truth uses the existing human-labeled data and this additional labeled data from human
workers to update the model.
9. The process is repeated until the dataset is fully labeled or until another stopping condition is met. For
example, auto-labeling stops if your human annotation budget is reached.

The preceding steps happen in iterations. Select each tab in the following table to see an example of the
processes that happen in each iteration for an object detection automated labeling job. The number of
data objects used in a given step in these images (for example, 200) is specific to this example. If there
are fewer than 5,000 objects to label, the validation set size is 20% of the whole dataset. If there are

808
Amazon SageMaker Developer Guide
Enhanced Data Labeling

more than 5,000 objects in your input dataset, the validation set size is 10% of the whole dataset. You
can control the number of human labels collected per active learning iteration by changing the value for
MaxConcurrentTaskCount when using the API operation CreateLabelingJob. This value is set to
1,000 when you create a labeling job using the console. In the active learning flow illustrated under the
Active Learning tab, this value is set to 200.

Model Training

809
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Automated Labeling

810
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Active Learning

Accuracy of Automated Labels

The definition of accuracy depends on the built-in task type that you use with automated labeling. For
all task types, these accuracy requirements are pre-determined by Ground Truth and cannot be manually
configured.

• For image classification and text classification, Ground Truth uses logic to find a label-prediction
confidence level that corresponds to at least 95% label accuracy. This means Ground Truth expects the
accuracy of the automated labels to be at least 95% when compared to the labels that human labelers
would provide for those examples.
• For bounding boxes, the expected mean Intersection Over Union (IoU) of the auto-labeled images is
0.6. To find the mean IoU, Ground Truth calculates the mean IoU of all the predicted and missed boxes
on the image for every class, and then averages these values across classes.
• For semantic segmentation, the expected mean IoU of the auto-labeled images is 0.7. To find the
mean IoU, Ground Truth takes the mean of the IoU values of all the classes in the image (excluding the
background).

At every iteration of Active Learning (steps 3-6 in the list above), the confidence threshold is found using
the human-annotated validation set so that the expected accuracy of the auto-labeled objects satisfies
certain predefined accuracy requirements.

Create an Automated Data Labeling Job (Console)


To create a labeling job that uses automated labeling in the SageMaker console, use the following
procedure.

To create an automated data labeling job (console)

1. Open the Ground Truth Labeling jobs section of the SageMaker console: https://
console.aws.amazon.com/sagemaker/groundtruth.
2. Using Create a Labeling Job (Console) (p. 706) as a guide, complete the Job overview and Task
type sections. Note that auto labeling is not supported for custom task types.

811
Amazon SageMaker Developer Guide
Enhanced Data Labeling

3. Under Workers, choose your workforce type.


4. In the same section, choose Enable automated data labeling.
5. Using Step 4: Configure the Bounding Box Tool (p. 531) as a guide, create worker instructions in
the section Task Type labeling tool. For example, if you chose Semantic segmentation as your
labeling job type, this section is called Semantic segmentation labeling tool.
6. To preview your worker instructions and dashboard, choose Preview.
7. Choose Create. This creates and starts your labeling job and the auto labeling process.

You can see your labeling job appear in the Labeling jobs section of the SageMaker console. Your
output data appears in the Amazon S3 bucket that you specified when creating the labeling job. For
more information about the format and file structure of your labeling job output data, see Output
Data (p. 776).

Create an Automated Data Labeling Job (API)


To create an automated data labeling job using the SageMaker API, use the
LabelingJobAlgorithmsConfig parameter of the CreateLabelingJob operation. To learn how to
start a labeling job using the CreateLabelingJob operation, see Create a Labeling Job (API) (p. 709).

Specify the Amazon Resource Name (ARN) of the algorithm that you are using for automated data
labeling in the LabelingJobAlgorithmSpecificationArn parameter. Choose from one of the four Ground
Truth built-in algorithms that are supported with automated labeling:

• Image Classification (Single Label) (p. 545)


• Image Semantic Segmentation (p. 538)
• Object detection (Bounding Box (p. 532))
• Text Classification (Single Label) (p. 556)

When an automated data labeling job finishes, Ground Truth returns the ARN of the model it used for
the automated data labeling job. Use this model as the starting model for similar auto-labeling job types
by providing the ARN, in string format, in the InitialActiveLearningModelArn parameter. To retrieve the
model's ARN, use an AWS Command Line Interface (AWS CLI) command similar to the following.

# Fetch the mARN of the model trained in the final iteration of the previous labeling
job.Ground Truth
pretrained_model_arn = sagemaker_client.describe_labeling_job(LabelingJobName=job_name)
['LabelingJobOutput']['FinalActiveLearningModelArn']

To encrypt data on the storage volume attached to the ML compute instance(s) that are used in
automated labeling, include an AWS Key Management Service (AWS KMS) key in the VolumeKmsKeyId
parameter. For information about AWS KMS keys, see What is AWS Key Management Service? in the AWS
Key Management Service Developer Guide.

For an example that uses the CreateLabelingJob operation to create an automated data labeling job,
see the object_detection_tutorial example in the SageMaker Examples, Ground Truth Labeling Jobs
section of a SageMaker notebook instance. To learn how to create and open a notebook instance, see
Create a Notebook Instance (p. 209). To learn how to access SageMaker example notebooks, see Example
Notebooks (p. 220).

Amazon EC2 Instances Required for Automated Data Labeling


The following table lists the Amazon Elastic Compute Cloud (Amazon EC2) instances that you need to
run automated data labeling for training and batch inference jobs.

812
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Automated Data Labeling Job Training Instance Type Inference Instance Type
Type

Image classification ml.p3.2xlarge* ml.c5.xlarge

Object detection (bounding box) ml.p3.2xlarge* ml.c5.4xlarge

Text classification ml.c5.2xlarge ml.m4.xlarge

Semantic segmentation ml.p3.2xlarge* ml.p3.2xlarge*

* In the Asia Pacific (Mumbai) Region (ap-south-1) use ml.p2.8xlarge instead.

Ground Truth manages the instances that you use for automated data labeling jobs. It creates,
configures, and terminates the instances as needed to perform your job. These instances don't appear in
your Amazon EC2 instance dashboard.

Set up an active learning workflow with your own model


You can create an active learning workflow with your own algorithm to run
training and inferences in that workflow to auto-label your data. The notebook
bring_your_own_model_for_sagemaker_labeling_workflows_with_active_learning.ipynb demonstrates
this using the SageMaker built-in algorithm, BlazingText. This notebook provides an AWS
CloudFormation stack that you can use to execute this workflow using AWS Step Functions. You can find
the notebook and supporting files in this GitHub repository.

You can also find this notebook in the SageMaker Examples repository. See Use Example Notebooks to
learn how to find an Amazon SageMaker example notebook.

Chaining Labeling Jobs


Amazon SageMaker Ground Truth can reuse datasets from prior jobs in two ways: cloning and chaining.

Cloning copies the setup of a prior labeling job and allows you to make additional changes before setting
it to run.

Chaining uses not only the setup of the prior job, but also the results. This allows you to continue an
incomplete job and add labels or data objects to a completed job. Chaining is a more complex operation.

For data processing:

• Cloning uses the prior job's input manifest, with optional modifications, as the new job's input
manifest.
• Chaining uses the prior job's output manifest as the new job's input manifest.

Chaining is useful when you need to:

• Continue a labeling job that was manually stopped.


• Continue a labeling job that failed mid-job, after fixing issues.
• Switch to automated data labeling after manually labeling part of a job (or the other way around).
• Add more data objects to a completed job and start the job from there.
• Add another annotation to a completed job. For example, you have a collection of phrases labeled for
topic, then want to run the set again, categorizing them by the topic's implied audience.

813
Amazon SageMaker Developer Guide
Enhanced Data Labeling

In Amazon SageMaker Ground Truth you can configure a chained labeling job with either the console or
the API.

Key Term: Label Attribute Name


The label attribute name (LabelAttributeName in the API) is a string used as the key for the key-value
pair formed with the label that a worker assigns to the data object.

The following rules apply for the label attribute name:

• It can't end with -metadata.


• The names source and source-ref are reserved and can't be used.
• For semantic segmentation labeling jobs, , it must end with -ref. For all other labeling jobs, it
can't end with -ref. If you use the console to create the job, Amazon SageMaker Ground Truth
automatically appends -ref to all label attribute names except for semantic segmentation jobs.
• For a chained labeling job, if you're using the same label attribute name from the originating job and
you configure the chained job to use auto-labeling, then if it had been in auto-labeling mode at any
point, Ground Truth uses the model from the originating job.

In an output manifest, the label attribute name appears similar to the following.

"source-ref": "<S3 URI>",


"<label attribute name>": {
"annotations": [{
"class_id": 0,
"width": 99,
"top": 87,
"height": 62,
"left": 175
}],
"image_size": [{
"width": 344,
"depth": 3,
"height": 234
}]
},
"<label attribute name>-metadata": {
"job-name": "<job name>",
"class-map": {
"0": "<label attribute name>"
},
"human-annotated": "yes",
"objects": [{
"confidence": 0.09
}],
"creation-date": "<timestamp>",
"type": "groundtruth/object-detection"
}

If you're creating a job in the console and don't explicitly set the label attribute name value, Ground
Truth uses the job name as the label attribute name for the job.

Start a Chained Job (Console)


Choose a stopped, failed, or completed labeling job from the list of your existing jobs. This enables the
Actions menu.

From the Actions menu, choose Chain.

814
Amazon SageMaker Developer Guide
Enhanced Data Labeling

Job Overview Panel

In the Job overview panel, a new Job name is set based on the title of the job from which you are
chaining this one. You can change it.

You may also specify a label attribute name different from the labeling job name.

If you're chaining from a completed job, the label attribute name uses the name of the new job you're
configuring. To change the name, select the check box.

If you're chaining from a stopped or failed job, the label attribute name uses to the name of the job from
which you're chaining. It's easy to see and edit the value because the name check box is checked.
Attribute label naming considerations

• The default uses the label attribute name Ground Truth has selected. All data objects without
data connected to that label attribute name are labeled.
• Using a label attribute name not present in the manifest causes the job to process all the
objects in the dataset.

The input dataset location in this case is automatically selected as the output manifest of the chained
job. The input field is not available, so you cannot change it.
Adding data objects to a labeling job
You cannot specify an alternate manifest file. Manually edit the output manifest from the
previous job to add new items before starting a chained job. The Amazon S3 URI helps you
locate where you are storing the manifest in your Amazon S3 bucket. Download the manifest
file from there, edit it locally on your computer, and then upload the new version to replace it.
Make sure you are not introducing errors during editing. We recommend you use JSON linter to
check your JSON. Many popular text editors and IDEs have linter plugins available.

Start a Chained Job (API)


The procedure is almost the same as setting up a new labeling job with CreateLabelingJob, except for
two primary differences:

• Manifest location: Rather than use your original manifest from the prior job, the value for the
ManifestS3Uri in the DataSource should point to the Amazon S3 URI of the output manifest from
the prior labeling job.
• Label attribute name: Setting the correct LabelAttributeName value is important here. This is the
key portion of a key-value pair where labeling data is the value. Sample use cases include:
• Adding new or more specific labels to a completed job — Set a new label attribute name.
• Labeling the unlabeled items from a prior job — Use the label attribute name from the prior job.

Use a Partially Labeled Dataset


You can get some chaining benefits if you use an augmented manifest that has already been partially
labeled. Check the Label attribute name check box and set the name so that it matches the name in
your manifest.

If you're using the API, the instructions are the same as those for starting a chained job. However, be sure
to upload your manifest to an Amazon S3 bucket and use it instead of using the output manifest from a
prior job.

The Label attribute name value in the manifest has to conform to the naming considerations discussed
earlier.

815
Amazon SageMaker Developer Guide
Security and Permissions

Ground Truth Security and Permissions


Use the topics on this page to learn about Ground Truth security features and how to configure AWS
Identity and Access Management (IAM) permissions to allow a user or role to create a labeling job.
Additionally, learn how to create an execution role. An execution role is the role that you specify when
you create a labeling job. This role is used to start your labeling job.

If you are a new user and want to get started quickly, or if you do not require granular permissions, see
Use IAM Managed Policies with Ground Truth (p. 817).

For more information about IAM users and roles, see Identities (Users, Groups, and Roles) in the IAM User
Guide.

To learn more about using IAM with SageMaker, see Identity and Access Management for Amazon
SageMaker (p. 3048).

Topics
• CORS Permission Requirement (p. 816)
• Assign IAM Permissions to Use Ground Truth (p. 817)
• Using Amazon SageMaker Ground Truth in an Amazon Virtual Private Cloud (p. 828)
• Output Data and Storage Volume Encryption (p. 839)
• Workforce Authentication and Restrictions (p. 840)

CORS Permission Requirement


Earlier in 2020, widely used browsers like Chrome and Firefox changed their default behavior for rotating
images based on image metadata, referred to as EXIF data. Previously, browsers would always display
images in exactly the manner in which they are stored on disk, which is typically unrotated. After the
change, images now rotate according to a piece of image metadata called orientation value. This has
important implications for the entire machine learning (ML) community. For example, if applications
that annotate images do not consider the EXIF orientation, they may display images in unexpected
orientations, resulting in incorrect labels.

Starting with Chrome 89, AWS can no longer automatically prevent the rotation of images because the
web standards group W3C has decided that the ability to control rotation of images violates the web’s
Same-origin Policy. Therefore, to ensure human workers annotate your input images in a predictable
orientation when you submit requests to create a labeling job, you must add a CORS header policy to the
Amazon S3 buckets that contain your input images.
Important
If you do not add a CORS configuration to the Amazon S3 buckets that contain your input data,
labeling tasks for those input data objects will fail.

If you create a job through the Ground Truth console, CORS is enabled by default. If all of your input
data is not located in the same Amazon S3 bucket as your input manifest file, you must add a CORS
configuration to all Amazon S3 buckets that contain input data using the following instructions.

If you are using the CreateLabelingJob API to create a Ground Truth labeling job, you can add a CORS
policy to an Amazon S3 bucket that contains input data in the S3 console. To set the required CORS
headers on the Amazon S3 bucket that contain your input images in the Amazon S3 console, follow the
directions detailed in How do I add cross-domain resource sharing with CORS?. Use the following CORS
configuration code for the buckets that host your images. If you use the Amazon S3 console to add the
policy to your bucket, you must use the JSON format.
Important
If you create a 3D point cloud or video frame labeling job, you must add additional rules
to your CORS configuration. To learn more, see 3D Point Cloud Labeling Job Permission
Requirements (p. 633) and Video Frame Job Permission Requirements (p. 579) respectively.

816
Amazon SageMaker Developer Guide
Security and Permissions

JSON

[{
"AllowedHeaders": [],
"AllowedMethods": ["GET"],
"AllowedOrigins": ["*"],
"ExposeHeaders": ["Access-Control-Allow-Origin"]
}]

XML

<CORSConfiguration>
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
</CORSRule>
</CORSConfiguration>

Assign IAM Permissions to Use Ground Truth


Use the topics in this section to learn how to use AWS Identity and Access Management (IAM) managed
and custom policies to manage access to Ground Truth and associated resources.

You can use the sections on this page to learn the following:

• How to create IAM policies that grant a user or role permission to create a labeling job. Administrators
can use IAM policies to restrict access to Amazon SageMaker and other AWS services that are specific
to Ground Truth.
• How to create a SageMaker execution role. An execution role is the role that you specify when you
create a labeling job. The role is used to start and manage your labeling job.

The following is an overview of the topics you'll find on this page:

• If you are getting started using Ground Truth, or you do not require granular permissions for your use
case, it is recommended that you use the IAM managed policies described in Use IAM Managed Policies
with Ground Truth (p. 817).
• Learn about the permissions required to use the Ground Truth console in Grant IAM Permission to Use
the Amazon SageMaker Ground Truth Console (p. 818). This section includes policy examples that
grant an IAM entity permission to create and modify private work teams, subscribe to vendor work
teams, and create custom labeling workflows.
• When you create a labeling job, you must provide an execution role. Use Create a SageMaker Execution
Role for a Ground Truth Labeling Job (p. 822) to learn about the permissions required for this role.

Use IAM Managed Policies with Ground Truth


SageMaker and Ground Truth provide AWS managed policies that you can use to create a labeling job.
If you are getting started using Ground Truth and you do not require granular permissions for your use
case, it is recommended that you use the following policies:

• AmazonSageMakerFullAccess – Use this policy to give a user or role permission to create a labeling
job. This is a broad policy that grants a entity permission to use SageMaker features, as well as features
of necessary AWS services through the console and API. This policy gives the entity permission to
create a labeling job and to create and manage workforces using Amazon Cognito. To learn more, see
AmazonSageMakerFullAccess Policy.

817
Amazon SageMaker Developer Guide
Security and Permissions

• AmazonSageMakerGroundTruthExecution – To create an execution role, you can attach the policy


AmazonSageMakerGroundTruthExecution to a role. An execution role is the role that you specify
when you create a labeling job and it is used to start your labeling job. This policy allows you to create
both streaming and non-streaming labeling jobs, and to create a labeling job using any task type. Note
the following limits of this managed policy.
• Amazon S3 permissions: This policy grants an execution role permission to access Amazon S3
buckets with the following strings in the name: GroundTruth, Groundtruth, groundtruth,
SageMaker, Sagemaker, and sagemaker or a bucket with an object tag that includes SageMaker
in the name (case insensitive). Make sure your input and output bucket names include these strings,
or add additional permissions to your execution role to grant it permission to access your Amazon
S3 buckets. You must give this role permission to perform the following actions on your Amazon S3
buckets: AbortMultipartUpload, GetObject, and PutObject.
• Custom Workflows: When you create a custom labeling workflow, this execution role is restricted
to invoking AWS Lambda functions with one of the following strings as part of the function name:
GtRecipe, SageMaker, Sagemaker, sagemaker, or LabelingFunction. This applies to both your
pre-annotation and post-annotation Lambda functions. If you choose to use names without those
strings, you must explicitly provide lambda:InvokeFunction permission to the execution role
used to create the labeling job.

To learn how to attach an AWS managed policy to a user or role, refer to Adding and removing IAM
identity permissions in the IAM User Guide.

Grant IAM Permission to Use the Amazon SageMaker Ground Truth Console
To use the Ground Truth area of the SageMaker console, you need to grant permission to an entity to
access SageMaker and other AWS services that Ground Truth interacts with. Required permissions to
access other AWS services depends on your use-case:

• Amazon S3 permissions are required for all use cases. These permissions must grant access to the
Amazon S3 buckets that contain input and output data.
• AWS Marketplace permissions are required to use a vendor workforce.
• Amazon Cognito permission are required for private work team setup.
• AWS KMS permissions are required to view available AWS KMS keys that can be used for output data
encryption.
• IAM permissions are required to either list pre-existing execution roles, or to create a new one.
Additionally, you must use add a PassRole permission to allow SageMaker to use the execution role
chosen to start the labeling job.

The following sections list policies you may want to grant to a role to use one or more functions of
Ground Truth.

Topics
• Ground Truth Console Permissions (p. 818)
• Custom Labeling Workflow Permissions (p. 821)
• Private Workforce Permissions (p. 822)
• Vendor Workforce Permissions (p. 822)

Ground Truth Console Permissions

To grant permission to a user or role to use the Ground Truth area of the SageMaker console to create
a labeling job, attach the following policy to the user or role. The following policy will give an IAM role
permission to create a labeling job using a built-in task type task type. If you want to create a custom

818
Amazon SageMaker Developer Guide
Security and Permissions

labeling workflow, add the policy in Custom Labeling Workflow Permissions (p. 821) to the following
policy. Each Statement included in the following policy is described below this code block.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SageMakerApis",
"Effect": "Allow",
"Action": [
"sagemaker:*"
],
"Resource": "*"
},
{
"Sid": "KmsKeysForCreateForms",
"Effect": "Allow",
"Action": [
"kms:DescribeKey",
"kms:ListAliases"
],
"Resource": "*"
},
{
"Sid": "AccessAwsMarketplaceSubscriptions",
"Effect": "Allow",
"Action": [
"aws-marketplace:ViewSubscriptions"
],
"Resource": "*"
},
{
"Sid": "SecretsManager",
"Effect": "Allow",
"Action": [
"secretsmanager:CreateSecret",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecrets"
],
"Resource": "*"
},
{
"Sid": "ListAndCreateExecutionRoles",
"Effect": "Allow",
"Action": [
"iam:ListRoles",
"iam:CreateRole",
"iam:CreatePolicy",
"iam:AttachRolePolicy"
],
"Resource": "*"
},
{
"Sid": "PassRoleForExecutionRoles",
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PassedToService": "sagemaker.amazonaws.com"
}
}
},

819
Amazon SageMaker Developer Guide
Security and Permissions

{
"Sid": "GroundTruthConsole",
"Effect": "Allow",
"Action": [
"groundtruthlabeling:*",
"lambda:InvokeFunction",
"lambda:ListFunctions",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketCors",
"s3:PutBucketCors",
"s3:ListAllMyBuckets",
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:AdminCreateUser",
"cognito-idp:AdminDeleteUser",
"cognito-idp:AdminDisableUser",
"cognito-idp:AdminEnableUser",
"cognito-idp:AdminRemoveUserFromGroup",
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:DescribeUserPool",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:ListGroups",
"cognito-idp:ListIdentityProviders",
"cognito-idp:ListUsers",
"cognito-idp:ListUsersInGroup",
"cognito-idp:ListUserPoolClients",
"cognito-idp:ListUserPools",
"cognito-idp:UpdateUserPool",
"cognito-idp:UpdateUserPoolClient"
],
"Resource": "*"
}
]
}

This policy includes the following statements. You can scope down any of these statements by adding
specific resourses to the Resource list for that statement.

SageMakerApis

This statement includes sagemaker:*, which allows the user to perform all SageMaker API actions.
You can reduce the scope of this policy by restricting users from performing actions that are not used to
create and monitoring a labeling job.

KmsKeysForCreateForms

You only need to include this statement if you want to grant a user permission to list and select AWS
KMS keys in the Ground Truth console to use for output data encryption. The policy above grants a user
permission to list and select any key in the account in AWS KMS. To restrict the keys that a user can list
and select, specify those key ARNs in Resource.

SecretsManager

This statement gives the user permission to describe, list, and create resources in AWS Secrets Manager
required to create the labeling job.

ListAndCreateExecutionRoles

This statement gives a user permission to list (ListRoles) and create (CreateRole) IAM roles
in your account. It also grants the user permission to create (CreatePolicy) policies and attach

820
Amazon SageMaker Developer Guide
Security and Permissions

(AttachRolePolicy) policies to entities. These are required to list, select, and if required, create an
execution role in the console.

If you have already created an execution role, and want to narrow the scope of this statement so
that users can only select that role in the console, specify the ARNs of the roles you want the user to
have permission to view in Resource and remove the actions CreateRole, CreatePolicy, and
AttachRolePolicy.

AccessAwsMarketplaceSubscriptions

These permissions are required to view and choose vendor work teams that you are already subscribed
to when creating a labeling job. To give the user permission to subscribe to vendor work teams, add the
statement in Vendor Workforce Permissions (p. 822) to the policy above

PassRoleForExecutionRoles

This is required to give the labeling job creator permission to preview the worker UI and verify that input
data, labels, and instructions display correctly. This statement gives an entity permissions to pass the
IAM execution role used to create the labeling job to SageMaker to render and preview the worker UI.
To narrow the scope of this policy, add the role ARN of the execution role used to create the labeling job
under Resource.

GroundTruthConsole

• groundtruthlabeling – This allows a user to perform actions required to use certain features
of the Ground Truth console. These include permissions to describe the labeling job status
(DescribeConsoleJob), list all dataset objects in the input manifest file (ListDatasetObjects),
filter the dataset if dataset sampling is selected (RunFilterOrSampleDatasetJob), and to generate
input manifest files if automated data labeling is used (RunGenerateManifestByCrawlingJob).
These actions are only available when using the Ground Truth console and cannot be called directly
using an API.
• lambda:InvokeFunction and lambda:ListFunctions – these actions give users permission to list
and invoke Lambda functions that are used to run a custom labeling workflow.
• s3:* – All Amazon S3 permissions included in this statement are used to view Amazon S3 buckets
for automated data setup (ListAllMyBuckets), access input data in Amazon S3 (ListBucket,
GetObject), check for and create a CORS policy in Amazon S3 if needed (GetBucketCors and
PutBucketCors), and write labeling job output files to S3 (PutObject).
• cognito-idp – These permissions are used to create, view and manage and private workforce using
Amazon Cognito. To learn more about these actions, refer to the Amazon Cognito API References.

Custom Labeling Workflow Permissions


Add the following statement to a policy similar to the one in Ground Truth Console
Permissions (p. 818) to give a user permission to select pre-existing pre-annotation and post-
annotation Lambda functions while creating a custom labeling workflow.

{
"Sid": "GroundTruthConsoleCustomWorkflow",
"Effect": "Allow",
"Action": [
"lambda:InvokeFunction",
"lambda:ListFunctions"
],
"Resource": "*"
}

To learn how to give an entity permission to create and test pre-annotation and post-annotation Lambda
functions, see Required Permissions To Use Lambda With Ground Truth.

821
Amazon SageMaker Developer Guide
Security and Permissions

Private Workforce Permissions


When added to a permissions policy, the following permission grants access to create and manage a
private workforce and work team using Amazon Cognito. These permissions are not required to use an
OIDC IdP workforce.

{
"Effect": "Allow",
"Action": [
"cognito-idp:AdminAddUserToGroup",
"cognito-idp:AdminCreateUser",
"cognito-idp:AdminDeleteUser",
"cognito-idp:AdminDisableUser",
"cognito-idp:AdminEnableUser",
"cognito-idp:AdminRemoveUserFromGroup",
"cognito-idp:CreateGroup",
"cognito-idp:CreateUserPool",
"cognito-idp:CreateUserPoolClient",
"cognito-idp:CreateUserPoolDomain",
"cognito-idp:DescribeUserPool",
"cognito-idp:DescribeUserPoolClient",
"cognito-idp:ListGroups",
"cognito-idp:ListIdentityProviders",
"cognito-idp:ListUsers",
"cognito-idp:ListUsersInGroup",
"cognito-idp:ListUserPoolClients",
"cognito-idp:ListUserPools",
"cognito-idp:UpdateUserPool",
"cognito-idp:UpdateUserPoolClient"
],
"Resource": "*"
}

To learn more about creating private workforce using Amazon Cognito, see Create and Manage Amazon
Cognito Workforce (p. 869).

Vendor Workforce Permissions


You can add the following statement to the policy in Grant IAM Permission to Use the Amazon
SageMaker Ground Truth Console (p. 818) to grant an entity permission to subscribe to a vendor
workforce.

{
"Sid": "AccessAwsMarketplaceSubscriptions",
"Effect": "Allow",
"Action": [
"aws-marketplace:Subscribe",
"aws-marketplace:Unsubscribe",
"aws-marketplace:ViewSubscriptions"
],
"Resource": "*"
}

Create a SageMaker Execution Role for a Ground Truth Labeling Job


When you configure your labeling job, you need to provide an execution role, which is a role that
SageMaker has permission to assume to start and run your labeling job.

This role must give Ground Truth permission to access the following:

• Amazon S3 to retrieve your input data and write output data to an Amazon S3 bucket. You can either
grant permission for an IAM role to access an entire bucket by providing the bucket ARN, or you can

822
Amazon SageMaker Developer Guide
Security and Permissions

grant access to the role to access specific resources in a bucket. For example, the ARN for a bucket may
look similar to arn:aws:s3:::awsexamplebucket1 and the ARN of a resource in an Amazon S3
bucket may look similar to arn:aws:s3:::awsexamplebucket1/prefix/file-name.png. To
apply an action to all resources in an Amazon S3 bucket, you can use the wild card: *. For example,
arn:aws:s3:::awsexamplebucket1/prefix/*. For more information, see Amazon Amazon S3
Resources in the Amazon Simple Storage Service User Guide.
• CloudWatch to log worker metrics and labeling job statuses.
• AWS KMS for data encryption. (Optional)
• AWS Lambda for processing input and output data when you create a custom workflow.

Additionally, if you create a streaming labeling job, this role must have permission to access:

• Amazon SQS to create an interact with an SQS queue used to manage labeling requests.
• Amazon SNS to subscribe to and retrieve messages from your Amazon SNS input topic and to send
messages to your Amazon SNS output topic.

All of these permissions can be granted with the AmazonSageMakerGroundTruthExecution managed


policy except:

• Data and storage volume encryption of your Amazon S3 buckets. To learn how to configure these
permissions, see Encrypt Output Data and Storage Volume with AWS KMS (p. 827).
• Permission to select and invoke Lambda functions that do not include GtRecipe, SageMaker,
Sagemaker, sagemaker, or LabelingFunction in the function name.
• Amazon S3 buckets that do not include either GroundTruth, Groundtruth, groundtruth,
SageMaker, Sagemaker, and sagemaker in the prefix or bucket name or an object tag that includes
SageMaker in the name (case insensitive).

If you require more granular permissions than the ones provided in


AmazonSageMakerGroundTruthExecution, use the following policy examples to create an execution
role that fits your specific use case.

Topics
• Built-In Task Types (Non-streaming) Execution Role Requirements (p. 823)
• Built-In Task Types (Streaming) Execution Role Requirements (p. 824)
• Execution Role Requirements for Custom Task Types (p. 826)
• Automated Data Labeling Permission Requirements (p. 826)

Built-In Task Types (Non-streaming) Execution Role Requirements

The following policy grants permission to create a labeling job for a built-in task type. This execution
policy does not include permissions for AWS KMS data encryption or decryption. Replace each red,
italicized ARN with your own Amazon S3 ARNs.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ViewBuckets",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"

823
Amazon SageMaker Developer Guide
Security and Permissions

],
"Resource": [
"arn:aws:s3:::<input-bucket-name>",
"arn:aws:s3:::<output-bucket-name>"
]
},
{
"Sid": "S3GetPutObjects",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>/*",
"arn:aws:s3:::<output-bucket-name>/*"
]
},
{
"Sid": "CloudWatch",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:DescribeLogStreams",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}

Built-In Task Types (Streaming) Execution Role Requirements

If you create a streaming labeling job, you must add a policy similar to the following to the execution
role you use to create the labeling job. To narrow the scope of the policy, replace the * in Resource with
specific AWS resources that you want to grant the IAM role permission to access and use.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>/*",
"arn:aws:s3:::<output-bucket-name>/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "*",
"Condition": {
"StringEqualsIgnoreCase": {
"s3:ExistingObjectTag/SageMaker": "true"

824
Amazon SageMaker Developer Guide
Security and Permissions

}
}
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<input-bucket-name>",
"arn:aws:s3:::<output-bucket-name>"
]
},
{
"Sid": "CloudWatch",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:DescribeLogStreams",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Sid": "StreamingQueue",
"Effect": "Allow",
"Action": [
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:SendMessageBatch",
"sqs:SetQueueAttributes"
],
"Resource": "arn:aws:sqs:*:*:*GroundTruth*"
},
{
"Sid": "StreamingTopicSubscribe",
"Effect": "Allow",
"Action": "sns:Subscribe",
"Resource": [
"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
],
"Condition": {
"StringEquals": {
"sns:Protocol": "sqs"
},
"StringLike": {
"sns:Endpoint": "arn:aws:sns:<aws-region>:<aws-account-
number>:*GroundTruth*"
}
}
},
{
"Sid": "StreamingTopic",
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": [

825
Amazon SageMaker Developer Guide
Security and Permissions

"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
]
},
{
"Sid": "StreamingTopicUnsubscribe",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe"
],
"Resource": [
"arn:aws:sns:<aws-region>:<aws-account-number>:<input-topic-name>",
"arn:aws:sns:<aws-region>:<aws-account-number>:<output-topic-name>"
]
}
]
}

Execution Role Requirements for Custom Task Types

If you want to create a custom labeling workflow, add the following statement to an execution role
policy like the ones found in Built-In Task Types (Non-streaming) Execution Role Requirements (p. 823)
or Built-In Task Types (Streaming) Execution Role Requirements (p. 824).

This policy gives the execution role permission to Invoke your pre-annotation and post-annotation
Lambda functions.

{
"Sid": "LambdaFunctions",
"Effect": "Allow",
"Action": [
"lambda:InvokeFunction"
],
"Resource": [
"arn:aws:lambda:<region>:<account-id>:function:<pre-annotation-lambda-name>",
"arn:aws:lambda:<region>:<account-id>:function:<post-annotation-lambda-name>"
]
}

Automated Data Labeling Permission Requirements

If you want to create a labeling job with automated data labeling enabled, you must 1) add one policy to
the IAM policy attached to the execution role and 2) update the trust policy of the execution role.

The following statement allows the IAM execution role to be passed to SageMaker so that it can be
used to run the training and inference jobs used for active learning and automated data labeling
respectively. Add this statement to an execution role policy like the ones found in Built-In Task Types
(Non-streaming) Execution Role Requirements (p. 823) or Built-In Task Types (Streaming) Execution
Role Requirements (p. 824). Replace arn:aws:iam::<account-number>:role/<role-name> with
the execution role ARN. You can find your IAM role ARN in the IAM console under Roles.

{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "arn:aws:iam::<account-number>:role/<execution-role-name>",
"Condition": {
"StringEquals": {
"iam:PassedToService": [

826
Amazon SageMaker Developer Guide
Security and Permissions

"sagemaker.amazonaws.com"
]
}
}
}

The following statement allows SageMaker to assume the execution role to create and manage the
SageMaker training and inference jobs. This policy must be added to the trust relationship of the
execution role. To learn how to add or modify an IAM role trust policy, see Modifying a role in the IAM
User Guide.

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Principal": {"Service": "sagemaker.amazonaws.com" },
"Action": "sts:AssumeRole"
}
}

Encrypt Output Data and Storage Volume with AWS KMS


You can use AWS Key Management Service (AWS KMS) to encrypt output data from a labeling job by
specifying a customer managed key when you create the labeling job. If you use the API operation
CreateLabelingJob to create a labeling job that uses automated data labeling, you can also use a
customer managed key to encrypt the storage volume attached to the ML compute instances to run the
training and inference jobs.

This section describes the IAM policies you must attach to your customer managed key to enable output
data encryption and the policies you must attach to your customer managed key and execution role to
use storage volume encryption. To learn more about these options, see Output Data and Storage Volume
Encryption (p. 839).

Encrypt Output Data using KMS

If you specify an AWS KMS customer managed key to encrypt output data, you must add an IAM policy
similar to the following to that key. This policy gives the IAM execution role that you use to create your
labeling job permission to use this key to perform all of the actions listed in "Action". To learn more
about these actions, see AWS KMS permissions in the AWS Key Management Service Developer Guide.

To use this policy, replace the IAM service-role ARN in "Principal" with the ARN of the execution
role you use to create the labeling job. When you create a labeling job in the console, this is the
role you specify for IAM Role under the Job overview section. When you create a labeling job using
CreateLabelingJob, this is ARN you specify for RoleArn.

{
"Sid": "AllowUseOfKmsKey",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:role/service-role/example-role"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],

827
Amazon SageMaker Developer Guide
Security and Permissions

"Resource": "*"
}

Encrypt Automated Data Labeling ML Compute Instance Storage Volume

If you specify a VolumeKmsKeyId to encrypt the storage volume attached to the ML compute instance
used for automated data labeling training and inference, you must do the following:

• Attach permissions described in Encrypt Output Data using KMS (p. 827) to the customer managed
key.
• Attach a policy similar to the following to the IAM execution role you use to create your labeling
job. This is the IAM role you specify for RoleArn in CreateLabelingJob. To learn more about the
"kms:CreateGrant" action that this policy permits, see CreateGrant in the AWS Key Management
Service API Reference.

{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Action": [
"kms:CreateGrant"
],
"Resource": "*"
}
]
}

To learn more about Ground Truth storage volume encryption, see Use Your KMS Key to Encrypt
Automated Data Labeling Storage Volume (API Only) (p. 840).

Using Amazon SageMaker Ground Truth in an Amazon Virtual


Private Cloud
Amazon Virtual Private Cloud (Amazon VPC) is a service with which you can launch AWS resources in a
logically isolated virtual network that you define. You can create and run a Ground Truth labeling job
inside of an Amazon VPC instead of connecting over the internet. When you launch a labeling job in an
Amazon VPC, communication between your VPC and Ground Truth is conducted entirely and securely
within the AWS network.

This guide shows how you can use Ground Truth in an Amazon VPC in the following ways:

1. Run an Amazon SageMaker Ground Truth Labeling Job in an Amazon Virtual Private Cloud (p. 828)
2. Use Amazon VPC Mode from a Private Worker Portal (p. 835)

Run an Amazon SageMaker Ground Truth Labeling Job in an Amazon Virtual


Private Cloud
Amazon SageMaker Ground Truth supports the following functionalities.

• You can use Amazon S3 bucket policies to control access to buckets from specific Amazon VPC
endpoints, or specific VPCs. If you launch a labeling job and your input data is located in an Amazon S3
bucket with access restricted to users in your VPC, you can add a bucket policy to also grant a Ground

828
Amazon SageMaker Developer Guide
Security and Permissions

Truth endpoint permission to access the bucket. To learn more, see Allow Ground Truth to Access VPC
Restricted Amazon S3 Buckets (p. 829).
• You can launch an automated data labeling job in your VPC. You use a VPC configuration to specify
VPC subnets and security groups. SageMaker uses this configuration to launch the training and
inference jobs used for automated data labeling in your VPC. To learn more, see Create an Automated
Data Labeling Job in a VPC (p. 833).

You may want to use these options in any of the following ways.

• You can use both of these methods to launch a labeling job using a VPC-protected Amazon S3 bucket
with automated data labeling enabled.
• You can launch a labeling job using any built-in task type using a VPC-protected bucket.
• You can launch a custom labeling workflow using a VPC-protected bucket. Ground Truth interacts with
your pre-annotation and post-annotation Lambda functions using an AWS PrivateLink endpoint.

We recommend that you review Prerequisites to Run a Ground Truth Labeling Job in a VPC (p. 829)
before you create a labeling job in an Amazon VPC.

Prerequisites to Run a Ground Truth Labeling Job in a VPC

Review the following prerequisites before you create a Ground Truth labeling job in an Amazon VPC.

• If you are a new user of Ground Truth, review Getting started to learn how to create a labeling job.
• If your input data is located in a VPC-protected Amazon S3 bucket, your workers must access the
worker portal from your VPC.
Note
When you launch a labeling job in your VPC, you must use a private work team. To learn more
about creating a private work team, see Use a Private Workforce.
• If you want to launch an automated data labeling job in your VPC, review the following prerequisites.
• Use the instructions in Create an Amazon S3 VPC Endpoint. Training and inference containers used
in the automated data labeling workflow use this endpoint to communicate with your buckets in
Amazon S3.
• Review Automate Data Labeling to learn more about this feature. Note that automated data
labeling is supported for the following built-in task types: Image Classification (Single Label), Image
Semantic Segmentation, Bounding Box, and Text Classification (Single Label). Streaming labeling
jobs do not support automated data labeling.

• Review the Ground Truth Security and Permissions section and ensure that you have met the following
conditions.
• The user creating the labeling job has all necessary permissions
• You have created an IAM execution role with required permissions. If you do not require fine-tuned
permissions for your use case, we recommend you use the IAM managed policies described in Grant
General Permissions To Get Started Using Ground Truth.
• Allow your VPC to have access to the sagemaker-labeling-data-region and sm-
bxcb-region-saved-task-states S3 buckets. These are system owned regionalized S3 buckets
that are accessed from worker portal when worker is working on a task. We use these buckets to
interact with system managed data.

Allow Ground Truth to Access VPC Restricted Amazon S3 Buckets

The following sections provide details about the permissions Ground Truth requires to launch labeling
jobs using Amazon S3 buckets that have access restricted to your VPC and VPC endpoints. To learn how

829
Amazon SageMaker Developer Guide
Security and Permissions

to restrict access to an Amazon S3 bucket to a VPC, see Controlling access from VPC endpoints with
bucket policies in the Amazon Simple Storage Service User Guide guide. To learn how to add a policy to
an S3 bucket, see Adding a bucket policy using the Amazon S3 console.
Note
Modifying policies on existing buckets can cause IN_PROGRESS Ground Truth jobs to fail. We
recommend you start new jobs using a new bucket. If you want to continue using the same
bucket, you can do one of the following.

• Wait for an IN_PROGRESS job to finish.


• Terminate the job using the console or the AWS CLI.

You can restrict Amazon S3 bucket access to users in your VPC using an AWS PrivateLink endpoint.
For example, the following S3 bucket policy allows access to a specific bucket, <bucket-name>, from
<vpc> and the endpoint <vpc-endpoint> only. When you modify this policy, you must replace all red-
italized text with your resources and specifications.
Note
The following policy denies all entities other than users within a VPC to perform the actions
listed in Action. If you do not include actions in this list, they are still accessible to any entity
that has access to this bucket and permission to perform those actions. For example, if a user
has permission to perform GetBucketLocation on your Amazon S3 bucket, the policy below
does not restrict the user from performing this action outside of your VPC.

{
"Version": "2012-10-17",
"Id": "Policy1415115909152",
"Statement": [
{
"Sid": "Access-to-specific-VPCE-only",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Deny",
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<vpc>"
]
}
}
}
]
}

Ground Truth must be able to perform the following Amazon S3 actions on the S3 buckets you use to
configure the labeling job.

"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"

830
Amazon SageMaker Developer Guide
Security and Permissions

You can do this by adding a Ground Truth endpoint to the bucket policy like the one previously
mentioned. The following table includes Ground Truth service endpoints for each AWS Region. Add an
endpoint in the same AWS Region you use to run your labeling job to your bucket policy.

AWS Region Ground Truth endpoint

us-east-2 vpce-02569ba1c40aad0bc

us-east-1 vpce-08408e335ebf95b40

us-west-2 vpce-0ea07aa498eb78469

ca-central-1 vpce-0d46ea4c9ff55e1b7

eu-central-1 vpce-0865e7194a099183d

eu-west-2 vpce-0bccd56798f4c5df0

eu-west-1 vpce-0788e7ed8628e595d

ap-south-1 vpce-0d7fcda14e1783f11

ap-southeast-2 vpce-0b7609e6f305a77d4

ap-southeast-1 vpce-0e7e67b32e9efed27

ap-northeast-2 vpce-007893f89e05f2bbf

ap-northeast-1 vpce-0247996a1a1807dbd

For example, the following policy restricts GetObject and PutObject actions on:

• An Amazon S3 bucket to users in a VPC (<vpc>)


• A VPC endpoint (<vpc-endpoint>)
• A Ground Truth service endpoint (<ground-truth-endpoint>)

{
"Version": "2012-10-17",
"Id": "1",
"Statement": [
{
"Sid": "DenyAccessFromNonGTandCustomerVPC",
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"ForAllValues:StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<ground-truth-endpoint>"
],
"aws:SourceVpc": "<vpc>"

831
Amazon SageMaker Developer Guide
Security and Permissions

}
}
}
]
}

If you want a user to have permission to launch a labeling job using the Ground Truth console, you must
also add the user's ARN to the bucket policy using the aws:PrincipalArn condition. This user must
also have permission to perform the following Amazon S3 actions on the bucket you use to launch the
labeling job.

"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketCors",
"s3:PutBucketCors",
"s3:ListAllMyBuckets",

The following code is an example of a bucket policy that restricts permission to perform the actions
listed in Action on the S3 bucket <bucket-name> to the following.

• <role-name>
• The VPC endpoints listed in aws:sourceVpce
• Users within the VPC named <vpc>

{
"Version": "2012-10-17",
"Id": "1",
"Statement": [
{
"Sid": "DenyAccessFromNonGTandCustomerVPC",
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*",
"arn:aws:s3:::<bucket-name>"
],
"Condition": {
"ForAllValues:StringNotEquals": {
"aws:sourceVpce": [
"<vpc-endpoint>",
"<ground-truth-endpoint>"
],
"aws:PrincipalArn": "arn:aws:iam::<aws-account-id>:role/<role-name>",
"aws:SourceVpc": "<vpc>"
}
}
}
]
}

Note
The Amazon VPC interface endpoints and the protected Amazon S3 buckets you use for input
and output data must be located in the same AWS Region that you use to create the labeling
job.

832
Amazon SageMaker Developer Guide
Security and Permissions

After you have granted Ground Truth permission to access your Amazon S3 buckets, you can use one
of the topics in Create a Labeling Job to launch a labeling job. Specify the VPC-restricted Amazon S3
buckets for your input and output data buckets.

Create an Automated Data Labeling Job in a VPC


To create an automated data labeling job using an Amazon VPC, you provide a VPC configuration using
the Ground Truth console or CreateLabelingJob API operation. SageMaker uses the subnets and
security groups you provide to launch the training and inferences jobs used for automated labeling.
Important
Before you launch an automated data labeling job with a VPC configuration, make sure you
have created an Amazon S3 VPC endpoint using the VPC you want to use for the labeling job. To
learn how, see Create an Amazon S3 VPC Endpoint.
Additionally, if you create an automated data labeling job using a VPC-restricted Amazon
S3 bucket, you must follow the instructions in Allow Ground Truth to Access VPC Restricted
Amazon S3 Buckets (p. 829) to give Ground Truth permission to access the bucket.

Use the following procedures to learn how to add a VPC configuration to your labeling job request.

Add a VPC configuration to an automated data labeling job (console):

1. Follow the instructions in Create a Labeling Job (Console) and complete each step in the procedure,
up to step 15.
2. In the Workers section, select the checkbox next to Enable automated data labeling.
3. Maximize the VPC configuration section of the console by selecting the arrow.
4. Specify the Virtual private cloud (VPC) that you want to use for your automated data labeling job.
5. Choose the dropdown list under Subnets and select one or more subnets.
6. Choose the dropdown list under Security groups and select one or more groups.
7. Complete all remaining steps of the procedure in Create a Labeling Job (Console).

Add a VPC configuration to an automated data labeling job (API):

To configure a labeling job using the Ground Truth API operation, CreateLabelingJob, follow the
instructions in Create an Automated Data Labeling Job (API) to configure your request. In addition
to the parameters described in this documentation, you must include a VpcConfig parameter in
LabelingJobResourceConfig to specify one or more subnets and security groups using the following
schema.

"LabelingJobAlgorithmsConfig": {
"InitialActiveLearningModelArn": "string",
"LabelingJobAlgorithmSpecificationArn": "string",
"LabelingJobResourceConfig": {
"VolumeKmsKeyId": "string",
"VpcConfig": {
"SecurityGroupIds": [ "string" ],
"Subnets": [ "string" ]
}
}
}

The following is an example of an AWS Python SDK (Boto3) request to create an automated data
labeling job in the US East (N. Virginia) Region using a private workforce. Replace all red-italicized
text with your labeling job resources and specifications. To learn more about the CreateLabelingJob
operation, see the Create a Labeling Job (API) tutorial and CreateLabelingJob API documentation.

import boto3
client = boto3.client(service_name='sagemaker')

833
Amazon SageMaker Developer Guide
Security and Permissions

response = client.create_labeling_job(
LabelingJobName="example-labeling-job",
LabelAttributeName="label",
InputConfig={
'DataSource': {
'S3DataSource': {
'ManifestS3Uri': "s3://bucket/path/manifest-with-input-data.json"
}
}
},
"LabelingJobAlgorithmsConfig": {
"LabelingJobAlgorithmSpecificationArn": "arn:aws:sagemaker:us-
east-1:027400017018:labeling-job-algorithm-specification/tasktype",
"LabelingJobResourceConfig": {
"VpcConfig": {
"SecurityGroupIds": [ "sg-01233456789", "sg-987654321" ],
"Subnets": [ "subnet-e0123456", "subnet-e7891011" ]
}
}
},
OutputConfig={
'S3OutputPath': "s3://bucket/path/file-to-store-output-data",
'KmsKeyId': "string"
},
RoleArn="arn:aws:iam::*:role/*,
LabelCategoryConfigS3Uri="s3://bucket/path/label-categories.json",
StoppingConditions={
'MaxHumanLabeledObjectCount': 123,
'MaxPercentageOfInputDatasetLabeled': 123
},
HumanTaskConfig={
'WorkteamArn': "arn:aws:sagemaker:region:*:workteam/private-crowd/*",
'UiConfig': {
'UiTemplateS3Uri': "s3://bucket/path/custom-worker-task-template.html"
},
'PreHumanTaskLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:PRE-tasktype",
'TaskKeywords': [
"Images",
"Classification",
"Multi-label"
],
'TaskTitle': "Add task title here",
'TaskDescription': "Add description of task here for workers",
'NumberOfHumanWorkersPerDataObject': 1,
'TaskTimeLimitInSeconds': 3600,
'TaskAvailabilityLifetimeInSeconds': 21600,
'MaxConcurrentTaskCount': 1000,
'AnnotationConsolidationConfig': {
'AnnotationConsolidationLambdaArn': "arn:aws:lambda:us-
east-1:432418664414:function:ACS-tasktype"
},
Tags=[
{
'Key': "string",
'Value': "string"
},
]
)

834
Amazon SageMaker Developer Guide
Security and Permissions

Use Amazon VPC Mode from a Private Worker Portal


To restrict worker portal access to labelers working inside of your Amazon VPC, you can add a VPC
configuration when you create a Ground Truth private workforce. You can also add a VPC configuration
to an existing private workforce. Ground Truth automatically creates VPC interface endpoints in your
VPC and sets up AWS PrivateLink between your VPC endpoint and the Ground Truth services. The worker
portal URL associated with the workforce can be accessed from your VPC. The worker portal URL can also
be accessed from public internet until you set the restriction on the public internet. When you delete the
workforce or remove the VPC configuration from your workforce, Ground Truth automatically deletes the
VPC endpoints associated with the workforce.
Note
There can be only one VPC supported for a workforce.

Point Cloud and video tasks do not support loading through a VPC.

The guide demonstrates how to complete the necessary steps to add and delete an Amazon VPC
configuration to your workforce, and satisfy the prerequisites.

Prerequisites

To run a Ground Truth labeling job in Amazon VPC, review the following prerequisites.

• You have an Amazon VPC configured that you can use. If you have not configured a VPC, follow these
instructions for creating a VPC.
• Depending on how a Worker Task Template is written, labeling data stored in an Amazon S3 bucket
may be accessed directly from Amazon S3 during labeling tasks. In these cases, the VPC network must
be configured to allow traffic from the device used by the human labeler to the S3 bucket containing
labeling data.
• Follow View and update DNS attributes for your VPC to enable DNS hostnames and DNS resolution for
your VPC.

Note
There are two ways to configure your VPC for your workforce. You can do this through the
console or the AWS SageMaker CLI.

Using the SageMaker console to manage a VPC config

You can use the SageMaker console to add or remove a VPC configuration. You can also delete an
existing workforce.

Adding a VPC configuration to your workforce

Create a private workforce

• Create a private workforce using Amazon Cognito


• Create a private workforce using OpenID Connect (OIDC) Identity Provider(IdP).

After you have created your private workforce, add a VPC configuration to it.

1. Navigate to Amazon SageMaker Runtime in your console.


2. Select Labeling workforces in the left panel.
3. Select Private to access your private workforce. After your Workforce status is Active, select Add next
to VPC.
4. When you are prompted to configure your VPC, provide the following:
a. Your VPC

835
Amazon SageMaker Developer Guide
Security and Permissions

b. Subnets
i. Ensure that your VPC has an existing subnet
c. Security groups
i. Note
You cannot select more than 5 security groups.
d. After filling in this information, choose Confirm.
5. After you choose Confirm, you are redirected back to the Private page under Labeling workforces.
You should see a green banner at the top that reads Your private workforce update with VPC
configuration was successfully initialized. The workforce status is Updating. Next to the Delete
workforce button is the Refresh button, which can be used to retrieve the latest Workforce status.
After the workforce status has changed to Active, the VPC endpoint ID is updated as well.

Removing a VPC configuration from your workforce

Use the following information to remove a VPC configuration from your workforce using the console.

1. Navigate to Amazon SageMaker Runtime in your console.


2. Select Labeling workforces in the left panel.
3. Find and select your workforce.
4. Under Private workforce summary, find VPC and choose Remove next to it.
5. Select Remove.

Deleting a workforce through the console

If you delete a workforce, you should not have any teams associated with it. You can delete a workforce
only if the workforce status is Active or Failed.

Use the following information to delete a workforce using the console.

1. Navigate to Amazon SageMaker Runtime in your console.


2. Select Labeling workforces in the left panel.
3. Find and select your workforce.
4. Choose Delete workforce.
5. Choose Delete.

Using the SageMaker AWS API to manage a VPC config

Download the following files to use a new VpcConfig parameter into to the SageMaker workforce CLI:

sagemaker-2017-07-24.normal.json

sagemaker-2017-07-24.paginators.json

sagemaker-2017-07-24.waiters-2.json

After downloading the files, run the following commands in your CLI:

aws configure add-model --service-model file://./sagemaker-2017-07-24.normal.json --


service-name sagemaker

cp ./sagemaker-2017-07-24.paginators.json ~/.aws/models/sagemaker/2017-07-24/
paginators.json

836
Amazon SageMaker Developer Guide
Security and Permissions

cp ./sagemaker-2017-07-24.waiters-2.json ~/.aws/models/sagemaker/2017-07-24/waiters-2.json

You can now test your API changes using AWS CLI. You can either create a new workforce with a VPC
configuration or update an existing workforce to add a VPC configuration. You can also remove a VPC
configuration from an existing workforce.

Create a workforce with a VPC configuration


If the account already has a workforce, then you must delete it first. You can also update the workforce
with VPC configuration.

aws sagemaker create-workforce --cognito-config '{"ClientId": "app-client-id","UserPool":


"Pool_ID",}' --workforce-vpc-config \
" {\"VpcId\": \"vpc-id\", \"SecurityGroupIds\": [\"sg-0123456789abcdef0\"], \"Subnets\":
[\"subnet-0123456789abcdef0\"]}" --workforce-name workforce-name
{
"WorkforceArn": "arn:aws:sagemaker:us-west-2:xxxxxxxxx:workforce/workforce-name"
}

Describe the workforce and make sure the status is Initializing.

aws sagemaker describe-workforce --workforce-name workforce-name


{
"Workforce": {
"WorkforceName": "workforce-name",
"WorkforceArn": "arn:aws:sagemaker:us-west-2:xxxxxxxxx:workforce/workforce-name",
"LastUpdatedDate": 1622151252.451,
"SourceIpConfig": {
"Cidrs": []
},
"SubDomain": "subdomain.us-west-2.sagamaker.aws.com",
"CognitoConfig": {
"UserPool": "Pool_ID",
"ClientId": "app-client-id"
},
"CreateDate": 1622151252.451,
"WorkforceVpcConfig": {
"VpcId": "vpc-id",
"SecurityGroupIds": [
"sg-0123456789abcdef0"
],
"Subnets": [
"subnet-0123456789abcdef0"
]
},
"Status": "Initializing"
}
}

Navigate to the Amazon VPC console. Select Endpoints from the left panel. There should be two VPC
endpoints created in your account.

Adding a VPC configuration your workforce


Update a non-VPC private workforce with a VPC configuration using the following command.

837
Amazon SageMaker Developer Guide
Security and Permissions

aws sagemaker update-workforce --workforce-name workforce-name\


--workforce-vpc-config "{\"VpcId\": \"vpc-id\", \"SecurityGroupIds\":
[\"sg-0123456789abcdef0\"], \"Subnets\": [\"subnet-0123456789abcdef0\"]}"

Describe the workforce and make sure the status is Updating.

aws sagemaker describe-workforce --workforce-name workforce-name


{
"Workforce": {
"WorkforceName": "workforce-name",
"WorkforceArn": "arn:aws:sagemaker:us-west-2:xxxxxxxxx:workforce/workforce-name",
"LastUpdatedDate": 1622151252.451,
"SourceIpConfig": {
"Cidrs": []
},
"SubDomain": "subdomain.us-west-2.sagamaker.aws.com",
"CognitoConfig": {
"UserPool": "Pool_ID",
"ClientId": "app-client-id"
},
"CreateDate": 1622151252.451,
"WorkforceVpcConfig": {
"VpcId": "vpc-id",
"SecurityGroupIds": [
"sg-0123456789abcdef0"
],
"Subnets": [
"subnet-0123456789abcdef0"
]
},
"Status": "Updating"
}
}

Navigate to your Amazon VPC console. Select Endpoints from the left panel. There should be two VPC
endpoints created in your account.

Removing a VPC configuration from your workforce

Update a VPC private workforce with an empty VPC configuration to remove VPC resources.

aws sagemaker update-workforce --workforce-name workforce-name\


--workforce-vpc-config "{}"

Describe the workforce and make sure the status is Updating.

aws sagemaker describe-workforce --workforce-name workforce-name


{
"Workforce": {
"WorkforceName": "workforce-name",
"WorkforceArn": "arn:aws:sagemaker:us-west-2:xxxxxxxxx:workforce/workforce-name",
"LastUpdatedDate": 1622151252.451,
"SourceIpConfig": {
"Cidrs": []
},
"SubDomain": "subdomain.us-west-2.sagamaker.aws.com",

838
Amazon SageMaker Developer Guide
Security and Permissions

"CognitoConfig": {
"UserPool": "Pool_ID",
"ClientId": "app-client-id"
},
"CreateDate": 1622151252.451,
"Status": "Updating"
}
}

Naviagate to your Amazon VPC console. Select Endpoints from the left panel. The two VPC endpoints
should be deleted.

Restrict public access to the worker portal while maintaining access through a VPC

The workers in a VPC or non-VPC worker portal are be able to see the labeling job tasks assigned
to them. The assignment comes from assigning workers in a work team through OIDC groups. It
is the customer’s responsibility to restrict the access to their public worker portal by setting the
sourceIpConfig in their workforce.
Note
You can restrict access to the worker portal only through the SageMaker API. This cannot be
done through the console.

Use the following command to restrict public access to the worker portal.

aws sagemaker update-workforce --region us-west-2 \


--workforce-name workforce-demo --source-ip-config '{"Cidrs":["10.0.0.0/16"]}'

After the sourceIpConfig is set on the workforce, the workers can access the worker portal in VPC but
not through public internet.
Note
You can not set the sourceIP restriction for worker portal in VPC.

Output Data and Storage Volume Encryption


With Amazon SageMaker Ground Truth, you can label highly sensitive data, stay in control of your data,
and employ security best practices. While your labeling job is running, Ground Truth encrypts data in
transit and at rest. Additionally, you can use AWS Key Management Service (AWS KMS) with Ground
Truth to do the following:

• Use a customer managed key to encrypt your output data.


• Use AWS KMS customer managed key with your automated data labeling job to encrypt the storage
volume attached to the compute instance used for model training and inference.

Use the topics on this page to learn more about these Ground Truth security features.

Use Your KMS Key to Encrypt Output Data


Optionally, you can provide an AWS KMS customer managed key when you create a labeling job, which
Ground Truth uses to encrypt your output data.

If you don't provide a customer managed key, Amazon SageMaker uses the default AWS managed key for
Amazon S3 for your role's account to encrypt your output data.

839
Amazon SageMaker Developer Guide
Security and Permissions

If you provide a customer managed key, you must add the required permissions to the key described in
Encrypt Output Data and Storage Volume with AWS KMS (p. 827). When you use the API operation
CreateLabelingJob, you can specify your customer managed key ID using the parameter KmsKeyId.
See the following procedure to learn how to add a customer managed key when you create a labeling job
using the console.

To add an AWS KMS key to encrypt output data (console):

1. Complete the first 7 steps in Create a Labeling Job (Console) (p. 706).
2. In step 8, select the arrow next to Additional configuration to expand this section.
3. For Encryption key, select the AWS KMS key that you want to use to encrypt output data.
4. Complete the rest of steps in Create a Labeling Job (Console) (p. 706) to create a labeling job.

Use Your KMS Key to Encrypt Automated Data Labeling Storage Volume (API
Only)
When you create a labeling job with automated data labeling using the CreateLabelingJob API
operation, you have the option to encrypt the storage volume attached to the ML compute instances
that run the training and inference jobs. To add encryption to your storage volume, use the parameter
VolumeKmsKeyId to input an AWS KMS customer managed key. For more information about this
parameter, see LabelingJobResourceConfig.

If you specify a key ID or ARN for VolumeKmsKeyId, your SageMaker execution role must include
permissions to call kms:CreateGrant. To learn how to add this permission to an execution role, see
Create a SageMaker Execution Role for a Ground Truth Labeling Job (p. 822).
Note
If you specify an AWS KMS customer managed key when you create a labeling job in the
console, that key is only used to encrypt your output data. It is not used to encrypt the storage
volume attached to the ML compute instances used for automated data labeling.

Workforce Authentication and Restrictions


Ground Truth enables you to use your own private workforce to work on labeling jobs. A private
workforce is an abstract concept which refers to a set of people who work for you. Each labeling job
is created using a work team, composed of workers in your workforce. Ground Truth supports private
workforce creation using Amazon Cognito.

A Ground Truth workforce maps to a Amazon Cognito user pool. A Ground Truth work team maps to
a Amazon Cognito user group. Amazon Cognito manages the worker authentication. Amazon Cognito
supports Open ID connection (OIDC) and customers can set up Amazon Cognito federation with their
own identity provider (IdP).

Ground Truth only allows one workforce per account per AWS Region. Each workforce has a dedicated
Ground Truth work portal login URL.

You can also restrict workers to a Classless Inter-Domain Routing (CIDR) block/IP address range. This
means annotators must be on a specific network to access the annotation site. You can add up to
ten CIDR blocks for one workforce. To learn more, see Manage Private Workforce Using the Amazon
SageMaker API (p. 885).

To learn how you can create a private workforce, see Create a Private Workforce (Amazon
Cognito) (p. 869).

Restrict Access to Workforce Types


Amazon SageMaker Ground Truth work teams fall into one of three workforce types: public (with
Amazon Mechanical Turk), private, and vendor. To restrict user access to a specific work team

840
Amazon SageMaker Developer Guide
Security and Permissions

using one of these types or the work team ARN, use the sagemaker:WorkteamType and/or the
sagemaker:WorkteamArn condition keys. For the sagemaker:WorkteamType condition key, use
string condition operators. For the sagemaker:WorkteamArn condition key, use Amazon Resource
Name (ARN) condition operators. If the user attempts to create a labeling job with a restricted work
team, SageMaker returns an access denied error.

The policies below demonstrate different ways to use the sagemaker:WorkteamType and
sagemaker:WorkteamArn condition keys with appropriate condition operators and valid condition
values.

The following example uses the sagemaker:WorkteamType condition key with the StringEquals
condition operator to restrict access to a public work team. It accepts condition values in the following
format: workforcetype-crowd, where workforcetype can equal public, private, or vendor.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",
"Resource": "*",
"Condition": {
"StringEquals": {
"sagemaker:WorkteamType": "public-crowd"
}
}
}
]
}

The following policies show how to restrict access to a public work team using the
sagemaker:WorkteamArn condition key. The first shows how to use it with a valid IAM regex-variant
of the work team ARN and the ArnLike condition operator. The second shows how to use it with the
ArnEquals condition operator and the work team ARN.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",
"Resource": "*",
"Condition": {
"ArnLike": {
"sagemaker:WorkteamArn": "arn:aws:sagemaker:*:*:workteam/public-crowd/
*"
}
}
}
]
}

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictWorkteamType",
"Effect": "Deny",
"Action": "sagemaker:CreateLabelingJob",

841
Amazon SageMaker Developer Guide
Monitor Labeling Job Status

"Resource": "*",
"Condition": {
"ArnEquals": {
"sagemaker:WorkteamArn": "arn:aws:sagemaker:us-
west-2:394669845002:workteam/public-crowd/default"
}
}
}
]
}

Monitor Labeling Job Status


To monitor the status of your labeling jobs, you can set up an Amazon CloudWatch Events (CloudWatch
Events) rule for Amazon SageMaker Ground Truth (Ground Truth) to send an event to CloudWatch Events
when a labeling job status changes to Completed, Failed, or Stopped or when a worker accepts,
declines, submits, or returns a task.

Once you create a rule, you can add a target to it. CloudWatch Events uses this target to invoke
another AWS service to process the event. For example, you can create a target using a Amazon Simple
Notification Service (Amazon SNS) topic to send a notification to your email when a labeling job status
changes.

Prerequisites:

To create a CloudWatch Events rule, you will need an AWS Identity and Access Management (IAM)
role with an events.amazonaws.com trust policy attached. The following is an example of an
events.amazonaws.com trust policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": [
"events.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}

Topics
• Send Events to CloudWatch Events (p. 842)
• Set Up a Target to Process Events (p. 843)
• Labeling Job Expiration (p. 844)
• Declining Tasks (p. 844)

Send Events to CloudWatch Events


To configure a CloudWatch Events rule to get status updates, or events, for your Ground Truth labeling
jobs, use the AWS Command Line Interface (AWS CLI) put-rule command. You can filter events that are
sent to your rule by status change. For example, you can create a rule that notifies you only if a labeling

842
Amazon SageMaker Developer Guide
Monitor Labeling Job Status

job status changes to Completed. When using the put-rule command, specify the following to receive
labeling job statuses:

• \"source\":[\"aws.sagemaker\"]
• \"detail-type\":[\"SageMaker Ground Truth Labeling Job State Change\"]

To configure a CloudWatch Events rule to watch for all status changes, use the following command and
replace the placeholder text. For example, replace "GTLabelingJobStateChanges" with a unique
CloudWatch Events rule name and "arn:aws:iam::111122223333:role/MyRoleForThisRule"
with the Amazon Resource Number (ARN) of an IAM role with an events.amazonaws.com trust policy
attached.

aws events put-rule --name "GTLabelingJobStateChanges"


--event-pattern "{\"source\":[\"aws.sagemaker\"],\"detail-type\":[\"SageMaker Ground
Truth Labeling Job State Change\"]}"
--role-arn "arn:aws:iam::111122223333:role/MyRoleForThisRule"
--region "region"

To filter by job status, use the \"detail\":{\"LabelingJobStatus\":[\"Status\"]}}" syntax.


Valid values for Status are Completed, Failed, and Stopped.

The following example creates a CloudWatch Events rule that notifies you when a labeling job in us-
west-2 (Oregon) changes to Completed.

aws events put-rule --name "LabelingJobCompleted"


--event-pattern "{\"source\":[\"aws.sagemaker\"],\"detail-type\":[\"SageMaker Ground
Truth Labeling Job State Change\"], \"detail\":{\"LabelingJobStatus\":[\"Completed\"]}}"
--role-arn "arn:aws:iam::111122223333:role/MyRoleForThisRule"
--region us-west-2

The following example creates a CloudWatch Events rule that notifies you when a labeling job in us-
east-1 (Virginia) changes to Completed or Failed.

aws events put-rule --name "LabelingJobCompletedOrFailed"


--event-pattern "{\"source\":[\"aws.sagemaker\"],\"detail-type\":[\"SageMaker Ground
Truth Labeling Job State Change\"], \"detail\":{\"LabelingJobStatus\":[\"Completed\",
\"Failed\"]}}"
--role-arn "arn:aws:iam::111122223333:role/MyRoleForThisRule"
--region us-east-1

To learn more about the put-rule request, see Event Patterns in CloudWatch Events in the Amazon
CloudWatch Events User Guide.

Set Up a Target to Process Events


After you have created a rule, events similar to the following are sent to CloudWatch Events. In this
example, the labeling job test-labeling-job's status changed to Completed.

{
"version": "0",
"id": "111e1111-11d1-111f-b111-1111b11dcb11",
"detail-type": "SageMaker Ground Truth Labeling Job State Change",
"source": "aws.sagemaker",
"account": "111122223333",
"time": "2018-10-06T12:26:13Z",
"region": "us-east-1",
"resources": [

843
Amazon SageMaker Developer Guide
SageMaker Ground Truth Plus

"arn:aws:sagemaker:us-east-1:111122223333:labeling-job/test-labeling-job"
],
"detail": {
"LabelingJobStatus": "Completed"
}
}

To process events, you need to set up a target. For example, if you want to receive an email when your
labeling job status changes, use a procedure in Setting Up Amazon SNS Notifications in the Amazon
CloudWatch User Guide to set up an Amazon SNS topic and subscribe your email to it. Once you have
create a topic, you can use it to create a target.

To add a target to your CloudWatch Events rule

1. Open the CloudWatch console: https://console.aws.amazon.com/cloudwatch/home


2. In the navigation pane, choose Rules.
3. Choose the rule that you want to add a target to.
4. Choose Actions, and then choose Edit.
5. Under Targets, choose Add Target and choose the AWS service you want to act when a labeling job
status change event is detected.
6. Configure your target. For instructions, see the topic for configuring a target in the AWS
documentation for that service.
7. Choose Configure details.
8. For Name, enter a name and, optionally, provide details about the purpose of the rule in
Description.
9. Make sure that the check box next to State is selected so that your rule is listed as Enabled.
10. Choose Update rule.

Labeling Job Expiration


If your labeling job is not completed after 30 days, it will expire. If your labeling job expires, you
can chain the job to create a new labeling job that will only send unlabeled data to workers. For
more information, and to learn how to create a labeling job using chaining, see Chaining Labeling
Jobs (p. 813).

Declining Tasks
Workers are able to decline tasks.

Workers decline a task if the instructions are not clear, input data is not displaying correctly, or
if they encounter some other issue with the task. If the number of workers per dataset object
(NumberOfHumanWorkersPerDataObject) decline the task, the data object is marked as expired and
will not be sent to additional workers.

Use Amazon SageMaker Ground Truth Plus to


Label Data
Amazon SageMaker Ground Truth Plus is a turnkey data labeling service that uses an expert workforce to
deliver high-quality annotations quickly and reduces costs by up to 40%. Using SageMaker Ground Truth
Plus, data scientists and business managers, such as data operations managers and program managers,
can create high-quality training datasets without having to build labeling applications and manage

844
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Plus.

labeling workforces on their own. You can get started with Amazon SageMaker Ground Truth Plus by
uploading data along with the labeling requirements in Amazon S3.

Why use SageMaker Ground Truth Plus?

To train a machine learning (ML) model, data scientists need large, high-quality, labeled datasets. As
ML adoption grows, labeling needs increase. This forces data scientists to spend weeks on building data
labeling workflows and managing a data labeling workforce. Unfortunately, this slows down innovation
and increases cost. To ensure data scientists can spend their time building, training, and deploying ML
models, data scientists typically task other in-house teams consisting of data operations managers and
program managers to produce high-quality training datasets. However, these teams typically don't have
access to skills required to deliver high-quality training datasets, which affects ML results. As a result, you
look for a data labeling partner that can help them create high-quality training datasets at scale without
consuming their in-house resources.

When you upload the data, SageMaker Ground Truth Plus sets up the data labeling workflows and
operates them on your behalf. From there, an expert workforce trained on a varierty of machine learning
(ML) tasks performs data labeling. SageMaker Ground Truth Plus currently offers two types of expert
workforce: an Amazon employed workforce and a curated list of third-party vendors. SageMaker Ground
Truth Plus provides you with the flexibility to choose the labeling workforce. AWS experts select the best
labeling workforce based on your project requirements. For example, if you need people proficient in
labeling audio files, specify that in the guidelines provided to SageMaker Ground Truth Plus, and the
service automatically selects labelers with those skills.
Note
SageMaker Ground Truth Plus does not support PHI, PCI or FedRAMP certified data, and you
should not provide this data to SageMaker Ground Truth Plus.

How does it work?

There are five main components to the SageMaker Ground Truth Plus workflow.

• Requesting a project
• Creating a project team
• Accessing the project portal to monitor progress of training datasets and review labeled data
• Creating a batch
• Receiving the labeled data

How do I use SageMaker Ground Truth Plus?

If you are a first-time user of SageMaker Ground Truth Plus, we recommend that you follow the
procedures outlined in the Getting Started with Amazon SageMaker Ground Truth Plus. (p. 845)
section.

Getting Started with Amazon SageMaker Ground


Truth Plus.
The guide demonstrates how to complete the necessary steps to start an Amazon SageMaker Ground
Truth Plus project, review labels, and satisfy SageMaker Ground Truth Plus prerequisites.

To get started using SageMaker Ground Truth Plus, review Set Up Amazon SageMaker Ground Truth Plus
Prerequisites (p. 845) and Core Components of Amazon SageMaker Ground Truth Plus (p. 846).

Set Up Amazon SageMaker Ground Truth Plus Prerequisites


Use the following information to sign up for an AWS account. If you already have an AWS account, skip
this step.

845
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Plus.

Sign up for an AWS account


If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.

When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.

AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.

Create an administrative user


After you sign up for an AWS account, create an administrative user so that you don't use the root user
for everyday tasks.

Secure your AWS account root user

1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.

For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.

For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.

Create an administrative user

• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).

For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.

Sign in as the administrative user

• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.

For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.

Core Components of Amazon SageMaker Ground Truth Plus


The following terms are key to understanding the capabilities of SageMaker Ground Truth Plus:

846
Amazon SageMaker Developer Guide
Request a Project

• Project: Each qualified engagement with an AWS expert results in a SageMaker Ground Truth Plus
project. A project can be in the pilot or production stage.
• Batch: A batch is a collection of similar recurring data objects such as images, video frames and text to
be labeled. A project can have multiple batches.
• Metrics: Metrics are data about your SageMaker Ground Truth Plus project for a specific date or over a
date range.
• Task type: SageMaker Ground Truth Plus supports five task types for data labeling. You can also have a
custom task type. These include text, image, video, audio, and 3D point cloud.
• Data objects: Individual items that are to be labeled.

Request a Project
You can request a free of cost pilot by creating a project.

To get started with a Amazon SageMaker Ground Truth Plus pilot, do the following.

1. Under the Ground Truth tab of Amazon SageMaker, choose Plus.


2. On the SageMaker Ground Truth Plus page, choose Request project.
3. A page titled Request a project opens. The page includes fields for General information and
Project overview. Enter the following information

a. Under General information, enter your First name, Last name and Business email address. An
AWS expert uses this information for contacting you to discuss the project after you submit the
request.
b. Under Project overview, enter your Project name and Project description. Choose the Task
type based on your data and use case. You can also indicate if your data contains personally
identifiable information (PII).
c. Create or select an IAM role that grants SageMaker Ground Truth Plus permissions to perform a
labeling job by choosing one of the options below.

i. You can Create an IAM role that provides access to any S3 bucket you specify.
ii. You can Enter a custom IAM role ARN.
iii. You can choose an existing role.
iv. If you use an existing role or a custom IAM role ARN, make sure you have the following IAM
role and trust policy.

IAM role

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
//Ex: "arn:aws:s3:::input-data-to-label/*"
]
}

847
Amazon SageMaker Developer Guide
Create a Project Team

]
}

Trust policy

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker-ground-truth-plus.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

4. Choose Request a project.

Once you create a project, you can find it on the SageMaker Ground Truth Plus page, under the Projects
section. The project status should be Review in-progress
Note
You cannot have more than 5 projects with the Review in progress status.

Create a Project Team


A project team provides access to the members from your organization or team to track projects, view
metrics, and review annotations. You can create a SageMaker Ground Truth Plus project team once you
have shared your data in an Amazon S3 bucket.

To add team members using Amazon Cognito, you have two options:

1. Create a new Amazon Cognito user group

a. Enter an Amazon Cognito user group name. This name cannot be changed.
b. Enter the email addresses of up to 50 team members in the Email addresses field. The
addresses must be separated by a comma.
c. Choose Create project team.

848
Amazon SageMaker Developer Guide
Open the Project Portal

d. Your team members receive an email inviting them to join the SageMaker Ground Truth Plus
project team as shown in the following image.

2. Import team members from existing Amazon Cognito user groups.

a. Choose a user pool that you have created. User pools require a domain and an existing user
group. If you get an error that the domain is missing, set it in the Domain name options on the
App integration page of the Amazon Cognito console for your group.
b. Choose an app client. We recommend using a client generated by Amazon SageMaker.
c. Choose a user group from your pool to import its members.
d. Choose Create project team.

You can view and manage the list of team members through the AWS console.

To add team members after creating the project team:

1. Choose Invite new members in the Members section.


2. Enter the email addresses of up to 50 team members in the Email addresses field. The addresses
must be separated by a comma.
3. Choose Invite new members

To delete existing team members:

1. Choose the team member to be deleted in the Members section.


2. Choose Delete.

Once you have added members to your project team, you can open the project portal to access your
projects.

Open the Project Portal


Once you have successfully submitted the intake form and created a project team, you can access the
SageMaker Ground Truth Plus project by choosing the Open project portal on the AWS console.

Each project consists of one or more batches. A batch is a collection of recurring similar data objects
(text, image, video frame, and point cloud) to be labeled. The project portal provides you with
transparency into the data labeling process. You can stay updated about a project, create batches within
a project, review the progress of the datasets across multiple projects, and analyze project metrics. The

849
Amazon SageMaker Developer Guide
Open the Project Portal

project portal also allows you to review a subset of the labeled data and provide feedback. You can
configure the columns displayed in your project and batch table.

You can use the SageMaker Ground Truth Plus project portal to track the following details about your
project.

Project name: Each project is identified using a unique name.

Status: A SageMaker Ground Truth Plus project has one of the following status types:

1. Review in progress: You have successfully submitted the project request form. An AWS expert is
currently reviewing your request.
2. Request approved: Your project request is approved. You can now share your data by creating a new
batch from the project portal.
3. Workflow design and setup progress: An AWS expert is setting up your project.
4. Pilot in-progress: Object labeling for the project in the pilot stage is currently in progress.
5. Pilot complete: Object labeling is complete and the labeled data is stored in your Amazon S3 bucket.
6. Pricing complete: An AWS expert shares the pricing for the production project with you.
7. Contract executed: The contract is complete.
8. Production in-progress: Labeling for the project in the production stage is in progress.
9. Production complete: Object labeling is complete and the labeled data is stored in your Amazon S3
bucket.
10.Paused: Project is currently paused at your request.

Task type: SageMaker Ground Truth Plus lets you label five types of tasks that include text, image, video,
audio, and point cloud.

Batches: Total number of batches within a project.

850
Amazon SageMaker Developer Guide
Create a Batch

Project creation date: Starting date of a project.

Total objects: Total number of objects to be labeled across all batches.

Objects completed: Number of labeled objects.

Remaining objects: Number of objects left to be labeled.

Failed objects: Number of objects that cannot be labeled due to an issue with the input data.

Create a Batch
You can use the project portal to create batches for a project after the project status is changed to
Request approved.

To create a batch, do the following.

1. Select a project by choosing the project name.


2. A page titled with the project name opens. Under the Batches section, choose Create batch.
3. Enter the Batch name, Batch description, S3 location for input datasets, and S3 location for
output datasets.
4. Choose Submit.

To create a batch successfully, make sure you meet the following criteria:

• Your data is in the US East (N. Virginia) Region.


• The maximum size for each file is no more than 2 gigabytes.
• The maximum number of files in a batch is 10,000.
• The total size of a batch is less than 100 gigabytes.
• You have no more than 5 batches with the Data transfer in-progress status.

Note
You cannot create a batch before the project status changes to Request approved.

Review Metrics
Metrics are data about your SageMaker Ground Truth Plus project for a specific date or over a date range.

851
Amazon SageMaker Developer Guide
Review Batches

You can review metrics for all batches or choose a batch of your choice as shown in the following image.

You can review the following metrics about the batch:

Total objects: Total number of objects in a batch or across all batches.

Objects completed by day: Total numbers of objects labeled on a specific date or over a date range.

Labels completed by day: Total numbers of labels completed on a specific date or over a date range. An
object can have more than one label.

Review Batches
Every Amazon SageMaker Ground Truth Plus project consists of one or more batches. Each batch is made
up of data objects to be labeled. You can view all the batches for your project using the project portal as
shown in the following image.

852
Amazon SageMaker Developer Guide
Review Batches

You can use the SageMaker Ground Truth Plus project portal to track the following details about every
batch:

Batch name: Each batch is identified with a unique batch name.

Status: A SageMaker Ground Truth Plus batch has one of the following status types:

1. Request submitted: You have successfully submitted a new batch.


2. Data transfer failed: Data transfer failed with errors. Check the error reason and create a new batch
after fixing the error.
3. Data received: We have received your unlabeled input data.
4. In-progress: Data labeling is in progress.
5. Ready for review: Data labeling is completed. A subset of labeled objects from the batch are ready for
you to review. This is an optional step.
6. Review submission in-progress: Review feedback is currently being processed.
7. Review complete: You have successfully reviewed the batch. Next, you have to accept or reject it. This
action can not be undone.
8. Accepted: You have accepted the labeled data and will receive it in your Amazon S3 bucket shortly.
9. Rejected: Labeled data needs to be reworked.
10.Sent for rework: Labeled data is sent for rework. You can review the batch after its status changes to
Ready for review.
11.Ready for delivery: Labeled data is ready to be transferred to your Amazon S3 bucket.
12.Data delivered: Object labeling is complete and the labeled data is stored in your Amazon S3 bucket.
13.Paused: Batch is paused at your request.

Task type: SageMaker Ground Truth Plus lets you label five types of tasks that include text, image, video,
audio, and point cloud.

Batch creation date: Date when the batch was created.

Total objects: Total number of objects to be labeled across a batch.

Completed objects: Number of labeled objects.

Remaining objects: Number of objects left to be labeled.

Failed objects: Number of objects that cannot be labeled due to an issue with the input data.

Objects to review: Number of objects that are ready for your review.

Objects with feedback: Number of objects that have gotten feedback from the team members.

SageMaker Ground Truth Plus lets you review a sample set of your labeled data (determined during the
initial consultation call) through the review UI shown in the following image.

853
Amazon SageMaker Developer Guide
Review Batches

854
Amazon SageMaker Developer Guide
Accept or Reject Batches

The portal allows your project team members and you to review a small sample set of the labeled
objects for each batch. You can provide feedback for each labeled object within that subset through this
UI. The review UI allows you to navigate across the subset of labeled objects and provide feedback for
those labeled objects.

You can perform the following actions using the review UI.

• Use the arrow controls on the bottom left to navigate through the data objects.
• You can provide feedback for each object. The Feedback section is in the right panel. Choose Submit
to submit feedback for all images.
• Use the image controls in the bottom tray to zoom, pan, and control contrast.
• If you plan on returning to finish up your review, choose Stop and resume later on the top right.
• Choose Save to save your progress. Your progress is also autosaved every 15 minutes.
• To exit the review UI, choose Close on the upper right corner of the review UI.
• You can verify the Label attributes and Frame attributes on each frame using the panel on the right.
You cannot create new objects or modify existing objects in this task.

Accept or Reject Batches


After you have reviewed a batch, you must choose to accept or reject it.

If you accept a batch, the output from that labeling job is placed in the Amazon S3 bucket that you
specify. Once the data is delivered to your S3 bucket, the status of your batch changes from Accepted to
Data delivered.

If you reject a batch, you can provide feedback and explain your reasons for rejecting the batch.

SageMaker Ground Truth Plus allows you to provide feedback at the data object level as well as the
batch level. You can provide feedback for data objects through the review UI. You can use the project
portal to provide feedback for each batch. When you reject a batch, an AWS expert contacts you to
determine the rework process and the next steps for the batch.
Note
Accepting or rejecting a batch is a one-time action and cannot be undone. It is necessary to
either accept or reject every batch of the project.

Use Amazon SageMaker Ground Truth Synthetic


Data to Generate and Label Data
Amazon SageMaker Ground Truth synthetic data is a turnkey data generation and labeling service that
makes it quicker and more cost effective for machine learning (ML) scientists to acquire images that are
used to train computer vision (CV) models. To train a CV model, ML scientists need large, high-quality,
labeled datasets. With Ground Truth synthetic data, ML scientists can generate and label thousands
of images within days. Ground Truth synthetic data uses computer-generated 3D models to create
virtual environments representing real-world scenarios, generates synthetic images captured from these
environments, and automatically annotates each image with labels. You can use the labeled synthetic
images with AWS’s CV model training services such as Amazon SageMaker and Amazon Lookout for
Vision.

Why use Ground Truth Synthetic Data?

Collecting and labeling data in dynamic environments with variations in object size, shape, color,
position, background, and lighting is often a time-consuming and expensive process. To effectively

855
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Synthetic Data

train a model to operate in a dynamic environment, ML scientists must collect a large set of real-world
images to represent all possible scenarios, a process that can take months. For scenarios that don’t occur
frequently, such as rare product defects and faulty product placement, it can take years to capture a
sufficient number of images to train a CV model. To acquire images with product defects, ML scientists
may intentionally damage products in order to acquire defective images. Ground Truth synthetic data
makes it faster and more cost effective for ML scientists to quickly acquire labeled images that represent
real-world scenarios, a core requirement for training CV models. ML scientists can use Ground Truth
synthetic data to generate thousands of synthetic images from 3D virtual environments representing
real world scenarios in hours instead of months. Ground Truth provides a synthetic image fidelity and
diversity report and a manifest file along with the labeled synthetic data. The synthetic image fidelity
and diversity report provides statistics and plots that help you better understand the generated synthetic
images. The manifest file contains information about the images and image labels that you can use to
train and test a model.
Note
Ground Truth synthetic data does not support PHI, PCI, or FedRAMP certified data, and you
should not provide this data to Ground Truth synthetic data.

Ground Truth synthetic data has the following functionalities.

• Full 3D scenes and multiple cameras in a scene


• Ground truth depth maps that provide 3D depth data for all generated images
• Sequences of images (video) from multiple synchronized cameras
• A moving conveyor belt that supports dynamic scenes

How do I use Ground Truth Synthetic Data?

If you are a first-time user of Ground Truth synthetic data, we recommend that you follow the
procedures outlined in the Getting Started with Amazon SageMaker Ground Truth Synthetic
Data (p. 856) section.

Getting Started with Amazon SageMaker Ground


Truth Synthetic Data
The guide demonstrates how to complete the necessary steps to satisfy the prerequisites, start a Ground
Truth synthetic data project, and review labels.

To get started using synthetic data, review Set Up Amazon SageMaker Ground Truth Synthetic
Data Prerequisites (p. 856) and Core Components of Amazon SageMaker Ground Truth Synthetic
Data (p. 857).

Set Up Amazon SageMaker Ground Truth Synthetic Data


Prerequisites
To use Ground Truth synthetic data, you need an AWS account. If you already have an AWS account, skip
this step.

Sign up for an AWS account


If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

1. Open https://portal.aws.amazon.com/billing/signup.

856
Amazon SageMaker Developer Guide
Getting Started with Amazon
SageMaker Ground Truth Synthetic Data

2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.

When you sign up for an AWS account, an AWS account root user is created. The root user has access
to all AWS services and resources in the account. As a security best practice, assign administrative
access to an administrative user, and use only the root user to perform tasks that require root user
access.

AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view
your current account activity and manage your account by going to https://aws.amazon.com/ and
choosing My Account.

Create an administrative user


After you sign up for an AWS account, create an administrative user so that you don't use the root user
for everyday tasks.

Secure your AWS account root user

1. Sign in to the AWS Management Console as the account owner by choosing Root user and entering
your AWS account email address. On the next page, enter your password.

For help signing in by using root user, see Signing in as the root user in the AWS Sign-In User Guide.
2. Turn on multi-factor authentication (MFA) for your root user.

For instructions, see Enable a virtual MFA device for your AWS account root user (console) in the IAM
User Guide.

Create an administrative user

• For your daily administrative tasks, grant administrative access to an administrative user in AWS IAM
Identity Center (successor to AWS Single Sign-On).

For instructions, see Getting started in the AWS IAM Identity Center (successor to AWS Single Sign-On)
User Guide.

Sign in as the administrative user

• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email
address when you created the IAM Identity Center user.

For help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in the
AWS Sign-In User Guide.

Core Components of Amazon SageMaker Ground Truth


Synthetic Data
The following terms are key to understanding the capabilities of Ground Truth synthetic data:

• Project: Each qualified engagement with an AWS expert results in a Ground Truth synthetic data
project.
• Batch: A batch is a collection of similar labeled images. A project can have multiple batches. A batch
can be in the test or production stage. A project can have multiple batches.

857
Amazon SageMaker Developer Guide
Share Data from Your Amazon S3 Bucket

• Synthetic Image Fidelity and Diversity Report: Ground Truth synthetic data provides a metrics report
that helps you compare the generated synthetic images with your typical dataset.

Request a Project
To get started with Amazon SageMaker Ground Truth synthetic data, go to the SageMaker console and
complete the intake form.

Once you submit the intake form in the AWS console, an AWS expert from the Ground Truth synthetic
data team reaches out to discuss your data labeling project requirements and pricing.

Share Data from Your Amazon S3 Bucket


After an AWS expert reaches out to discuss your project, you may be required to fill out an intake form
with questions specific to your synthetic data requirements. The intake form, along with assets shared
with Ground Truth synthetic data, allows the Ground Truth synthetic data team to evaluate your project
and the estimated work required to complete your project.

Create an Amazon S3 bucket to share your project assets with Ground Truth synthetic data and store
your project’s output data.

To create an Amazon S3 bucket and share it with us:

1. Follow the instructions in Create a Bucket in the Amazon Simple Storage Service Console User Guide.
2. We recommend using the following naming convention while storing your data in an Amazon S3
bucket.

a. The bucket-name should contain fewer than 63 characters.


b. The bucket-name can include hyphens, but no spaces and underscores.
3. In the Buckets list, choose the name of the bucket you created.
4. Choose Permissions.
5. In the Bucket policy section, choose Edit.
6. Confim that ACLs disabled is selected.
Note
Under Object Ownership should have ACLs disabled as shown in the image below.

7. Choose Save changes.

858
Amazon SageMaker Developer Guide
Project Portal

Note
If you have additional requirements for accessing your data in an Amazon S3 bucket, please
contact your AWS expert.

To share your project assets with the Ground Truth synthetic data team for project evaluation, work
estimation, and synthetic data generation, follow the steps in the Send Project Data to Ground Truth
Synthetic Data (p. 859) section below.

After receiving your intake form and project assets, we return a statement of work (SOW) within 5
business days. The SOW outlines your engagement with Ground Truth synthetic data generation and
labeling. After you approve the SOW, the Ground Truth synthetic data team produces a test batch
consisting of 50 synthetic images. An AWS expert meets with you to the review the test batch, approve
or reject images, and complete the final production. The timeline for this is based on the responses in
your intake form.

Send Project Data to Ground Truth Synthetic Data


After an AWS expert has been assigned to your project, you can send project data to the Ground Truth
synthetic data team to assist in project evaluation, work estimation, and synthetic data generation.

To send project data to Ground Truth synthetic data:

1. Under the Project data transfers table in the project portal, choose Send project data.
2. Enter the name of your S3 bucket from which you would like to send project data as the Amazon S3
source location of the project data transfer.
3. Select an IAM role for the project data transfer. If you select Automatic, Ground Truth synthetic data
creates an IAM role in your account with the required permissions to run the project data transfer
and call other services on your behalf (recommended). If you select an existing IAM role in your
account, Ground Truth synthetic data uses that IAM role to run the project data transfer and call
other services on your behalf.
4. Choose Create to create and start the project data transfer.

After creating a project data transfer, you can view the status of the transfer in the Project data
transfers table on the project details page in the project portal. When the project data transfer status is
Completed, the project data is available to the Ground Truth synthetic data team.

Project Portal
Each project consists of one or more batches. A batch is a collection of similar generated and labeled
images. The project portal provides you access to the projects you have contracted with Ground Truth
synthetic data. You can view the status of your projects and access completed batches along with the
synthetic image fidelity and diversity report. You also review your batches to accept or reject them
through the project portal.

859
Amazon SageMaker Developer Guide
Project Portal

You can use the Ground Truth synthetic data project portal to track the following details about your
project:

Project name: Each project is identified using a unique name.

Status: A Ground Truth synthetic data project has one of the following status types:

1. Request submitted: You have successfully submitted the project request form. Next, an AWS expert
schedules a call with you to discuss the details for your project.
2. Review in progress: We are reviewing your project. An AWS expert has been assigned to your project.
3. Production in progress: We are currently working on generating labeled data for your project.
4. Data ready for review: At least one batch is ready for your review.
5. Project complete: We have completed the generation of the required labeled images. The images are
stored in your Amazon S3 bucket.

Batches: Total number of batches within a project.

Project start date: Starting date of a project.

Total images: Number of images you requested.

Completed images: Number of labeled images generated across all accepted production batches.

Delete a Project
You can delete a project using the console if the project status is Request submitted or Project
complete. To delete a project with any other status, contact your AWS expert. Deleting a Ground Truth
synthetic data project does not delete your data from the Amazon S3 buckets and can be subject to
charge.

You can delete a project based on its status as follows:

• Request submitted: Deleting a requested project deletes all the project and customer information
from the Ground Truth synthetic data database.
• Review in progress / Production in progress / Data ready for review: You can request your AWS
expert to delete a project having one of these statuses. Deleting a project deletes all the project and
personal information from Ground Truth synthetic data database and S3 buckets.
• Project complete: Once a project is marked as Project complete, we delete all customer information
from the Ground Truth synthetic data database and S3 buckets. You can view the project and batches
as long as you want, or delete them using the console.

860
Amazon SageMaker Developer Guide
Review Batches

Note
Deleting a project does not delete the images from your S3 bucket. To learn more about
deleting images from your S3 bucket, refer to Deleting Amazon S3 objects.

Review Batches
Every Amazon SageMaker Ground Truth synthetic data project consists of one or more batches. Each
batch is made up of labeled synthetic images. Batches are of two types, Test Batch and Production
Batch. A test batch provides a small preview of how the synthetic images look using your 3D assets and
environment. Images in the test batch are not counted towards the total number of synthetic images you
contract. After you approve the test batch for a specific configuration of images, Ground Truth synthetic
data starts generating images for your production batch. Images in a production batch are counted
towards the total required images.

For every batch, Ground Truth synthetic data provides a Synthetic Image Fidelity and Diversity
Report. This report provides image and object level statistics and plots that help you make sense of
the generated synthetic images. The statistics are used to describe the diversity and the fidelity of the
synthetic images and compare them with real images. Examples of the statistics and plots provided are
the distributions of object classes, object sizes, image brightness, image contrast, as well as the plots
evaluating the indistinguishability between synthetic and real images. The raw data for all the computed
dataset statistics is also provided as CSV files to help you accelerate model debugging and enable further
analyses.

You can view all the batches for your project using the project portal.

You can use the Ground Truth synthetic data project portal to track the following details about every
batch:

861
Amazon SageMaker Developer Guide
Accept or Reject Batches

Batch name: Each batch is identified with a unique batch name.

Status: A Ground Truth synthetic data batch has one of the following status types:

1. In progress: We are currently generating labeled images for this batch. It will soon be ready for your
review.
2. Ready for review: A batch of labeled synthetic images is now ready for your review. Follow the steps
in the Transfer Batch Data to Your Amazon S3 bucket (p. 862) section to view the images and review
the batch.
3. Accepted: You have accepted this batch.
4. Rejected: You have rejected this batch and it needs to be reworked. When you reject a batch, an AWS
expert contacts you to discuss this further.

Batch type: A batch can either be a test batch or a production batch.

Creation date: Date when the batch was created.

Images: Total number of images in the batch.

Transfer Batch Data to Your Amazon S3 bucket


When the batch status is Ready for review, you must transfer the batch data to your S3 bucket to view
the images and review the batch.

To transfer batch data to your S3 bucket:

1. On the batch details page in the project portal, choose Get batch data.
2. Under S3 destination location, enter the name of the S3 bucket where you would like to receive
your batch data.
3. Select an IAM role for the project data transfer. If you select Automatic, Ground Truth synthetic data
creates an IAM role in your account with the required permissions to run the project data transfer
and call other services on your behalf (recommended). If you select an existing IAM role in your
account, Ground Truth synthetic data uses that IAM role to run the project data transfer and call
other services on your behalf.
4. Choose Create to create and start the batch data transfer.

After creating a batch data transfer, you can view the status of the transfer in the Batch data transfers
table on the batch details page in the project portal. When the batch data transfer status is Completed,
the batch data is available in your S3 bucket, the batch images are viewable on the batch details page in
the project portal, and you can proceed to review the batch.

Accept or Reject Batches


After you have reviewed a batch, you can choose to accept or reject it from the project portal as shown
below.

862
Amazon SageMaker Developer Guide
Create and Manage Workforces

Accepting a batch informs your AWS expert to continue or complete the project, based on the number of
remaining images you contracted.

When you reject a batch, an AWS expert contacts you to determine the rework process and the next
steps for the batch.

Accepting or rejecting a batch is a one-time action and can only be undone by contacting your AWS
expert. It is necessary to either accept or reject every batch of the project.

Create and Manage Workforces


A workforce is the group of workers that you have selected to label your dataset. You can choose either
the Amazon Mechanical Turk workforce, a vendor-managed workforce, or you can create your own
private workforce to label or review your dataset. Whichever workforce type you choose, Amazon
SageMaker takes care of sending tasks to workers.

When you use a private workforce, you also create work teams, a group of workers from your workforce
that are assigned to specific jobs— Amazon SageMaker Ground Truth labeling jobs or Amazon
Augmented AI human review tasks. You can have multiple work teams and can assign one or more work
teams to each job.

You can use Amazon Cognito or your own private OpenID Connect (OIDC) Identity Provider (IdP) to
manage your private workforce and work teams. For more information about the permissions required to
manage your workforce this way, see Permissions Required to Use the Amazon SageMaker Ground Truth
Console (p. 3058).

Topics
• Using the Amazon Mechanical Turk Workforce (p. 863)
• Managing Vendor Workforces (p. 867)
• Use a Private Workforce (p. 868)

Using the Amazon Mechanical Turk Workforce


The Amazon Mechanical Turk (Mechanical Turk) workforce provides the most workers for your Amazon
SageMaker Ground Truth labeling job and Amazon Augmented AI human review task. The Amazon
Mechanical Turk workforce is a world-wide resource. Workers are available 24 hours a day, 7 days a week.
You typically get the fastest turnaround for your human review tasks and labeling jobs when you use the
Amazon Mechanical Turk workforce.

863
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce

Any Amazon Mechanical Turk workforce billing is handled as part of your Ground Truth or Amazon
Augmented AI billing. You do not need to create a separate Mechanical Turk account to use the Amazon
Mechanical Turk workforce.
Important
You should not share confidential information, personal information, or protected health
information with this workforce. You should not use the Amazon Mechanical Turk workforce
when you use Amazon A2I in conjunction with AWS HIPAA-eligible services, such as Amazon
Textract and Amazon Rekognition, for workloads containing protected health information.

You can choose Mechanical Turk as your workforce when you create a Ground Truth labeling job or
Amazon A2I human review workflow (flow definition). You can create a labeling job and a human review
workflow using the SageMaker console and API.

When you use an API operation to create a labeling job or human review workflow, you use the following
ARN for the Amazon Mechanical Turk workforce for your WorkteamArn. Replace region with the AWS
Region you are using to create the labeling job or human loops. For example, if you create a labeling job
in US West (Oregon), replace region with us-west-2.

• arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default

Ground Truth and Amazon A2I require that your input data is free of personally identifiable information
(PII) when you use Mechanical Turk. If you use the Mechanical Turk workforce and do not specify that
your input data is free of PII, your Ground Truth labeling jobs and Augmented AI tasks will fail. You
specify that your input data is free of PII when you create a Ground Truth labeling job and when you
create a Amazon A2I human loop using a built-in integration or the StartHumanLoop operation.

Use the following sections to learn how to use Mechanical Turk with these services.

Topics
• Use Mechanical Turk with Ground Truth (p. 864)
• Use Mechanical Turk with Amazon A2I (p. 865)
• When is Mechanical Turk Not Supported? (p. 867)

Use Mechanical Turk with Ground Truth


You can use Mechanical Turk with Ground Truth when you create a labeling job using the console, or the
CreateLabelingJob operation.

When you create a labeling job, we recommend you adjust the number of workers that annotate each
data object based on the complexity of the job and the quality that you need. Amazon SageMaker
Ground Truth uses annotation consolidation to improve the quality of the labels. More workers can make
a difference in the quality of the labels for more complex labeling jobs, but might not make a difference
for simpler jobs. For more information, see Consolidate Annotations (p. 806). Note that annotation
consolidation is not supported for Amazon A2I human review workflows.

To use Mechanical Turk when you create a labeling job (console):

1. Use the following to create a labeling job using the Ground Truth area of the SageMaker console:
Create a Labeling Job (Console) (p. 706).
2. When you are selecting Worker types in the Workers section, select Amazon Mechanical Turk.
3. Specify the total amount of time workers have to complete a task using Task timeout.
4. Specify the total amount of time a task remains available to workers in Task expiration. This is how
long workers have to pick up a task before it fails.
5. Select the Price per task using the dropdown list. This is the amount of money a worker receives for
completing a single task.

864
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce

6. (Optional) If applicable, select The dataset does not contain adult content. SageMaker may restrict
the Mechanical Turk workers that can view your task if it contains adult content.
7. You must read and confirm the following statement by selecting the check box to use the
Mechanical Turk workforce. If your input data contains confidential information, personal
information, or protected health information, you must select another workforce.

You understand and agree that the Mechanical Turk workforce consists of independent
contractors located worldwide and that you should not share confidential information, personal
information, or protected health information with this workforce.
8. (Optional) Select the check box next to Enable automated data labeling if you want to enable
automated data labeling. To learn more about this feature, see Automate Data Labeling (p. 807).
9. You can specify the Number of workers per dataset object under Additional configuration. For
example, if you enter 3 in this field, each data object will be labeled by 3 workers.

When you create your labeling job by selecting Create, your labeling tasks are sent to Mechanical Turk
workers.

To use Mechanical Turk when you create a labeling job (API):

1. Use the following to create a labeling job using the CreateLabelingJob operation: Create a
Labeling Job (API) (p. 709).
2. Use the following for the WorkteamArn. Replace region with the AWS Region you are using to
create the labeling job.

arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
3. Use TaskTimeLimitInSeconds to specify the total amount of time workers have to complete a
task.
4. Use TaskAvailabilityLifetimeInSeconds to specify the total amount of time a task remains
available to workers. This is how long workers have to pick up a task before it fails.
5. Use NumberOfHumanWorkersPerDataObject to specify the number of workers per dataset
object.
6. Use PublicWorkforceTaskPrice to set the price per task. This is the amount of money a worker
receives for completing a single task.
7. Use DataAttributes to specify that your input data is free of confidential information, personal
information, or protected health information.

Ground Truth requires that your input data is free of personally identifiable information (PII) if you
use the Mechanical Turk workforce. If you use Mechanical Turk and do not specify that your input
data is free of PII using the FreeOfPersonallyIdentifiableInformation flag, your labeling
job will fail.

Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.

You can see examples of how to use this API in the following notebooks, found on GitHub: Ground
Truth Jupyter Notebook Examples. You can access these notebooks under the SageMaker Example
Notebooks (p. 220) in a notebook instance.

Use Mechanical Turk with Amazon A2I


You can specify that you want to use Mechanical Turk with Amazon A2I when you create a human review
workflow, also referred to as a flow definition, in the console, or with the CreateFlowDefinition API

865
Amazon SageMaker Developer Guide
Using the Amazon Mechanical Turk Workforce

operation. When you use this human review workflow to configure human loops, you must specify that
your input data is free of PII.

To use Mechanical Turk when you create a human review workflow (console):

1. Use the following to create a human review workflow in the Augmented AI section of the SageMaker
console: Create a Human Review Workflow (Console) (p. 2967).
2. When you are selecting Worker types in the Workers section, select Amazon Mechanical Turk.
3. Select the Price per task using the dropdown list. This is the amount of money a worker receives for
completing a single task.
4. (Optional) You can specify the Number of workers per dataset object under Additional
configuration. For example, if you enter 3 in this field, each data object will be labeled by 3 workers.
5. (Optional) Specify the total amount of time workers have to complete a task using Task timeout.
6. (Optional) Specify the total amount of time a task remains available to workers in Task expiration.
This is how long workers have to pick up a task before it fails.
7. Once you have created your human review workflow, you can use it to configure a human loop by
providing its Amazon Resource Name (ARN) in the parameter FlowDefinitionArn. You configure
a human loop using one of the API operations of a built-in task type, or the Amazon A2I runtime API
operation, StartHumanLoop. To learn more, see Create and Start a Human Loop (p. 2985).

When you configure your human loop, you must specify that your input data is free of personally
identifiable information (PII) using the FreeOfPersonallyIdentifiableInformation content
classifier in DataAttributes. If you use Mechanical Turk and do not specify that your input data is
free of PII, your human review tasks will fail.

Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.

To use Mechanical Turk when you create a human review workflow (API):

1. Use the following to create a human review workflow using the CreateFlowDefinition
operation: Create a Human Review Workflow (API) (p. 2969).
2. Use the following for the WorkteamArn. Replace region with the AWS Region you are using to
create the labeling job.

arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default
3. Use TaskTimeLimitInSeconds to specify the total amount of time workers have to complete a
task.
4. Use TaskAvailabilityLifetimeInSeconds to specify the total amount of time a task remains
available to workers. This is how long workers have to pick up a task before it fails.
5. Use TaskCount to specify the number of workers per dataset object. For example, if you specify 3
for this parameter, each data object will be labeled by 3 workers.
6. Use PublicWorkforceTaskPrice to set the price per task. This is the amount of money a worker
receives for completing a single task.
7. Once you have created your human review workflow, you can use it to configure a human loop by
providing its Amazon Resource Name (ARN) in the parameter FlowDefinitionArn. You configure
a human loop using one of the API operations of a built-in task type, or the Amazon A2I runtime API
operation, StartHumanLoop. To learn more, see Create and Start a Human Loop (p. 2985).

When you configure your human loop, you must specify that your input data is free of personally
identifiable information (PII) using the FreeOfPersonallyIdentifiableInformation content
classifier in DataAttributes. If you use Mechanical Turk and do not specify that your input data is
free of PII, your human review tasks will fail.

866
Amazon SageMaker Developer Guide
Managing Vendor Workforces

Use the FreeOfAdultContent flag to declare that your input data is free of adult
content. SageMaker may restrict the Mechanical Turk workers that can view your task if it contains
adult content.

You can see examples of how to use this API in the following notebooks, found on GitHub: Amazon A2I
Jupyter Notebook Examples.

When is Mechanical Turk Not Supported?


This workforce is not supported under the following scenarios. In each scenario, you must use a private
or vendor workforce.

• This workforce is not supported for Ground Truth video frame labeling jobs and 3D point cloud
labeling jobs.
• You cannot use this workforce if your input data contains personally identifiable information (PII).
• Mechanical Turk is not available in some of the AWS special regions. If applicable, refer to the
documentation for your special region for more information.

Managing Vendor Workforces


You can use a vendor-managed workforce to label your data using Amazon SageMaker Ground Truth
(Ground Truth) and Amazon Augmented AI (Amazon A2I). Vendors have extensive experience in providing
data labeling services for the purpose of machine learning. Vendor workforces for these two services
must be created and managed seperately through the Amazon SageMaker console.

Vendors make their services available via the AWS Marketplace. You can find details of the vendor's
services on their detail page, such as the number of workers and the hours that they work. You can use
these details to make estimates of how much the labeling job will cost and the amount of time that you
can expect the job to take. Once you have chosen a vendor you subscribe to their services using the AWS
Marketplace.

A subscription is an agreement between you and the vendor. The agreement spells out the details of the
agreement, such as price, schedule, or refund policy. You work directly with the vendor if there are any
issues with your labeling job.

You can subscribe to any number of vendors to meet your data annotation needs. When you create a
labeling job or human review worklow you can specify that the job be routed to a specific vendor.
Important
Before you send sensitive data to a vendor, check the vendor's security and compliance
practices on their detail page and review the end user license agreement (EULA) that is part
of your subscription agreement. You are responsible for ensuring that the vendor meets your
compliance requirements for personal or confidential information. Do not share protected
health information with this workforce.

You must use the console to subscribe to a vendor workforce. Once you have a subscription, you can use
the ListSubscribedWorkteams operation to list your subscribed vendors.

To subscribe to a vendor workforce

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose the appropriate page in the SageMaker console.

• For Ground Truth labeling jobs, choose Labeling workforces, choose Vendor, and then choose
Find data labeling services.

867
Amazon SageMaker Developer Guide
Use a Private Workforce

• For Amazon A2I human review workflows, choose Human review workforces, choose Vendor, and
then choose Find human review services.
3. The console opens the AWS Marketplace with:

• data labeling services category selected for Ground Truth


• human review services category selected for Amazon A2I

Here you see a list of the vendor services available for this service.
4. Choose a vendor. The AWS Marketplace shows detailed information about the data labeling or
human review service. Use this information to determine if the vendor meets your requirements for
your task.
5. If the vendor meets your requirements, choose Continue to subscribe.
6. Review the details of the subscription. If you agree to the terms, choose Subscribe to complete your
subscription to the service.

Use a Private Workforce


A private workforce is a group of workers that you choose. These can be employees of your company or
a group of subject matter experts from your industry. For example, if the task is to label medical images,
you could create a private workforce of people knowledgeable about the images in question.

Each AWS account has access to a single private workforce per region, and the owner has the ability
to create multiple private work teams within that workforce. A single private work team is used to
complete a labeling job or human review task, or a job. You can assign each work team to a separate job
or use a single team for multiple jobs. A single worker can be in more than one work team.

Your private workforce can either be created and managed using Amazon Cognito or your own private
OpenID Connect (OIDC) Identity Provider (IdP).

If you are a new user of Amazon SageMaker Ground Truth or Amazon Augmented AI and do not require
your workers to be managed with your own IdP, it is recommended that you use Amazon Cognito to
create and manage your private workforce.

After you create a workforce, in addition to creating and managing work teams, you can do the
following:

• Track worker performance


• Create and manage Amazon SNS topics to notify workers when labeling tasks are available
• Manage Private Workforce Access to Tasks Using IP Addresses

Note
Your private workforce is shared between Ground Truth and Amazon A2I. To create and manage
private work teams used by Augmented AI, use the Ground Truth section of the SageMaker
console.

Topics
• Create and Manage Amazon Cognito Workforce (p. 869)
• Create and Manage OIDC IdP Workforce (p. 876)
• Manage Private Workforce Using the Amazon SageMaker API (p. 885)
• Track Worker Performance (p. 886)
• Create and manage Amazon SNS topics for your work teams (p. 887)

868
Amazon SageMaker Developer Guide
Use a Private Workforce

Create and Manage Amazon Cognito Workforce


Create and manage your private workforce using Amazon Cognito when you want to create your
workforce using the Amazon SageMaker console or you don't want the overhead of managing worker
credentials and authentication. When you create a private workforce with Amazon Cognito, it provides
authentication, authorization, and user management for your private workers.

Topics
• Create a Private Workforce (Amazon Cognito) (p. 869)
• Manage a Private Workforce (Amazon Cognito) (p. 871)

Create a Private Workforce (Amazon Cognito)


When you use Amazon Cognito, you can create a private workforce in one of the following ways:

• Create a new workforce while you are creating your labeling job. To learn how, see Create an Amazon
Cognito Workforce When Creating a Labeling Job (p. 870).
• Create a new workforce before you create your labeling job. To learn how, see Create an Amazon
Cognito Workforce Using the Labeling Workforces Page (p. 870).
• Import an existing workforce after creating a user pool in the Amazon Cognito console. To learn how,
see Create a Private Workforce (Amazon Cognito Console) (p. 871).

Once you create a private workforce, that workforce and all work teams and workers associated with it
are available to use for all Ground Truth labeling job tasks and Amazon Augmented AI human review
workflows tasks.

If you are new to Amazon SageMaker and want to test Ground Truth or Amazon A2I, we suggest that you
create a private work team consisting of people from your organization using the console. Use this work
team when creating labeling or human review workflows (flow definitions) to test your worker UI and job
workflow.

Topics
• Create a Private Workforce (Amazon SageMaker Console) (p. 869)
• Create a Private Workforce (Amazon Cognito Console) (p. 871)

Create a Private Workforce (Amazon SageMaker Console)

You can create a private workforce in the Amazon SageMaker console in one of two ways:

• When creating a labeling job in the Labeling jobs page of the Amazon SageMaker Ground Truth
section.
• Using the Labeling workforces page of the Amazon SageMaker Ground Truth section. If you are
creating a private workforce for an Amazon A2I human review workflow, use this method.

Both of these methods also create a default work team containing all of the members of the
workforce. This private workforce is available to use for both Ground Truth and Amazon Augmented AI
jobs.

When you create a private workforce using the console, SageMaker uses Amazon Cognito as an identity
provider for your workforce. If you want to use your own OpenID Connect (OIDC) Identity Provider (IdP)
to create and manage your private workforce, you must create a workforce using the SageMaker API
operation CreateWorkforce. To learn more, see Create a Private Workforce (OIDC IdP) (p. 876).

869
Amazon SageMaker Developer Guide
Use a Private Workforce

Create an Amazon Cognito Workforce When Creating a Labeling Job

If you haven't created a private workforce when you create your labeling job and you choose to use
private workers, you are prompted to create a work team. This will create a private workforce using
Amazon Cognito.

To create a workforce while creating a labeling job (console)

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, choose Labeling jobs and fill in all required fields. For instructions on how to
start a labeling job, see Getting started (p. 527). Choose Next.
3. Choose Private for the workforce type.
4. In the Workers section, enter:

a. The Team name.


b. Email addresses for up to 100 workforce members. Email addresses are case sensitive. Your
workers must log in using the same case used when the address was initially entered. You can
add additional workforce members after the job has been created.
c. The name of your organization. SageMaker uses this to customize the email sent to the workers.
d. A contact email address for workers to report issues related to the task.

When you create the labeling job, an email is sent to each worker inviting them to join the workforce.
After creating the workforce, you can add, delete, and disable workers using the SageMaker console or
the Amazon Cognito console.

Create an Amazon Cognito Workforce Using the Labeling Workforces Page

To create and manage your private workforce using Amazon Cognito, you can use the Labeling
workforces page. When following the instructions below, you have the option to create a private
workforce by entering worker emails importing a pre-existing workforce from an Amazon Cognito user
pool. To import a workforce, see Create a Private Workforce (Amazon Cognito Console) (p. 871).

To create a private workforce using worker emails

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, choose Labeling workforces.
3. Choose Private, then choose Create private team.
4. Choose Invite new workers by email.
5. Paste or type a list of up to 50 email addresses, separated by commas, into the email addresses box.
6. Enter an organization name and contact email.
7. Optionally, choose an SNS topic to which to subscribe the team so workers are notified by email
when new Ground Truth labeling jobs become available. Amazon SNS notifications are supported
by Ground Truth and are not supported by Augmented AI. If you subscribe workers to receive SNS
notifications, they only receive notifications about Ground Truth labeling jobs. They do not receive
notifications about Augmented AI tasks.
8. Click the Create private team button.

After you import your private workforce, refresh the page. On the Private workforce summary page,
you can see information about the Amazon Cognito user pool for your workforce, a list of work teams for
your workforce, and a list of all of the members of your private workforce.
Note
If you delete all of your private work teams, you have to repeat this process to use a private
workforce in that region.

870
Amazon SageMaker Developer Guide
Use a Private Workforce

Create a Private Workforce (Amazon Cognito Console)


Amazon Cognito is used to define and manage your private workforce and your work teams. It is a
service that you can use to create identities for your workers and authenticate these identities with
identity providers. A private workforce corresponds to a single Amazon Cognito user pool. Private work
teams correspond to Amazon Cognito user groups within that user pool.

Example identity providers supported by Amazon Cognito:

• Social sign-in providers such as Facebook and Google


• OpenID Connect (OIDC) providers
• Security Assertion Markup Language (SAML) providers such as Active Directory
• The Amazon Cognito built-in identity provider

For more information, see What Is Amazon Cognito?.

To create a private workforce using Amazon Cognito, you must have an existing Amazon Cognito user
pool containing at least one user group. See Tutorial: Creating a User Pool to learn how to create a user
pool. See Adding Groups to a User Pool to learn how to add a user group to a pool.

Once your user pool has been created, follow the steps below to create a private workforce by importing
that user pool into Amazon SageMaker.

To create a private workforce by importing a Amazon Cognito user pool

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, choose Labeling workforces.
3. Choose Private.
4. Choose Create private team. This creates a private workforce and a work team.
5. Choose Import workers from existing Amazon Cognito user groups.
6. Choose a user pool that you have created. User pools require a domain and an existing user group.
If you get an error that the domain is missing, set it in the Domain name options on the App
integration page of the Amazon Cognito console for your group.
7. Choose an app client. We recommend using a client generated by SageMaker.
8. Choose a user group from your pool to import its members.
9. Optionally choose an Amazon Simple Notification Service (Amazon SNS) topic to which to subscribe
the team so that workers are notified by email when new labeling jobs become available. Amazon
SNS notifications are supported by Ground Truth and are not supported by Augmented AI. If you
subscribe workers to receive SNS notifications, they only receive notifications about Ground Truth
labeling jobs. They do not receive notifications about Augmented AI tasks.
10. Choose Create private team.

Important
After you create a workforce using an Amazon Cognito user pool, it should not be deleted
without first deleting all work teams associated with that pool in the SageMaker console.

After you import your private workforce, refresh the page to see the Private workforce summary page.
On this page, you can see information about the Amazon Cognito user pool for your workforce, a list of
work teams for your workforce, and a list of all of the members of your private workforce. This workforce
is now available to use in both Amazon Augmented AI and Amazon SageMaker Ground Truth for human
review tasks and data labeling jobs respectively.

Manage a Private Workforce (Amazon Cognito)


After you have created a private workforce using Amazon Cognito, you can create and manage work
teams using the Amazon SageMaker console and API operations.

871
Amazon SageMaker Developer Guide
Use a Private Workforce

You can do the following using either the SageMaker console or Amazon Cognito console.

• Add and delete work teams.


• Add workers to your workforce and one or more work teams.
• Disable or remove workers from your workforce and one or more workteams. If you add workers to a
workforce using the Amazon Cognito console, you must use the same console to remove the worker
from the workforce.

You can restrict access to tasks to workers at specific IP addresses using the SageMaker API. For more
information, see Manage Private Workforce Using the Amazon SageMaker API (p. 885).

Topics
• Manage a Workforce (Amazon SageMaker Console) (p. 872)
• Manage a Private Workforce (Amazon Cognito Console) (p. 874)

Manage a Workforce (Amazon SageMaker Console)

You can use the Amazon SageMaker console to create and manage the work teams and individual
workers that make up a private workforce.

Use a work team to assign members of your private workforce to a labeling or human review job. When
you create your workforce using the SageMaker console, there is a work team called Everyone-in-
private-workforce that enables you to assign your entire workforce to a job. Because an imported
Amazon Cognito user pool may contain members that you don't want to include in your work teams, a
similar work team is not created for Amazon Cognito user pools.

You have two choices to create a new work team:

• You can create a work team in the SageMaker console and add members from your workforce to the
team.
• You can create a user group by using the Amazon Cognito console and then create a work team by
importing the user group. You can import more than one user group into each work team. You manage
the members of the work team by updating the user group in the Amazon Cognito console. See
Manage a Private Workforce (Amazon Cognito Console) (p. 874) for more information.

Create a Work Team Using the SageMaker Console

You can create a new Amazon Cognito user group or import an existing user group using the SageMaker
console, on the Labeling workforces page. For more information on creating a user group in the Amazon
Cognito console, see Manage a Private Workforce (Amazon Cognito Console) (p. 874).

To create a work team using the SageMaker console

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Labeling workforces from the left menu.
3. Under Private, choose Create private team.
4. Under Team details, enter a Team name. The name must be unique in your account in an AWS
Region.
5. Under Add workers, choose a method to add workers to the team using a user group.

• If you chose Create a team by adding workers to a new Amazon Cognito user group, select the
workers to add to the team.
• If you chose Create a team by importing existing Amazon Cognito user groups, choose the user
groups that are part of the new team.

872
Amazon SageMaker Developer Guide
Use a Private Workforce

6. If you select an SNS topic, all workers added to the team are subscribed to the Amazon SNS topic
and notified when new work items are available to the team. Select from a list of your existing
Ground Truth related Amazon SNS topics or select Create new topic to open a topic-creation dialog.

Amazon SNS notifications are supported by Ground Truth and are not supported by Augmented AI.
If you subscribe workers to receive SNS notifications, they only receive notifications about Ground
Truth labeling jobs. They do not receive notifications about Augmented AI tasks.

Workers in a workteam subscribed to a topic receive notifications when a new Ground Truth labeling job
for that team becomes available and when one is about to expire.

Read Create and manage Amazon SNS topics for your work teams (p. 887) for more information about
using Amazon SNS topic.

Subscriptions

After you have created a work team, you can see more information about the team and change or set the
Amazon SNS topic to which its members are subscribed by visiting the Amazon Cognito console. If you
added any team members before you subscribed the team to a topic, you need to manually subscribe
those members to that topic. Read Create and manage Amazon SNS topics for your work teams for more
information on creating and managing the Amazon SNS topic.

Add or Remove Workers

A work team is a group of workers within your workforce to whom you can assign jobs. A worker can be
added to more than one work team. Once a worker has been added to a work team, that worker can be
disabled or removed.

Add Workers to the Workforce

Adding a worker to the workforce enables you to add that worker to any work team within that work
force.

To add workers using the private workforce summary page

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Labeling workforces to navigate to your private workforce summary page.
3. Choose Private.
4. Choose Invite new workers.
5. Paste or type a list of email addresses, separated by commas, into the email addresses box. You can
have up to 50 email addresses in this list.

Add a Worker to a Work Team

A worker must be added to the workforce before being added to a work team. To add a worker to a work
team, first navigate to the Private workforce summary page using the steps above.

To add a worker to a work team from the private workforce summary page

1. In the Private teams section, choose the team to which you want to add the workers.
2. Choose the Workers tab.
3. Choose Add workers to team and choose the boxes next to the workers that you want to add.
4. Click Add workers to team.

873
Amazon SageMaker Developer Guide
Use a Private Workforce

Disable and Remove a Worker from the Workforce

Disabling a worker stops the worker from receiving a job. This action does not remove the worker from
the workforce, or from any work team with which the worker is associated. To disable or remove a worker
from a work team, first navigate to the private workforce summary page using the steps above.

To deactivate a worker using the private workforce summary page

1. In the Workers section, choose the worker that you would like to disable.
2. Choose Disable.

If desired, you can subsequently Enable a worker after they have been disabled.

You can remove workers from your private workforce directly in the SageMaker console if that worker
was added in this console. If you added the worker (user) in the Amazon Cognito console, see Manage
a Private Workforce (Amazon Cognito Console) (p. 874) to learn how to remove the worker in the
Amazon Cognito console.

To remove a worker using the private workforce summary page

1. In the Workers section, choose the worker that you would like to delete.
2. If the worker has not been disabled, choose Disable.
3. Select the worker and choose Delete.

Manage a Private Workforce (Amazon Cognito Console)

A private workforce corresponds to a single Amazon Cognito user pool. Private work teams correspond
to Amazon Cognito user groups within that user pool. Workers correspond to Amazon Cognito users
within those groups.

After your workforce has been created, you can add work teams and individual workers through the
Amazon Cognito console. You can also delete workers from your private workforce or remove them from
individual teams in the Amazon Cognito console.
Important
You can't delete work teams from the Amazon Cognito console. Deleting a Amazon Cognito user
group that is associated with an Amazon SageMaker work team will result in an error. To remove
work teams, use the SageMaker console.

Create Work Teams (Amazon Cognito Console)

You can create a new work team to complete a job by adding a Amazon Cognito user group to the user
pool associated with your private workforce. To add a Amazon Cognito user group to an existing worker
pool, see Adding groups to a User Pool.

To create a work team using an existing Amazon Cognito user group

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, choose Workforces.
3. For Private teams, choose Create private team.
4. Under Team details, give the team a name. The name must be unique in your account in an AWS
Region.
5. For Add workers, choose Import existing Amazon Cognito user groups, and choose one or more
user groups that are part of the new team.
6. If you choose an SNS topic, all workers added to the team are subscribed to the Amazon Simple
Notification Service (Amazon SNS) topic and notified when new work items are available to the

874
Amazon SageMaker Developer Guide
Use a Private Workforce

team. Choose from a list of your existing SNS topics related to SageMaker Ground Truth or Amazon
Augmented AI or choose Create new topic to create one.
Note
Amazon SNS notifications are supported by Ground Truth and are not supported by
Augmented AI. If you subscribe workers to receive SNS notifications, they only receive
notifications about Ground Truth labeling jobs. They do not receive notifications about
Augmented AI tasks.

Subscriptions

After you have created a work team, you can see more information about the team and change or set
the SNS topic to which its members are subscribed using the Amazon Cognito console. If you added
any team members before you subscribed the team to a topic, you need to manually subscribe those
members to that topic. For more information, see Create and manage Amazon SNS topics for your work
teams (p. 887).

Add and Remove Workers (Amazon Cognito Console)

When using the Amazon Cognito console to add workers to a work team, you must add a user to the user
pool associated with the workforce before adding that user to a user group. Users can be added to a user
pool in various ways. For more information, see Signing Up and Confirming User Accounts.

Add a Worker to a Work Team

After a user has been added to a pool, the user can be associated with user groups inside of that pool.
After a user has been added to a user group, that user becomes a worker on any work team created using
that user group.

To add a user to a user group

1. Open the Amazon Cognito console: https://console.aws.amazon.com/cognito/.


2. Choose Manage User Pools.
3. Choose the user pool associated with your SageMaker workforce.
4. Under General Settings, choose Users and Groups and do one of the following:
• Choose Groups, choose the group that you want to add the user to, and choose Add users.
Choose the users that you want to add by choosing the plus-icon to the right of the user's
name.
• Choose Users, choose the user that you want to add to the user group, and choose Add to
group. From the dropdown menu, choose the group and choose Add to group.

Disable and Remove a Worker From a Work Team

Disabling a worker stops the worker from receiving jobs. This action doesn't remove the worker from the
workforce, or from any work team the worker is associated with. To remove a user from a work team in
Amazon Cognito, you remove the user from the user group associated with that team.

To deactivate a worker (Amazon Cognito console)

1. Open the Amazon Cognito console: https://console.aws.amazon.com/cognito/.


2. Choose Manage User Pools.
3. Choose the user pool associated with your SageMaker workforce.
4. Under General Settings, choose Users and Groups.
5. Choose the user that you want to disable.
6. Choose Disable User.

875
Amazon SageMaker Developer Guide
Use a Private Workforce

You can enable a disabled user by choosing Enable User.

To remove a user from a user group (Amazon Cognito console)

1. Open the Amazon Cognito console: https://console.aws.amazon.com/cognito/.


2. Choose Manage User Pools.
3. Choose the user pool associated with your SageMaker workforce.
4. Under General Settings, choose Users and Groups.
5. For User tab, choose the X icon to the right of the group from which you want to remove the user.

Create and Manage OIDC IdP Workforce


Create a private workforce using an OpenID Connect (OIDC) Identity Provider (IdP) when you want to
manage and authenticate your workers using your own OIDC IdP. Individual worker credentials and other
data will be kept private. Ground Truth and Amazon A2I will only have visibility into worker information
you provide through the claims that you send to these services. To create a workforce using an OIDC IdP,
your IdP must support groups because Ground Truth and Amazon A2I map one or more groups in your
IdP to a work team. To learn more, see Send Required and Optional Claims to Ground Truth and Amazon
A2I (p. 876).

If you are a new user of Ground Truth or Amazon A2I, you can test your worker UI and job workflow by
creating a private work team and adding yourself as a worker. Use this work team when you create a
labeling job or human review workflow. First, create a private OIDC IdP workforce using the instructions
in Create a Private Workforce (OIDC IdP) (p. 876). Next, refer to Manage a Private Workforce (OIDC
IdP) (p. 882) to learn how to create a work team.

Topics
• Create a Private Workforce (OIDC IdP) (p. 876)
• Manage a Private Workforce (OIDC IdP) (p. 882)

Create a Private Workforce (OIDC IdP)


Create a private workforce using an OpenID Connect (OIDC) Identity Provider (IdP) when you want
to authenticate and manage workers using your own identity provider. Use this page to learn how to
configure your IdP to communicate with Amazon SageMaker Ground Truth (Ground Truth) or Amazon
Augmented AI (Amazon A2I) and to learn how to create a workforce using your own IdP.

To create a workforce using an OIDC IdP, your IdP must support groups because Ground Truth and
Amazon A2I use one or more groups that you specify to create work teams. You use work teams to
specify workers for your labeling jobs and human review tasks. Because groups are not a standard
claim, your IdP may have a different naming convention for a group of users (workers). Therefore,
you must identify one or more user groups to which a worker belongs using the custom claim
sagemaker:groups that is sent to Ground Truth or Amazon A2I from your IdP. To learn more, see Send
Required and Optional Claims to Ground Truth and Amazon A2I (p. 876).

You create an OIDC IdP workforce using the SageMaker API operation CreateWorkforce. Once
you create a private workforce, that workforce and all work teams and workers associated with it are
available to use for all Ground Truth labeling job tasks and Amazon A2I human review workflows tasks.
To learn more, see Create an OIDC IdP Workforce (p. 878).

Send Required and Optional Claims to Ground Truth and Amazon A2I

When you use your own IdP, Ground Truth and Amazon A2I use your Issuer, ClientId,
and ClientSecret to authenticate workers by obtaining an authentication CODE from your
AuthorizationEndpoint.

876
Amazon SageMaker Developer Guide
Use a Private Workforce

Ground Truth and Amazon A2I will use this CODE to obtain a custom claim from either your IdP's
TokenEndpoint or UserInfoEndpoint. You can either configure TokenEndpoint to return a JSON
web token (JWT) or UserInfoEndpoint to return a JSON object. The JWT or JSON object must contain
required and optional claims that you specify. A claim is a key-value pair that contains information about
a worker or metadata about the OIDC service. The following table lists the claims that must be included,
and that can optionally be included in the JWT or JSON object that your IdP returns.
Note
Some of the parameters in the following table can be specified using a : or a -. For example,
you can specify the groups a worker belongs to using sagemaker:groups or sagemaker-
groups in your claim.

Name Required Accepted Format and Description Example


Values

sagemaker:groupsYes Data type: Assigns a worker to Example of worker


or sagemaker- one or more groups. that belongs to
groups If a worker belongs Groups are used to a single group:
to a single group, map the worker into "work_team1"
identify the group work teams.
using a string. Example of a worker
that belongs to more
If a worker belongs than one groups:
to multiple groups, ["work_team1",
use a list of up to 10 "work_team2"]
strings.

Allowable
characters:

Regex:
[\p{L}\p{M}\p{S}\p{N}\p{P}]+

Quotas:

10 groups per worker

63 characters per
group name

sagemaker:sub Yes Data type: This is mandatory "111011101-123456789-3687056


or sagemaker- to track a worker
sub String identity inside the
Ground Truth platform
for auditing and to
identify tasks worked
on by that worker.

For ADFS: Customers


must use the Primary
Security Identifier
(SID).

Yes
sagemaker:client_id Data type: A client ID. All tokens "00b600bb-1f00-05d0-
or sagemaker- must be issued for this bd00-00be00fbd0e0"
client_id String client ID.
Allowable
characters:

877
Amazon SageMaker Developer Guide
Use a Private Workforce

Name Required Accepted Format and Description Example


Values
Regex: [\w+-]+

Quotes:

128 characters

sagemaker:name Yes Data type: The worker name to be "Jane Doe"


or sagemaker- displayed in the worker
name String portal.

email No Data type: The worker email. "example-


Ground Truth uses this [email protected]"
String email to notify workers
that they have been
invited to work on
labeling tasks. Ground
Truth will also use
this email to notify
your workers when
labeling tasks become
available if you set up
an Amazon SNS topic
for a work team that
this worker is on.

email_verified No Data type: Indicates if the user True


email was verified or
Bool not.
Accepted Values:

True, False

The following an example of the JSON object syntax your UserInfoEndpoint can return.

{
"sub":"122",
"exp":"10000",
"sagemaker-groups":["group1","group2"]
"sagemaker-name":"name",
"sagemaker-sub":"122",
"sagemaker-client_id":"123456"
}

Ground Truth or Amazon A2I compares the groups listed in sagemaker:groups or sagemaker-groups
to verify that your worker belongs to the work team specified in the labeling job or human review task.
After the work team has been verified, labeling or human review tasks are sent to that worker.

Create an OIDC IdP Workforce

You can create a workforce using the SageMaker API operation CreateWorkforce and associated
language-specific SDKs. Specify a WorkforceName and information about your OIDC IDP in the
parameter OidcConfig. It is recommended that you configure your OIDC with a place-holder redirect
URI, and then update the URI with the worker portal URL after you create the workforce. To learn more,
see Configure your OIDC IdP (p. 879).

878
Amazon SageMaker Developer Guide
Use a Private Workforce

The following shows an example of the request. See CreateWorkforce to learn more about each
parameter in this request.

CreateWorkforceRequest: {
#required fields
WorkforceName: "example-oidc-workforce",
OidcConfig: {
ClientId: "clientId",
ClientSecret: "secret",
Issuer: "https://example-oidc-idp.com/adfs",
AuthorizationEndpoint: "https://example-oidc-idp.com/adfs/oauth2/authorize",
TokenEndpoint: "https://example-oidc-idp.com/adfs/oauth2/token",
UserInfoEndpoint: "https://example-oidc-idp.com/adfs/oauth2/userInfo",
LogoutEndpoint: "https://example-oidc-idp.com/adfs/oauth2/log-out",
JwksUri: "https://example-oidc-idp.com/adfs/discovery/keys"
},
SourceIpConfig: {
Cidrs: ["string", "string"]
}
}

Configure your OIDC IdP

How you configure your OIDC IdP depends on the IdP you use, and your business requirements.

When you configure your IdP, you must to specify a callback or redirect URI. After Ground Truth or
Amazon A2I authenticates a worker, this URI will redirect the worker to the worker portal where the
workers can access labeling or human review tasks. To create a worker portal URL, you need to create a
workforce with your OIDC IdP details using the CreateWorkforce API operation. Specifically, you must
configure your OIDC IdP with required custom sagemaker claims (see the next section for more details).
Therefore, it is recommended that you configure your OIDC with a place-holder redirect URI, and then
update the URI after you create the workforce. See Create an OIDC IdP Workforce (p. 878) to learn how
to create a workforce using this API.

You can view your worker portal URL in the SageMaker Ground Truth console, or using the SageMaker
API operation, DescribeWorkforce. The worker portal URL is in the SubDomain parameter in the
response.
Important
Make sure you add the workforce subdomain to your OIDC IdP allow list. When you add the
subdomain to your allow list, it must end with /oauth2/idpresponse.

To view your worker portal URL after creating a private workforce (Console):

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. In the navigation pane, choose Labeling workforces.
3. Select the Private tab.
4. In Private workforce summary you will see Labeling portal sign-in URL. This is your worker portal
URL.

To view your worker portal URL after creating a private workforce (API):

When you create a private workforce using CreateWorkforce, you specify a WorkforceName. Use this
name to call DescribeWorkforce. The following table includes examples of requests using the AWS
CLI and AWS SDK for Python (Boto3).

SDK for Python (Boto3)

response = client.describe_workforce(WorkforceName='string')

879
Amazon SageMaker Developer Guide
Use a Private Workforce

print(f'The workforce subdomain is: {response['SubDomain']}')

AWS CLI

$ C:\> describe-workforce --workforce-name 'string'

Validate Your OIDC IdP Workforce Authentication Response

After you have created your OIDC IdP workforce, you can use the following procedure to validate its
authentication workflow using cURL. This procedure assumes you have access to a terminal, and that you
have cURL installed.

To validate your OIDC IdP authorization response:

1. Get an authorization code using a URI configured as follows:

{AUTHORIZE ENDPOINT}?client_id={CLIENT ID}&redirect_uri={REDIRECT


URI}&scope={SCOPE}&response_type=code

a. Replace {AUTHORIZE ENDPOINT} with the authorize endpoint for your OIDC IdP.
b. Replace {CLIENT ID} with the Client ID from your OAuth client.
c. Replace {REDIRECT URI} with the worker portal URL. If it is not already present, you must add
/oauth2/idpresponse to the end of the URL.
d. If you have a custom scope, use it to replace {SCOPE}. If you do not have a custom scope,
replace {SCOPE} with openid.

The following is an example of a URI after the modifications above are made:

https://example.com/authorize?
client_id=f490a907-9bf1-4471-97aa-6bfd159f81ac&redirect_uri=https%3A%2F%2F
%2Fexample.labeling.sagemaker.aws
%2Foauth2%2Fidpresponse&response_type=code&scope=openid

2. Copy and paste the modified URI from step 1 into your browser and press Enter on your keyboard.
3. Authenticate using your IdP.
4. Copy the authentication code query parameter in the URI. This parameter beings with code=.
The following is an example of what the response might look like. In this example, copy
code=MCNYDB... and everything thereafter.

https://example.labeling.sagemaker.aws/oauth2/idpresponse?code=MCNYDB....

5. Open a terminal and enter the following command after making required modifications listed
below:

curl --request POST \


--url '{TOKEN ENDPOINT}' \
--header 'content-type: application/x-www-form-urlencoded' \
--data grant_type=authorization_code \
--data 'client_id={CLIENT ID}' \
--data client_secret={CLIENT SECRET} \
--data code={CODE} \
--data 'redirect_uri={REDIRECT URI}'

a. Replace {TOKEN ENDPOINT} with the token endpoint for your OIDC IdP.

880
Amazon SageMaker Developer Guide
Use a Private Workforce

b. Replace {CLIENT ID} with the Client ID from your OAuth client.
c. Replace {CLIENT SECRET} with the Client Secret from your OAuth client.
d. Replace {CODE} with the authentication code query parameter you copied in step 4.
e. Replace {REDIRECT URI} with the worker portal URL.

The following is an example of the cURL request after making the modifications described above:

curl --request POST \


--url 'https://example.com/token' \
--header 'content-type: application/x-www-form-urlencoded' \
--data grant_type=authorization_code \
--data 'client_id=f490a907-9bf1-4471-97aa-6bfd159f81ac' \
--data client_secret=client-secret \
--data code=MCNYDB... \
--data 'redirect_uri=https://example.labeling.sagemaker.aws/oauth2/idpresponse'

6. This step depends on the type of access_token your IdP returns, a plain text access token or a JWT
access token.

• If your IdP does not support JWT access tokens, access_token may be plain text (for example, a
UUID). The response you see may look similar to the following. In this case, move to step 7.

{
"access_token":"179c144b-fccb-4d96-a28f-eea060f39c13",
"token_type":"Bearer",
"expires_in":3600,
"refresh_token":"ef43e52e-9b4f-410c-8d4c-d5c5ee57631a",
"scope":"openid"
}

• If your IdP supports JWT access tokens, step 5 should generate an access token in JWT format. For
example, the response may look similar to the following:

{
"access_token":"eyJh...JV_adQssw5c",
"refresh_token":"i6mapTIAVSp2oJkgUnCACKKfZxt_H5MBLiqcybBBd04",
"refresh_token_expires_in":6327,
"scope":"openid",
"id_token":"eyJ0eXAiOiJK9...-rDaQzUHl6cQQWNiDpWOl_lxXjQEvQ"
}

Copy the JWT and decode it. You can use python script or a third party website to decode it. For
example, you can go to the website https://jwt.io/ and paste the JWT into the Encoded box to
decode it.

Make sure the decoded response contains the following:


• The Required SageMaker claims in the table found in Send Required and Optional Claims to
Ground Truth and Amazon A2I (p. 876). If it does not, you must reconfigure your OIDC IdP to
contain these claims.
• The Issuer you specified when you set up the IdP workforce.
7. In a terminal and enter the following command after making required modifications listed below:

curl -X POST -H 'Authorization: Bearer {ACCESS TOKEN}' -d '' -k -v {USERINFO ENDPOINT}

a. Replace {USERINFO ENDPOINT} with the user info endpoint for your OIDC IdP.

881
Amazon SageMaker Developer Guide
Use a Private Workforce

b. Replace {ACCESS TOKEN} with the access token in the response you received in step 7. This is
the entry for the "access_token" parameter.

The following is an example of the cURL request after making the modifications described above:

curl -X POST -H 'Authorization: Bearer eyJ0eX...' -d '' -k -v https://example.com/


userinfo

8. The response to the final step in the procedure above may look similar to the following code block.

If the access_token returned in step 6 was plain text, you must verify that this response contains
required information. In this case, the response must contain the Required SageMaker claims in the
table found in Send Required and Optional Claims to Ground Truth and Amazon A2I (p. 876). For
example, sagemaker-groups, sagamaker-name.

{
"sub":"122",
"exp":"10000",
"sagemaker-groups":["group1","group2"]
"sagemaker-name":"name",
"sagemaker-sub":"122",
"sagemaker-client_id":"123456"
}

Next Steps

Once you've created a private workforce using your IdP and verified your IdP authentication response,
you can create work teams using your IdP groups. To learn more, see Manage a Private Workforce (OIDC
IdP) (p. 882).

You can restrict worker access to tasks to specific IP addresses, and update or delete your workforce
using the SageMaker API. To learn more, see Manage Private Workforce Using the Amazon SageMaker
API (p. 885).

Manage a Private Workforce (OIDC IdP)


Once you've created a private workforce using your OpenID Connect (OIDC) Identity Provider (IdP), you
can manage your workers using your IdP. For example, you can add, remove, and group workers directly
through your IdP.

To add workers to an Amazon SageMaker Ground Truth (Ground Truth) labeling job or Amazon
Augmented AI (Amazon A2I) human review task, you create work teams using 1-10 IdP groups and
assign that work team to the job or task. You assign a work team to a job or task by specifing that work
team when you create a labeling job (Ground Truth) or a human review workflow (Amazon A2I).

You can only assign one team to each labeling job or human review workflow. You can use the same
team to create multiple labeling jobs or human review tasks. You can also create multiple work teams to
work on different labeling jobs or human review tasks.

Prerequisites

To create and manage private work teams using your OIDC IdP groups, first you must create a workforce
using the SageMaker API operation CreateWorkforce. To learn more, see Create a Private Workforce
(OIDC IdP) (p. 876).

882
Amazon SageMaker Developer Guide
Use a Private Workforce

Add work teams

You can use the SageMaker console to create a private work team using your OIDC IdP workforce on the
Labeling workforces page under Ground Truth. If you are creating a Ground Truth labeling job, you can
also create a private work team while creating a labeling job.
Note
You create and manage work teams for Amazon A2I in the Ground Truth area of the SageMaker
console.

You can also use the SageMaker API and associated language-specific SDKs to create a private work
team.

Use the following procedures to learn how to create a private work team using the SageMaker console
and API.

To create a private work team on the Labeling workforces page (console)

1. Go to the Ground Truth area of the SageMaker console: https://console.aws.amazon.com/


sagemaker/groundtruth.
2. Select Labeling workforces.
3. Select Private.
4. In the Private teams section, select Create private team.
5. In the Team details section, enter a Team name.
6. In the Add workers section, enter the name of a single user group. All workers associated with this
group in your IdP are added to this work team.
7. To add more than one user group, select Add new user group and enter the names of the user
groups you want to add to this work team. Enter one user group per line.
8. (Optional) For Ground Truth labeling jobs, if you provide an email for workers in your JWT, Ground
Truth notifies workers when a new labeling task is available if you select an SNS topic.
9. Select Create private team.

To create a private work team while creating a Ground Truth labeling job (console)

1. Go to the Ground Truth area of the SageMaker console: https://console.aws.amazon.com/


sagemaker/groundtruth.
2. Select Labeling jobs.
3. Use the instructions in Create a Labeling Job (Console) (p. 706) to create a labeling job. Stop when
you get to the Workers section on the second page.
4. Select Private for your worker type.
5. Enter a Team name.
6. In the Add workers section, enter the name of a single user group under User groups. All workers
associated with this group in your IdP are added to this work team.
Important
The group names you specify for User groups must match the group names specified in
your OIDC IdP.
7. To add more than one user group, select Add new user group and enter the names of the user
groups you want to add to this work team. Enter one user group per line.
8. Complete all remaining steps to create your labeling job.

The private team that you create is used for this labeling job, and is listed in the Labeling workforces
section of the SageMaker console.

883
Amazon SageMaker Developer Guide
Use a Private Workforce

To create a private work team using the SageMaker API

You can create a private work team using the SageMaker API operation CreateWorkteam.

When you use this operation, list all user groups that you want included in the work team in the
OidcMemberDefinition parameter Groups.
Important
The group names you specify for Groups must match the group names specified in your OIDC
IdP.

For example, if your user group names are group1, group2, and group3 in your OIDC IdP, configure
OidcMemberDefinition as follows:

"OidcMemberDefinition": {
"Groups": ["group1", "group2", "group3"]
}

Additionally, you must give the work team a name using the WorkteamName parameter.

Add or remove IdP groups from work teams

After you've created a work team, you can use the SageMaker API to manage that work team. Use the
UpdateWorkteam operation to update the IdP user groups included in that work team.

• Use the WorkteamName parameter to identify the work team that you want to update.
• When you use this operation, list all user groups that you want included in the work team in the
OidcMemberDefinition parameter Groups. If a user group is associated with a work team and you
do not include it in this list, that user group is no longer associated with this work team.

Delete a work team

You can delete a work team using the SageMaker console and SageMaker API.

To delete a private work team in the SageMaker console

1. Go to the Ground Truth area of the SageMaker console: https://console.aws.amazon.com/


sagemaker/groundtruth.
2. Select Labeling workforces.
3. Select Private.
4. In the Private teams section, select the work team that you want to delete.
5. Select Delete.

To delete a private work team (API)

You can delete a private work team using the SageMaker API operation DeleteWorkteam.

Manage Individual Workers

When you create a workforce using your own OIDC IdP, you cannot use Ground Truth or Amazon A2I to
manage individual workers.

• To add a worker to a work team, add that worker to a group associated with that work team.
• To remove a worker from a work team, remove that worker from all user groups associated with that
work team.

884
Amazon SageMaker Developer Guide
Use a Private Workforce

Update, Delete, and Describe Your Workforce


You can update, delete, and describe your OIDC IdP workforce using the SageMaker API. The following
is a list of API operations that you can use to manage your workforce. For additional details, including
how you can locate your workforce name, see Manage Private Workforce Using the Amazon SageMaker
API (p. 885).

• UpdateWorkforce – You may want to update a workforce created using your own OIDC IdP to specify
a different authorization endpoint, token endpoint, or issuer. You can update any parameter found in
OidcConfig using this operation.

You can only update your OIDC IdP configuration when there are no work teams associated with your
workforce. To learn how to delete work teams, see Delete a work team (p. 884).
• DeleteWorkforce – Use this operation to delete your private workforce. If you have any work teams
associated with your workforce, you must delete those work teams before you delete your work force.
For more information, see Delete a work team (p. 884).
• DescribeWorkforce – Use this operation to list private workforce information, including workforce
name, Amazon Resource Name (ARN), and, if applicable, allowed IP address ranges (CIDRs).

Manage Private Workforce Using the Amazon SageMaker API


You can use Amazon SageMaker API operations to manage, update, and delete your private workforce.
For each API operation linked on this page, you can find a list of supported language-specific SDKs and
their documentation in the See Also section of the API documentation.

Find Your Workforce Name


Some of the SageMaker workforce-related API operations require your workforce name as input. You can
see your Amazon Cognito or OIDC IdP private and vendor workforce names in an AWS Region using the
ListWorkforces API operation in that AWS Region.

If you created your workforce using your own OIDC IdP, you can find your workforce name in the Ground
Truth area of the SageMaker console.

To find your workforce name in the SageMaker console

1. Go to the Ground Truth area of the SageMaker console: https://console.aws.amazon.com/


sagemaker/groundtruth.
2. Select Labeling workforces.
3. Select Private.
4. In the Private workforce summary section, locate your workforce ARN. Your workforce name
is located at the end of this ARN. For example, if the ARN is arn:aws:sagemaker:us-
east-2:111122223333:workforce/example-workforce, the workforce name is example-
workforce.

Restrict Worker Access to Tasks to Allowable IP Addresses


By default, a workforce isn't restricted to specific IP addresses. You can use the UpdateWorkforce
operation to require that workers use a specific range of IP addresses (CIDRs) to access tasks. If you
specify one or more CIDRs, workers who attempt to access tasks using any IP address outside the
specified ranges are denied and will get a HTTP 204 No Content error message on the worker portal. You
can specify up to 10 CIDR values using UpdateWorkforce.

After you have restricted your workforce to one or more CIDRs, the output of UpdateWorkforce lists all
allowable CIDRs. You can also use the DescribeWorkforce operation to view all allowable CIDRs for a
workforce.

885
Amazon SageMaker Developer Guide
Use a Private Workforce

Update OIDC Identity Provider Workforce Configuration


You may want to update a workforce created using your own OIDC IdP to specify a different
authorization endpoint, token endpoint, or issuer. You can update any parameter found in OidcConfig
using the UpdateWorkforce operation.
Important
You can only update your OIDC IdP configuration when there are no work teams associated with
your workforce. You can delete a private work team using the DeleteWorkteam operation.

Delete a Private Workforce


You can only have one private workforce in each AWS Region. You may want to delete your private
workforce in an AWS Region when:

• You want to create a workforce using a new Amazon Cognito user pool.
• You have already created a private workforce using Amazon Cognito and you want to create a
workforce using your own OpenID Connect (OIDC) Identity Provider (IdP).

To delete a private workforce, use the DeleteWorkforce API operation. If you have any work teams
associated with your workforce, you must delete those work teams before you delete your workforce.
You can delete a private work team using the DeleteWorkteam operation.

Track Worker Performance


Amazon SageMaker Ground Truth logs worker events to Amazon CloudWatch, such as when a worker
starts or submits a task. Use Amazon CloudWatch metrics to measure and track throughput across a
team or for individual workers.
Important
Worker event tracking is not available for Amazon Augmented AI human review workflows.

Enable Tracking
During the set-up process for a new work team, the permissions for Amazon CloudWatch logging of
worker events are created. Since this feature was added in August 2019, work teams created prior to that
may not have the correct permissions. If all of your work teams were created before August 2019, create
a new work team. It does not need any members and may be deleted after creation, but by creating it,
you establish the permissions and apply them to all of your work teams, regardless of when they were
created.

Examine Logs
After tracking is enabled, the activity of your workers is logged. Open the Amazon CloudWatch
console and choose Logs in the navigation pane. You should see a log group named /aws/sagemaker/
groundtruth/WorkerActivity.

Each completed task is represented by a log entry, which contains information about the worker, their
team, the job, when the task was accepted, and when it was submitted.

Example Log entry

{
"worker_id": "cd449a289e129409",
"cognito_user_pool_id": "us-east-2_IpicJXXXX",
"cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd",

886
Amazon SageMaker Developer Guide
Use a Private Workforce

"task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019",


"task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019",
"task_returned_time": "",
"task_declined_time": "",
"workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-
labeling-team",
"labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo",
"work_requester_account_id": "############",
"job_reference_code": "############",
"job_type": "Private",
"event_type": "TasksSubmitted",
"event_timestamp": "1565798464"
}

A useful data point in each event is the cognito_sub_id. You can match that to an individual worker.

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Under the Ground Truth section, choose Workforces.
3. Choose Private.
4. Choose the name of a team in the Private teams section.
5. In the Team summary section, choose the user group identified under Amazon Cognito user group.
That will take you to the group in the Amazon Cognito console.
6. The Group page lists the users in the group. Choose any user's link in the Username column to see
more information about the user, including a unique sub ID.

To get information about all of the team's members, use the ListUsers action (examples) in the Amazon
Cognito API.

Use Log Metrics


If you don't want to write your own scripts to process and visualize the raw log information, Amazon
CloudWatch metrics provide insights into worker activity for you.

To view metrics

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. In the navigation pane, choose Metrics.
3. Choose the AWS/SageMaker/Workteam name space, then explore the available metrics (p. 3271).
For example, selecting the Workteam and Workforce metrics lets you calculate the average time per
submitted task for a specific labeling job.

For more information, see Using Amazon CloudWatch Metrics.

Create and manage Amazon SNS topics for your work teams
Use the procedures in this topic when you want to:

• Create a topic to which you want an existing work team to subscribe.


• Create a topic before you've created a work team.
• Create or modify the work team with an API call, and specify a topic Amazon Resource Name (ARN).

If you create a work team using the console, the console provides an option to create a new topic for the
team so that you don't have to perform these steps.

887
Amazon SageMaker Developer Guide
Use a Private Workforce

Important
The Amazon SNS feature is not supported by Amazon A2I. If you subscribe your work team to
an Amazon SNS topic, workers will only receive notifications about Ground Truth labeling jobs.
Workers will not receive notifications about new Amazon A2I human review tasks.

Create the Amazon SNS topic


The steps for creating Amazon SNS topics for work team notifications are similar to the steps in Getting
Started in the Amazon SNS Developer Guide, with one significant addition—you must add an access
policy so that Amazon SageMaker can publish messages to the topic on your behalf.

To add the policy when you create the topic

1. Open the Amazon SNS console at https://console.aws.amazon.com/sns/v3/home.


2. In Create topic, enter the name of your topic and then choose Next steps.
3. In Access policy, choose Advanced.
4. In the JSON editor, find the Resource property, which displays the topic's ARN.
5. Copy the Resource ARN value.
6. Before the final closing brace (]), add the following policy.

, {
"Sid": "AwsSagemaker_SnsAccessPolicy",
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sns:Publish",
"Resource": "arn:partition:sns:region:111122223333:MyTopic", # ARN of the topic
you copied in the previous step
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:partition:sagemaker:region:111122223333:workteam/
*" # Workteam ARN
},
"StringEquals": {
"aws:SourceAccount": "111122223333" # SNS topic account
}
}
}

7. Create the topic.

After you create the topic, it appears in your Topics summary screen. For more information about
creating topics, see Creating a Topic in the Amazon SNS Developer Guide.

Manage worker subscriptions

If you subscribe a work team to a topic after you've already created the work team, the individual work
team members who were added to the team when the work team was created are not automatically
subscribed to the topic. For information about subscribing workers' email addresses to the topic, see
Subscribing an Endpoint to an Amazon SNS Topic in the Amazon SNS Developer Guide.

The only situation in which workers are automatically subscribed to your topic is when you create or
import an Amazon Cognito user group at the time that you create a work team and you set up the topic
subscription when you create that work team. For more information about creating and managing your
workteams with Amazon Cognito, see Create Work Teams (Amazon Cognito Console) (p. 874).

888
Amazon SageMaker Developer Guide
Crowd HTML Elements Reference

Crowd HTML Elements Reference


Crowd HTML Elements are web components, a web standard that abstracts HTML markup, CSS, and
JavaScript functionality into an HTML tag or set of tags. Amazon SageMaker provides customers with the
ability to design their own custom task templates in HTML.

As a starting point, you can use a template built using Crowd HTML Elements from one of the following
GitHub repositories:

• Example task UIs for Amazon SageMaker Ground Truth


• Over 60 example task UIs for Amazon Augmented AI (A2I)

These repositories include templates designed for audio, image, text, video, and other types of data
labeling and annotation tasks.

For more information about how to implement custom templates in Amazon SageMaker Ground Truth,
see Creating Custom Labeling Workflows (p. 671). To learn more about custom templates in Amazon
Augmented AI, see Create Custom Worker Task Templates (p. 2995).

SageMaker Crowd HTML Elements


The following is a list of Crowd HTML Elements that make building a custom template easier and provide
a familiar UI for workers. These elements are supported in Ground Truth, Augmented AI, and Mechanical
Turk.

Topics
• crowd-alert (p. 890)
• crowd-badge (p. 891)
• crowd-button (p. 893)
• crowd-bounding-box (p. 894)
• crowd-card (p. 898)
• crowd-checkbox (p. 900)
• crowd-classifier (p. 903)
• crowd-classifier-multi-select (p. 904)
• crowd-entity-annotation (p. 906)
• crowd-fab (p. 910)
• crowd-form (p. 911)
• crowd-icon-button (p. 912)
• crowd-image-classifier (p. 913)
• crowd-image-classifier-multi-select (p. 917)
• crowd-input (p. 919)
• crowd-instance-segmentation (p. 921)
• crowd-instructions (p. 925)
• crowd-keypoint (p. 927)
• crowd-line (p. 931)
• crowd-modal (p. 934)
• crowd-polygon (p. 935)
• crowd-polyline (p. 940)

889
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• crowd-radio-button (p. 944)


• crowd-radio-group (p. 946)
• crowd-semantic-segmentation (p. 948)
• crowd-slider (p. 951)
• crowd-tab (p. 953)
• crowd-tabs (p. 954)
• crowd-text-area (p. 956)
• crowd-toast (p. 958)
• crowd-toggle-button (p. 959)

crowd-alert
A message that alerts the worker to a current situation.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-alert> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<div id="errorBox"></div>

<crowd-keypoint
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Item A', 'Item B', 'Item C']"
header="Please locate the centers of each item."
name="annotatedResult">
<short-instructions>
Describe your task briefly here and give examples
</short-instructions>
<full-instructions>
Give additional instructions and good/bad examples here
</full-instructions>
</crowd-keypoint>
</crowd-form>

<script>
var num_obj = 1;

document.querySelector('crowd-form').onsubmit = function(e) {
const keypoints = document.querySelector('crowd-keypoint').value.keypoints ||
document.querySelector('crowd-keypoint')._submittableValue.keypoints;
const labels = keypoints.map(function(p) {
return p.label;
});

// 1. Make sure total number of keypoints is correct.


var original_num_labels = document.getElementsByTagName("crowd-keypoint")
[0].getAttribute("labels");

original_num_labels = original_num_labels.substring(2, original_num_labels.length -


2).split("\",\"");
var goalNumKeypoints = num_obj*original_num_labels.length;
if (keypoints.length != goalNumKeypoints) {
e.preventDefault();

890
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must add all keypoint


annotations and use each label only once.</crowd-alert>';
errorBox.scrollIntoView();
return;
}

// 2. Make sure all labels are unique.


labelCounts = {};
for (var i = 0; i < labels.length; i++) {
if (!labelCounts[labels[i]]) {
labelCounts[labels[i]] = 0;
}
labelCounts[labels[i]]++;
}
const goalNumSingleLabel = num_obj;

const numLabels = Object.keys(labelCounts).length;

Object.entries(labelCounts).forEach(entry => {
if (entry[1] != goalNumSingleLabel) {
e.preventDefault();
errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must use each label
only once.</crowd-alert>';
errorBox.scrollIntoView();
}
})
};
</script>

Attributes
The following attributes are supported by this element.

dismissible

A Boolean switch that, if present, allows the message to be closed by the worker.

type

A string that specifies the type of message to be displayed. The possible values are "info" (the default),
"success", "error", and "warning".

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-badge
An icon that floats over the top right corner of another element to which it is attached.

891
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a template that uses the <crowd-badge> element. Copy the following
code and save it in a file with the extension .html. Open the file in any browser to preview and interact
with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="https://unsplash.com/photos/NLUkAA-nDdE"
header="Choose the correct category for this image."
categories="['Person', 'Umbrella', 'Chair', 'Dolphin']"
>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>

<short-instructions id="short-instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<crowd-badge icon="star" for="short-instructions"/>
</short-instructions>
</crowd-image-classifier>
</crowd-form>

Attributes
The following attributes are supported by this element.

for

A string that specifies the ID of the element to which the badge is attached.

icon

A string that specifies the icon to be displayed in the badge. The string must be either the name of an
icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.

This attribute overrides the label attribute.

The following is an example of the syntax that you can use to add an iron-icon to a <crowd-badge>
HTML element. Replace icon-name with the name of the icon you'd like to use from this Icons set.

<crowd-badge icon="icon-name" for="short-instructions"/>

label

The text to display in the badge. Three characters or less is recommended because text that is too large
will overflow the badge area. An icon can be displayed instead of text by setting the icon attribute.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)

892
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-button
A styled button that represents some action.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a template that uses the <crowd-button> element. Copy the following
code and save it in a file with the extension .html. Open the file in any browser to preview and interact
with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-image-classifier
name="crowd-image-classifier"
src="https://unsplash.com/photos/NLUkAA-nDdE"
header="Please select the correct category for this image"
categories="['Person', 'Umbrella', 'Chair', 'Dolphin']"
>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<crowd-button>
<iron-icon icon="question-answer"/>
</crowd-button>
</short-instructions>
</crowd-image-classifier>
</crowd-form>

Attributes
The following attributes are supported by this element.

disabled

A Boolean switch that, if present, displays the button as disabled and prevents clicks.

form-action

A switch that either submits its parent crowd-form (p. 911) element, if set to "submit", or resets its
parent <crowd-form> element, if set to "reset".

href

The URL to an online resource. Use this property if you need a link styled as a button.

893
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

icon

A string that specifies the icon to be displayed next to the button's text. The string must be the name of
an icon from the open-source iron-icons set, which is pre-loaded. For example, to insert the search iron-
icon, use the following:

<crowd-button>
<iron-icon icon="search"/>
</crowd-button>

The icon is positioned to either the left or the right of the text, as specified by the icon-align attribute.

To use a custom icon see icon-url.

icon-align

The left or right position of the icon relative to the button's text. The default is "left".

icon-url

A URL to a custom image for the icon. A custom image can be used in place of a standard icon that is
specified by the icon attribute.

loading

A Boolean switch that, if present, displays the button as being in a loading state. This attribute has
precedence over the disabled attribute if both attributes are present.

target

When you use the href attribute to make the button act as a hyperlink to a specific URL, the target
attribute optionally targets a frame or window where the linked URL should load.

variant

The general style of the button. Use "primary" for primary buttons, "normal" for secondary buttons,
"link" for tertiary buttons, or "icon" to display only the icon without text.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-bounding-box
A widget for drawing rectangles on an image and assigning a label to the portion of the image that is
enclosed in each rectangle.

894
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-bounding-box> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-bounding-box
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Draw bounding boxes around all the cats and dogs in this image"
labels="['Cat', 'Dog']"
>
<full-instructions header="Bounding Box Instructions" >
<p>Use the bounding box tool to draw boxes around the requested target of interest:</
p>
<ol>
<li>Draw a rectangle using your mouse over each instance of the target.</li>
<li>Make sure the box does not cut into the target, leave a 2 - 3 pixel margin</li>
<li>
When targets are overlapping, draw a box around each object,
include all contiguous parts of the target in the box.
Do not include parts that are completely overlapped by another object.
</li>
<li>
Do not include parts of the target that cannot be seen,
even though you think you can interpolate the whole shape of the target.
</li>
<li>Avoid shadows, they're not considered as a part of the target.</li>
<li>If the target goes off the screen, label up to the edge of the image.</li>
</ol>
</full-instructions>

<short-instructions>
Draw boxes around the requested target of interest.
</short-instructions>
</crowd-bounding-box>
</crowd-form>

Attributes
The following attributes are supported by this element.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

initial-value

An array of JSON objects, each of which sets a bounding box when the component is loaded. Each
JSON object in the array contains the following properties. Bounding boxes set via the initial-
value property can be adjusted and whether or not a worker answer was adjusted is tracked via an
initialValueModified boolean in the worker answer output.

• height – The height of the box in pixels.


• label – The text assigned to the box as part of the labeling task. This text must match one of the labels
defined in the labels attribute of the <crowd-bounding-box> element.
• left – Distance of the top-left corner of the box from the left side of the image, measured in pixels.

895
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• top – Distance of the top-left corner of the box from the top of the image, measured in pixels.
• width – The width of the box in pixels.

You can extract the bounding box initial value from a manifest file of a previous job in a custom
template using the Liquid templating language:

initial-value="[
{% for box in task.input.manifestLine.label-attribute-name-from-prior-job.annotations
%}
{% capture class_id %}{{ box.class_id }}{% endcapture %}
{% assign label = task.input.manifestLine.label-attribute-name-from-prior-job-
metadata.class-map[class_id] %}
{
label: {{label | to_json}},
left: {{box.left}},
top: {{box.top}},
width: {{box.width}},
height: {{box.height}},
},
{% endfor %}
]"

labels

A JSON formatted array of strings, each of which is a label that a worker can assign to the image portion
enclosed by a rectangle. Limit: 10 labels.

name

The name of this widget. It's used as a key for the widget's input in the form output.

src

The URL of the image on which to draw bounding boxes.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 896), short-instructions (p. 896)

Regions
The following regions are required by this element.

full-instructions

General instructions about how to draw bounding boxes.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

Output
The following output is supported by this element.

896
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

boundingBoxes

An array of JSON objects, each of which specifies a bounding box that has been created by the worker.
Each JSON object in the array contains the following properties.

• height – The height of the box in pixels.


• label – The text assigned to the box as part of the labeling task. This text must match one of the labels
defined in the labels attribute of the <crowd-bounding-box> element.
• left – Distance of the top-left corner of the box from the left side of the image, measured in pixels.
• top – Distance of the top-left corner of the box from the top of the image, measured in pixels.
• width – The width of the box in pixels.

inputImageProperties

A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

Example : Sample Element Outputs

The following are samples of outputs from common use scenarios for this element.

Single Label, Single Box / Multiple Label, Single Box

[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 401,
"label": "Dog",
"left": 243,
"top": 117,
"width": 187
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]

Single Label, Multiple Box

[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 401,
"label": "Dog",
"left": 243,
"top": 117,
"width": 187

897
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

},
{
"height": 283,
"label": "Dog",
"left": 684,
"top": 120,
"width": 116
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]

Multiple Label, Multiple Box

[
{
"annotatedResult": {
"boundingBoxes": [
{
"height": 395,
"label": "Dog",
"left": 241,
"top": 125,
"width": 158
},
{
"height": 298,
"label": "Cat",
"left": 699,
"top": 116,
"width": 101
}
],
"inputImageProperties": {
"height": 533,
"width": 800
}
}
}
]

You could have many labels available, but only the ones that are used appear in the output.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-card
A box with an elevated appearance for displaying information.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

898
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

The following is an example of a template designed for sentiment analysis tasks that uses the <crowd-
card> element. Copy the following code and save it in a file with the extension .html. Open the file in
any browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<style>
h3 {
margin-top: 0;
}

crowd-card {
width: 100%;
}

.card {
margin: 10px;
}

.left {
width: 70%;
margin-right: 10px;
display: inline-block;
height: 200px;
}

.right {
width: 20%;
height: 200px;
display: inline-block;
}
</style>

<crowd-form>
<short-instructions>
Your short instructions here.
</short-instructions>

<full-instructions>
Your full instructions here.
</full-instructions>

<div class="left">
<h3>What sentiment does this text convey?</h3>
<crowd-card>
<div class="card">
Nothing is great.
</div>
</crowd-card>
</div>

<div class="right">
<h3>Select an option</h3>

<select name="sentiment1" style="font-size: large" required>


<option value="">(Please select)</option>
<option>Negative</option>
<option>Neutral</option>
<option>Positive</option>
<option>Text is empty</option>
</select>
</div>

<div class="left">
<h3>What sentiment does this text convey?</h3>

899
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<crowd-card>
<div class="card">
Everything is great!
</div>
</crowd-card>
</div>

<div class="right">
<h3>Select an option</h3>

<select name="sentiment2" style="font-size: large" required>


<option value="">(Please select)</option>
<option>Negative</option>
<option>Neutral</option>
<option>Positive</option>
<option>Text is empty</option>
</select>
</div>
</crowd-form>

Attributes
The following attributes are supported by this element.

heading

The text displayed at the top of the box.

image

A URL to an image to be displayed within the box.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-checkbox
A UI component that can be checked or unchecked allowing a user to select multiple options from a set.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-checkbox> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

900
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<crowd-form>

<p>Find the official website for: <strong>{{ task.input.company }}</strong></p>


<p>Do not give Yelp pages, LinkedIn pages, etc.</p>
<p>Include the http:// prefix from the website</p>
<crowd-input name="website" placeholder="http://example.com"></crowd-input>

<crowd-checkbox name="website-found">Website Found</crowd-checkbox>

</crowd-form>

Attributes
The following attributes are supported by this element.

checked

A Boolean switch that, if present, displays the check box as checked.

The following is an example of the syntx used to check a checkbox by default.

<crowd-checkbox name="checkedBox" value="checked" checked>This box is checked</crowd-


checkbox>

disabled

A Boolean switch that, if present, displays the check box as disabled and prevents it from being checked.

The following is an example of the syntax used to disable a checkbox.

<crowd-checkbox name="disabledCheckBox" value="Disabled" disabled>Cannot be selected</


crowd-checkbox>

name

A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.

required

A Boolean switch that, if present, requires the worker to provide input.

The following is an example of the syntax used to require a checkbox be selected.

<crowd-checkbox name="work_verified" required>Instructions were clear</crowd-checkbox>

value

A string used as the name for the check box state in the output. Defaults to "on" if not specified.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)

901
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• Child elements: none

Output
Provides a JSON object. The name string is the object name and the valuestring is the property name
for a Boolean value based on the check box state; true if checked, false if not checked.

Example : Sample Element Outputs

Using the same name value for multiple boxes.

<!-- INPUT -->


<div><crowd-checkbox name="image_attributes" value="blurry"> Blurry </crowd-checkbox></div>
<div><crowd-checkbox name="image_attributes" value="dim"> Too Dim </crowd-checkbox></div>
<div><crowd-checkbox name="image_attributes" value="exposed"> Too Bright </crowd-
checkbox></div>

//Output with "blurry" and "dim" checked


[
{
"image_attributes": {
"blurry": true,
"dim": true,
"exposed": false
}
}
]

Note that all three color values are properties of a single object.

Using different name values for each box.

<!-- INPUT -->


<div><crowd-checkbox name="Stop" value="Red"> Red </crowd-checkbox></div>
<div><crowd-checkbox name="Slow" value="Yellow"> Yellow </crowd-checkbox></div>
<div><crowd-checkbox name="Go" value="Green"> Green </crowd-checkbox></div>

//Output with "Red" checked


[
{
"Go": {
"Green": false
},
"Slow": {
"Yellow": false
},
"Stop": {
"Red": true
}
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

902
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

crowd-classifier
A widget for classifying non-image content, such as audio, video, or text.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an HTML worker task template built using crowd-classifier. This
example uses the Liquid template language to automate:

• Label categories in the categories parameter


• The objects that are being classified in the classification-target parameter.

Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-classifier
name="category"
categories="{{ task.input.labels | to_json | escape }}"
header="What type of a document is this?"
>
<classification-target>
<iframe style="width: 100%; height: 600px;" src="{{ task.input.taskObject |
grant_read_access }}" type="application/pdf"></iframe>
</classification-target>

<full-instructions header="Document Classification Instructions">


<p>Read the task carefully and inspect the document.</p>
<p>Choose the appropriate label that best suits the document.</p>
</full-instructions>

<short-instructions>
Please choose the correct category for the document
</short-instructions>
</crowd-classifier>
</crowd-form>

Attributes
The following attributes are supported by this element.

categories
A JSON formatted array of strings, each of which is a category that a worker can assign to the text. You
should include "other" as a category, otherwise the worker my not be able to provide an answer.

header
The text to display above the image. This is typically a question or simple instruction for the worker.

name
The name of this widget. It is used as a key for the widget's input in the form output.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)

903
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• Child elements: classification-target (p. 904), full-instructions (p. 904), short-instructions (p. 904)

Regions
The following regions are supported by this element.

classification-target
The content to be classified by the worker. This can be plain text or HTML. Examples of how the HTML
can be used include but are not limited to embedding a video or audio player, embedding a PDF, or
performing a comparison of two or more images.

full-instructions
General instructions about how to do text classification.

short-instructions
Important task-specific instructions that are displayed in a prominent place.

Output
The output of this element is an object using the specified name value as a property name, and a string
from the categories as the property's value.

Example : Sample Element Outputs


The following is a sample of output from this element.

[
{
"<name>": {
"label": "<value>"
}
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-classifier-multi-select
A widget for classifying various forms of content—such as audio, video, or text—into one or more
categories. The content to classify is referred to as an object.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an HTML worker task template built using this element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>

904
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<crowd-classifier-multi-select
name="category"
categories="['Positive', 'Negative', 'Neutral']"
header="Select the relevant categories"
exclusion-category="{ text: 'None of the above' }"
>
<classification-target>
{{ task.input.taskObject }}
</classification-target>

<full-instructions header="Text Categorization Instructions">


<p><strong>Positive</strong> sentiment include: joy, excitement, delight</p>
<p><strong>Negative</strong> sentiment include: anger, sarcasm, anxiety</p>
<p><strong>Neutral</strong>: neither positive or negative, such as stating a fact</
p>
<p><strong>N/A</strong>: when the text cannot be understood</p>
<p>When the sentiment is mixed, such as both joy and sadness, choose both labels.</
p>
</full-instructions>

<short-instructions>
Choose all categories that are expressed by the text.
</short-instructions>
</crowd-classifier-multi-select>
</crowd-form>

Attributes
The following attributes are supported by the crowd-classifier-multi-select element. Each
attribute accepts a string value or string values.

categories
Required. A JSON-formatted array of strings, each of which is a category that a worker can assign to the
object.

header
Required. The text to display above the image. This is typically a question or simple instruction for
workers.

name
Required. The name of this widget. In the form output, the name is used as a key for the widget's input.

exclusion-category
Optional. A JSON-formatted string with the following format: "{ text: 'default-value' }". This
attribute sets a default value that workers can choose if none of the labels applies to the object shown in
the worker UI.

Element Hierarchy
This element has the following parent and child elements:

• Parent elements: crowd-form (p. 911)


• Child elements: classification-target (p. 904), full-instructions (p. 904), short-instructions (p. 904)

Regions
This element uses the following regions.

905
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

classification-target

The content to be classified by the worker. Content can be plain text or an object that you specify in
the template using HTML. For example, you can use HTML elements to include a video or audio player,
embedding a PDF file, or include a comparison of two or more images.

full-instructions

General instructions about how to classify text.

short-instructions

Important task-specific instructions. These instructions are displayed prominently.

Output
The output of this element is an object that uses the specified name value as a property name, and a
string from categories as the property's value.

Example : Sample Element Outputs

The following is a sample of output from this element.

[
{
"<name>": {
labels: ["label_a", "label_b"]
}
}
]

See Also
For more information, see the following:

• Text Classification (Multi-label) (p. 559)


• Use Amazon SageMaker Ground Truth to Label Data (p. 526)
• Crowd HTML Elements Reference (p. 889)

crowd-entity-annotation
A widget for labeling words, phrases, or character strings within a longer text. Workers select a label, and
highlight the text that the label applies to.
Important: Self-contained Widget
Do not use <crowd-entity-annotation> element with the <crowd-form> element. It
contains its own form submission logic and Submit button.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a template that uses the <crowd-entity-annotation> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

906
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<crowd-entity-annotation
name="crowd-entity-annotation"
header="Highlight parts of the text below"
labels="[{'label': 'person', 'shortDisplayName': 'per', 'fullDisplayName': 'Person'},
{'label': 'date', 'shortDisplayName': 'dat', 'fullDisplayName': 'Date'}, {'label':
'company', 'shortDisplayName': 'com', 'fullDisplayName': 'Company'}]"
text="Amazon SageMaker Ground Truth helps you build highly accurate training datasets for
machine learning quickly."
>
<full-instructions header="Named entity recognition instructions">
<ol>
<li><strong>Read</strong> the text carefully.</li>
<li><strong>Highlight</strong> words, phrases, or sections of the text.</li>
<li><strong>Choose</strong> the label that best matches what you have highlighted.</
li>
<li>To <strong>change</strong> a label, choose highlighted text and select a new
label.</li>
<li>To <strong>remove</strong> a label from highlighted text, choose the X next to
the abbreviated label name on the highlighted text.</li>
<li>You can select all of a previously highlighted text, but not a portion of it.</
li>
</ol>
</full-instructions>

<short-instructions>
Apply labels to words or phrases.
</short-instructions>

<div id="additionalQuestions" style="margin-top: 20px">


<h3>
What is the overall subject of this text?
</h3>
<crowd-radio-group>
<crowd-radio-button name="tech" value="tech">Technology</crowd-radio-button>
<crowd-radio-button name="politics" value="politics">Politics</crowd-radio-button>
</crowd-radio-group>
</div>
</crowd-entity-annotation>

<script>
document.addEventListener('all-crowd-elements-ready', () => {
document
.querySelector('crowd-entity-annotation')
.shadowRoot
.querySelector('crowd-form')
.form
.appendChild(additionalQuestions);
});
</script>

Attributes
The following attributes are supported by this element.

header
The text to display above the image. This is typically a question or simple instruction for the worker.

initial-value
A JSON formatted array of objects, each of which defines an annotation to apply to the text at
initialization. Objects contain a label value that matches one in the labels attribute, an integer
startOffset value for labeled span's starting unicode offset, and an integer endOffset value for the
ending unicode offset.

907
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Example

[
{
label: 'person',
startOffset: 0,
endOffset: 16
},
...
]

labels

A JSON formatted array of objects, each of which contains:

• label (required): The name used to identify entities.


• fullDisplayName (optional): Used for the label list in the task widget. Defaults to the label value if
not specified.
• shortDisplayName (optional): An abbreviation of 3-4 letters to display above selected entities.
Defaults to the label value if not specified.
shortDisplayName is highly recommended
Values displayed above the selections can overlap and create difficulty managing labeled
entities in the workspace. Providing a 3-4 character shortDisplayName for each label
is highly recommended to prevent overlap and keep the workspace manageable for your
workers.

Example

[
{
label: 'person',
shortDisplayName: 'per',
fullDisplayName: 'person'
}
]

name

Serves as the widget's name in the DOM. It is also used as the label attribute name in form output and
the output manifest.

text

The text to be annotated. The templating system escapes quotes and HTML strings by default. If your
code is already escaped or partially escaped, see Variable filters (p. 676) for more ways to control
escaping.

Element Hierarchy
This element has the following parent and child elements.

• Child elements: full-instructions (p. 909), short-instructions (p. 909)

Regions
The following regions are supported by this element.

908
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

full-instructions

General instructions about how to work with the widget.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

Output
The following output is supported by this element.

entities

A JSON object that specifies the start, end, and label of an annotation. This object contains the following
properties.

• label – The assigned label.


• startOffset – The Unicode offset of the beginning of the selected text.
• endOffset – The Unicode offset of the first character after the selection.

Example : Sample Element Outputs

The following is a sample of the output from this element.

{
"myAnnotatedResult": {
"entities": [
{
"endOffset": 54,
"label": "person",
"startOffset": 47
},
{
"endOffset": 97,
"label": "event",
"startOffset": 93
},
{
"endOffset": 219,
"label": "date",
"startOffset": 212
},
{
"endOffset": 271,
"label": "location",
"startOffset": 260
}
]
}
}

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

909
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

crowd-fab
A floating button with an image in its center.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template designed for image classification that uses the
<crowd-fab> element. This template uses JavaScript to enable workers to report issues with the worker
UI. Copy the following code and save it in a file with the extension .html. Open the file in any browser
to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>If there is an issue with the image or tools, please select
<b>None of the Above</b>, describe the issue in the text box and click the
button below.</p>
<crowd-input label="Report an Issue" name="template-issues"></crowd-input>
<crowd-fab id="button1" icon="report-problem" title="Issue"/>
</short-instructions>

<full-instructions header="Classification Instructions">


<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
Use the <b>None of the Above</b> option if none of the other labels suit the
image.</p>
</full-instructions>

</crowd-image-classifier>
</crowd-form>

<script>
[
button1,
].forEach(function(button) {
button.addEventListener('click', function() {
document.querySelector('crowd-form').submit();
});
});
</script>

Attributes
The following attributes are supported by this element.

disabled

A Boolean switch that, if present, displays the floating button as disabled and prevents clicks.

icon

A string that specifies the icon to be displayed in the center of the button. The string must be either the
name of an icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.

910
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

The following is an example of the syntax that you can use to add an iron-icon to a <crowd-fab> HTML
element. Replace icon-name with the name of the icon you'd like to use from this Icons set.

<crowd-fab "id="button1" icon="icon-name" title="Issue"/>

label
A string consisting of a single character that can be used instead of an icon. Emojis or multiple characters
may result in the button displaying an ellipsis instead.

title
A string that will display as a tool tip when the mouse hovers over the button.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-form
The form wrapper for all custom tasks. Sets and implements important actions for the proper
submission of your form data.

If a crowd-button (p. 893) of type "submit" is not included inside the <crowd-form> element, it will
automatically be appended within the <crowd-form> element.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an image classification template that uses the <crowd-form> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>

911
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<full-instructions header="Classification Instructions">


<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
Use the <b>None of the Above</b> option if none of the other labels suit the
image.</p>
</full-instructions>

</crowd-image-classifier>
</crowd-form>

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: none


• Child elements: Any of the UI Template (p. 889) elements

Element Events
The crowd-form element extends the standard HTML form element and inherits its events, such as
onclick and onsubmit.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-icon-button
A button with an image placed in the center. When the user touches the button, a ripple effect emanates
from the center of the button.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template designed for image classification that uses the
<crowd-icon-button> element. This template uses JavaScript to enable workers to report issues with
the worker UI. Copy the following code and save it in a file with the extension .html. Open the file in
any browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>If there is an issue with the image or tools, please select
<b>None of the Above</b>, describe the issue in the text box and click the
button below.</p>
<crowd-input label="Report an Issue" name="template-issues"/></crowd-input>
<crowd-icon-button id="button1" icon="report-problem" title="Issue"/>
</short-instructions>

912
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<full-instructions header="Classification Instructions">


<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
Use the <b>None of the Above</b> option if none of the other labels suit the
image.</p>
</full-instructions>

</crowd-image-classifier>
</crowd-form>

<script>
[
button1,
].forEach(function(button) {
button.addEventListener('click', function() {
document.querySelector('crowd-form').submit();
});
});
</script>

Attributes
The following attributes are supported by this element.

disabled
A Boolean switch that, if present, displays the button as disabled and prevents clicks.

icon
A string that specifies the icon to be displayed in the center of the button. The string must be either the
name of an icon from the open-source iron-icons set, which is pre-loaded, or the URL to a custom icon.

The following is an example of the syntax that you can use to add an iron-icon to a <crowd-icon-
button> HTML element. Replace icon-name with the name of the icon you'd like to use from this Icons
set.

<crowd-icon-button id="button1" icon="icon-name" title="Issue"/>

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-image-classifier
A widget for classifying an image. Use one of the following supported image formats: APNG, BMP, GIF,
ICO, JPEG, PNG, SVG. Images do not have a size limit.

913
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an image classification template that uses the <crowd-image-
classifier> element. Copy the following code and save it in a file with the extension .html. Open the
file in any browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-image-classifier
src="${image_url}"
categories="['Cat', 'Dog', 'Bird', 'None of the Above']"
header="Choose the correct category for the image"
name="category">

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>

<full-instructions header="Classification Instructions">


<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
Use the <b>None of the Above</b> option if none of the other labels suit the
image.</p>
</full-instructions>

</crowd-image-classifier>
</crowd-form>

Attributes
The following attributes are required by this element.

categories

A JSON formatted array of strings, each of which is a category that a worker can assign to the image. You
should include "other" as a category, so that the worker can provide an answer. You can specify up to 10
categories.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

name

The name of this widget. It is used as a key for the widget's input in the form output.

overlay

Information to be overlaid on the source image. This is for verification workflows of bounding-box,
semantic-segmentation, and instance-segmentation tasks.

It is a JSON object containing an object with the name of the task-type in camelCase as the key. That
key's value is an object that contains the labels and other necessary information from the previous task.

An example of a crowd-image-classifier element with attributes for verifying a bounding-box task


follows:

<crowd-image-classifier

914
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

name="boundingBoxClassification"
header="Rate the quality of the annotations based on the background section
in the instructions on the left hand side."
src="https://i.imgur.com/CIPKVJo.jpg"
categories="['good', 'bad', 'okay']"
overlay='{
"boundingBox": {
labels: ["bird", "cat"],
value: [
{
height: 284,
label: "bird",
left: 230,
top: 974,
width: 223
},
{
height: 69,
label: "bird",
left: 79,
top: 889,
width: 247
}
]
},
}'
> ... </crowd-image-classifier>

A semantic segmentation verification task would use the overlay value as follows:

<crowd-image-classifier
name='crowd-image-classifier'
categories='["good", "bad"]'
src='URL of image to be classified'
header='Please classify'
overlay='{
"semanticSegmentation": {
"labels": ["Cat", "Dog", "Bird", "Cow"],
"labelMappings": {
"Bird": {
"color": "#ff7f0e"
},
"Cat": {
"color": "#2ca02c"
},
"Cow": {
"color": "#d62728"
},
"Dog": {
"color": "#2acf59"
}
},
"src": "URL of overlay image",
}
}'
> ... </crowd-image-classifier>

An instance-segmentation task would use the overlay value as follows:

<crowd-image-classifier
name='crowd-image-classifier'
categories='["good", "bad"]'
src='URL of image to be classified'

915
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

header='Please classify instances of each category'


overlay='{
"instanceSegmentation": {
"labels": ["Cat", "Dog", "Bird", "Cow"],
"instances": [
{
"color": "#2ca02c",
"label": "Cat"
},
{
"color": "#1f77b4",
"label": "Cat"
},
{
"color": "#d62728",
"label": "Dog"
}
],
"src": "URL of overlay image",
}
}'
> ... </crowd-image-classifier>

src

The URL of the image to be classified.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 916), short-instructions (p. 916), worker-comment (p. 916)

Regions
The following regions are used by this element.

full-instructions

General instructions for the worker on how to classify an image.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

worker-comment

Use this in verification workflows when you need workers to explain why they made the choice they
did. Use the text between the opening and closing tags to provide instructions for workers on what
information should be included in the comment.

It uses the following attributes:

header

A phrase with a call to action for leaving a comment. Used as the title text for a modal window where the
comment is added.

Optional. Defaults to "Add a comment."

916
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

link-text

This text appears below the categories in the widget. When clicked, it opens a modal window where the
worker may add a comment.

Optional. Defaults to "Add a comment."

placeholder

An example text in the comment text area that is overwritten when worker begins to type. This does not
appear in output if the worker leaves the field blank.

Optional. Defaults to blank.

Output
The output of this element is a string that specifies one of the values defined in the categories attribute
of the <crowd-image-classifier> element.

Example : Sample Element Outputs

The following is a sample of output from this element.

[
{
"<name>": {
"label": "<value>"
"workerComment": "Comment - if no comment is provided, this field will not be
present"
}
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-image-classifier-multi-select
A widget for classifying an image into one or more categories. Use one of the following supported image
formats: APNG, BMP, GIF, ICO, JPEG, PNG, SVG. Images do not have a size limit.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an HTML worker task template built using this crowd element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-image-classifier-multi-select
name="animals"
categories="['Cat', 'Dog', 'Horse', 'Pig', 'Bird']"

917
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

src="https://images.unsplash.com/photo-1509205477838-a534e43a849f?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1998&q=80"
header="Please identify the animals in this image"
exclusion-category="{ text: 'None of the above' }"
>
<full-instructions header="Classification Instructions">
<p>If more than one label applies to the image, select multiple labels.</p>
<p>If no labels apply, select <b>None of the above</b></p>
</full-instructions>

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label(s) that best suit the image.</p>
</short-instructions>
</crowd-image-classifier-multi-select>
</crowd-form>

Attributes
The following attributes are supported by the crowd-image-classifier-multi-select element.
Each attribute accepts a string value or string values.

categories

Required. A JSON-formatted array of strings, each of which is a category that a worker can assign to the
image. A worker must choose at least one category and can choose all categories.

header

Required. The text to display above the image. This is typically a question or simple instruction for
workers.

name

Required. The name of this widget. In the form output, the name is used as a key for the widget's input.

src

Required. The URL of the image to be classified.

exclusion-category

Optional. A JSON-formatted string with the following format: "{ text: 'default-value' }". This
attribute sets a default value that workers can choose if none of the labels applies to the image shown in
the worker UI.

Element Hierarchy
This element has the following parent and child elements:

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 916), short-instructions (p. 916), worker-comment (p. 916)

Regions
This element uses the following regions

full-instructions

General instructions for the worker on how to classify an image.

918
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

short-instructions

Important task-specific instructions. These instructions are displayed prominently.

Output
The output of this element is a string that specifies one or more of the values defined in the
categories attribute of the <crowd-image-classifier-multi-select> element.

Example : Sample Element Outputs

The following is a sample of output from this element.

[
{
"<name>": {
labels: ["label_a", "label_b"]
}
}
]

See Also
For more information, see the following:

• Image Classification (Multi-label) (p. 547)


• Use Amazon SageMaker Ground Truth to Label Data (p. 526)
• Crowd HTML Elements Reference (p. 889)

crowd-input
A box that accepts input data.
Cannot be self-closing
Unlike the input element in the HTML standard, this element cannot be self-closed by putting
a slash before the ending bracket, e.g. <crowd-input ... />. It must be followed with a </
crowd-input> to close the element.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-input> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<img style="max-width: 35vw; max-height: 50vh" src="{{ task.input.taskObject |
grant_read_access }}">
<crowd-input name="tag1" label="Word/phrase 1" required></crowd-input>
<crowd-input name="tag2" label="Word/phrase 2" required></crowd-input>
<crowd-input name="tag3" label="Word/phrase 3" required></crowd-input>

<short-instructions>
Your custom quick instructions and examples
</short-instructions>

<full-instructions>

919
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Your custom detailed instracutions and more examples


</full-instructions>
</crowd-form>

Attributes
The following attributes are supported by this element.

allowed-pattern

A regular expression that is used with the auto-validate attribute to ignore non-matching characters as
the worker types.

auto-focus

When the value is set to true, the browser places focus inside the input area after loading. This way, the
worker can start typing without having to select it first.

auto-validate

A Boolean switch that, if present, turns on input validation. The behavior of the validator can be
modified by the error-message and allowed-pattern attributes.

disabled

A Boolean switch that, if present, displays the input area as disabled.

error-message

The text to be displayed below the input field, on the left side, if validation fails.

label

A string that is displayed inside a text field.

This text shrinks and rises up above a text field when the worker starts typing in the field or when the
value attribute is set.

max-length

A maximum number of characters the input will accept. Input beyond this limit is ignored.

min-length

A minimum length for the input in the field

name

Sets the name of the input to be used in the DOM and the output of the form.

placeholder

A string value that is used as placeholder text, displayed until the worker starts entering data into the
input, It is not used as a default value.

required

A Boolean switch that, if present, requires the worker to provide input.

920
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

type
Takes a string to set the HTML5 input-type behavior for the input. Examples include file and date.

value
A preset that becomes the default if the worker does not provide input. The preset appears in a text field.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

Output
Provides a name string as the property name, and the text that was entered in the field as its value.

Example : Sample JSON Output


The values for multiple elements are output in the same object, with their name attribute value as their
property name. Elements with no input do not appear in the output. For example, let's use three inputs:

<crowd-input name="tag1" label="Word/phrase 1"></crowd-input>


<crowd-input name="tag2" label="Word/phrase 2"></crowd-input>
<crowd-input name="tag3" label="Word/phrase 3"></crowd-input>

This is the output if only two have input:

[
{
"tag1": "blue",
"tag2": "red"
}
]

This means any code built to parse these results should be able to handle the presence or absence of
each input in the answers.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-instance-segmentation
A widget for identifying individual instances of specific objects within an image and creating a colored
overlay for each labeled instance.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-instance-segmentation>.


Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.

921
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-instance-segmentation
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please label each of the requested objects in this image"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Segmentation Instructions">
<ol>
<li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to
understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li>
</ol>
</full-instructions>

<short-instructions>
<p>Use the tools to label all instances of the requested items in the image</p>
</short-instructions>
</crowd-instance-segmentation>
</crowd-form>

Use a template similar to the following to allow workers to add their own categories (labels).

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-instance-segmentation
id="annotator"
name="myTexts"
src="{{ task.input.taskObject | grant_read_access }}"
header="Click Instructions to add new labels."
labels="['placeholder']"
>
<short-instructions>
<h3>Add a label to describe each type of object in this image.</h3>
<h3>Cover each instance of each object with a segmentation mask.</h3>
<br>
<h3>
Add new label
</h3>
<crowd-input name="_customLabel" id="customLabel"></crowd-input>
<crowd-button id="addLabel">Add</crowd-button>

<br><br><br>
<h3>
Manage labels
</h3>
<div id="labelsSection"></div>
</short-instructions>

<full-instructions>
Describe your task in more detail here.
</full-instructions>
</crowd-instance-segmentation>
</crowd-form>

<script>
document.addEventListener('all-crowd-elements-ready', function(event) {
document.querySelector('crowd-instance-segmentation').labels = [];
});

function populateLabelsSection() {

922
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

labelsSection.innerHTML = '';
annotator.labels.forEach(function(label) {
const labelContainer = document.createElement('div');
labelContainer.innerHTML = label + ' <a href="javascript:void(0)">(Delete)</a>';
labelContainer.querySelector('a').onclick = function() {
annotator.labels = annotator.labels.filter(function(l) {
return l !== label;
});
populateLabelsSection();
};
labelsSection.appendChild(labelContainer);
});
}

addLabel.onclick = function() {
annotator.labels = annotator.labels.concat([customLabel.value]);
customLabel.value = null;

populateLabelsSection();
};
</script>

Attributes
The following attributes are supported by this element.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

labels

A JSON formatted array of strings, each of which is a label that a worker can assign to an instance of an
object in the image. Workers can generate different overlay colors for each relevant instance by selecting
"add instance" under the label in the tool.

name

The name of this widget. It is used as a key for the labeling data in the form output.

src

The URL of the image that is to be labeled.

initial-value
A JSON object containing the color mappings of a prior instance segmentation job and a link to the
overlay image output by the prior job. Include this when you want a human worker to verify the results
of a prior labeling job and adjust it if necessary.

The attribute will appear as follows:

initial-value="{
"instances": [
{
"color": "#2ca02c",
"label": "Cat"
},
{
"color": "#1f77b4",
"label": "Cat"

923
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

},
{
"color": "#d62728",
"label": "Dog"
}
],
"src": {{ "S3 file URL for image" | grant_read_access }}
}"

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 924), short-instructions (p. 924)

Regions
The following regions are supported by this element.

full-instructions

General instructions about how to do image segmentation.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

Output
The following output is supported by this element.

labeledImage

A JSON Object containing a Base64 encoded PNG of the labels.

instances

A JSON Array containing objects with the instance labels and colors.

• color – The hexadecimal value of the label's RGB color in the labeledImage PNG.
• label – The label given to overlay(s) using that color. This value may repeat, because the different
instances of the label are identified by their unique color.

inputImageProperties

A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

Example : Sample Element Outputs

The following is an example of output from this element.

924
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

[
{
"annotatedResult": {
"inputImageProperties": {
"height": 533,
"width": 800
},
"instances": [
{
"color": "#1f77b4",
"label": "<Label 1>":
},
{
"color": "#2ca02c",
"label": "<Label 1>":
},
{
"color": "#ff7f0e",
"label": "<Label 3>":
},
],
"labeledImage": {
"pngImageData": "<Base-64 Encoded Data>"
}
}
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-instructions
An element that displays instructions on three tabbed pages, Summary, Detailed Instructions, and
Examples, when the worker clicks on a link or button.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that used the <crowd-instructions> element. Copy
the following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-instructions link-text="View instructions" link-type="button">
<short-summary>
<p>Given an image, write three words or short phrases that summarize its contents.</
p>
</short-summary>
<detailed-instructions>
<p>Imagine that you are describing an image to a friend or tagging it for a news
website. Provide three specific words or short phrases that describe it.</p>
</detailed-instructions>
<positive-example>
<p><img src="https://s3.amazonaws.com/cv-demo-images/highway.jpg"/></p>

925
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<p>
<ul>
<li>Highway</li>
<li>Cars</li>
<li>Gas station</li>
</ul>
</p>
</positive-example>
<negative-example>
<p><img src="https://s3.amazonaws.com/cv-demo-images/highway.jpg"/></p>
<p>
These are not specific enough:
<ol>
<li>Trees</li>
<li>Outside</li>
<li>Daytime</li>
</ol>
</p>
</negative-example>
</crowd-instructions>
<p><strong>Instructions: </strong>Given an image, write three words or short phrases
that summarize its contents.</p>
<p>If someone were to see these three words or phrases, they should understand the
subject and context of the image, as well as any important actions.</p>
<p>View the instructions for detailed instructions and examples.</p>
<p><img style="max-width: 100%; max-height: 100%" src="{{ task.input.taskObject |
grant_read_access }}"></p>
<crowd-input name="tag1" label="Word/phrase 1" required></crowd-input>
<crowd-input name="tag2" label="Word/phrase 2" required></crowd-input>
<crowd-input name="tag3" label="Word/phrase 3" required></crowd-input>
</crowd-form>

Attributes
The following attributes are supported by this element.

link-text

The text to display for opening the instructions. The default is Click for instructions.

link-type

A string that specifies the type of trigger for the instructions. The possible values are "link" (default) and
"button".

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

Regions
The following regions are supported by this element.

detailed-instructions

Content that provides specific instructions for a task. This appears on the page of the "Detailed
Instructions" tab.

926
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

negative-example

Content that provides examples of inadequate task completion. This appears on the page of the
"Examples" tab. More than one example may be provided within this element.

positive-example

Content that provides examples of proper task completion. This appears on the page of the "Examples"
tab.

short-summary

A brief statement that summarizes the task to be completed. This appears on the page of the "Summary"
tab. More than one example may be provided within this element.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-keypoint
Generates a tool to select and annotate key points on an image.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of an Liquid template that uses the <crowd-keypoint> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<div id="errorBox"></div>

<crowd-keypoint
src="{{ task.input.taskObject | grant_read_access }}"
labels="['Item A', 'Item B', 'Item C']"
header="Please locate the centers of each item."
name="annotatedResult">
<short-instructions>
Describe your task briefly here and give examples
</short-instructions>
<full-instructions>
Give additional instructions and good/bad examples here
</full-instructions>
</crowd-keypoint>
</crowd-form>

<script>
var num_obj = 1;

document.querySelector('crowd-form').onsubmit = function(e) {
const keypoints = document.querySelector('crowd-keypoint').value.keypoints ||
document.querySelector('crowd-keypoint')._submittableValue.keypoints;
const labels = keypoints.map(function(p) {
return p.label;

927
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

});

// 1. Make sure total number of keypoints is correct.


var original_num_labels = document.getElementsByTagName("crowd-keypoint")
[0].getAttribute("labels");

original_num_labels = original_num_labels.substring(2, original_num_labels.length -


2).split("\",\"");
var goalNumKeypoints = num_obj*original_num_labels.length;
if (keypoints.length != goalNumKeypoints) {
e.preventDefault();
errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must add all keypoint
annotations and use each label only once.</crowd-alert>';
errorBox.scrollIntoView();
return;
}

// 2. Make sure all labels are unique.


labelCounts = {};
for (var i = 0; i < labels.length; i++) {
if (!labelCounts[labels[i]]) {
labelCounts[labels[i]] = 0;
}
labelCounts[labels[i]]++;
}
const goalNumSingleLabel = num_obj;

const numLabels = Object.keys(labelCounts).length;

Object.entries(labelCounts).forEach(entry => {
if (entry[1] != goalNumSingleLabel) {
e.preventDefault();
errorBox.innerHTML = '<crowd-alert type="error" dismissible>You must use each label
only once.</crowd-alert>';
errorBox.scrollIntoView();
}
})
};
</script>

Attributes
The following attributes are supported by this element.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

initial-value

An array, in JSON format, of keypoints to be applied to the image on start. For example:

initial-value="[
{
'label': 'Left Eye',
'x': 1022,
'y': 429
},
{
'label': 'Beak',
'x': 941,
'y': 403
}

928
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Note
Please note that label values used in this attribute must have a matching value in the labels
attribute or the point will not be rendered.

labels

An array, in JSON format, of strings to be used as keypoint annotation labels.

name

A string used to identify the answer submitted by the worker. This value will match a key in the JSON
object that specifies the answer.

src

The source URI of the image to be annotated.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 929), short-instructions (p. 929)

Regions
The following regions are required by this element.

full-instructions

General instructions about how to annotate the image.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

Output
The following output is supported by this element.

inputImageProperties

A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

keypoints

An array of JSON objects containing the coordinates and label of a keypoint. Each object contains the
following properties.

• label – The assigned label for the keypoint.

929
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• x – The X coordinate, in pixels, of the keypoint on the image.


• y – The Y coordinate, in pixels, of the keypoint on the image.

Note
X and Y coordinates are based on 0,0 being the top left corner of the image.

Example : Sample Element Outputs

The following is a sample output from using this element.

[
{
"crowdKeypoint": {
"inputImageProperties": {
"height": 1314,
"width": 962
},
"keypoints": [
{
"label": "dog",
"x": 155,
"y": 275
},
{
"label": "cat",
"x": 341,
"y": 447
},
{
"label": "cat",
"x": 491,
"y": 513
},
{
"label": "dog",
"x": 714,
"y": 578
},
{
"label": "cat",
"x": 712,
"y": 763
},
{
"label": "cat",
"x": 397,
"y": 814
}
]
}
}
]

You may have many labels available, but only the ones that are used appear in the output.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

930
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

crowd-line
A widget for drawing lines on an image. Each line is associated with a label, and output data will report
the starting and ending points of each line.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-line> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-line
name="crowdLine"
src="{{ task.input.taskObject | grant_read_access }}"
header="Add header here to describe the task"
labels="['car','pedestrian','street car']"
>
<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>Draw a line on each objects that the label applies to.</p>
</short-instructions>

<full-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.
<p>Draw a line along each object that the image applies to.
Make sure that the line does not extend beyond the boundaries
of the object.
</p>
<p>Each line is defined by a starting and ending point. Carefully
place the starting and ending points on the boundaries of the object.</p>
</full-instructions>

</crowd-line>
</crowd-form>

Attributes
The following attributes are supported by this element.

header
Optional. The text to display above the image. This is typically a question or simple instruction for the
worker.

initial-value
Optional. An array of JSON objects, each of which sets a line when the component is loaded. Each JSON
object in the array contains the following properties:

• label – The text assigned to the line as part of the labeling task. This text must match one of the labels
defined in the labels attribute of the <crowd-line> element.
• vertices – the x and y pixel corrdinates of the start point and end point of the line, relative to the top-
left corner of the image.

initial-value="{

931
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

lines: [
{
label: 'sideline', // label of this line annotation
vertices:[ // an array of vertices which decide the position of the line
{
x: 84,
y: 110
},
{
x: 60,
y: 100
}
]
},
{
label: 'yardline',
vertices:[
{
x: 651,
y: 498
},
{
x: 862,
y: 869
}
]
}
]
}"

Lines set via the initial-value property can be adjusted. Whether or not a worker answer was
adjusted is tracked via an initialValueModified boolean in the worker answer output.

labels

Required. A JSON formatted array of strings, each of which is a label that a worker can assign to the line.

Limit: 10 labels

label-colors

Optional. An array of strings. Each string is a hexadecimal (hex) code for a label.

name

Required. The name of this widget. It's used as a key for the widget's input in the form output.

src

Required. The URL of the image on which to draw lines.

Regions
The following regions are required by this element.

full-instructions

General instructions about how to draw lines.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

932
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: short-instructions (p. 932), full-instructions (p. 932)

Output
inputImageProperties

A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

lines

A JSON Array containing objects with the line labels and vertices.

• label – The label given to a line.


• vertices – the x and y pixel corrdinates of the start point and end point of the line, relative to the top-
left corner of the image.

Example : Sample Element Outputs

The following is an example of output from this element.

{
"crowdLine": { //This is the name you set for the crowd-line
"inputImageProperties": {
"height": 1254,
"width": 2048
},
"lines": [
{
"label": "yardline",
"vertices": [
{
"x": 58,
"y": 295
},
{
"x": 1342,
"y": 398
}
]
},
{
"label": "sideline",
"vertices": [
{
"x": 472,
"y": 910
},
{
"x": 1480,

933
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

"y": 600
}
]
}
]
}
}

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-modal
A small window that pops up on the display when it is opened.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of the syntax that you can use with the <crowd-modal> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-modal
link-text = "See Examples"
link-type = "button">
Example Modal Text</crowd-modal>

Attributes
The following attributes are supported by this element.

link-text

The text to display for opening the modal. The default is "Click to open modal".

link-type

A string that specifies the type of trigger for the modal. The possible values are "link" (default) and
"button".

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

934
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-polygon
A widget for drawing polygons on an image and assigning a label to the portion of the image that is
enclosed in each polygon.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-polygon> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-polygon
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Draw a polygon around each of the requested target(s) of interest"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Polygon instructions">
<ul>
<li>Make the polygon tight around the object</li>
<li>You need to select a label before starting a polygon</li>
<li>You will need to select a label again after completing a polygon</li>
<li>To select a polygon, you can click on its borders</li>
<li>You can start drawing a polygon from inside another polygon</li>
<li>You can undo and redo while you're drawing a polygon to go back and forth
between points you've placed</li>
<li>You are prevented from drawing lines that overlap other lines from the same
polygon</li>
</ul>
</full-instructions>

<short-instructions>
<p>Draw a polygon around each of the requested target(s) of interest</p>
<p>Make the polygon tight around the object</p>
</short-instructions>
</crowd-polygon>
</crowd-form>

Attributes
The following attributes are supported by this element.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

labels

A JSON formatted array of strings, each of which is a label that a worker can assign to the image portion
enclosed by a polygon.

name

The name of this widget. It's used as a key for the widget's input in the form output.

935
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

src
The URL of the image on which to draw polygons.

initial-value
An array of JSON objects, each of which defines a polygon to be drawn when the component is loaded.
Each JSON object in the array contains the following properties.

• label – The text assigned to the polygon as part of the labeling task. This text must match one of the
labels defined in the labels attribute of the <crowd-polygon> element.
• vertices – An array of JSON objects. Each object contains an x and y coordinate value for a point in the
polygon.

Example
An initial-value attribute might look something like this.

initial-value =
'[
{
"label": "dog",
"vertices":
[
{
"x": 570,
"y": 239
},
...
{
"x": 759,
"y": 281
}
]
}
]'

Because this will be within an HTML element, the JSON array must be enclosed in single or double
quotes. The example above uses single quotes to encapsulate the JSON and double quotes within the
JSON itself. If you must mix single and double quotes inside your JSON, replace them with their HTML
entity codes (&quot; for double quote, &#39; for single) to safely escape them.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 936), short-instructions (p. 936)

Regions
The following regions are required.

full-instructions
General instructions about how to draw polygons.

short-instructions
Important task-specific instructions that are displayed in a prominent place.

936
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Output
The following output is supported by this element.

polygons
An array of JSON objects, each of which describes a polygon that has been created by the worker. Each
JSON object in the array contains the following properties.

• label – The text assigned to the polygon as part of the labeling task.
• vertices – An array of JSON objects. Each object contains an x and y coordinate value for a point in the
polygon. The top left corner of the image is 0,0.

inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

Example : Sample Element Outputs


The following are samples of outputs from common use scenarios for this element.

Single Label, Single Polygon

{
"annotatedResult":
{
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons":
[
{
"label": "dog",
"vertices":
[
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]

937
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

}
]
}
}
]

Single Label, Multiple Polygons

[
{
"annotatedResult": {
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons": [
{
"label": "dog",
"vertices": [
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]
},
{
"label": "dog",
"vertices": [
{
"x": 870,
"y": 278
},
{
"x": 908,
"y": 446
},
{
"x": 1009,
"y": 602
},
{
"x": 1116,
"y": 519
},
{
"x": 1174,
"y": 498
},
{

938
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

"x": 1227,
"y": 479
},
{
"x": 1179,
"y": 405
},
{
"x": 1179,
"y": 337
}
]
}
]
}
}
]

Multiple Labels, Multiple Polygons

[
{
"annotatedResult": {
"inputImageProperties": {
"height": 853,
"width": 1280
},
"polygons": [
{
"label": "dog",
"vertices": [
{
"x": 570,
"y": 239
},
{
"x": 603,
"y": 513
},
{
"x": 823,
"y": 645
},
{
"x": 901,
"y": 417
},
{
"x": 759,
"y": 281
}
]
},
{
"label": "cat",
"vertices": [
{
"x": 870,
"y": 278
},
{
"x": 908,
"y": 446
},
{

939
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

"x": 1009,
"y": 602
},
{
"x": 1116,
"y": 519
},
{
"x": 1174,
"y": 498
},
{
"x": 1227,
"y": 479
},
{
"x": 1179,
"y": 405
},
{
"x": 1179,
"y": 337
}
]
}
]
}
}
]

You could have many labels available, but only the ones that are used appear in the output.

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-polyline
A widget for drawing polylines or lines on an image. Each polyline is associated with a label and can
include two or more vertices. A polyline can intersect itself and its starting and ending points can be
placed anywhere on the image.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-polyline> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template. For more examples, see this GitHub repository.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-polyline
name="crowdPolyline"
src="{{ task.input.taskObject | grant_read_access }}"
header="Add header here to describe the task"
labels="['car','pedestrian','street car']"
>
<full-instructions>

940
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<p>Read the task carefully and inspect the image.</p>


<p>Choose the appropriate label that best suits the image.</p>
<p>Draw a polyline around the boundaries of all objects
that the label applies to.</p>
<p>Use the <b>Enter</b> key to complete a polyline.</p>
<p>Make sure that the polyline fits tightly around the boundary
of the object.</p>
</full-instructions>

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Review the tool guide to learn how to use the polyline tool.</p>
<p>Choose the appropriate label that best suits the image.</p>
<p>To draw a polyline, select a label that applies to an object of interest
and add a single point to the photo by clicking on that point. Continue to
draw the polyline around the object by adding additional points
around the object boundary.</p>
<p>After you place the final point on the polyline, press <b>Enter</b> on your
keyboard to complete the polyline.</p>

</short-instructions>
</crowd-polyline>
</crowd-form>

Attributes
The following attributes are supported by this element.

header
Optional. The text to display above the image. This is typically a question or simple instruction for the
worker.

initial-value
Optional. An array of JSON objects, each of which sets a polyline when the component is loaded. Each
JSON object in the array contains the following properties:

• label – The text assigned to the polyline as part of the labeling task. This text must match one of the
labels defined in the labels attribute of the <crowd-polyline> element.
• vertices – the x and y pixel corrdinates of the vertices of a polyline, relative to the top-left corner of
the image.

initial-value= "{
polylines: [
{
label: 'sideline', // label of this line annotation
vertices:[ // an array of vertices which decide the position of the line
{
x: 84,
y: 110
},
{
x: 60,
y: 100
}
]
},
{
label: 'yardline',
vertices:[
{

941
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

x: 651,
y: 498
},
{
x: 862,
y: 869
},
{
x: 1000,
y: 869
}
]
}
]
}"

Polylines set via the initial-value property can be adjusted. Whether or not a worker answer was
adjusted is tracked via an initialValueModified boolean in the worker answer output.

labels
Required. A JSON formatted array of strings, each of which is a label that a worker can assign to the line.

Limit: 10 labels

label-colors
Optional. An array of strings. Each string is a hexadecimal (hex) code for a label.

name
Required. The name of this widget. It's used as a key for the widget's input in the form output.

src
Required. The URL of the image on which to draw polylines.

Regions
The following regions are required by this element.

full-instructions
General instructions about how to draw polylines.

short-instructions
Important task-specific instructions that are displayed in a prominent place.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: short-instructions (p. 942), full-instructions (p. 942)

Output
inputImageProperties
A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

942
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

polylines

A JSON Array containing objects with polylines' labels and vertices.

• label – The label given to a line.


• vertices – the x and y pixel corrdinates of the vertices of a polyline, relative to the top-left corner of
the image.

Example : Sample Element Outputs

The following is an example of output from this element.

{
"crowdPolyline": { //This is the name you set for the crowd-polyline
"inputImageProperties": {
"height": 1254,
"width": 2048
},
"polylines": [
{
"label": "sideline",
"vertices": [
{
"x": 651,
"y": 498
},
{
"x": 862,
"y": 869
},
{
"x": 1449,
"y": 611
}
]
},
{
"label": "yardline",
"vertices": [
{
"x": 1148,
"y": 322
},
{
"x": 1705,
"y": 474
},
,
{
"x": 1755,
"y": 474
}
]
}
]
}
}

943
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-radio-button
A button that can be either checked or unchecked. When radio buttons are inside a radio group, exactly
one radio button in the group can be checked at any time. The following is an example of how to
configure a crowd-radio-button element inside of a crowd-radio-group element.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of the syntax that you can use with the <crowd-radio-button> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-radio-group>
<crowd-radio-button name="tech" value="tech">Technology</crowd-radio-button>
<crowd-radio-button name="politics" value="politics">Politics</crowd-radio-button>
</crowd-radio-group>
</crowd-form>

The previous example can be seen in a custom worker task template in this GitHub example: entity
recognition labeling job custom template.

Crowd HTML Element radio buttons do not support the HTML tag, required. To make a radio button
selection required, use <input type="radio"> elements to create radio buttons and add the
required tag. The name attribute for all <input> elements that belong to the same group of radio
buttons must be the same. For example, the following template requires the user to select a radio button
in the animal-type group before submitting.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<p>Select an animal type:</p>
<img src="https://images.unsplash.com/photo-1537151608828-ea2b11777ee8?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1539&q=80" style="height:
500; width: 400;"/>
<br><br>
<div>
<input type="radio" id="cat" name="animal-type" value="cat" required>
<label for="cat">Cat</label>
</div>
<div>
<input type="radio" id="dog" name="animal-type" value="dog">
<label for="dog">Dog</label>
</div>
<div>
<input type="radio" id="unknown" name="animal-type" value="unknown">
<label for="unknown">Unknown</label>
</div>
<full-instructions header="Classification Instructions">
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</full-instructions>

944
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<short-instructions>
<p>Read the task carefully and inspect the image.</p>
<p>Choose the appropriate label that best suits the image.</p>
</short-instructions>
</crowd-form>

Attributes
The following attributes are supported by this element.

checked
A Boolean switch that, if present, displays the radio button as checked.

disabled
A Boolean switch that, if present, displays the button as disabled and prevents it from being checked.

name
A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.
Note
If you use the buttons outside of a crowd-radio-group (p. 946) element, but with the same
name string and different value strings, the name object in the output will contain a Boolean
value for each value string. To ensure that only one button in a group is selected, make them
children of a crowd-radio-group (p. 946) element and use different name values.

value
A property name for the element's boolean value. If not specified, it uses "on" as the default, e.g.
{ "<name>": { "<value>": <true or false> } }.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-radio-group (p. 946)


• Child elements: none

Output
Outputs an object with the following pattern: { "<name>": { "<value>": <true or false> } }.
If you use the buttons outside of a crowd-radio-group (p. 946) element, but with the same name
string and different value strings, the name object will contain a Boolean value for each value
string. To ensure that only one in a group of buttons is selected, make them children of a crowd-radio-
group (p. 946) element and use different name values.

Example Sample output of this element

[
{
"btn1": {
"yes": true
},
"btn2": {
"no": false
}
}

945
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-radio-group
A group of radio buttons. Only one radio button within the group can be selected. Choosing one radio
button clears any previously chosen radio button within the same group. For an example of a custom UI
template that uses the crowd-radio-group element, see this entity recognition labeling job custom
template.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of the syntax that you can use with the <crowd-radio-group> element.
Copy the following code and save it in a file with the extension .html. Open the file in any browser to
preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<style>
body {
padding-left: 20px;
margin-bottom: 20px;
}
#outer-container {
display: flex;
justify-content: space-around;
max-width: 900px;
margin-left: 100px;
}
.left-container {
margin-right: auto;
padding-right: 50px;
}
.right-container {
margin-left: auto;
padding-left: 50px;
}
#vertical-separator {
border: solid 1px #d5dbdb;
}
</style>

<crowd-form>
<div>
<h1>Instructions</h1>
Lorem ipsum...
</div>
<div>
<h2>Background</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
</div>
<div id="outer-container">

946
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<span class="left-container">
<h2>Option 1</h2>
<p>Nulla facilisi morbi tempus iaculis urna. Orci dapibus ultrices in iaculis nunc sed
augue lacus.</p>
</span>
<span id="vertical-separator"></span>
<span class="right-container">
<h2>Option 2</h2>
<p>Ultrices vitae auctor eu augue ut. Pellentesque massa placerat duis ultricies lacus
sed turpis tincidunt id.</p>
</span>
</div>
<div>
<h2>Question</h2>
<p>Which do you agree with?</p>
<crowd-radio-group>
<crowd-radio-button name="option1" value="Option 1">Option 1</crowd-radio-button>
<crowd-radio-button name="option2" value="Option 2">Option 2</crowd-radio-button>
</crowd-radio-group>

<p>Why did you choose this answer?</p>


<crowd-text-area name="explanation" placeholder="Explain how you reached your
conclusion..."></crowd-text-area>
</div>
</crowd-form>

Attributes
No special attributes are supported by this element.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: crowd-radio-button (p. 944)

Output
Outputs an array of objects representing the crowd-radio-button (p. 944) elements within it.

Example Sample of Element Output

[
{
"btn1": {
"yes": true
},
"btn2": {
"no": false
}
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

947
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

crowd-semantic-segmentation
A widget for segmenting an image and assigning a label to each image segment.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-semantic-segmentation>


element. Copy the following code and save it in a file with the extension .html. Open the file in any
browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-semantic-segmentation
name="annotatedResult"
src="{{ task.input.taskObject | grant_read_access }}"
header="Please label each of the requested objects in this image"
labels="['Cat', 'Dog', 'Bird']"
>
<full-instructions header="Segmentation Instructions">
<ol>
<li><strong>Read</strong> the task carefully and inspect the image.</li>
<li><strong>Read</strong> the options and review the examples provided to
understand more about the labels.</li>
<li><strong>Choose</strong> the appropriate label that best suits the image.</li>
</ol>
</full-instructions>

<short-instructions>
<p>Use the tools to label the requested items in the image</p>
</short-instructions>
</crowd-semantic-segmentation>
</crowd-form>

Attributes
The following attributes are supported by this element.

header

The text to display above the image. This is typically a question or simple instruction for the worker.

initial-value

A JSON object containing the color mappings of a prior semantic segmentation job and a link to the
overlay image output by the prior job. Include this when you want a human worker to verify the results
of a prior labeling job and adjust it if necessary.

The attribute would appear as follows:

initial-value='{
"labelMappings": {
"Bird": {
"color": "#ff7f0e"
},
"Cat": {
"color": "#2ca02c"
},
"Cow": {
"color": "#d62728"
},

948
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

"Dog": {
"color": "#1f77b4"
}
},
"src": {{ "S3 file URL for image" | grant_read_access }}
}'

When using Ground Truth built in task types with annotation consolidation (where more than one worker
labels a single image), label mappings are included in individual worker output records, however the
overall result is represented as the internal-color-map in the consolidated results.

You can convert the internal-color-map to label-mappings in a custom template using the Liquid
templating language:

initial-value="{
'src' : '{{ task.input.manifestLine.label-attribute-name-from-prior-job|
grant_read_access }}',
'labelMappings': {
{% for box in task.input.manifestLine.label-attribute-name-from-prior-job-
metadata.internal-color-map %}
{% if box[1]['class-name'] != 'BACKGROUND' %}
{{ box[1]['class-name'] | to_json }}: {
'color': {{ box[1]['hex-color'] | to_json }}
},
{% endif %}
{% endfor %}
}
}"

labels
A JSON formatted array of strings, each of which is a label that a worker can assign to a segment of the
image.

name
The name of this widget. It is used as a key for the widget's input in the form output.

src
The URL of the image that is to be segmented.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: full-instructions (p. 949), short-instructions (p. 949)

Regions
The following regions are supported by this element.

full-instructions
General instructions about how to do image segmentation.

short-instructions
Important task-specific instructions that are displayed in a prominent place.

949
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Output
The following output is supported by this element.

labeledImage

A JSON Object containing a Base64 encoded PNG of the labels.

labelMappings

A JSON Object containing objects with named with the segmentation labels.

• color – The hexadecimal value of the label's RGB color in the labeledImage PNG.

initialValueModified

A boolean representing whether the initial values have been modified. This is only included when the
output is from an adjustment task.

inputImageProperties

A JSON object that specifies the dimensions of the image that is being annotated by the worker. This
object contains the following properties.

• height – The height, in pixels, of the image.


• width – The width, in pixels, of the image.

Example : Sample Element Outputs

The following is a sample of output from this element.

[
{
"annotatedResult": {
"inputImageProperties": {
"height": 533,
"width": 800
},
"labelMappings": {
"<Label 2>": {
"color": "#ff7f0e"
},
"<label 3>": {
"color": "#2ca02c"
},
"<label 1>": {
"color": "#1f77b4"
}
},
"labeledImage": {
"pngImageData": "<Base-64 Encoded Data>"
}
}
}
]

See Also
For more information, see the following.

950
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-slider
A bar with a sliding knob that allows a worker to select a value from a range of values by moving
the knob. The slider makes it a great choice for settings that reflect intensity levels, such as volume,
brightness, or color saturation.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a survey template that uses the <crowd-slider> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-instructions link-text="View instructions" link-type="button">
<short-summary>
<p>Provide a brief instruction here</p>
</short-summary>

<detailed-instructions>
<h3>Provide more detailed instructions here</h3>
<p>Include additional information</p>
</detailed-instructions>

<positive-example>
<p>Provide an example of a good answer here</p>
<p>Explain why it's a good answer</p>
</positive-example>

<negative-example>
<p>Provide an example of a bad answer here</p>
<p>Explain why it's a bad answer</p>
</negative-example>
</crowd-instructions>

<div>
<p>What is your favorite color for a bird?</p>
<crowd-input name="favoriteColor" placeholder="example: pink" required></crowd-input>
</div>

<div>
<p>Check this box if you like birds</p>
<crowd-checkbox name="likeBirds" checked="true" required></crowd-checkbox>
</div>

<div>
<p>On a scale of 1-10, how much do you like birds?</p>
<crowd-slider name="howMuch" min="1" max="10" step="1" pin="true" required></crowd-
slider>
</div>

<div>
<p>Write a short essay describing your favorite bird</p>
<crowd-text-area name="essay" rows="4" placeholder="Lorem ipsum..." required></crowd-
text-area>
</div>
</crowd-form>

951
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Attributes
The following attributes are supported by this element.

disabled

A Boolean switch that, if present, displays the slider as disabled.

editable

A Boolean switch that, if present, displays an up/down button that can be chosen to select the value.

Selecting the value via the up/down button is an alternative to selecting the value by moving the knob
on the slider. The knob on the slider will move synchronously with the up/down button choices.

max

A number that specifies the maximum value on the slider.

min

A number that specifies the minimum value on the slider.

name

A string that is used to identify the answer submitted by the worker. This value will match a key in the
JSON object that specifies the answer.

pin

A Boolean switch that, if present, displays the current value above the knob as the knob is moved.

required

A Boolean switch that, if present, requires the worker to provide input.

secondary-progress

When used with a crowd-slider-secondary-color CSS attribute, the progress bar is colored
to the point represented by the secondary-progress. For example, if this was representing the
progress on a streaming video, the value would represent where the viewer was in the video timeline.
The secondary-progress value would represent the point on the timeline to which the video had
buffered.

step

A number that specifies the difference between selectable values on the slider.

value

A preset that becomes the default if the worker does not provide input.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

952
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-tab
A component styled to look like a tab with information below.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example template that uses the <crowd-tab> element. Copy the following code and
save it in a file with the extension .html. Open the file in any browser to preview and interact with this
template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-tabs>
<crowd-tab header="Tab 1">
<h2>Image</h2>

<img
src="https://images.unsplash.com/photo-1478382188900-5bb598fe27d3?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80"
style="max-width: 40%"
>

<h2>Text</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
</p>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>

<crowd-tab header="Tab 2">


<h2>Description</h2>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>

<crowd-tab header="Tab 3">


<div style="width: 40%; display: inline-block">
<img
src="https://images.unsplash.com/photo-1472747459646-91fd6f13995f?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80"
style="max-width: 80%"
>
<crowd-input label="Input inside tab" name="inputInsideTab"></crowd-input>
<input type="checkbox" name="checkbox" value="foo">Foo
<input type="checkbox" name="checkbox" value="bar">Bar
<crowd-button>Some button</crowd-button>
</div>

953
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

<div style="width: 40%; display: inline-block; vertical-align: top">


Lorem ipsum dolor sit amet, lorem a wisi nibh, in pulvinar, consequat praesent
vestibulum tellus ante felis auctor, vitae lobortis dictumst mauris.
Pellentesque nulla ipsum ante quisque quam augue.
Class lacus id euismod, blandit tempor mauris quisque tortor mauris, urna gravida
nullam pede libero, ut suscipit orci faucibus lacus varius ornare, pellentesque ipsum.
At etiam suspendisse est elementum luctus netus, vel sem nulla sodales, potenti
magna enim ipsum diam tortor rutrum,
quam donec massa elit ac, nam adipiscing sed at leo ipsum consectetuer. Ac turpis
amet wisi, porttitor sint lacus ante, turpis accusantium, ac maecenas deleniti,
nisl leo sem integer ac dignissim. Lobortis etiam luctus lectus odio auctor. Justo
vitae, felis integer id, bibendum accumsan turpis eu est mus eros, ante id eros.
</div>
</crowd-tab>

</crowd-tabs>

<crowd-input label="Input outside tabs" name="inputOutsideTab"></crowd-input>

<short-instructions>
<p>Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas
sed sed risus.</p>
</short-instructions>

<full-instructions header="Classification Instructions">


<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.</p>
<p> Tempus egestas sed sed risus.</p>
</full-instructions>

</crowd-form>

Attributes
The following attributes are supported by this element.

header

The text appearing on the tab. This is usually some short descriptive name indicative of the information
contained below the tab.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-tabs (p. 954)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-tabs
A container for tabbed information.

954
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example template that uses the <crowd-tabs> element. Copy the following code
and save it in a file with the extension .html. Open the file in any browser to preview and interact with
this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<crowd-tabs>
<crowd-tab header="Tab 1">
<h2>Image</h2>

<img
src="https://images.unsplash.com/photo-1478382188900-5bb598fe27d3?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80"
style="max-width: 40%"
>

<h2>Text</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
</p>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>

<crowd-tab header="Tab 2">


<h2>Description</h2>
<p>
Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas sed
sed risus.
</p>
</crowd-tab>

<crowd-tab header="Tab 3">


<div style="width: 40%; display: inline-block">
<img
src="https://images.unsplash.com/photo-1472747459646-91fd6f13995f?
ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80"
style="max-width: 80%"
>
<crowd-input label="Input inside tab" name="inputInsideTab"></crowd-input>
<input type="checkbox" name="checkbox" value="foo">Foo
<input type="checkbox" name="checkbox" value="bar">Bar
<crowd-button>Some button</crowd-button>
</div>

<div style="width: 40%; display: inline-block; vertical-align: top">


Lorem ipsum dolor sit amet, lorem a wisi nibh, in pulvinar, consequat praesent
vestibulum tellus ante felis auctor, vitae lobortis dictumst mauris.
Pellentesque nulla ipsum ante quisque quam augue.
Class lacus id euismod, blandit tempor mauris quisque tortor mauris, urna gravida
nullam pede libero, ut suscipit orci faucibus lacus varius ornare, pellentesque ipsum.
At etiam suspendisse est elementum luctus netus, vel sem nulla sodales, potenti
magna enim ipsum diam tortor rutrum,
quam donec massa elit ac, nam adipiscing sed at leo ipsum consectetuer. Ac turpis
amet wisi, porttitor sint lacus ante, turpis accusantium, ac maecenas deleniti,
nisl leo sem integer ac dignissim. Lobortis etiam luctus lectus odio auctor. Justo
vitae, felis integer id, bibendum accumsan turpis eu est mus eros, ante id eros.
</div>
</crowd-tab>

955
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

</crowd-tabs>

<crowd-input label="Input outside tabs" name="inputOutsideTab"></crowd-input>

<short-instructions>
<p>Sed risus ultricies tristique nulla aliquet enim tortor at auctor. Tempus egestas
sed sed risus.</p>
</short-instructions>

<full-instructions header="Classification Instructions">


<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.</p>
<p> Tempus egestas sed sed risus.</p>
</full-instructions>

</crowd-form>

Attributes
This element has no attributes.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: crowd-tab (p. 953)

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-text-area
A field for text input.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template designed to transcribe audio clips that uses the
<crowd-text-area> element. Copy the following code and save it in a file with the extension .html.
Open the file in any browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<audio controls>
<source src="{{ task.input.taskObject | grant_read_access }}" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
<h3>Instructions</h3>
<p>Transcribe the audio</p>
<p>Ignore "umms", "hmms", "uhs" and other non-textual phrases</p>
<crowd-text-area name="transcription" rows="4"></crowd-text-area>
</crowd-form>

956
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Attributes
The following attributes are supported by this element.

allowed-pattern
A regular expression that is used with the auto-validate attribute to ignore non-matching characters as
the worker types.

auto-focus
A Boolean switch that, if present, puts the cursor in this element on-load so that users can immediately
begin typing without having to click inside the element.

auto-validate
A Boolean switch that, if present, turns on input validation. The behavior of the validator can be
modified by the error-message and allowed-pattern attributes.

char-counter
A Boolean switch that, if present, puts a small text field beneath the lower-right corner of the element,
displaying the number of characters inside the element.

disabled
A Boolean switch that, if present, displays the input area as disabled.

error-message
The text to be displayed below the input field, on the left side, if validation fails.

label
A string that is displayed inside a text field.

This text shrinks and rises up above a text field when the worker starts typing in the field or when the
value attribute is set.

max-length
An integer that specifies the maximum number of characters allowed by the element. Characters typed
or pasted beyond the maximum are ignored.

max-rows
An integer that specifies the maximum number of rows of text that are allowed within a crowd-text-
area. Normally the element expands to accommodate new rows. If this is set, after the number of rows
exceeds it, content scrolls upward out of view and a scrollbar control appears.

name
A string used to represent the element's data in the output.

placeholder
A string presented to the user as placeholder text. It disappears after the user puts something in the
input area.

rows
An integer that specifies the height of the element in rows of text.

957
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

value

A preset that becomes the default if the worker does not provide input. The preset appears in a text field.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

Output
This element outputs the name as a property name and the element's text contents as the value.
Carriage returns in the text are represented as \n.

Example Sample output for this element

[
{
"textInput1": "This is the text; the text that\nmakes the crowd go wild."
}
]

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-toast
A subtle notification that temporarily appears on the display. Only one crowd-toast is visible.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following is an example of a Liquid template that uses the <crowd-toast> element. Copy the
following code and save it in a file with the extension .html. Open the file in any browser to preview
and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<p>Find the official website for: <strong>{{ task.input.company }}</strong></p>
<p>Do not give Yelp pages, LinkedIn pages, etc.</p>
<p>Include the http:// prefix from the website</p>
<crowd-input name="website" placeholder="http://example.com"></crowd-input>

<crowd-toast duration="10000" opened>


This is a message that you want users to see when opening the template. This message
will disappear in 10 seconds.
</crowd-toast>

958
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

</crowd-form>

Attributes
The following attributes are supported by this element.

duration

A number that specifies the duration, in milliseconds, that the notification appears on the screen.

text

The text to display in the notification.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

See Also
For more information, see the following.

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

crowd-toggle-button
A button that acts as an ON/OFF switch, toggling a state.

See an interactive example of an HTML template that uses this Crowd HTML Element in CodePen.

The following example shows different ways you can use to use the <crowd-toggle-button> HTML
element. Copy the following code and save it in a file with the extension .html. Open the file in any
browser to preview and interact with this template.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
<!--Toggle button without value-->
<crowd-toggle-button name="toggleButtonWithoutValue"></crowd-toggle-button>

<!--Toggle button with value-->


<crowd-toggle-button name="toggleButtonWithValue" value="someValue"></crowd-toggle-
button>

<!--Toggle button disabled-->


<crowd-toggle-button name="toggleButtonDisabled" disabled></crowd-toggle-button>

<!--Toggle button marked invalid-->


<crowd-toggle-button name="toggleButtonInvalid" invalid></crowd-toggle-button>

<!--Toggle button marked required-->


<crowd-toggle-button name="toggleButtonRequired" required></crowd-toggle-button>
</crowd-form>

959
Amazon SageMaker Developer Guide
SageMaker Crowd HTML Elements

Attributes
The following attributes are supported by this element.

checked

A Boolean switch that, if present, displays the button switched to the ON position.

disabled

A Boolean switch that, if present, displays the button as disabled and prevents toggling.

invalid

When in an off position, a button using this attribute, will display in an alert color. The standard is red,
but may be changed in CSS. When toggled on, the button will display in the same color as other buttons
in the on position.

name

A string that is used to identify the answer submitted by the worker. This value matches a key in the
JSON object that specifies the answer.

required

A Boolean switch that, if present, requires the worker to provide input.

value

A value used in the output as the property name for the element's Boolean state. Defaults to "on" if not
provided.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements: crowd-form (p. 911)


• Child elements: none

Output
This element outputs the name as the name of an object, containing the value as a property name
and the element's state as Boolean value for the property. If no value for the element is specified, the
property name defaults to "on."

Example Sample output for this element

[
{
"theToggler": {
"on": true
}
}
]

See Also
For more information, see the following.

960
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

• Use Amazon SageMaker Ground Truth to Label Data (p. 526)


• Crowd HTML Elements Reference (p. 889)

Augmented AI Crowd HTML Elements


The following Crowd HTML Elements are only available for Amazon Augmented AI human workflow
tasks.

Topics
• crowd-textract-analyze-document (p. 961)
• crowd-rekognition-detect-moderation-labels (p. 964)

crowd-textract-analyze-document
A widget to enable human review of a Amazon Textract document analysis result.

Attributes
The following attributes are supported by this element.

header
This is the text that is displayed as the header.

src
This is a link to the image to be analyzed by the worker.

initialValue
This sets initial values for attributes found in the worker UI.

The following is an example of an initialValue input:

[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": {
"boundingBox": {
"width": 0.32613086700439453,
"weight": 0.0942094624042511,
"left": 0.4833833575248718,
"top": 0.5227988958358765
},
"polygon": [
{"x": 0.123, "y": 0.345}, ...
]
}
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{
"type": "CHILD",
"ids": [
"7ee7b7da-ee1b-428d-a567-55a3e3affa56",
"4d6da730-ba43-467c-a9a5-c6137ba0c472"
]
},
{

961
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

"type": "VALUE",
"ids": [
"6ee7b7da-ee1b-428d-a567-55a3e3affa54"
]
}
],
"entityTypes": [
"KEY"
],
"text": "Foo bar"
},
]

blockTypes

This determines the kind of analysis the workers can do. Only KEY_VALUE_SET is currently supported.

keys

This specifies new keys and the associated text value the worker can add. The input values for keys can
include the following elements:

• importantFormKey accepts strings, and is used to specify a single key.


• importantFormKeyAliases can be used to specify aliases that are acceptable alternatives to the
keys supplied. Use this element to identify alternative spellings or presentations of your keys. This
parameter accepts a list of one or more strings.

The following is an example of an input for keys.

[
{
importantFormKey: 'Address',
importantFormKeyAliases: [
'address',
'Addr.',
'Add.',
]
},
{
importantFormKey: 'Last name',
importantFormKeyAliases: ['Surname']
}
]

no-key-edit

This prevents the workers from editing the keys of annotations passed through initialValue. This
prevents workers from editing the keys that have been detected on your documents. This is required.

no-geometry-edit

This prevents workers from editing the polygons of annotations passed through initialValue. For
example, this would prevent the worker from editing the bounding box around a given key. This is
required.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements – crowd-form

962
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

• Child elements – full-instructions (p. 963), short-instructions (p. 963)

Regions
The following regions are supported by this element. You can use custom HTML and CSS code within
these regions to format your instructions to workers. For example, use the short-instructions
section to provide good and bad examples of how to complete a task.

full-instructions
General instructions about how to work with the widget.

short-instructions
Important task-specific instructions that are displayed in a prominent place.

Example of a Worker Template Using the crowd Element


An example of a worker template using this crowd element would look like the following.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
{% capture s3_uri %}http://s3.amazonaws.com/
{{ task.input.aiServiceRequest.document.s3Object.bucket }}/
{{ task.input.aiServiceRequest.document.s3Object.name }}{% endcapture %}

<crowd-form>
<crowd-textract-analyze-document
src="{{ s3_uri | grant_read_access }}"
initial-value="{{ task.input.selectedAiServiceResponse.blocks }}"
header="Review the key-value pairs listed on the right and correct them if they don't
match the following document."
no-key-edit
no-geometry-edit
keys="{{ task.input.humanLoopContext.importantFormKeys }}"
block-types="['KEY_VALUE_SET']"
>
<short-instructions header="Instructions">
<style>
.instructions {
white-space: pre-wrap;
}
.instructionsImage {
display: inline-block;
max-width: 100%;
}
</style>
<p class='instructions'>Click on a key-value block to highlight the corresponding
key-value pair in the document.

If it is a valid key-value pair, review the content for the value. If the content is
incorrect, correct it.

The text of the value is incorrect, correct it.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/correct-
value-text.png" />

A wrong value is identified, correct it.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/correct-
value.png" />

If it is not a valid key-value relationship, choose No.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/not-a-key-
value-pair.png" />

963
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

If you can’t find the key in the document, choose Key not found.
<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/key-is-not-
found.png" />

If the content of a field is empty, choose Value is blank.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/value-is-
blank.png" />

<b>Examples</b>
Key and value are often displayed next or below to each other.

Key and value displayed in one line.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/sample-key-
value-pair-1.png" />

Key and value displayed in two lines.


<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/sample-key-
value-pair-2.png" />

If the content of the value has multiple lines, enter all the text without line break.
Include all value text even if it extends beyond the highlight box.
<img class='instructionsImage' src="https://assets.crowd.aws/images/a2i-console/multiple-
lines.png" /></p>
</short-instructions>

<full-instructions header="Instructions"></full-instructions>
</crowd-textract-analyze-document>
</crowd-form>

Output
The following is a sample of the output from this element. You can find a detailed explanation of this
output in the Amazon Textract AnalyzeDocument API documentation.

{
"AWS/Textract/AnalyzeDocument/Forms/V1": {
blocks: [
{
"blockType": "KEY_VALUE_SET",
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{
"type": "CHILD",
"ids": ["7ee7b7da-ee1b-428d-a567-55a3e3affa56", "4d6da730-ba43-467c-a9a5-
c6137ba0c472"]
},
{
"type": "VALUE",
"ids": ["6ee7b7da-ee1b-428d-a567-55a3e3affa54"]
}
],
"entityTypes": ["KEY"],
"text": "Foo bar baz"
}
]
}
}

crowd-rekognition-detect-moderation-labels
A widget to enable human review of an Amazon Rekognition image moderation result.

964
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

Attributes
The following attributes are supported by this element.

header

This is the text that is displayed as the header.

src

This is a link to the image to be analyzed by the worker.

categories

This supports categories as an array of strings or an array of objects where each object has a name
field.

If the categories come in as objects, the following applies:

• The displayed categories are the value of the name field.


• The returned answer contains the full objects of any selected categories.

If the categories come in as strings, the following applies:

• The returned answer is an array of all the strings that were selected.

exclusion-category

By setting this attribute you create a button underneath the categories in the UI.

• When a user chooses the button, all categories are deselected and disabled.
• Choosing the button again re-enables the categories so that users can choose them.
• If you submit after choosing the button, it returns an empty array.

Element Hierarchy
This element has the following parent and child elements.

• Parent elements – crowd-form


• Child elements – full-instructions (p. 965), short-instructions (p. 965)

AWS Regions
The following AWS Regions are supported by this element. You can use custom HTML and CSS
code within these Regions to format your instructions to workers. For example, use the short-
instructions section to provide good and bad examples of how to complete a task.

full-instructions

General instructions about how to work with the widget.

short-instructions

Important task-specific instructions that are displayed in a prominent place.

965
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

Example Worker Template with the crowd Element


An example of a worker template using the crowd element would look like the following.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
{% capture s3_uri %}http://s3.amazonaws.com/
{{ task.input.aiServiceRequest.image.s3Object.bucket }}/
{{ task.input.aiServiceRequest.image.s3Object.name }}{% endcapture %}

<crowd-form>
<crowd-rekognition-detect-moderation-labels
categories='[
{% for label in task.input.selectedAiServiceResponse.moderationLabels %}
{
name: "{{ label.name }}",
parentName: "{{ label.parentName }}",
},
{% endfor %}
]'
src="{{ s3_uri | grant_read_access }}"
header="Review the image and choose all applicable categories."
>
<short-instructions header="Instructions">
<style>
.instructions {
white-space: pre-wrap;
}
</style>
<p class='instructions'>Review the image and choose all applicable categories.
If no categories apply, choose None.

<b>Nudity</b>
Visuals depicting nude male or female person or persons

<b>Graphic Male Nudity</b>


Visuals depicting full frontal male nudity, often close ups

<b>Graphic Female Nudity</b>


Visuals depicting full frontal female nudity, often close ups

<b>Sexual Activity</b>
Visuals depicting various types of explicit sexual activities and pornography

<b>Illustrated Nudity or Sexual Activity</b>


Visuals depicting animated or drawn sexual activity, nudity or pornography

<b>Adult Toys</b>
Visuals depicting adult toys, often in a marketing context

<b>Female Swimwear or Underwear</b>


Visuals depicting female person wearing only swimwear or underwear

<b>Male Swimwear Or Underwear</b>


Visuals depicting male person wearing only swimwear or underwear

<b>Partial Nudity</b>
Visuals depicting covered up nudity, for example using hands or pose

<b>Revealing Clothes</b>
Visuals depicting revealing clothes and poses, such as deep cut dresses

<b>Graphic Violence or Gore</b>


Visuals depicting prominent blood or bloody injuries

<b>Physical Violence</b>

966
Amazon SageMaker Developer Guide
Augmented AI Crowd HTML Elements

Visuals depicting violent physical assault, such as kicking or punching

<b>Weapon Violence</b>
Visuals depicting violence using weapons like firearms or blades, such as shooting

<b>Weapons</b>
Visuals depicting weapons like firearms and blades

<b>Self Injury</b>
Visuals depicting self-inflicted cutting on the body, typically in distinctive patterns
using sharp objects

<b>Emaciated Bodies</b>
Visuals depicting extremely malnourished human bodies

<b>Corpses</b>
Visuals depicting human dead bodies

<b>Hanging</b>
Visuals depicting death by hanging</p>
</short-instructions>

<full-instructions header="Instructions"></full-instructions>
</crowd-rekognition-detect-moderation-labels>
</crowd-form>

Output
The following is a sample of the output from this element. For details about this output, see Amazon
Rekognition DetectModerationLabels API documentation.

{
"AWS/Rekognition/DetectModerationLabels/Image/V3": {
"ModerationLabels": [
{ name: 'Gore', parentName: 'Violence' },
{ name: 'Corpses', parentName: 'Violence' },
]
}
}

967
Amazon SageMaker Developer Guide
Detect Pre-training Data Bias

Prepare and Analyze Datasets


You can use Amazon SageMaker Data Wrangler to import, prepare, transform, visualize and analyze data.
You can integrate Data Wrangler into your machine learning workflows to simplify and streamline data
pre-processing and feature engineering using little to no coding. You can also add your own Python
scripts and transformations to customize your data prep workflow.

Import data from Amazon S3, Amazon Redshift, Amazon Athena, and use Data Wrangler to create
sophisticated machine learning data prep workflows with built-in and custom data transformations and
analysis including feature target leakage and quick modeling.

After you have defined a data prep workflow, or data flow, you can integrate it with SageMaker
Processing, SageMaker Pipelines, and SageMaker Feature Store, simplify the task of processing, sharing
and storing ML training data. You can also export your data flow to a python script and create a custom
ML data prep pipeline.

For more information, see Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981).

For fast data preparation at scale, Amazon SageMaker Studio provides a built-in integration with
Amazon EMR. You can use SageMaker Studio to connect to, provision, or manage EMR clusters from
your notebook interface for petabyte-scale data processing, interactive analytics, and machine learning.
Amazon EMR uses open-source frameworks such as Apache Spark, Apache Hive, or Presto. For more
information about using Amazon EMR within SageMaker Studio, see Prepare data using Amazon
EMR (p. 1164).

Alternatively, you can use the Apache Spark-based serverless engine from an AWS Glue interactive
sessions to aggregate and transform data from multiple sources. You can aggregate and transform
data from your analytics and ETL (extract, transform, and load) pipelines without needing to manage
infrastructure. For more information about using AWS Glue interactive sessions within SageMaker Studio,
see Prepare data using AWS Glue Interactive Sessions (p. 1192).

The data that you’re using to train your machine learning model might contain bias. Bias can result in
machine learning models that discriminate against certain individuals or groups. You can use Amazon
SageMaker Clarify to determine whether the data that you’re using to train models or your resulting
model encodes any bias. SageMaker Clarify can also help you explain models created with tabular, image
or NLP data with partial dependence plots, feature importance and more. For more information about
SageMaker Clarify, see Detect Pre-training Data Bias (p. 968).

Topics
• Detect Pre-training Data Bias (p. 968)
• Prepare ML Data with Amazon SageMaker Data Wrangler (p. 981)
• Prepare data at scale from Studio notebooks with Amazon EMR or AWS Glue (p. 1163)

Detect Pre-training Data Bias


Algorithmic bias, discrimination, fairness, and related topics have been studied across disciplines such
as law, policy, and computer science. A computer system might be considered biased if it discriminates
against certain individuals or groups of individuals. The machine learning models powering these
applications learn from data and this data could reflect disparities or other inherent biases. For example,
the training data may not have sufficient representation of various demographic groups or may contain
biased labels. The machine learning models trained on datasets that exhibit these biases could end
up learning them and then reproduce or even exacerbate those biases in their predictions. The field of
machine learning provides an opportunity to address biases by detecting them and measuring them at
each stage of the ML lifecycle. You can use Amazon SageMaker Clarify to determine whether data used
for training models encodes any bias

968
Amazon SageMaker Developer Guide
Amazon SageMaker Clarify Terms for Bias and Fairness

Bias can be measured before training and after training, and monitored against baselines after deploying
models to endpoints for inference. Pre-training bias metrics are designed to detect and measure bias
in the raw data before it is used to train a model. The metrics used are model-agnostic because they do
not depend on any model outputs. However, there are different concepts of fairness that require distinct
measures of bias. Amazon SageMaker Clarify provides bias metrics to quantify various fairness criteria.

For additional information about bias metrics, see Learn How Amazon SageMaker Clarify Helps Detect
Bias and Fairness Measures for Machine Learning in Finance.

Amazon SageMaker Clarify Terms for Bias and


Fairness
SageMaker Clarify uses the following terminology to discuss bias and fairness.

Feature

An individual measurable property or characteristic of a phenomenon being observed, contained in a


column for tabular data.
Label

Feature that is the target for training a machine learning model. Referred to as the observed label or
observed outcome.
Predicted label

The label as predicted by the model. Also referred to as the predicted outcome.
Sample

An observed entity described by feature values and label value, contained in a row for tabular data.
Dataset

A collection of samples.
Bias

An imbalance in the training data or the prediction behavior of the model across different groups,
such as age or income bracket. Biases can result from the data or algorithm used to train your
model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may
be less accurate when making predictions involving younger and older people.
Bias metric

A function that returns numerical values indicating the level of a potential bias.
Bias report

A collection of bias metrics for a given dataset, or a combination of a dataset and a model.
Positive label values

Label values that are favorable to a demographic group observed in a sample. In other words,
designates a sample as having a positive result.
Negative label values

Label values that are unfavorable to a demographic group observed in a sample. In other words,
designates a sample as having a negative result.
Group variable

Categorical column of the dataset that is used to form subgroups for the measurement of
Conditional Demographic Disparity (CDD). Required only for this metric with regards to Simpson’s
paradox.

969
Amazon SageMaker Developer Guide
Sample Notebooks

Facet

A column or feature that contains the attributes with respect to which bias is measured.
Facet value

The feature values of attributes that bias might favor or disfavor.


Predicted probability

The probability, as predicted by the model, of a sample having a positive or negative outcome.

Sample Notebooks
Amazon SageMaker Clarify provides the following sample notebook for bias detection:

• Explainability and bias detection with Amazon SageMaker Clarify – Use SageMaker Clarify to create a
processing job for detecting bias and explaining model predictions with feature attributions.

This notebook has been verified to run in Amazon SageMaker Studio only. If you need instructions on
how to open a notebook in Amazon SageMaker Studio, see Create or Open an Amazon SageMaker Studio
Notebook (p. 148). If you're prompted to choose a kernel, choose Python 3 (Data Science).

Topics
• Measure Pre-training Bias (p. 970)
• Generate Reports for Bias in Pre-training Data in SageMaker Studio (p. 980)

Measure Pre-training Bias


Measuring bias in ML models is a first step to mitigating bias. Each measure of bias corresponds to
a different notion of fairness. Even considering simple concepts of fairness leads to many different
measures applicable in various contexts. For example, consider fairness with respect to age, and, for
simplicity, that middle-aged and rest of the age groups are the two relevant demographics, referred to
as facets. In the case of an ML model for lending, we may want small business loans to be issued to equal
numbers of both demographics. Or, when processing job applicants, we may want to see equal numbers
of members of each demographic hired. However, this approach may assume that equal numbers of both
age groups apply to these jobs, so we may want to condition on the number that apply. Further, we may
want to consider not whether equal numbers apply, but whether we have equal numbers of qualified
applicants. Or, we may consider fairness to be an equal acceptance rate of qualified applicants across
both age demographics, or, an equal rejection rate of applicants, or both. You might use datasets with
different proportions of data on the attributes of interest. This imbalance can conflate the bias measure
you choose. The models might be more accurate in classifying one facet than in the other. Thus, you
need to choose bias metrics that are conceptually appropriate for the application and the situation.

We use the following notation to discuss the bias metrics. The conceptual model described here is for
binary classification, where events are labeled as having only two possible outcomes in their sample
space, referred to as positive (with value 1) and negative (with value 0). This framework is usually
extensible to multicategory classification in a straightforward way or to cases involving continuous
valued outcomes when needed. In the binary classification case, positive and negative labels are assigned
to outcomes recorded in a raw dataset for a favored facet a and for a disfavored facet d. These labels y
are referred to as observed labels to distinguish them from the predicted labels y' that are assigned by
a machine learning model during the training or inferences stages of the ML lifecycle. These labels are
used to define probability distributions Pa(y) and Pd(y) for their respective facet outcomes.

• labels:

970
Amazon SageMaker Developer Guide
Measure Pre-training Bias

• y represents the n observed labels for event outcomes in a training dataset.


• y' represents the predicted labels for the n observed labels in the dataset by a trained model.
• outcomes:
• A positive outcome (with value 1) for a sample, such as an application acceptance.
(1)
• n is the number of observed labels for positive outcomes (acceptances).
(1)
• n' is the number of predicted labels for positive outcomes (acceptances).
• A negative outcome (with value 0) for a sample, such as an application rejection.
(0)
• n is the number of observed labels for negative outcomes (rejections).
(0)
• n' is the number of predicted labels for negative outcomes (rejections).
• facet values:
• facet a – The feature value that defines a demographic that bias favors.
(1) (0)
• na is the number of observed labels for the favored facet value: na = na + na the sum of the
positive and negative observed labels for the value facet a.
(1) (0)
• n'a is the number of predicted labels for the favored facet value: n'a = n'a + n'a the sum of the
positive and negative predicted outcome labels for the facet value a. Note that n'a = na.
• facet d – The feature value that defines a demographic that bias disfavors.
(1) (0)
• nd is the number of observed labels for the disfavored facet value: nd = nd + nd the sum of the
positive and negative observed labels for the facet value d.
(1) (0)
• n'd is the number of predicted labels for the disfavored facet value: n'd = n'd + n'd the sum of
the positive and negative predicted labels for the facet value d. Note that n'd = nd.
• probability distributions for outcomes of the labeled facet data outcomes:
• Pa(y) is the probability distribution of the observed labels for facet a. For binary labeled data, this
distribution is given by the ratio of the number of samples in facet a labeled with positive outcomes
1 (1)
to the total number, Pa(y ) = na / na, and the ratio of the number of samples with negative
0 (0)
outcomes to the total number, Pa(y ) = na / na.
• Pd(y) is the probability distribution of the observed labels for facet d. For binary labeled data, this
distribution is given by the number of samples in facet d labeled with positive outcomes to the total
1 (1)
number, Pd(y ) = nd / nd, and the ratio of the number of samples with negative outcomes to the
0 (0)
total number, Pd(y ) = nd / nd.

Models trained on data biased by demographic disparities might learn and even exacerbate them. To
identify bias in the data before expending resources to train models on it, SageMaker Clarify provides
data bias metrics that you can compute on raw datasets before training. All of the pretraining metrics are
model-agnostic because they do not depend on model outputs and so are valid for any model. The first
bias metric examines facet imbalance, but not outcomes. It determines the extent to which the amount
of training data is representative across different facets, as desired for the application. The remaining
bias metrics compare the distribution of outcome labels in various ways for facets a and d in the data.
The metrics that range over negative values can detect negative bias. The following table contains a
cheat sheet for quick guidance and links to the pretraining bias metrics.

Pre-training Bias Metrics

Bias metric Description Example question Interpreting metric


values

Class Imbalance Measures the imbalance Could there be age- Normalized range: [-1,
(CI) (p. 975) in the number of based biases due to not +1]
members between having enough data
different facet values. for the demographic Interpretation:
outside a middle-aged
facet? • Positive values
indicate the facet a

971
Amazon SageMaker Developer Guide
Measure Pre-training Bias

Bias metric Description Example question Interpreting metric


values
has more training
samples in the
dataset.
• Values near zero
indicate the facets
are balanced in the
number of training
samples in the
dataset.
• Negative values
indicate the facet d
has more training
samples in the
dataset.

Difference in Measures the imbalance Could there be age- Range for normalized
Proportions of Labels of positive outcomes based biases in ML binary & multicategory
(DPL) (p. 975) between different facet predictions due to facet labels: [-1,+1]
values. biased labeling of facet
values in the data? Range for continuous
labels: (-∞, +∞)

Interpretation:

• Positive values
indicate facet a has a
higher proportion of
positive outcomes.
• Values near zero
indicate a more equal
proportion of positive
outcomes between
facets.
• Negative values
indicate facet d has a
higher proportion of
positive outcomes.

Kullback-Leibler Measures how much the How different are Range for binary,
Divergence outcome distributions the distributions multicategory,
(KL) (p. 976) of different facets for loan application continuous: [0, +∞)
diverge from each other outcomes for different
entropically. demographic groups? Interpretation:

• Values near zero


indicate the labels are
similarly distributed.
• Positive values
indicate the label
distributions diverge,
the more positive the
larger the divergence.

972
Amazon SageMaker Developer Guide
Measure Pre-training Bias

Bias metric Description Example question Interpreting metric


values

Jensen-Shannon Measures how much the How different are Range for binary,
Divergence outcome distributions the distributions multicategory,
(JS) (p. 976) of different facets for loan application continuous: [0, +∞)
diverge from each other outcomes for different
entropically. demographic groups? Interpretation:

• Values near zero


indicate the labels are
similarly distributed.
• Positive values
indicate the label
distributions diverge,
the more positive the
larger the divergence.

Lp-norm (LP) (p. 977) Measures a p-norm How different are Range for binary,
difference between the distributions multicategory,
distinct demographic for loan application continuous: [0, +∞)
distributions of the outcomes for different
outcomes associated demographics? Interpretation:
with different facets in
a dataset. • Values near zero
indicate the labels are
similarly distributed.
• Positive values
indicate the label
distributions diverge,
the more positive the
larger the divergence.

Total Variation Distance Measures half How different are Range for binary,
(TVD) (p. 977) of the L1-norm the distributions multicategory, and
difference between for loan application continuous outcomes:
distinct demographic outcomes for different [0, +∞)
distributions of the demographics?
outcomes associated • Values near zero
with different facets in indicates the
a dataset. labels are similarly
distributed.
• Positive values
indicates the label
distributions diverge,
the more positive the
larger the divergence.

973
Amazon SageMaker Developer Guide
Measure Pre-training Bias

Bias metric Description Example question Interpreting metric


values

Kolmogorov-Smirnov Measures maximum Which college Range of KS values for


(KS) (p. 978) divergence between application outcomes binary, multicategory,
outcomes in manifest the greatest and continuous
distributions for disparities by outcomes: [0,+1]
different facets in a demographic group?
dataset. • Values near zero
indicate the labels
were evenly
distributed between
facets in all outcome
categories.
• Values near one
indicate the labels for
one category were all
in one facet, so very
imbalanced.
• Intermittent values
indicate relative
degrees of maximum
label imbalance.

Conditional Measures the disparity Do some groups have Range of CDD: [-1, +1]
Demographic Disparity of outcomes between a larger proportion of
(CDD) (p. 978) different facets as a rejections for college • Positive values
whole, but also by admission outcomes indicate a outcomes
subgroups. than their proportion of where facet d is
acceptances? rejected more than
accepted.
• Near zero indicates
no demographic
disparity on average.
• Negative values
indicate a outcomes
where facet a is
rejected more than
accepted.

For additional information about bias metrics, see Fairness Measures for Machine Learning in Finance.

Topics
• Class Imbalance (CI) (p. 975)
• Difference in Proportions of Labels (DPL) (p. 975)
• Kullback-Leibler Divergence (KL) (p. 976)
• Jensen-Shannon Divergence (JS) (p. 976)
• Lp-norm (LP) (p. 977)
• Total Variation Distance (TVD) (p. 977)
• Kolmogorov-Smirnov (KS) (p. 978)
• Conditional Demographic Disparity (CDD) (p. 978)

974
Amazon SageMaker Developer Guide
Measure Pre-training Bias

Class Imbalance (CI)


Class imbalance (CI) bias occurs when a facet value d has fewer training samples when compared with
another facet a in the dataset. This is because models preferentially fit the larger facets at the expense
of the smaller facets and so can result in a higher training error for facet d. Models are also at higher risk
of overfitting the smaller data sets, which can cause a larger test error for facet d. Consider the example
where a machine learning model is trained primarily on data from middle-aged individuals (facet a), it
might be less accurate when making predictions involving younger and older people (facet d).

The formula for the (normalized) facet imbalance measure:

CI = (na - nd)/(na + nd)

Where na is the number of members of facet a and nd the number for facet d. Its values range over the
interval [-1, 1].

• Positive CI values indicate the facet a has more training samples in the dataset and a value of 1
indicates the data only contains members of the facet a.
• Values of CI near zero indicate a more equal distribution of members between facets and a value of
zero indicates a perfectly equal partition between facets and represents a balanced distribution of
samples in the training data.
• Negative CI values indicate the facet d has more training samples in the dataset and a value of -1
indicates the data only contains members of the facet d.
• CI values near either of the extremes values of -1 or 1 are very imbalanced and are at a substantial risk
of making biased predictions.

If a significant facet imbalance is found to exist among the facets, you might want to rebalance the
sample before proceeding to train models on it.

Difference in Proportions of Labels (DPL)


The difference in proportions of labels (DPL) compares the proportion of observed outcomes with
positive labels for facet d with the proportion of observed outcomes with positive labels of facet a in a
training dataset. For example, you could use it to compare the proportion of middle-aged individuals
(facet a) and other age groups (facet d) approved for financial loans. Machine learning models try to
mimic the training data decisions as closely as possible. So a machine learning model trained on a
dataset with a high DPL is likely to reflect the same imbalance in its future predictions.

The formula for the difference in proportions of labels is as follows:

DPL = (qa - qd)

Where:
(1)
• qa = na /na is the proportion of facet a who have an observed label value of 1. For example, the
(1)
proportion of a middle-aged demographic who get approved for loans. Here na represents the
number of members of facet a who get a positive outcome and na the is number of members of facet
a.
(1)
• qd = nd /nd is the proportion of facet d who have an observed label value of 1. For example, the
(1)
proportion of people outside the middle-aged demographic who get approved for loans. Here nd
represents the number of members of the facet d who get a positive outcome and nd the is number of
members of the facet d.

If DPL is close enough to 0, then we say that demographic parity has been achieved.

975
Amazon SageMaker Developer Guide
Measure Pre-training Bias

For binary and multicategory facet labels, the DPL values range over the interval (-1, 1). For continuous
labels, we set a threshold to collapse the labels to binary.

• Positive DPL values indicate that facet a is has a higher proportion of positive outcomes when
compared with facet d.
• Values of DPL near zero indicate a more equal proportion of positive outcomes between facets and a
value of zero indicates perfect demographic parity.
• Negative DPL values indicate that facet d has a higher proportion of positive outcomes when
compared with facet a.

Whether or not a high magnitude of DPL is problematic varies from one situation to another. In a
problematic case, a high-magnitude DPL might be a signal of underlying issues in the data. For example,
a dataset with high DPL might reflect historical biases or prejudices against age-based demographic
groups that would be undesirable for a model to learn.

Kullback-Leibler Divergence (KL)


The Kullback-Leibler divergence (KL) measures how much the observed label distribution of facet a,
Pa(y), diverges from distribution of facet d, Pd(y). It is also known as the relative entropy of Pa(y) with
respect to Pd(y) and quantifies the amount of information lost when moving from Pa(y) to Pd(y).

The formula for the Kullback-Leibler divergence is as follows:

KL(Pa || Pd) = ∑yPa(y)*log[Pa(y)/Pd(y)]

It is the expectation of the logarithmic difference between the probabilities Pa(y) and Pd(y), where the
expectation is weighted by the probabilities Pa(y). This is not a true distance between the distributions
as it is asymmetric and does not satisfy the triangle inequality. The implementation uses natural
logarithms, giving KL in units of nats. Using different logarithmic bases gives proportional results but in
different units. For example, using base 2 gives KL in units of bits.

For example, assume that a group of applicants for loans have a 30% approval rate (facet d) and that
the approval rate for other applicants (facet a) is 80%. The Kullback-Leibler formula gives you the label
distribution divergence of facet a from facet d as follows:

KL = 0.8*ln(0.8/0.3) + 0.2*ln(0.2/0.7) = 0.53

There are two terms in the formula here because labels are binary in this example. This measure can
be applied to multiple labels in addition to binary ones. For example, in a college admissions scenario,
assume an applicant may be assigned one of three category labels: yi = {y0, y1, y2} = {rejected, waitlisted,
accepted}.

Range of values for the KL metric for binary, multicategory, and continuous outcomes is [0, +∞).

• Values near zero mean the outcomes are similarly distributed for the different facets.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.

Jensen-Shannon Divergence (JS)


The Jensen-Shannon divergence (JS) measures how much the label distributions of different facets
diverge from each other entropically. It is based on the Kullback-Leibler divergence, but it is symmetric.

The formula for the Jensen-Shannon divergence is as follows:

JS = ½*[KL(Pa || P) + KL(Pd || P)]

Where P = ½( Pa + Pd ), the average label distribution across facets a and d.

976
Amazon SageMaker Developer Guide
Measure Pre-training Bias

The range of JS values for binary, multicategory, continuous outcomes is [0, ln(2)).

• Values near zero mean the labels are similarly distributed.


• Positive values mean the label distributions diverge, the more positive the larger the divergence.

This metric indicates whether there is a big divergence in one of the labels across facets.

Lp-norm (LP)
The Lp-norm (LP) measures the p-norm distance between the facet distributions of the observed labels in
a training dataset. This metric is non-negative and so cannot detect reverse bias.

The formula for the Lp-norm is as follows:


p 1/p
Lp(Pa, Pd) = ( ∑y||Pa - Pd|| )

Where the p-norm distance between the points x and y is defined as follows:
p p p 1/p
Lp(x, y) = (|x1-y1| + |x2-y2| + … +|xn-yn| )

The 2-norm is the Euclidean norm. Assume you have an outcome distribution with three categories, for
example, yi = {y0, y1, y2} = {accepted, waitlisted, rejected} in a college admissions multicategory scenario.
You take the sum of the squares of the differences between the outcome counts for facets a and d. The
resulting Euclidean distance is calculated as follows:
(0) (0) 2 (1) (1) 2 (2) (2) 2 1/2
L2(Pa, Pd) = [(na - nd ) + (na - nd ) + (na - nd ) ]

Where:
(i) (0)
• na is number of the ith category outcomes in facet a: for example na is number of facet a
acceptances.
(i) (2)
• nd is number of the ith category outcomes in facet d: for example nd is number of facet d rejections.

The range of LP values for binary, multicategory, and continuous outcomes is [0, √2), where:
• Values near zero mean the labels are similarly distributed.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.

Total Variation Distance (TVD)


The total variation distance data bias metric (TVD) is half the L1-norm. The TVD is the largest possible
difference between the probability distributions for label outcomes of facets a and d. The L1-norm is
the Hamming distance, a metric used compare two binary data strings by determining the minimum
number of substitutions required to change one string into another. If the strings were to be copies of
each other, it determines the number of errors that occurred when copying. In the bias detection context,
TVD quantifies how many outcomes in facet a would have to be changed to match the outcomes in facet
d.

The formula for the Total variation distance is as follows:

TVD = ½*L1(Pa, Pd)

For example, assume you have an outcome distribution with three categories, yi = {y0, y1, y2} = {accepted,
waitlisted, rejected}, in a college admissions multicategory scenario. You take the differences between
the counts of facets a and d for each outcome to calculate TVD. The result is as follows:
(0) (0) (1) (1) (2) (2)
L1(Pa, Pd) = |na - nd | + |na - nd | + |na - nd |

977
Amazon SageMaker Developer Guide
Measure Pre-training Bias

Where:
(i) (0)
• na is number of the ith category outcomes in facet a: for example na is number of facet a
acceptances.
(i) (2)
• nd is number of the ith category outcomes in facet d: for example nd is number of facet d
rejections.

The range of TVD values for binary, multicategory, and continuous outcomes is [0, 1), where:
• Values near zero mean the labels are similarly distributed.
• Positive values mean the label distributions diverge, the more positive the larger the divergence.

Kolmogorov-Smirnov (KS)
The Kolmogorov-Smirnov bias metric (KS) is equal to the maximum divergence between labels in the
distributions for facets a and d of a dataset. The two-sample KS test implemented by SageMaker Clarify
complements the other measures of label imbalance by finding the most imbalanced label.

The formula for the Kolmogorov-Smirnov metric is as follows:

KS = max(|Pa(y) - Pd(y)|)

For example, assume a group of applicants (facet a) to college are rejected, waitlisted, or accepted at
40%, 40%, 20% respectively and that these rates for other applicants (facet d) are 20%, 10%, 70%. Then
the Kolmogorov-Smirnov bias metric value is as follows:

KS = max(|0.4-0.2|, |0.4-0.1|, |0.2-0.7|) = 0.5

This tells us the maximum divergence between facet distributions is 0.5 and occurs in the acceptance
rates. There are three terms in the equation because labels are multiclass of cardinality three.

The range of LP values for binary, multicategory, and continuous outcomes is [0, +1], where:

• Values near zero indicate the labels were evenly distributed between facets in all outcome categories.
For example, both facets applying for a loan got 50% of the acceptances and 50% of the rejections.
• Values near one indicate the labels for one outcome were all in one facet. For example, facet a got
100% of the acceptances and facet d got none.
• Intermittent values indicate relative degrees of maximum label imbalance.

Conditional Demographic Disparity (CDD)


The demographic disparity metric (DD) determines whether a facet has a larger proportion of the
rejected outcomes in the dataset than of the accepted outcomes. In the binary case where there are
two facets, men and women for example, that constitute the dataset, the disfavored one is labelled
facet d and the favored one is labelled facet a. For example, in the case of college admissions, if
women applicants comprised 46% of the rejected applicants and comprised only 32% of the accepted
applicants, we say that there is demographic disparity because the rate at which women were rejected
exceeds the rate at which they are accepted. Women applicants are labelled facet d in this case. If men
applicants comprised 54% of the rejected applicants and 68% of the accepted applicants, then there is
not a demographic disparity for this facet as the rate of rejection is less that the rate of acceptance. Men
applicants are labelled facet a in this case.

The formula for the demographic disparity for the less favored facet d is as follows:
(0) (0) (1) (1) R 0 A 1
DDd = nd /n - nd /n = Pd (y ) - Pd (y )

Where:

978
Amazon SageMaker Developer Guide
Measure Pre-training Bias

(0) (0) (0)


• n = na + nd is the total number of rejected outcomes in the dataset for the favored facet a and
disadvantaged facet d.
(1) (1) (1)
• n = na + nd is the total number of accepted outcomes in the dataset for the favored facet a and
disadvantaged facet d.
R 0
• Pd (y ) is the proportion of rejected outcomes (with value 0) in facet d.
A 1
• Pd (y ) is the proportion of accepted outcomes (value 1) in facet d.

For the college admission example, the demographic disparity for women is DDd = 0.46 - 0.32 = 0.14. For
men DDa = 0.54 - 0.68 = - 0.14.

A conditional demographic disparity (CDD) metric that conditions DD on attributes that define a strata of
subgroups on the dataset is needed to rule out Simpson's paradox. The regrouping can provide insights
into the cause of apparent demographic disparities for less favored facets. The classic case arose in the
case of Berkeley admissions where men were accepted at a higher rate overall than women. The statistics
for this case were used in the example calculations of DD. However, when departmental subgroups
were examined, women were shown to have higher admission rates than men when conditioned by
department. The explanation was that women had applied to departments with lower acceptance rates
than men had. Examining the subgrouped acceptance rates revealed that women were actually accepted
at a higher rate than men for the departments with lower acceptance rates.

The CDD metric gives a single measure for all of the disparities found in the subgroups defined by
an attribute of a dataset by averaging them. It is defined as the weighted average of demographic
disparities (DDi) for each of the subgroups, with each subgroup disparity weighted in proportion to the
number of observations in contains. The formula for the conditional demographic disparity is as follows:

CDD = (1/n)*∑ini *DDi

Where:

• ∑ini = n is the total number of observations and niis the number of observations for each subgroup.
(0) (0) (1) (1) R 0 A 1
• DDi = ni /n - ni /n = Pi (y ) - Pi (y ) is the demographic disparity for the ith subgroup.

The demographic disparity for a subgroup (DDi) are the difference between the proportion of rejected
outcomes and the proportion of accepted outcomes for each subgroup.

The range of DD values for binary outcomes for the full dataset DDd or for its conditionalized subgroups
DDi is [-1, +1].

• +1: when there no rejections in facet a or subgroup and no acceptances in facet d or subgroup
• Positive values indicate there is a demographic disparity as facet d or subgroup has a greater
proportion of the rejected outcomes in the dataset than of the accepted outcomes. The higher the
value the less favored the facet and the greater the disparity.
• Negative values indicate there is not a demographic disparity as facet d or subgroup has a larger
proportion of the accepted outcomes in the dataset than of the rejected outcomes. The lower the
value the more favored the facet.
• -1: when there are no rejections in facet d or subgroup and no acceptances in facet a or subgroup

If you don't condition on anything then CDD is zero if and only if DPL is zero.

This metric is useful for exploring the concepts of direct and indirect discrimination and of objective
justification in EU and UK non-discrimination law and jurisprudence. For additional information, see Why
Fairness Cannot Be Automated. This paper also contains the relevant data and analysis of the Berkeley
admissions case that shows how conditionalizing on departmental admission rate subgroups illustrates
Simpson's paradox.

979
Amazon SageMaker Developer Guide
Generate Reports for Bias in Pre-
training Data in SageMaker Studio

Generate Reports for Bias in Pre-training Data in


SageMaker Studio
SageMaker Clarify is integrated with Amazon SageMaker Data Wrangler, which can help you identify bias
during data preparation without having to write your own code. Data Wrangler provides an end-to-end
solution to import, prepare, transform, featurize, and analyze data with Amazon SageMaker Studio. For
an overview of the Data Wrangler data prep workflow, see Prepare ML Data with Amazon SageMaker
Data Wrangler (p. 981).

You specify attributes of interest, such as gender or age, and SageMaker Clarify runs a set of algorithms
to detect the presence of bias in those attributes. After the algorithm runs, SageMaker Clarify provides
a visual report with a description of the sources and severity of possible bias so that you can plan steps
to mitigate. For example, in a financial dataset that contains few examples of business loans to one
age group as compared to others, SageMaker flags the imbalance so that you can avoid a model that
disfavors that age group.

To analyze and report on data bias

To get started with Data Wrangler, see Get Started with Data Wrangler (p. 983).

1.
In Amazon SageMaker Studio, from the Home ( ) menu in the left panel, navigate to the Data
node, then choose Data Wrangler. This opens the Data Wrangler landing page in Studio.
2. Choose the + Import data button to create a new flow.
3. In your flow page, from the Import tab, choose Amazon S3, navigate to your Amazon S3 bucket,
find your dataset, then choose Import.
4. After you have imported your data, on the flow graph in the Data flow tab, choose the + sign to the
right of the Data types node.
5. Choose Add analysis.
6. On the Create Analysis page, choose Bias Report for the Analysis type.
7. Configure the bias report by providing a report Name, the column to predict and whether it is a
value or threshold, the column to analyze for bias (the facet) and whether it is a value or threshold.
8. Continue configuring the bias report by choosing the bias metrics.

980
Amazon SageMaker Developer Guide
Prepare Data with Data Wrangler

9. Choose Check for bias to generate and view the bias report. Scroll down to view all of the reports.

10. Choose the caret to the right of each bias metric description to see documentation that can help you
interpret the significance of the metric values.
11. To view a table summary of the bias metric values, choose the Table toggle. To save the report,
choose Save in the lower-right corner of the page. You can see the report on the flow graph in the
Data flow tab. Double-click on the report to open it.

Prepare ML Data with Amazon SageMaker Data


Wrangler
Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon SageMaker Studio that
provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can
integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify
and streamline data pre-processing and feature engineering using little to no coding. You can also add
your own Python scripts and transformations to customize workflows.

981
Amazon SageMaker Developer Guide
Prepare Data with Data Wrangler

Data Wrangler provides the following core functionalities to help you analyze and prepare data for
machine learning applications.

• Import – Connect to and import data from Amazon Simple Storage Service (Amazon S3), Amazon
Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
• Data Flow – Create a data flow to define a series of ML data prep steps. You can use a flow to combine
datasets from different data sources, identify the number and types of transformations you want to
apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.
• Transform – Clean and transform your dataset using standard transforms like string, vector, and
numeric data formatting tools. Featurize your data using transforms like text and date/time
embedding and categorical encoding.
• Generate Data Insights – Automatically verify data quality and detect abnormalities in your data with
Data Wrangler Data Insights and Quality Report.
• Analyze – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-
in data visualization tools like scatter plots and histograms, as well as data analysis tools like target
leakage analysis and quick modeling to understand feature correlation.
• Export – Export your data preparation workflow to a different location. The following are example
locations:
• Amazon Simple Storage Service (Amazon S3) bucket
• Amazon SageMaker Model Building Pipelines – Use SageMaker Pipelines to automate model
deployment. You can export the data that you've transformed directly to the pipelines.
• Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
• Python script – Store the data and their transformations in a Python script for your custom
workflows.

To start using Data Wrangler, see Get Started with Data Wrangler (p. 983).
Important
Data Wrangler no longer supports Jupyter Lab Version 1 (JL1). To access the latest features and
updates, update to Jupyter Lab Version 3. For more information about upgrading, see View and
update the JupyterLab version of an application from the console (p. 140).
Important
The information and procedures in this guide use the latest version of Amazon SageMaker
Studio. For information about updating Studio to the latest version, see Amazon SageMaker
Studio UI Overview (p. 129).

Topics
• Get Started with Data Wrangler (p. 983)
• Import (p. 991)
• Create and Use a Data Wrangler Flow (p. 1034)
• Get Insights On Data and Data Quality (p. 1045)
• Automatically Train Models on Your Data Flow (p. 1057)
• Transform Data (p. 1058)
• Analyze and Visualize (p. 1101)
• Reusing Data Flows for Different Datasets (p. 1109)
• Export (p. 1116)
• Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Notebook to Get Data
Insights (p. 1138)
• Security and Permissions (p. 1141)
• Release Notes (p. 1152)
• Troubleshoot (p. 1156)

982
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

• Increase Amazon EC2 Instance Limit (p. 1160)


• Update Data Wrangler (p. 985)
• Shut Down Data Wrangler (p. 1162)

Get Started with Data Wrangler


Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio. Use this section to learn
how to access and get started using Data Wrangler. Do the following:

1. Complete each step in Prerequisites (p. 983).


2. Follow the procedure in Access Data Wrangler (p. 983) to start using Data Wrangler.

Prerequisites
To use Data Wrangler, you must complete the following prerequisites.

1. To use Data Wrangler, you need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance.
For more information about the Amazon EC2 instances that you can use, see Instances (p. 1034). To
learn how to view your quotas and, if necessary, request a quota increase, see AWS service quotas.
2. Configure the required permissions described in Security and Permissions (p. 1141).

To use Data Wrangler, you need an active Studio instance. To learn how to launch a new instance, see
Onboard to Amazon SageMaker Domain (p. 37). When your Studio instance is Ready, use the instructions
in Access Data Wrangler (p. 983).

Access Data Wrangler


The following procedure assumes you have completed the Prerequisites (p. 983).

To access Data Wrangler in Studio, do the following.

1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. You can also create a Data Wrangler flow by doing the following.

a. In the top navigation bar, select File.


b. Select New.
c. Select Data Wrangler Flow.

983
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

9. (Optional) Rename the new directory and the .flow file.


10. When you create a new .flow file in Studio, you might see a carousel that introduces you to Data
Wrangler.

This may take a few minutes.

This messaging persists as long as the KernelGateway app on your User Details page is Pending.
To see the status of this app, in the SageMaker console on the Amazon SageMaker Studio page,
select the name of the user you are using to access Studio. On the User Details page, you see a
KernelGateway app under Apps. Wait until this app status is Ready to start using Data Wrangler.
This can take around 5 minutes the first time you launch Data Wrangler.

11. To get started, choose a data source and use it to import a dataset. See Import (p. 991) to learn
more.

When you import a dataset, it appears in your data flow. To learn more, see Create and Use a Data
Wrangler Flow (p. 1034).
12. After you import a dataset, Data Wrangler automatically infers the type of data in each column.
Choose + next to the Data types step and select Edit data types.
Important
After you add transforms to the Data types step, you cannot bulk-update column types
using Update types.
13. Use the data flow to add transforms and analyses. To learn more see Transform Data (p. 1058) and
Analyze and Visualize (p. 1101).
14. To export a complete data flow, choose Export and choose an export option. To learn more, see
Export (p. 1116).

984
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

15. Finally, choose the Components and registries icon, and select Data Wrangler from the dropdown
list to see all the .flow files that you've created. You can use this menu to find and move between
data flows.

After you have launched Data Wrangler, you can use the following section to walk through how you
might use Data Wrangler to create an ML data prep flow.

Update Data Wrangler


We recommend that you periodically update the Data Wrangler Studio app to access the latest features
and updates. The Data Wrangler app name starts with sagemaker-data-wrang. To learn how to update a
Studio app, see Shut down and Update Studio Apps (p. 200).

Demo: Data Wrangler Titanic Dataset Walkthrough


The following sections provide a walkthrough to help you get started using Data Wrangler. This
walkthrough assumes that you have already followed the steps in Access Data Wrangler (p. 983) and
have a new data flow file open that you intend to use for the demo. You may want to rename this .flow
file to something similar to titanic-demo.flow.

This walkthrough uses the Titanic dataset. It's a modified version of the Titanic dataset that you can
import into your Data Wrangler flow more easily. This data set contains the survival status, age, gender,
and class (which serves as a proxy for economic status) of passengers aboard the maiden voyage of the
RMS Titanic in 1912.

In this tutorial, you perform the following steps.

1. Do one of the following:


• Open your Data Wrangler flow and choose Use Sample Dataset.
• Upload the Titanic dataset to Amazon Simple Storage Service (Amazon S3), and then import this
dataset into Data Wrangler.
2. Analyze this dataset using Data Wrangler analyses.
3. Define a data flow using Data Wrangler data transforms.
4. Export your flow to a Jupyter Notebook that you can use to create a Data Wrangler job.
5. Process your data, and kick off a SageMaker training job to train a XGBoost Binary Classifier.

Upload Dataset to S3 and Import


To get started, you can use one of the following methods to import the Titanic dataset into Data
Wrangler:

• Importing the dataset directly from the Data Wrangler flow


• Uploading the dataset to Amazon S3 and then importing it into Data Wrangler

To import the dataset directly into Data Wrangler, open the flow and choose Use Sample Dataset.

Uploading the dataset to Amazon S3 and importing it into Data Wrangler is closer to the experience
you have importing your own data. The following information tells you how to upload your dataset and
import it.

Before you start importing the data into Data Wrangler, download the Titanic dataset and upload it to an
Amazon S3 (Amazon S3) bucket in the AWS Region in which you want to complete this demo.

985
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

If you are a new user of Amazon S3, you can do this using drag and drop in the Amazon S3 console.
To learn how, see Uploading Files and Folders by Using Drag and Drop in the Amazon Simple Storage
Service User Guide.
Important
Upload your dataset to an S3 bucket in the same AWS Region you want to use to complete this
demo.

When your dataset has been successfully uploaded to Amazon S3, you can import it into Data Wrangler.

Import the Titanic dataset to Data Wrangler

1. Choose the Import data button in your Data flow tab or choose the Import tab.
2. Select Amazon S3.
3. Use the Import a dataset from S3 table to find the bucket to which you added the Titanic dataset.
Choose the Titanic dataset CSV file to open the Details pane.
4. Under Details, the File type should be CSV. Check First row is header to specify that the first row of
the dataset is a header. You can also name the dataset something more friendly, such as Titanic-
train.
5. Choose the Import button.

When your dataset is imported into Data Wrangler, it appears in your Data Flow tab. You can double
click on a node to enter the node detail view, which allows you to add transformations or analysis. You
can use the plus icon for a quick access to the navigation. In the next section, you use this data flow to
add analysis and transform steps.

Data Flow
In the data flow section, the only steps in the data flow are your recently imported dataset and a Data
type step. After applying transformations, you can come back to this tab and see what the data flow
looks like. Now, add some basic transformations under the Prepare and Analyze tabs.

Prepare and Visualize

Data Wrangler has built-in transformations and visualizations that you can use to analyze, clean, and
transform your data.

The Data tab of the node detail view lists all built-in transformations in the right panel, which also
contains an area in which you can add custom transformations. The following use case showcases how to
use these transformations.

To get information that might help you with data exploration and feature engineering, create a data
quality and insights report. The information from the report can help you clean and process your data.
It gives you information such as the number of missing values and the number of outliers. If you have
issues with your data, such as target leakage or imbalance, the insights report can bring those issues
to your attention. For more information about creating a report, see Get Insights On Data and Data
Quality (p. 1045).

Data Exploration

First, create a table summary of the data using an analysis. Do the following:

1. Choose the + next to the Data type step in your data flow and select Add analysis.
2. In the Analysis area, select Table summary from the dropdown list.
3. Give the table summary a Name.

986
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

4. Select Preview to preview the table that will be created.


5. Choose Save to save it to your data flow. It appears under All Analyses.

Using the statistics you see, you can make observations similar to the following about this dataset:

• Fare average (mean) is around $33, while the max is over $500. This column likely has outliers.
• This dataset uses ? to indicate missing values. A number of columns have missing values: cabin,
embarked, and home.dest
• The age category is missing over 250 values.

Next, clean your data using the insights gained from these stats.

Drop Unused Columns

Using the analysis from the previous section, clean up the dataset to prepare it for training. To add a
new transform to your data flow, choose + next to the Data type step in your data flow and choose Add
transform.

First, drop columns that you don't want to use for training. You can use pandas data analysis library to
do this, or you can use one of the built-in transforms.

Use the following procedure to drop the unused columns.

To drop the unused columns.

1. Open the Data Wrangler flow.


2. There are two nodes in your Data Wrangler flow. Choose the + to the right of the Data types node.
3. Choose Add transform.
4. In the All steps column, choose Add step.
5. In the Standard transform list, choose Manage Columns. The standard transformations are ready-
made, built-in transformations. Make sure that Drop column is selected.
6. Under Columns to drop, check the following column names:

• cabin
• ticket
• name
• sibsp
• parch
• home.dest
• boat
• body
7. Choose Preview.
8. Verify that the columns have been dropped, then choose Add.

To do this using pandas, follow these steps.

1. In the All steps column, choose Add step.


2. In the Custom transform list, choose Custom transform.
3. Provide a name for your transformation, and choose Python (Pandas) from the dropdown list.
4. Enter the following Python script in the code box.

987
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

cols = ['name', 'ticket', 'cabin', 'sibsp', 'parch', 'home.dest','boat', 'body']


df = df.drop(cols, axis=1)

5. Choose Preview to preview the change, and then choose Add to add the transformation.

Clean up Missing Values

Now, clean up missing values. You can do this with the Handling missing values transform group.

A number of columns have missing values. Of the remaining columns, age and fare contain missing
values. Inspect this using a Custom Transform.

Using the Python (Pandas) option, use the following to quickly review the number of entries in each
column:

df.info()

To drop rows with missing values in the age category, do the following:

1. Choose Handle missing.


2. Choose Drop missing for the Transformer.
3. Choose age for the Input column.
4. Choose Preview to see the new data frame, and then choose Add to add the transform to your flow.
5. Repeat the same process for fare.

You can use df.info() in the Custom transform section to confirm that all rows now have 1,045
values.

988
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

Custom Pandas: Encode

Try flat encoding using Pandas. Encoding categorical data is the process of creating a numerical
representation for categories. For example, if your categories are Dog and Cat, you may encode this
information into two vectors: [1,0] to represent Dog, and [0,1] to represent Cat.

1. In the Custom Transform section, choose Python (Pandas) from the dropdown list.
2. Enter the following in the code box.

import pandas as pd

dummies = []
cols = ['pclass','sex','embarked']
for col in cols:
dummies.append(pd.get_dummies(df[col]))

encoded = pd.concat(dummies, axis=1)

df = pd.concat((df, encoded),axis=1)

3. Choose Preview to preview the change. The encoded version of each column is added to the dataset.
4. Choose Add to add the transformation.

Custom SQL: SELECT Columns

Now, select the columns you want to keep using SQL. For this demo, select the columns listed in the
following SELECT statement. Because survived is your target column for training, put that column first.

1. In the Custom Transform section, select SQL (PySpark SQL) from the dropdown list.
2. Enter the following in the code box.

SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df;

3. Choose Preview to preview the change. The columns listed in your SELECT statement are the only
remaining columns.
4. Choose Add to add the transformation.

Export to a Data Wrangler Notebook


When you've finished creating a data flow, you have a number of export options. The following section
explains how to export to a Data Wrangler job notebook. A Data Wrangler job is used to process
your data using the steps defined in your data flow. To learn more about all export options, see
Export (p. 1116).

Export to Data Wrangler Job Notebook

When you export your data flow using a Data Wrangler job, the process automatically creates a Jupyter
Notebook. This notebook automatically opens in your Studio instance and is configured to run a
SageMaker processing job to run your Data Wrangler data flow, which is referred to as a Data Wrangler
job.

1. Save your data flow. Select File and then select Save Data Wrangler Flow.
2. Back to the Data Flow tab, select the last step in your data flow (SQL), then choose the + to open
the navigation.
3. Choose Export, and Amazon S3 (via Jupyter Notebook). This opens a Jupyter Notebook.

989
Amazon SageMaker Developer Guide
Get Started with Data Wrangler

4. Choose any Python 3 (Data Science) kernel for the Kernel.


5. When the kernel starts, run the cells in the notebook book until Kick off SageMaker Training Job
(Optional).
6. Optionally, you can run the cells in Kick off SageMaker Training Job (Optional) if you want
to create a SageMaker training job to train an XGBoost classifier. You can find the cost to run a
SageMaker training job in Amazon SageMaker Pricing.

Alternatively, you can add the code blocks found in Training XGBoost Classifier (p. 990) to the
notebook and run them to use the XGBoost open source library to train an XGBoost classifier.
7. Uncomment and run the cell under Cleanup and run it to revert the SageMaker Python SDK to its
original version.

You can monitor your Data Wrangler job status in the SageMaker console in the Processing tab.
Additionally, you can monitor your Data Wrangler job using Amazon CloudWatch. For additional
information, see Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics.

If you kicked off a training job, you can monitor its status using the SageMaker console under Training
jobs in the Training section.

Training XGBoost Classifier

You can train an XGBoost Binary Classifier using either a Jupyter notebook or a Amazon SageMaker
Autopilot. You can use Autopilot to automatically train and tune models on the data that you've
transformed directly from your Data Wrangler flow. For information about Autopilot, see Automatically
Train Models on Your Data Flow (p. 1057).

990
Amazon SageMaker Developer Guide
Import

In the same notebook that kicked off the Data Wrangler job, you can pull the data and train an XGBoost
Binary Classifier using the prepared data with minimal data preparation.

1. First, upgrade necessary modules using pip and remove the _SUCCESS file (this last file is
problematic when using awswrangler).

! pip install --upgrade awscli awswrangler boto sklearn


! aws s3 rm {output_path} --recursive --exclude "*" --include "*_SUCCESS*"

2. Read the data from Amazon S3. You can use awswrangler to recursively read all the CSV files in
the S3 prefix. The data is then split into features and labels. The label is the first column of the
dataframe.

import awswrangler as wr

df = wr.s3.read_csv(path=output_path, dataset=True)
X, y = df.iloc[:,:-1],df.iloc[:,-1]

• Finally, create DMatrices (the XGBoost primitive structure for data) and do cross-validation using
the XGBoost binary classification.

import xgboost as xgb

dmatrix = xgb.DMatrix(data=X, label=y)

params = {"objective":"binary:logistic",'learning_rate': 0.1, 'max_depth': 5,


'alpha': 10}

xgb.cv(
dtrain=dmatrix,
params=params,
nfold=3,
num_boost_round=50,
early_stopping_rounds=10,
metrics="rmse",
as_pandas=True,
seed=123)

Shut down Data Wrangler

When you are finished using Data Wrangler, we recommend that you shut down the instance it runs on
to avoid incurring additional charges. To learn how to shut down the Data Wrangler app and associated
instance, see Shut Down Data Wrangler (p. 1162).

Import
You can use Amazon SageMaker Data Wrangler to import data from the following data sources: Amazon
Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. The dataset that
you import can include up to 1000 columns.

Topics
• Import data from Amazon S3 (p. 993)
• Import data from Athena (p. 997)
• Import data from Amazon Redshift (p. 1000)
• Import data from Amazon EMR (p. 1004)
• Import data from Databricks (JDBC) (p. 1011)

991
Amazon SageMaker Developer Guide
Import

• Import data from Snowflake (p. 1013)


• Import Data From Software as a Service (SaaS) Platforms (p. 1030)
• Imported Data Storage (p. 1033)

Some data sources allow you to add multiple data connections:

• You can connect to multiple Amazon Redshift clusters. Each cluster becomes a data source.
• You can query any Athena database in your account to import data from that database.

When you import a dataset from a data source, it appears in your data flow. Data Wrangler automatically
infers the data type of each column in your dataset. To modify these types, select the Data types step
and select Edit data types.

When you import data from Athena or Amazon Redshift, the imported data is automatically stored
in the default SageMaker S3 bucket for the AWS Region in which you are using Studio. Additionally,
Athena stores data you preview in Data Wrangler in this bucket. To learn more, see Imported Data
Storage (p. 1033).
Important
The default Amazon S3 bucket may not have the least permissive security settings, such as
bucket policy and server-side encryption (SSE). We strongly recommend that you Add a Bucket
Policy To Restrict Access to Datasets Imported to Data Wrangler.
Important
In addition, if you use the managed policy for SageMaker, we strongly recommend that you
scope it down to the most restrictive policy that allows you to perform your use case. For more
information, see Grant an IAM Role Permission to Use Data Wrangler (p. 1143).

All data sources except for Amazon Simple Storage Service (Amazon S3) require you to specify a SQL
query to import your data. For each query, you must specify the following:

• Data catalog
• Database
• Table

You can specify the name of the database or the data catalog in either the drop down menus or within
the query. The following are example queries:

• select * from example-data-catalog-name.example-database-name.example-table-


name – The query doesn't use anything specified in the dropdown menus of the user-interface (UI) to
run. It queries example-table-name within example-database-name within example-data-
catalog-name.
• select * from example-database-name.example-table-name – The query uses the data
catalog that you've specified in the Data catalog dropdown menu to run. It queries example-table-
name within example-database-name within the data catalog that you've specified.
• select * from example-table-name – The query requires you to select fields for both the Data
catalog and Database name dropdown menus. It queries example-table-name within the data
catalog within the database and data catalog that you've specified.

The link between Data Wrangler and the data source is a connection. You use the connection to import
data from your data source.

There are the following types of connections:

992
Amazon SageMaker Developer Guide
Import

• Direct
• Cataloged

Data Wrangler always has access to the most recent data in a direct connection. If the data in the data
source has been updated, you can use the connection to import the data. For example, if someone adds a
file to one of your Amazon S3 buckets, you can import the file.

A cataloged connection is the result of a data transfer. The data in the cataloged connection doesn't
necessarily have the most recent data. For example, you might set up a data transfer between Salesforce
and Amazon S3. If there's an update to the Salesforce data, you must transfer the data again. You can
automate the process of transferring data. For more information about data transfers, see Import Data
From Software as a Service (SaaS) Platforms (p. 1030).

Import data from Amazon S3


You can use Amazon Simple Storage Service (Amazon S3) to store and retrieve any amount of data,
at any time, from anywhere on the web. You can accomplish these tasks using the AWS Management
Console, which is a simple and intuitive web interface, and the Amazon S3 API. If you've stored your
dataset locally, we recommend that you add it to an S3 bucket for import into Data Wrangler. To learn
how, see Uploading an object to a bucket in the Amazon Simple Storage Service User Guide.

Data Wrangler uses S3 Select to allow you to preview your Amazon S3 files in Data Wrangler. You incur
standard charges for each file preview. To learn more about pricing, see the Requests & data retrievals
tab on Amazon S3 pricing.
Important
If you plan to export a data flow and launch a Data Wrangler job, ingest data into a SageMaker
feature store, or create a SageMaker pipeline, be aware that these integrations require Amazon
S3 input data to be located in the same AWS region.
Important
If you're importing a CSV file, make sure it meets the following requirements:

• A record in your dataset can't be longer than one line.


• A backslash, \, is the only valid escape character.
• Your dataset must use one of the following delimiters:
• Comma – ,
• Colon – :
• Semicolon – ;
• Pipe – |
• Tab – [TAB]

To save space, you can import compressed CSV files.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For
Amazon S3, it provides the following sampling options:

• None – Import the entire dataset.


• First K – Sample the first K rows of the dataset, where K is an integer that you specify.
• Randomized – Takes a random sample of a size that you specify.
• Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a
column.

After you've imported your data, you can also use the sampling transformer to take one or more samples
from your entire dataset. For more information about the sampling transformer, see Sampling (p. 1092).

993
Amazon SageMaker Developer Guide
Import

You can import either a single file or multiple files as a dataset. You can use the multifile import
operation when you have a dataset that is partitioned into separate files. It takes all of the files from an
Amazon S3 directory and imports them as a single dataset. For information on the types of files that you
can import and how to import them, see the following sections.

Single File Import

You can import single files in the following formats:

• Comma Separated Values (CSV)


• Parquet
• Javascript Object Notation (JSON)
• Optimized Row Columnar (ORC)
• Image – Data Wrangler uses OpenCV to import images. For more information about supported
image formats, see Image file reading and writing.

For files formatted in JSON, Data Wrangler supports both JSON lines (.jsonl) and JSON documents
(.json). When you preview your data, it automatically shows the JSON in tabular format. For nested
JSON documents that are larger than 5 MB, Data Wrangler shows the schema for the structure
and the arrays as values in the dataset. Use the Flatten structured and Explode array operators to
display the nested values in tabular format. For more information, see Unnest JSON Data (p. 1097)
and Explode Array (p. 1098).

When you choose a dataset, you can rename it, specify the file type, and identify the first row as a
header.

You can import a dataset that you've partitioned into multiple files in an Amazon S3 bucket in a
single import step.

To import a dataset into Data Wrangler from a single file that you've stored in Amazon
S3:

1. If you are not currently on the Import tab, choose Import.


2. Under Available, choose Amazon S3 to see the Import S3 Data Source view.
3. From the table of available S3 buckets, select a bucket and navigate to the dataset you want to
import.
4. Select the file that you want to import. If your dataset does not have a .csv or .parquet
extension, select the data type from the File Type dropdown list.
5. If your CSV file has a header, select the checkbox next to Add header to table.
6. Use the Preview table to preview your dataset. This table shows up to 100 rows.
7. In the Details pane, verify or change the Name and File Type for your dataset. If you add a
Name that contains spaces, these spaces are replaced with underscores when your dataset is
imported.
8. Specify the sampling configuration that you'd like to use.
9. Choose Import dataset.

994
Amazon SageMaker Developer Guide
Import

Multifile Import

The following are the requirements for importing multiple files:

• The files must be in the same folder of your Amazon S3 bucket.


• The files must either share the same header or have no header.

Each file must be in one of the following formats:

• CSV
• Parquet
• Optimized Row Columnar (ORC)
• Image – Data Wrangler uses OpenCV to import images. For more information about supported
image formats, see Image file reading and writing.

Use the following procedure to import multiple files.

To import a dataset into Data Wrangler from multiple files that you've stored in an
Amazon S3 directory

1. If you are not currently on the Import tab, choose Import.


2. Under Available, choose Amazon S3 to see the Import S3 Data Source view.

995
Amazon SageMaker Developer Guide
Import

3. From the table of available S3 buckets, select the bucket containing the folder that you want to
import.
4. Select the folder containing the files that you want to import. Each file must be in one of the
supported formats. Your files must be the same data type.
5. If your folder contains CSV files with headers, select the checkbox next to First row is header.
6. If your files are nested within other folders, select the checkbox next to Include nested
directories.
7. (Optional) Choose Add filename column add a column to the dataset that shows the filename
for each observation.
8. (Optional) By default, Data Wrangler doesn't show you a preview of a folder. You can activate
previewing by choosing the blue Preview off button. A preview shows the first 10 rows of
the first 10 files in the folder. The following images show you how to activate a preview for a
dataset created from nested directories.

996
Amazon SageMaker Developer Guide
Import

9. In the Details pane, verify or change the Name and File Type for your dataset. If you add a
Name that contains spaces, these spaces are replaced with underscores when your dataset is
imported.
10. Specify the sampling configuration that you'd like to use.
11. Choose Import dataset.

You can also use parameters to import a subset of files that match a pattern. Parameters help you more
selectively pick the files that you're importing. To start using parameters, edit the data source and apply
them to the path that you're using to import the data. For more information, see Reusing Data Flows for
Different Datasets (p. 1109).

Import data from Athena


Use Amazon Athena to import your data from Amazon Simple Storage Service (Amazon S3) into Data
Wrangler. In Athena, you write standard SQL queries to select the data that you're importing from
Amazon S3. For more information, see What is Amazon Athena?

You can use the AWS Management Console to set up Amazon Athena. You must create at least one
database in Athena before you start running queries. For more information about getting started with
Athena, see Getting started.

Athena is directly integrated with Data Wrangler. You can write Athena queries without having to leave
the Data Wrangler UI.

In addition to writing simple Athena queries in Data Wrangler, you can also use:

• Athena workgroups for query result management. For more information about workgroups, see
Managing query results (p. 999).
• Lifecycle configurations for setting data retention periods. For more information about data retention,
see Setting data retention periods (p. 999).

997
Amazon SageMaker Developer Guide
Import

Query Athena within Data Wrangler


Note
Data Wrangler does not support federated queries.

If you use AWS Lake Formation with Athena, make sure your Lake Formation IAM permissions do not
override IAM permissions for the database sagemaker_data_wrangler.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For
Athena, it provides the following sampling options:

• None – Import the entire dataset.


• First K – Sample the first K rows of the dataset, where K is an integer that you specify.
• Randomized – Takes a random sample of a size that you specify.
• Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a
column.

The following procedure shows how to import a dataset from Athena into Data Wrangler.

To import a dataset into Data Wrangler from Athena

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.
9. Under Available, choose Amazon Athena.
10. For Data Catalog, choose a data catalog.
11. Use the Database dropdown list to select the database that you want to query. When you select a
database, you can preview all tables in your database using the Tables listed under Details.
12. (Optional) Choose Advanced configuration.

a. Choose a Workgroup.
b. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a
workgroup, specify a value for Amazon S3 location of query results.
c. (Optional) For Data retention period, select the checkbox to set a data retention period and
specify the number of days to store the data before it's deleted.
d. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the
checkbox and not save the connection.
13. For Sampling, choose a sampling method. Choose None to turn off sampling.
14. Enter your query in the query editor and use the Run button to run the query. After a successful
query, you can preview your result under the editor.
Note
Salesforce data uses the timestamptz type. If you're querying the timestamp column that
you've imported to Athena from Salesforce, cast the data in the column to the timestamp
type. The following query casts the timestamp column to the correct type.

# cast column timestamptz_col as timestamp type, and name it as timestamp_col


select cast(timestamptz_col as timestamp) as timestamp_col from table

998
Amazon SageMaker Developer Guide
Import

15. To import the results of your query, select Import.

After you complete the preceding procedure, the dataset that you've queried and imported appears in
the Data Wrangler flow.

By default, Data Wrangler saves the connection settings as a new connection. When you import your
data, the query that you've already specified appears as a new connection. The saved connections store
information about the Athena workgroups and Amazon S3 buckets that you're using. When you're
connecting to the data source again, you can choose the saved connection.

Managing query results


Data Wrangler supports using Athena workgroups to manage the query results within an AWS account.
You can specify an Amazon S3 output location for each workgroup. You can also specify whether
the output of the query can go to different Amazon S3 locations. For more information, see Using
Workgroups to Control Query Access and Costs.

Your workgroup might be configured to enforce the Amazon S3 query output location. You can't change
the output location of the query results for those workgroups.

If you don't use a workgroup or specify an output location for your queries, Data Wrangler uses the
default Amazon S3 bucket in the same AWS Region in which your Studio instance is located to store
Athena query results. It creates temporary tables in this database to move the query output to this
Amazon S3 bucket. It deletes these tables after data has been imported; however the database,
sagemaker_data_wrangler, persists. To learn more, see Imported Data Storage (p. 1033).

To use Athena workgroups, set up the IAM policy that gives access to workgroups. If you're using a
SageMaker-Execution-Role, we recommend adding the policy to the role. For more information
about IAM policies for workgroups, see IAM policies for accessing workgroups. For example workgroup
policies, see Workgroup example policies.

Setting data retention periods


Data Wrangler automatically sets a data retention period for the query results. The results are deleted
after the length of the retention period. For example, the default retention period is five days. The
results of the query are deleted after five days. This configuration is designed to help you clean up data
that you're no longer using. Cleaning up your data prevents unauthorized users from gaining access. It
also helps control the costs of storing your data on Amazon S3.

If you don't set a retention period, the Amazon S3 lifecycle configuration determines the duration that
the objects are stored. The data retention policy that you've specified for the lifecycle configuration
removes any query results that are older than the Lifecycle configuration that you've specified. For more
information, see Setting lifecycle configuration on a bucket.

Data Wrangler uses S3 lifecycle configurations to manage data retention and expiration. You must
give your Amazon SageMaker Studio IAM execution role permissions to manage bucket lifecycle
configurations. Use the following procedure to give permissions.

To give permissions to manage the lifecycle configuration do the following.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. Choose Roles.
3. In the search bar, specify the Amazon SageMaker execution role that Amazon SageMaker Studio is
using.
4. Choose the role.

999
Amazon SageMaker Developer Guide
Import

5. Choose Add permissions.


6. Choose Create inline policy.
7. For Service, specify S3 and choose it.
8. Under the Read section, choose GetLifecycleConfiguration.
9. Under the Write section, choose PutLifecycleConfiguration.
10. For Resources, choose Specific.
11. For Actions, select the arrow icon next to Permissions management.
12. Choose PutResourcePolicy.
13. For Resources, choose Specific.
14. Choose the checkbox next to Any in this account.
15. Choose Review policy.
16. For Name, specify a name.
17. Choose Create policy.

Import data from Amazon Redshift


Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The first step
to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you
provision your cluster, you can upload your dataset and then perform data analysis queries.

You can connect to and query one or more Amazon Redshift clusters in Data Wrangler. To use this import
option, you must create at least one cluster in Amazon Redshift. To learn how, see Getting started with
Amazon Redshift.

You can output the results of your Amazon Redshift query in one of the following locations:

• The default Amazon S3 bucket


• An Amazon S3 output location that you specify

You can either import the entire dataset or sample a portion of it. For Amazon Redshift, it provides the
following sampling options:

• None – Import the entire dataset.


• First K – Sample the first K rows of the dataset, where K is an integer that you specify.
• Randomized – Takes a random sample of a size that you specify.
• Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a
column.

The default Amazon S3 bucket is in the same AWS Region in which your Studio instance is located to
store Amazon Redshift query results. For more information, see Imported Data Storage (p. 1033).

For either the default Amazon S3 bucket or the bucket that you specify, you have the following
encryption options:

• The default AWS service-side encryption with an Amazon S3 managed key (SSE-S3)
• An AWS Key Management Service (AWS KMS) key that you specify

An AWS KMS key is an encryption key that you create and manage. For more information on KMS keys,
see AWS Key Management Service.

You can specify an AWS KMS key using either the key ARN or the ARN of your AWS account.

1000
Amazon SageMaker Developer Guide
Import

If you use the IAM managed policy, AmazonSageMakerFullAccess, to grant a role permission to use
Data Wrangler in Studio, your Database User name must have the prefix sagemaker_access.

Use the following procedures to learn how to add a new cluster.


Note
Data Wrangler uses the Amazon Redshift Data API with temporary credentials. To learn
more about this API, refer to Using the Amazon Redshift Data API in the Amazon Redshift
Management Guide.

To connect to a Amazon Redshift cluster

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.
9. Under Available, choose Amazon Athena.
10. Choose Amazon Redshift.
11. Choose Temporary credentials (IAM) for Type.
12. Enter a Connection Name. This is a name used by Data Wrangler to identify this connection.
13. Enter the Cluster Identifier to specify to which cluster you want to connect. Note: Enter only the
cluster identifier and not the full endpoint of the Amazon Redshift cluster.
14. Enter the Database Name of the database to which you want to connect.
15. Enter a Database User to identify the user you want to use to connect to the database.
16. For UNLOAD IAM Role, enter the IAM role ARN of the role that the Amazon Redshift cluster should
assume to move and write data to Amazon S3. For more information about this role, see Authorizing
Amazon Redshift to access other AWS services on your behalf in the Amazon Redshift Management
Guide.
17. Choose Connect.
18. (Optional) For Amazon S3 output location, specify the S3 URI to store the query results.
19. (Optional) For KMS key ID, specify the ARN of the AWS KMS key or alias. The following image shows
you where you can find either key in the AWS Management Console.

1001
Amazon SageMaker Developer Guide
Import

1002
Amazon SageMaker Developer Guide
Import

The following image shows all the fields from the preceding procedure.

After your connection is successfully established, it appears as a data source under Data Import. Select
this data source to query your database and import data.

To query and import data from Amazon Redshift

1. Select the connection that you want to query from Data Sources.
2. Select a Schema. To learn more about Amazon Redshift Schemas, see Schemas in the Amazon
Redshift Database Developer Guide.
3. (Optional) Under Advanced configuration, specify the Sampling method that you'd like to use.
4. Enter your query in the query editor and choose Run to run the query. After a successful query, you
can preview your result under the editor.
5. Select Import dataset to import the dataset that has been queried.
6. Enter a Dataset name. If you add a Dataset name that contains spaces, these spaces are replaced
with underscores when your dataset is imported.
7. Choose Add.

To edit a dataset, do the following.

1003
Amazon SageMaker Developer Guide
Import

1. Navigate to your Data Wrangler flow.


2. Choose the + next to Source - Sampled.
3. Change the data that you're importing.
4. Choose Apply

Import data from Amazon EMR


You can use Amazon EMR as a data source for your Amazon SageMaker Data Wrangler flow. Amazon
EMR is a managed cluster platform that you can use process and analyze large amounts of data. For
more information about Amazon EMR, see What is Amazon EMR?. To import a dataset from EMR, you
connect to it and query it.
Important
You must meet the following prerequisites to connect to an Amazon EMR cluster:

Prerequisites

• Network configurations

• You have an Amazon VPC in the Region that you're using to launch Amazon SageMaker
Studio and Amazon EMR.
• Both Amazon EMR and Amazon SageMaker Studio must be launched in private subnets.
They can be in the same subnet or in different ones.
• Amazon SageMaker Studio must be in VPC-only mode.

For more information about creating a VPC, see Create a VPC.

For more information about creating a VPC, see Connect SageMaker Studio Notebooks in a
VPC to External Resources.
• The Amazon EMR clusters that you're running must be in the same Amazon VPC.
• The Amazon EMR clusters and the Amazon VPC must be in the same AWS account.
• Your Amazon EMR clusters are running Hive or Presto.
• Hive clusters must allow inbound traffic from Studio security groups on port 10000.
• Presto clusters must allow inbound traffic from Studio security groups on port 8889.
• SageMaker Studio

• Amazon SageMaker Studio must run Jupyter Lab Version 3. For information about updating
the Jupyter Lab Version, see View and update the JupyterLab version of an application from
the console (p. 140).
• Amazon SageMaker Studio has an IAM role that controls users access. The default IAM role
that you're using to run Amazon SageMaker Studio doesn't have policies that you can give
you access Amazon EMR clusters. You must attach the policy granting permissions to the
IAM role. For more information, see Configure the discoverability of Amazon EMR clusters
(for administrators) (p. 1178).
• The IAM role must also have the following policy attached
secretsmanager:PutResourcePolicy.
• If you're using a Studio domain that you've already created, make sure that its
AppNetworkAccessType is in VPC-only mode. For information about updating a domain
to use VPC-only mode, see Shut down and Update SageMaker Studio (p. 199).
• Amazon EMR clusters

• You must have Hive or Presto installed on your cluster.


• The Amazon EMR release must be version 5.5.0 or later.
1004
Amazon SageMaker Developer Guide
Import

Note
Amazon EMR supports auto termination. Auto termination stops idle clusters from
running and prevents you from incurring costs. The following are the releases that
support auto termination:
• For 6.x releases, version 6.1.0 or later.
• For 5.x releases, version 5.30.0 or later.

An Amazon VPC is a virtual network that is logically isolated from other networks on the AWS cloud.
Amazon SageMaker Studio and your Amazon EMR cluster only exist within the Amazon VPC.

Use the following procedure to launch Amazon SageMaker Studio in an Amazon VPC.

To launch Studio within a VPC, do the following.

1. Navigate to the SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Launch SageMaker Studio.
3. Choose Standard setup.
4. For Default execution role, choose the IAM role to set up Studio.
5. Choose the VPC where you've launched the Amazon EMR clusters.
6. For Subnet, choose a private subnet.
7. For Security group(s), specify the security groups that you're using to control between your VPC
8. Choose VPC Only.
9. (Optional) AWS uses a default encryption key. You can specify an AWS Key Management Service key
to encrypt your data.
10. Choose Next.
11. Under Studio settings, choose the configurations that are best suited to you.
12. Choose Next to skip the SageMaker Canvas settings.
13. Choose Next to skip the RStudio settings.

If you don't have an Amazon EMR cluster ready, you can use the following procedure to create one. For
more information about Amazon EMR, see What is Amazon EMR?

To create a cluster, do the following.

1. Navigate to the AWS Management Console.


2. In the search bar, specify Amazon EMR.
3. Choose Create cluster.
4. For Cluster name, specify the name of your cluster.
5. For Release, select the release version of the cluster.
Note
Amazon EMR supports auto termination for the following releases:

• For 6.x releases, releases 6.1.0 or later


• For 5.x releases, releases 5.30.0 or later

Auto termination stops idle clusters from running and prevent you from incurring costs.
6. (Optional) For Applications, choose Presto.
7. Choose the application that you're running on the cluster.

1005
Amazon SageMaker Developer Guide
Import

8. Under Networking, for Hardware configuration, specify the hardware configuration settings.
Important
For Networking, choose the VPC that is running Amazon SageMaker Studio and choose a
private subnet.
9. Under Security and access, specify the security settings.
10. Choose Create.

For a tutorial about creating an Amazon EMR cluster, see Getting started with Amazon EMR. For
information about best practices for configuring a cluster, see Considerations and best practices.
Note
For security best practices, Data Wrangler can only connect to VPCs on private subnets. You
won't be able to connect to the master node unless you use AWS Systems Manager for your
EMR instances. For more information, see Securing access to EMR clusters using AWS Systems
Manager.

You can currently use the following methods to access and Amazon EMR cluster:

• No authentication
• Lightweight Directory Access Protocol (LDAP)

Use the following sections to create a Presto or Hive Amazon EMR cluster with LDAP activated.

Presto
Important
To use AWS Glue as a metastore for Presto tables, select Use for Presto table metadata
to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're
launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you
from incurring charges.
To be able to query large datasets on Amazon EMR clusters, add the following properties to
Presto configuration file on your EMR clusters:

[{"classification":"presto-config","properties":{
"http-server.max-request-header-size":"5MB",
"http-server.max-response-header-size":"5MB"}}]

You can also modify the configuration settings when you launch the Amazon EMR cluster.
The configuration file for your Amazon EMR cluster is located under the following path: /
etc/presto/conf/config.properties.

Use the following procedure to create a Presto cluster with LDAP activated,.

To create a cluster, do the following.

1. Navigate to the AWS Management Console.


2. In the search bar, specify Amazon EMR.
3. Choose Create cluster.
4. For Cluster name, specify the name of your cluster.
5. For Release, select the release version of the cluster.
Note
Amazon EMR supports auto termination for the following releases:

1006
Amazon SageMaker Developer Guide
Import

• For 6.x releases, releases 6.1.0 or later


• For 5.x releases, releases 5.30.0 or later

Auto termination stops idle clusters from running and prevent you from incurring costs.
6. Choose the application that you're running on the cluster.
7. Under Networking, for Hardware configuration, specify the hardware configuration settings.
Important
For Networking, choose the VPC that is running Amazon SageMaker Studio and choose
a private subnet.
8. Under Security and access, specify the security settings.
9. Choose Create.

Hive
Important
To use AWS Glue as a metastore for Hive tables, select Use for Hive table metadata to store
the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching
an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from
incurring charges.
To be able to query large datasets on Amazon EMR clusters, add the following properties to
Hive configuration file on your EMR clusters:

[{"classification":"hive-site", "properties"
:{"hive.resultset.use.unique.column.names":"false"}}]

You can also modify the configuration settings when you launch the Amazon EMR cluster.
The configuration file for your Amazon EMR cluster is located under the following path: /
etc/hive/conf/hive-site.xml. You can specify the following property and restart the
cluster:

<property>
<name>hive.resultset.use.unique.column.names</name>
<value>false</value>
</property>

Use the following procedure to create a Hive cluster with LDAP activated,.

To create a Hive cluster with LDAP activated, do the following.

1. Navigate to the AWS Management Console.


2. In the search bar, specify Amazon EMR.
3. Choose Create cluster.
4. Choose Go to advanced options.
5. For Release, select an Amazon EMR release version.
6. The Hive configuration option is selected by default. Make sure the Hive option has a checkbox
next to it.
7. (Optional) You can also select Presto as a configuration option to activate both Hive and Presto
on your cluster.

1007
Amazon SageMaker Developer Guide
Import

8. (Optional) Select Use for Hive table metadata to store the results of your Amazon EMR queries
in a AWS Glue data catalog. Storing the query results in a AWS Glue catalog can save you from
incurring charges. For more information, see Using the AWS Glue Data Catalog as the metastore
for Hive.
Note
Storing the query results in a data catalog requires Amazon EMR version 5.8.0 or later.
9. Under Enter configuration, specify the following JSON:

[
{
"classification": "hive-site",
"properties": {
"hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
"hive.server2.authentication": "LDAP",
"hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
}
}
]

Note
As a security best practice, we recommend enabling SSL for HiveServer by adding a few
properties in the preceding hive-site JSON. For more information, see Enable SSL on
HiveServer2.
10. Specify the remaining cluster settings and create a cluster.

Use the following sections to use LDAP authentication for Amazon EMR clusters that you've already
created.

LDAP for Presto

Using LDAP on a cluster running Presto requires access to the Presto coordinator through HTTPS. Do
the following to provide access:

• Activate access on port 636


• Enable SSL for the Presto coordinator

Use the following template to configure Presto:

- Classification: presto-config
ConfigurationProperties:
http-server.authentication.type: 'PASSWORD'
http-server.https.enabled: 'true'
http-server.https.port: '8889'
http-server.http.port: '8899'
node-scheduler.include-coordinator: 'true'
http-server.https.keystore.path: '/path/to/keystore/path/for/presto'
http-server.https.keystore.key: 'keystore-key-password'
discovery.uri: 'http://master-node-dns-name:8899'
- Classification: presto-password-authenticator
ConfigurationProperties:
password-authenticator.name: 'ldap'
ldap.url: !Sub 'ldaps://ldap-server-dns-name:636'
ldap.user-bind-pattern: "uid=${USER},dc=example,dc=org"
internal-communication.authentication.ldap.user: "ldap-user-name"
internal-communication.authentication.ldap.password: "ldap-password"

1008
Amazon SageMaker Developer Guide
Import

For information about setting up LDAP in Presto, see the following resources:

• LDAP Authentication
• Using LDAP Authentication for Presto on Amazon EMR

Note
As a security best practice, we recommend enabling SSL for Presto. For more information,
see Secure Internal Communication.
LDAP for Hive

To use LDAP for Hive for a cluster that you've created, use the following procedure Reconfigure an
instance group in the console.

You're specifying the name of the cluster to which you're connecting.

[
{
"classification": "hive-site",
"properties": {
"hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
"hive.server2.authentication": "LDAP",
"hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
}
}
]

Use the following procedure to import data from a cluster.

To import data from a cluster, do the following.

1. Open a Data Wrangler flow.


2. Choose Create Connection.
3. Choose Amazon EMR.
4. Do one of the following.
• (Optional) For Secrets ARN, specify the Amazon Resource Number (ARN) of the database within
the cluster. Secrets provide additional security. For more information about secrets, see What is
AWS Secrets Manager? For information about creating a secret for your cluster, see Creating a
AWS Secrets Manager secret for your cluster (p. 1010).
• From the dropdown table, choose a cluster.
5. Choose Next.
6. For Select an endpoint for example-cluster-name cluster, choose a query engine.
7. (Optional) Select Save connection.
8. Choose Next, select login and choose one of the following:

• No authentication
• LDAP
9. For Login into example-cluster-name cluster, specify the Username and Password for the
cluster.
10. Choose Connect.
11. In the query editor specify a SQL query.

1009
Amazon SageMaker Developer Guide
Import

12. Choose Run.


13. Choose Import.

Creating a AWS Secrets Manager secret for your cluster


A Secrets Manager secret stores the JDBC URL of the Amazon EMR cluster as a secret. Using a secret is
more secure than directly entering in your credentials.

Use the following procedure to store the JDBC URL as a secret.

To store the JDBC URL as a secret, do the following.

1. Navigate to the AWS Management Console.


2. In the search bar, specify Secrets Manager.
3. Choose AWS Secrets Manager.
4. Choose Store a new secret.
5. For Secret type, choose Other type of secret.
6. For Key/value pairs, specify jdbcURL as the key and a valid JDBC URL as the value.

The format of a valid JDBC URL depends on whether you use authentication and whether you use
Hive or Presto as the query engine. The following list shows the valid JBDC URL formats for the
different possible configurations.

• Hive, no authentication – jdbc:hive2://emr-cluster-master-public-dns:10000/;


• Hive, LDAP authentication – jdbc:hive2://emr-cluster-master-public-dns-
name:10000/;AuthMech=3;UID=david;PWD=welcome123;
• For Hive with SSL enabled, the JDBC URL format depends on whether you use a Java Keystore File
for the TLS configuration. The Java Keystore File helps verify the identity of the master node of
the Amazon EMR cluster. To use a Java Keystore File, generate it on an EMR cluster and upload
it to Data Wrangler. To generate a file, use the following command on the Amazon EMR cluster,
keytool -genkey -alias hive -keyalg RSA -keysize 1024 -keystore hive.jks.
For information about running commands on an Amazon EMR cluster, see Securing access to EMR
clusters using AWS Systems Manager. To upload a file, choose the upward arrow on the left-hand
navigation of the Data Wrangler UI.

The following are the valid JDBC URL formats for Hive with SSL enabled:
• Without a Java Keystore File – jdbc:hive2://emr-cluster-
master-public-dns:10000/;AuthMech=3;UID=user-
name;PWD=password;SSL=1;AllowSelfSignedCerts=1;
• With a Java Keystore File – jdbc:hive2://emr-cluster-master-public-
dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;SSLKeyStore=/
home/sagemaker-user/data/Java-keystore-file-name;SSLKeyStorePwd=Java-
keystore-file-passsword;
• Presto, no authentication – jdbc:presto://emr-cluster-master-public-dns:8889/;
• For Presto with LDAP authentication and SSL enabled, the JDBC URL format depends on whether
you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the
identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it
on an EMR cluster and upload it to Data Wrangler. To upload a file, choose the upward arrow on
the left-hand navigation of the Data Wrangler UI. For information about creating a Java Keystore
File for Presto, see Java Keystore File for TLS. For information about running commands on an
Amazon EMR cluster, see Securing access to EMR clusters using AWS Systems Manager.
• Without a Java Keystore File – jdbc:presto://emr-cluster-master-public-
dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;UID=user-
name;PWD=password;AllowSelfSignedServerCert=1;AllowHostNameCNMismatch=1;

1010
Amazon SageMaker Developer Guide
Import

• With a Java Keystore File – jdbc:presto://emr-cluster-


master-public-dns:8889/;SSL=1;AuthenticationType=LDAP
Authentication;SSLTrustStorePath=/home/sagemaker-user/data/Java-
keystore-file-name;SSLTrustStorePwd=Java-keystore-file-
passsword;UID=user-name;PWD=password;

Throughout the process of importing data from an Amazon EMR cluster, you might run into issues. For
information about troubleshooting them, see Troubleshooting issues with Amazon EMR (p. 1159).

Import data from Databricks (JDBC)


You can use Databricks as a data source for your Amazon SageMaker Data Wrangler flow. To import a
dataset from Databricks, use the JDBC (Java Database Connectivity) import functionality to access to
your Databricks database. After you access the database, specify a SQL query to get the data and import
it.

We assume that you have a running Databricks cluster and that you've configured your JDBC driver to it.
For more information, see the following Databricks documentation pages:

• JDBC driver
• JDBC configuration and connection parameters
• Authentication parameters

Data Wrangler stores your JDBC URL in AWS Secrets Manager. You must give your Amazon SageMaker
Studio IAM execution role permissions to use Secrets Manager. Use the following procedure to give
permissions.

To give permissions to Secrets Manager, do the following.

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. Choose Roles.
3. In the search bar, specify the Amazon SageMaker execution role that Amazon SageMaker Studio is
using.
4. Choose the role.
5. Choose Add permissions.
6. Choose Create inline policy.
7. For Service, specify Secrets Manager and choose it.
8. For Actions, select the arrow icon next to Permissions management.
9. Choose PutResourcePolicy.
10. For Resources, choose Specific.
11. Choose the checkbox next to Any in this account.
12. Choose Review policy.
13. For Name, specify a name.
14. Choose Create policy.

You can use partitions to import your data more quickly. Partitions give Data Wrangler the ability to
process the data in parallel. By default, Data Wrangler uses 2 partitions. For most use cases, 2 partitions
give you near-optimal data processing speeds.

If you choose to specify more than 2 partitions, you can also specify a column to partition the data. The
type of the values in the column must be numeric or date.

1011
Amazon SageMaker Developer Guide
Import

We recommend using partitions only if you understand the structure of the data and how it's processed.

You can either import the entire dataset or sample a portion of it. For a Databricks database, it provides
the following sampling options:

• None – Import the entire dataset.


• First K – Sample the first K rows of the dataset, where K is an integer that you specify.
• Randomized – Takes a random sample of a size that you specify.
• Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a
column.

Use the following procedure to import your data from a Databricks database.

To import data from Databricks, do the following.

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. From the Import data tab of your Data Wrangler flow, choose Databricks.
6. Specify the following fields:

• Dataset name – A name that you want to use for the dataset in your Data Wrangler flow.
• Driver – com.simba.spark.jdbc.Driver.
• JDBC URL – The URL of the Databricks database. The URL formatting can vary between Databricks
instances. For information about finding the URL and the specifying the parameters within
it, see JDBC configuration and connection parameters. The following is an example of how a
URL can be formatted: jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/
default;transportMode=http;ssl=1;httpPath=sql/protocolv1/
o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=token;PWD=personal-
access-token.
Note
You can specify a secret ARN that contains the JDBC URL instead of specifying the
JDBC URL itself. The secret must contain a key-value pair with the following format:
jdbcURL:JDBC-URL. For more information, see What is Secrets Manager?.
7. Specify a SQL SELECT statement.
Note
Data Wrangler doesn't support Common Table Expressions (CTE) or temporary tables within
a query.
8. For Sampling, choose a sampling method.
9. Choose Run.
10. (Optional) For the PREVIEW, choose the gear to open the Partition settings.
The gear for the additional settings is located to the far right of the PREVIEW title.

• Specify the number of partitions. You can partition by column if you specify the number of
partitions:

• Enter number of partitions – Specify a value greater than 2.


• (Optional) Partition by column – Specify the following fields. You can only partition by a
column if you've specified a value for Enter number of partitions.
• Select column – Select the column that you're using for the data partition. The data type of
the column must be numeric or date.

1012
Amazon SageMaker Developer Guide
Import

• Upper bound – From the values in the column that you've specified, the upper bound is the
value that you're using in the partition. The value that you specify doesn't change the data
that you're importing. It only affects the speed of the import. For the best performance,
specify an upper bound that's close to the column's maximum.
• Lower bound – From the values in the column that you've specified, the lower bound is the
value that you're using in the partition. The value that you specify doesn't change the data
that you're importing. It only affects the speed of the import. For the best performance,
specify a lower bound that's close to the column's minimum.
11. Choose Import.

Import data from Snowflake


You can use Snowflake as a data source in SageMaker Data Wrangler to prepare data in Snowflake for
machine learning.

With Snowflake as a data source in Data Wrangler, you can quickly connect to Snowflake without writing
a single line of code. You can join your data in Snowflake with data from any other data source in Data
Wrangler.

Once connected, you can interactively query data stored in Snowflake, transform data with more than
300 preconfigured data transformations, understand data and identify potential errors and extreme
values with a set of robust preconfigured visualization templates, quickly identify inconsistencies in your
data preparation workflow, and diagnose issues before models are deployed into production. Finally, you
can export your data preparation workflow to Amazon S3 for use with other SageMaker features such as
Amazon SageMaker Autopilot, Amazon SageMaker Feature Store and Amazon SageMaker Model Building
Pipelines.

You can encrypt the output of your queries using an AWS Key Management Service key that you've
created. For more information about AWS KMS, see AWS Key Management Service.

Topics
• Administrator Guide (p. 1013)
• Data Scientist Guide (p. 1026)

Administrator Guide
Important
To learn more about granular access control and best practices, see Security Access Control.

This section is for Snowflake administrators who are setting up access to Snowflake from within
SageMaker Data Wrangler.
Important
You are responsible for managing and monitoring the access control within Snowflake. This
includes what data a user can access, what storage integration a user can use, and what queries
a user can run. Data Wrangler does not add a layer of access control with respect to Snowflake.
Access control includes the following:

• The data that a user accesses.


• The storage integration that provides Snowflake the ability to write query results to an
Amazon S3 bucket
• The queries that a user can run.

Data Wrangler does not add a layer of access control to Snowflake. For more information, see
Configure Snowflake Data Import Permissions (p. 1014).

1013
Amazon SageMaker Developer Guide
Import

Important
Note that granting monitor privileges can permit users to see details within an object, such as
queries or usage within a warehouse.

Configure Snowflake Data Import Permissions


To import data from Snowflake, configure access from Data Wrangler using Amazon S3.

This feature is currently not available in the opt-in Regions.

Snowflake requires the following permissions on an S3 bucket and directory to be able to access files in
the directory:

• s3:GetObject
• s3:GetObjectVersion
• s3:ListBucket
• s3:ListObjects
• s3:GetBucketLocation

Create an IAM policy

You must create an IAM policy to configure access permissions for Snowflake to load and unload data
from an Amazon S3 bucket.

The following is the JSON policy document that you use to create the policy:

# Example policy for S3 write access


# This needs to be updated
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::bucket/prefix/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::bucket/",
"Condition": {
"StringLike": {
"s3:prefix": ["prefix/*"]
}
}
}
]
}

For information and procedures about creating policies with policy documents, see Creating IAM policies.

For documentation that provides an overview of using IAM permissions with Snowflake, see the
following resources:

1014
Amazon SageMaker Developer Guide
Import

• What is IAM?
• Create the IAM Role in AWS
• Create a Cloud Storage Integration in Snowflake
• Retrieve the AWS IAM User for your Snowflake Account
• Grant the IAM User Permissions to Access Bucket.

To grant the data scientist's Snowflake role usage permission to the storage integration, you must run
GRANT USAGE ON INTEGRATION integration_name TO snowflake_role;.

• integration_name is the name of your storage integration.


• snowflake_role is the name of the default Snowflake role given to the data scientist user.

Setting up Snowflake OAuth Access

Instead of having your users directly enter their credentials into Data Wrangler, you can have them use
an identity provider to access Snowflake. The following are links to the Snowflake documentation for the
identity providers that Data Wrangler supports.

• Azure AD
• Okta
• Ping Federate

Use the documentation from the preceding links to set up access to your identity provider. The
information and procedures in this section help you understand how to properly use the documentation
to access Snowflake within Data Wrangler.

Your identity provider needs to recognize Data Wrangler as an application. Use the following procedure
to register Data Wrangler as an application within the identity provider:

1. Select the configuration that starts the process of registering Data Wrangler as an application.
2. Provide the users within the identity provider access to Data Wrangler.
3. Turn on OAuth client authentication by storing the client credentials as an AWS Secrets Manager
secret.
4. Specify a redirect URL using the following format: https://Domain-ID.studio.AWS
Region.sagemaker.aws/jupyter/default/lab
Important
You're specifying the Amazon SageMaker Domain ID and AWS Region that you're using to
run Data Wrangler.
Important
You must register a URL for each Amazon SageMaker Domain and AWS Region where you're
running Data Wrangler. Users from a Domain and AWS Region that don't have redirect
URLs set up for them won't be able to authenticate with the identity provider to access the
Snowflake connection.
5. Make sure that the authorization code and refresh token grant types are allowed for the Data
Wrangler application.

Within your identity provider, you must set up a server that sends OAuth tokens to Data Wrangler at the
user level. The server sends the tokens with Snowflake as the audience.

Snowflake uses the concept of roles that are distinct role the IAM roles used in AWS. You must configure
the identity provider to use any role to use the default role associated with the Snowflake account. For

1015
Amazon SageMaker Developer Guide
Import

example, if a user has systems administrator as the default role in their Snowflake profile, the
connection from Data Wrangler to Snowflake uses systems administrator as the role.

Use the following procedure to set up the server.

To set up the server, do the following. You're working within Snowflake for all steps except the last one.

1. Start setting up the server or API.


2. Configure the authorization server to use the authorization code and refresh token grant types.
3. Specify the lifetime of the access token.
4. Set the refresh token idle timeout. The idle timeout is the time that the refresh token expires if it's
not used.
Note
If you're scheduling jobs in Data Wrangler, we recommend making the idle timeout time
greater than the frequency of the processing job. Otherwise, some processing jobs might
fail because the refresh token expired before they could run. When the refresh token
expires, the user must re-authenticate by accessing the connection that they've made to
Snowflake through Data Wrangler.
5. Specify session:role-any as the new scope.
Note
For Azure AD, copy the unique identifier for the scope. Data Wrangler requires you to
provide it with the identifier.
6. Important
Within the External OAuth Security Integration for Snowflake, enable
external_oauth_any_role_mode.

Important
Data Wrangler doesn't support rotating refresh tokens. Using rotating refresh tokens might
result in access failures or users needing to log in frequently.
Important
If the refresh token expires, your users must reauthenticate by accessing the connection that
they've made to Snowflake through Data Wrangler.

After you've set up the OAuth provider, you provide Data Wrangler with the information it needs to
connect to the provider. You can use the documentation from your identity provider to get values for the
following fields:

• Token URL – The URL of the token that the identity provider sends to Data Wrangler.
• Authorization URL – The URL of the authorization server of the identity provider.
• Client ID – The ID of the identity provider.
• Client secret – The secret that only the authorization server or API recognizes.
• (Azure AD only) The OAuth scope credentials that you've copied.

You store the fields and values in a AWS Secrets Manager secret and add it to the Amazon SageMaker
Studio Lifecycle Configuration that you're using for Data Wrangler. A Lifecycle Configuration is a shell
script. Use it to make the Amazon Resource Name (ARN) of the secret accessible to Data Wrangler. For
information about creating secrets see Move hardcoded secrets to AWS Secrets Manager. For information
about using lifecycle configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker
Studio (p. 182).
Important
Before you create a Secrets Manager secret, make sure that the SageMaker execution role that
you're using for Amazon SageMaker Studio has permissions to create and update secrets in

1016
Amazon SageMaker Developer Guide
Import

Secrets Manager. For more information about adding permissions, see Example: Permission to
create secrets.

For Okta and Ping Federate, the following is the format of the secret:

{
"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
"client_id":"example-client-id",
"client_secret":"example-client-secret",
"identity_provider":"OKTA"|"PING_FEDERATE",
"authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/
v2/authorize"
}

For Azure AD, the following is the format of the secret:

{
"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
"client_id":"example-client-id",
"client_secret":"example-client-secret",
"identity_provider":"AZURE_AD",
"authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/
v2/authorize",
"datasource_oauth_scope":"api://appuri/session:role-any)"
}

You must have a LifeCycle configuration that uses the Secrets Manager secret that you've created.
You can either create the LifeCycle configuration or modify one that has already been created. The
configuration must use the following script.

#!/bin/bash

set -eux

## Script Body

cat > ~/.snowflake_identity_provider_oauth_config <<EOL


{
"secret_arn": "example-secret-arn"
}
EOL

For information about setting up Lifecycle Configurations, see Creating and Associating a Lifecycle
Configuration (p. 183). When you're going through the process of setting up, do the following:

• Set the application type of the configuration to Jupyter Server.


• Attach the configuration to the Amazon SageMaker Domain that has your users.
• Have the configuration run by default. It must run every time a user logs into Studio. Otherwise,
the credentials saved in the configuration won't be available to your users when they're using Data
Wrangler.
• The Lifecycle Configuration creates a file with the name,
snowflake_identity_provider_oauth_config in the user's home folder. The file contains the
Secrets Manager secret. Make sure that it's in the user's home folder every time the Jupyter Server's
instance is initialized.

1017
Amazon SageMaker Developer Guide
Import

Private Connectivity between Data Wrangler and Snowflake via AWS PrivateLink

This section explains how to use AWS PrivateLink to establish a private connection between Data
Wrangler and Snowflake. The steps are explained in the following sections.

Create a VPC

If you do not have a VPC set up, then follow the Create a new VPC instructions to create one.

Once you have a chosen VPC you would like to use for establishing a private connection, provide the
following credentials to your Snowflake Administrator to enable AWS PrivateLink:

• VPC ID
• AWS Account ID
• Your corresponding account URL you use to access Snowflake

Important
As described in Snowflake's documentation, enabling your Snowflake account can take up to
two business days.

Set up Snowflake AWS PrivateLink Integration

After AWS PrivateLink is activated, retrieve the AWS PrivateLink configuration for your Region by running
the following command in a Snowflake worksheet. Log into your Snowflake console and enter the
following under Worksheets: select SYSTEM$GET_PRIVATELINK_CONFIG();

1. Retrieve the values for the following: privatelink-account-name, privatelink_ocsp-url,


privatelink-account-url, and privatelink_ocsp-url from the resulting JSON object.
Examples of each value are shown in the following snippet. Store these values for later use.

privatelink-account-name: xxxxxxxx.region.privatelink
privatelink-vpce-id: com.amazonaws.vpce.region.vpce-svc-xxxxxxxxxxxxxxxxx
privatelink-account-url: xxxxxxxx.region.privatelink.snowflakecomputing.com
privatelink_ocsp-url: ocsp.xxxxxxxx.region.privatelink.snowflakecomputing.com

2. Switch to your AWS Console and navigate to the VPC menu.


3. From the left side panel, choose the Endpoints link to navigate to the VPC Endpoints setup.

Once there, choose Create Endpoint.


4. Select the radio button for Find service by name, as shown in the following screenshot.

5. In the Service Name field, paste in the value for privatelink-vpce-id that you retrieved in the
preceding step and choose Verify.

If the connection is successful, a green alert saying Service name found appears on your screen and
the VPC and Subnet options automatically expand, as shown in the following screenshot. Depending
on your targeted Region, your resulting screen may show another AWS Region name.

1018
Amazon SageMaker Developer Guide
Import

6. Select the same VPC ID that you sent to Snowflake from the VPC dropdown list.
7. If you have not yet created a subnet, then perform the following set of instructions on creating a
subnet.
8. Select Subnets from the VPC dropdown list. Then select Create subnet and follow the prompts to
create a subset in your VPC. Ensure you select the VPC ID you sent Snowflake.
9. Under Security Group Configuration, select Create New Security Group to open the default Security
Group screen in a new tab. In this new tab, select tCreate Security Group.
10.Provide a name for the new security group (such as datawrangler-doc-snowflake-
privatelink-connection) and a description. Be sure to select the VPC ID you have used in
previous steps.
11.Add two rules to allow traffic from within your VPC to this VPC endpoint.

Navigate to your VPC under Your VPCs in a separate tab, and retrieve your CIDR block for your VPC.
Then choose Add Rule in the Inbound Rules section. Select HTTPS for the type, leave the Source as
Custom in the form, and paste in the value retrieved from the preceding describe-vpcs call (such
as 10.0.0.0/16).
12.Choose Create Security Group. Retrieve the Security Group ID from the newly created security group
(such as sg-xxxxxxxxxxxxxxxxx).
13.In the VPC Endpoint configuration screen, remove the default security group. Paste in the security
group ID in the search field and select the checkbox.

14.Select Create Endpoint.


15.If the endpoint creation is successful, you see a page that has a link to your VPC endpoint
configuration, specified by the VPC ID. Select the link to view the configuration in full.

1019
Amazon SageMaker Developer Guide
Import

Retrieve the topmost record in the DNS names list. This can be differentiated from other DNS names
because it only includes the Region name (such as us-west-2), and no Availability Zone letter
notation (such as us-west-2a). Store this information for later use.

Configure DNS for Snowflake Endpoints in your VPC

This section explains how to configure DNS for Snowflake endpoints in your VPC. This allows your VPC to
resolve requests to the Snowflake AWS PrivateLink endpoint.

1. Navigate to the Route 53 menu within your AWS console.


2. Select the Hosted Zones option (if necessary, expand the left-hand menu to find this option).
3. Choose Create Hosted Zone.
a. In the Domain name field, reference the value that was stored for privatelink-account-url
in the preceding steps. In this field, your Snowflake account ID is removed from the DNS name and
only uses the value starting with the Region identifier. A Resource Record Set is also created later
for the subdomain, such as, region.privatelink.snowflakecomputing.com.
b. Select the radio button for Private Hosted Zone in the Type section. Your Region code may not be
us-west-2. Reference the DNS name returned to you by Snowflake.

1020
Amazon SageMaker Developer Guide
Import

c. In the VPCs to associate with the hosted zone section, select the Region in which your VPC is
located and the VPC ID used in previous steps.

d. Choose Create hosted zone.


4. Next, create two records, one for privatelink-account-url and one for privatelink_ocsp-
url.
• In the Hosted Zone menu, choose Create Record Set.
a. Under Record name, enter your Snowflake Account ID only (the first 8 characters in
privatelink-account-url).
b. Under Record type, select CNAME.
c. Under Value, enter the DNS name for the regional VPC endpoint you retrieved in the last step of
the Set up the Snowflake AWS PrivateLink Integration section.

d. Choose Create records.


e. Repeat the preceding steps for the OCSP record we notated as privatelink-ocsp-url,
starting with ocsp through the 8-character Snowflake ID for the record name (such as
ocsp.xxxxxxxx).

1021
Amazon SageMaker Developer Guide
Import

Configure Route 53 Resolver Inbound Endpoint for your VPC

This section explains how to configure Route 53 resolvers inbound endpoints for your VPC.

1. Navigate to the Route 53 menu within your AWS console.


• In the left hand panel in the Security section, select the Security Groups option.
2. Choose Create Security Group.
• Provide a name for your security group (such as datawranger-doc-route53-resolver-sg) and
a description.
• Select the VPC ID used in previous steps.
• Create rules that allow for DNS over UDP and TCP from within the VPC CIDR block.

• Choose Create Security Group. Note the Security Group ID because adds a rule to allow traffic to
the VPC endpoint security group.
3. Navigate to the Route 53 menu within your AWS console.
• In the Resolver section, select the Inbound Endpoint option.
4. Choose Create Inbound Endpoint.
• Provide an endpoint name.
• From the VPC in the Region dropdown list, select the VPC ID you have used in all previous steps.
• In the Security group for this endpoint dropdown list, select the security group ID from Step 2 in
this section.

1022
Amazon SageMaker Developer Guide
Import

• In the IP Address section, select an Availability Zones, select a subnet, and leave the radio selector
for Use an IP address that is selected automatically selected for each IP address.

• Choose Submit.

1023
Amazon SageMaker Developer Guide
Import

5. Select the Inbound endpoint after it has been created.


6. Once the inbound endpoint is created, note the two IP addresses for the resolvers.

SageMaker VPC Endpoints

This section explains how to create VPC endpoints for the following: Amazon SageMaker Studio,
SageMaker Notebooks, the SageMaker API, SageMaker Runtime, and Amazon SageMaker Feature Store
Runtime.

Create a security group that is applied to all endpoints.

1. Navigate to the EC2 menu in the AWS Console.


2. In the Network & Security section, select the Security groups option.
3. Choose Create security group.
4. Provide a security group name and description (such as datawrangler-doc-sagemaker-vpce-sg).
A rule is added later to allow traffic over HTTPS from SageMaker to this group.

Creating the endpoints

1. Navigate to the VPC menu in the AWS console.


2. Select the Endpoints option.
3. Choose Create Endpoint.
4. Search for the service by entering its name in the Search field.
5. From the VPC dropdown list, select the VPC in which your Snowflake AWS PrivateLink connection
exists.
6. In the Subnets section, select the subnets which have access to the Snowflake PrivateLink connection.
7. Leave the Enable DNS Name checkbox selected.
8. In the Security Groups section, select the security group you created in the preceding section.
9. Choose Create Endpoint.

Configure Studio and Data Wrangler

This section explains how to configure Studio and Data Wrangler.

1. Configure the security group.


a. Navigate to the Amazon EC2 menu in the AWS Console.
b. Select the Security Groups option in the Network & Security section.
c. Choose Create Security Group.
d. Provide a name and description for your security group (such as datawrangler-doc-sagemaker-
studio).
e. Create the following inbound rules.
• The HTTPS connection to the security group you provisioned for the Snowflake PrivateLink
connection you created in the Set up the Snowflake PrivateLink Integration step.

1024
Amazon SageMaker Developer Guide
Import

• The HTTP connection to the security group you provisioned for the Snowflake PrivateLink
connection you created in the Set up the Snowflake PrivateLink Integration step.
• The UDP and TCP for DNS (port 53) to Route 53 Resolver Inbound Endpoint security group you
create in step 2 of Configure Route 53 Resolver Inbound Endpoint for your VPC.
f. Choose Create Security Group button in the lower right hand corner.
2. Configure Studio.
• Navigate to the SageMaker menu in the AWS console.
• From the left hand console, Select the SageMaker Studio option.
• If you do not have any domains configured, the Get Started menu is present.
• Select the Standard Setup option from the Get Started menu.
• Under Authentication method, select AWS Identity and Access Management (IAM).
• From the Permissions menu, you can create a new role or use a pre-existing role, depending on your
use case.
• If you choose Create a new role, you are presented the option to provide an S3 bucket name, and
a policy is generated for you.
• If you already have a role created with permissions for the S3 buckets to which you require access,
select the role from the dropdown list. This role should have the AmazonSageMakerFullAccess
policy attached to it.
• Select the Network and Storage dropdown list to configure the VPC, security, and subnets
SageMaker uses.
• Under VPC, select the VPC in which your Snowflake PrivateLink connection exists.
• Under Subnet(s), select the subnets which have access to the Snowflake PrivateLink connection.
• Under Network Access for Studio, select VPC Only.
• Under Security Group(s), select the security group you created in step 1.
• Choose Submit.
3. Edit the SageMaker security group.
• Create the following inbound rules:
• Port 2049 to the inbound and outbound NFS Security Groups created automatically by
SageMaker in step 2 (the security group names contain the Studio domain ID).
• Access to all TCP ports to itself (required for SageMaker for VPC Only).
4. Edit the VPC Endpoint Security Groups:
• Navigate to the Amazon EC2 menu in the AWS console.
• Locate the security group you created in a preceding step.
• Add an inbound rule allowing for HTTPS traffic from the security group created in step 1.
5. Create a user profile.
• From the SageMaker Studio Control Panel , choose Add User.
• Provide a user name.
• For the Execution Role, choose to create a new role or to use a pre-existing role.
• If you choose Create a new role, you are presented the option to provide an Amazon S3 bucket
name, and a policy is generated for you.
• If you already have a role created with permissions to the Amazon S3 buckets to which
you require access, select the role from the dropdown list. This role should have the
AmazonSageMakerFullAccess policy attached to it.
• Choose Submit.
6. Create a data flow (follow the data scientist guide outlined in a preceding section).
• When adding a Snowflake connection, enter the value of privatelink-account-name (from the
Set up Snowflake PrivateLink Integration step) into the Snowflake account name (alphanumeric)
field, instead of the plain Snowflake account
1025 name. Everything else is left unchanged.
Amazon SageMaker Developer Guide
Import

Provide information to the data scientist


Provide the data scientist with the information that they need to access Snowflake from Amazon
SageMaker Data Wrangler.

1. To allow your data scientist to access Snowflake from SageMaker Data Wrangler, provide them with
one of the following:

• For Basic Authentication, a Snowflake account name, user name, and password.
• For OAuth, a user name and password in the identity provider.
• For ARN, the Secrets Manager secret Amazon Resource Name (ARN).
• A secret created with AWS Secrets Manager and the ARN of the secret. Use the following
procedure below to create the secret for Snowflake if you choose this option.
Important
If your data scientists use the Snowflake Credentials (User name and Password) option
to connect to Snowflake, you can use Secrets Manager to store the credentials in a
secret. Secrets Manager rotates secrets as part of a best practice security plan. The
secret created in Secrets Manager is only accessible with the Studio role configured
when you set up a Studio user profile. This requires you to add this permission,
secretsmanager:PutResourcePolicy, to the policy that is attached to your Studio
role.
We strongly recommend that you scope the role policy to use different roles for different
groups of Studio users. You can add additional resource-based permissions for the Secrets
Manager secrets. See Manage Secret Policy for condition keys you can use.
For information about creating a secret, see Create a secret. You're charged for the secrets
that you create.
2. Provide the data scientist with the name of the storage integration you created in Step 3: Create
a Cloud Storage Integration in Snowflake. This is the name of the new integration and is called
integration_name in the CREATE INTEGRATION SQL command you ran, which is shown in the
following snippet:

CREATE STORAGE INTEGRATION integration_name


TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'iam_role'
[ STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control' ]
STORAGE_ALLOWED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/')
[ STORAGE_BLOCKED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/') ]

Data Scientist Guide


Use the following to connect Snowflake and access your data in Data Wrangler.
Important
Your administrator needs to use the information in the preceding sections to set up Snowflake.
If you're running into issues, contact them for troubleshooting help.

You must use Studio version 1.3.0 or later. Use the following procedure to open Amazon SageMaker
Studio and see which version you're running.

To open Studio and check its version, see the following procedure.

1. Use the steps in Prerequisites (p. 983) to access Data Wrangler through Amazon SageMaker Studio.
2. Next to the user you want to use to launch Studio, select Launch app.

1026
Amazon SageMaker Developer Guide
Import

3. Choose Studio.
4. After Studio loads, select File, then New, and then Terminal.

5. Once you have launched Studio, select File, then New, and then Terminal.
6. Enter cat /opt/conda/share/jupyter/lab/staging/yarn.lock | grep -A 1 "@amzn/
sagemaker-ui-data-prep-plugin@" to print the version of your Studio instance. You must have
Studio version 1.3.0 to use Snowflake.

You can update Amazon SageMaker Studio from within the AWS Management Console. For more
information about updating Studio, see Amazon SageMaker Studio UI Overview (p. 129).

You can connect to Snowflake in one of the following ways:

• Specifying your Snowflake credentials (account name, user name, and password) in Data Wrangler.
• Providing an Amazon Resource Name (ARN) of a secret containing the credentials.
• Using an open standard for access delegation (OAuth) provider that connects to Snowflake. Your
administrator can give you access to one of the following OAuth providers:

1027
Amazon SageMaker Developer Guide
Import

• Azure AD
• Okta
• Ping Federate

Talk to your administrator about the method that you need to use to connect to Snowflake.

The following sections have information about how you can connect to Snowflake using the preceding
methods.

Specifying your Snowflake Credentials

To import a dataset into Data Wrangler from Snowflake using your credentials

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.
9. Under Available, choose Snowflake.
10. For Authentication method, choose Basic Username-Password.
11. Specify the following:

• Snowflake account name (alphanumeric) – The full name of the Snowflake account.
• Username – The username that you use to access the account.
• Password – The password associated with the username.
• Storage integration – Your administrator provides you with the storage integration
information. It's the configuration that specifies the IAM role that Snowflake uses to save the
query results to an Amazon S3 bucket.
• Connection name – The name that you're specifying to uniquely identify the connection.
• (Optional) KMS key ID – A KMS key that you've created. You can specify its ARN to encrypt the
output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.
12. Choose Connect.

Providing an Amazon Resource Name (ARN)

To import a dataset into Data Wrangler from Snowflake using an ARN

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.

1028
Amazon SageMaker Developer Guide
Import

9. Under Available, choose Snowflake.


10. For Authentication method, choose ARN.
11. Specify the following:

• Secrets Manager ARN – The ARN of the AWS Secrets Manager secret used to store the
credentials used to connect to Snowflake.
• Storage integration – Your administrator you with the storage integration information. It's
the configuration that specifies the IAM role that Snowflake uses to save the query results to
an Amazon S3 bucket.
• KMS key ID – Your administrator provides
• Connection name – The name that you're specifying for the connection. You can choose the
connection
• (Optional) KMS key ID – A KMS key that you've created. It's used to encrypt the output of the
Snowflake query.
12. Choose Connect.

Using an OAuth Connection


Important
Your administrator customized your Studio environment to provide the functionality you're
using to use an OAuth connection. You might need to restart the Jupyter server application
to use the functionality.
Use the following procedure to update the Jupyter server application.

1. Within Studio, choose File


2. Choose Shut down.
3. Choose Shut down server.
4. Close the tab or window that you're using to access Studio.
5. From the Amazon SageMaker console, open Studio.

To import a dataset into Data Wrangler from Snowflake using your credentials

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.
9. Under Available, choose Snowflake.
10. For Authentication method, choose OAuth.
11. Specify the following:

• Connection name – The name that you're specifying to uniquely identify the connection.
• Snowflake account name (alphanumeric) – The full name of the Snowflake account.
• (Optional) KMS key ID – A KMS key that you've created. You can specify its ARN to encrypt the
output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.
12. Choose Connect.

1029
Amazon SageMaker Developer Guide
Import

You can begin the process of importing your data from Snowflake after you've connected to it.

Within Data Wrangler, you can view your data warehouses, databases, and schemas, along with the eye
icon with which you can preview your table. After you select the Preview Table icon, the schema preview
of that table is generated. You must select a warehouse before you can preview a table.
Important
If you're importing a dataset with columns of type TIMESTAMP_TZ or TIMESTAMP_LTZ, add
::string to the column names of your query. For more information, see How To: Unload
TIMESTAMP_TZ and TIMESTAMP_LTZ data to a Parquet file.

After you select a data warehouse, database and schema, you can now write queries and run them. The
output of your query shows under Query results.

After you have settled on the output of your query, you can then import the output of your query into a
Data Wrangler flow to perform data transformations.

After you've queried your data, navigate to the Data flow screen to start transforming your data.

Import Data From Software as a Service (SaaS) Platforms


You can use Data Wrangler to import data from more than forty software as a service (SaaS) platforms.
To import your data from your SaaS platform, you or your administrator must use Amazon AppFlow
to transfer the data from the platform to Amazon S3 or Amazon Redshift. For more information
about Amazon AppFlow, see What is Amazon AppFlow? If you don't need to use Amazon Redshift, we
recommend transferring the data to Amazon S3 for a simpler process.

Data Wrangler supports transferring data from the following SaaS platforms:

• Amplitude
• CircleCI
• DocuSign Monitor
• Domo
• Datadog
• Dynatrace
• Facebook Ads
• Facebook Page Insights
• Google Ads
• Google Analytics 4
• Google Search Console
• GitHub
• GitLab
• Infor Nexus
• Instagram Ads
• Jira Cloud
• LinkedIn Ads
• Mailchimp
• Marketo
• Microsoft Teams
• Mixpanel

1030
Amazon SageMaker Developer Guide
Import

• Okta
• Salesforce
• Salesforce Marketing Cloud
• Salesforce Pardot
• SAP OData
• SendGrid
• ServiceNow
• Singular
• Slack
• Stripe
• Trend Micro
• Typeform
• Veeva
• Zendesk
• Zendesk Chat
• Zendesk Sell
• Zendesk Sunshine
• Zoom Meetings

The preceding list has links to more information about setting up your data source. You or your
administrator can refer to the preceding links after you've read the following information.

When you navigate to the Import tab of your Data Wrangler flow, you see data sources under the
following sections:

• Available
• Set up data sources

You can connect to data sources under Available without needing additional configuration. You can
choose the data source and import your data.

Data sources under Set up data sources, require you or your administrator to use Amazon AppFlow to
transfer the data from the SaaS platform to Amazon S3 or Amazon Redshift. For information about
performing a transfer, see Using Amazon AppFlow to transfer your data (p. 1031).

After you perform the data transfer, the SaaS platform appears as a data source under Available. You
can choose it and import the data that you've transferred into Data Wrangler. The data that you've
transferred appears as tables that you can query.

Using Amazon AppFlow to transfer your data


Amazon AppFlow is a platform that you can use to transfer data from your SaaS platform to Amazon
S3 or Amazon Redshift without having to write any code. To perform a data transfer, you use the AWS
Management Console.
Important
You must make sure you've set up the permissions to perform a data transfer. For more
information, see Amazon AppFlow Permissions (p. 1150).

After you've added permissions, you can transfer the data. Within Amazon AppFlow, you create a flow
to transfer the data. A flow is a series of configurations. You can use it to specify whether you're running

1031
Amazon SageMaker Developer Guide
Import

the data transfer on a schedule or whether you're partitioning the data into separate files. After you've
configured the flow, you run it to transfer the data.

For information about creating a flow, see Creating flows in Amazon AppFlow. For information about
running a flow, see Activate an Amazon AppFlow flow.

After the data has been transferred, use the following procedure to access the data in Data Wrangler.

Important
Before you try to access your data, make sure your IAM role has the following policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "glue:SearchTables",
"Resource": [
"arn:aws:glue:*:*:table/*/*",
"arn:aws:glue:*:*:database/*",
"arn:aws:glue:*:*:catalog"
]
}
]
}

By default, the IAM role that you use to access Data Wrangler is the
SageMakerExecutionRole. For more information about adding policies, see Adding IAM
identity permissions (console).

To connect to a data source, do the following.

1. Sign into Amazon SageMaker Console.


2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Data Wrangler.
8. Choose Import data.
9. Under Available, choose the data source.
10. For the Name field, specify the name of the connection.
11. (Optional) Choose Advanced configuration.

a. Choose a Workgroup.
b. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a
workgroup, specify a value for Amazon S3 location of query results.
c. (Optional) For Data retention period, select the checkbox to set a data retention period and
specify the number of days to store the data before it's deleted.
d. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the
checkbox and not save the connection.
12. Choose Connect.
13. Specify a query.

1032
Amazon SageMaker Developer Guide
Import

Note
To help you specify a query, you can choose a table on the left-hand navigation panel. Data
Wrangler shows the table name and a preview of the table. Choose the icon next to the
table name to copy the name. You can use the table name in the query.
14. Choose Run.
15. Choose Import query.
16. For Dataset name, specify the name of the dataset.
17. Choose Add.

When you navigate to the Import data screen, you can see the connection that you've created. You can
use the connection to import more data.

Imported Data Storage


Important
We strongly recommend that you follow the best practices around protecting your Amazon S3
bucket by following Security best practices.

When you query data from Amazon Athena or Amazon Redshift, the queried dataset is automatically
stored in Amazon S3. Data is stored in the default SageMaker S3 bucket for the AWS Region in which you
are using Studio.

Default S3 buckets have the following naming convention: sagemaker-region-account number.


For example, if your account number is 111122223333 and you are using Studio in us-east-1, your
imported datasets are stored in sagemaker-us-east-1-111122223333.

Data Wrangler flows depend on this Amazon S3 dataset location, so you should not modify this dataset
in Amazon S3 while you are using a dependent flow. If you do modify this S3 location, and you want to
continue using your data flow, you must remove all objects in trained_parameters in your .flow file.
To do this, download the .flow file from Studio and for each instance of trained_parameters, delete
all entries. When you are done, trained_parameters should be an empty JSON object:

"trained_parameters": {}

When you export and use your data flow to process your data, the .flow file you export refers to this
dataset in Amazon S3. Use the following sections to learn more.

Amazon Redshift Import Storage


Data Wrangler stores the datasets that result from your query in a Parquet file in your default SageMaker
S3 bucket.

This file is stored under the following prefix (directory): redshift/uuid/data/, where uuid is a unique
identifier that gets created for each query.

For example, if your default bucket is sagemaker-us-east-1-111122223333, a single dataset queried


from Amazon Redshift is located in s3://sagemaker-us-east-1-111122223333/redshift/uuid/data/.

Amazon Athena Import Storage


When you query an Athena database and import a dataset, Data Wrangler stores the dataset, as well as a
subset of that dataset, or preview files, in Amazon S3.

The dataset you import by selecting Import dataset is stored in Parquet format in Amazon S3.

1033
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

Preview files are written in CSV format when you select Run on the Athena import screen, and contain
up to 100 rows from your queried dataset.

The dataset you query is located under the prefix (directory): athena/uuid/data/, where uuid is a
unique identifier that gets created for each query.

For example, if your default bucket is sagemaker-us-east-1-111122223333, a single dataset


queried from Athena is located in s3://sagemaker-us-east-1-111122223333/athena/uuid/
data/example_dataset.parquet.

The subset of the dataset that is stored to preview dataframes in Data Wrangler is stored under the
prefix: athena/.

Create and Use a Data Wrangler Flow


Use an Amazon SageMaker Data Wrangler flow, or a data flow, to create and modify a data preparation
pipeline. The data flow connects the datasets, transformations, and analyses, or steps, you create and can
be used to define your pipeline.

Instances
When you create a Data Wrangler flow in Amazon SageMaker Studio, Data Wrangler uses an Amazon
EC2 instance to run the analyses and transformations in your flow. By default, Data Wrangler uses
the m5.4xlarge instance. m5 instances are general purpose instances that provide a balance between
compute and memory. You can use m5 instances for a variety of compute workloads.

Data Wrangler also gives you the option of using r5 instances. r5 instances are designed to deliver fast
performance that processes large datasets in memory.

We recommend that you choose an instance that is best optimized around your workloads. For example,
the r5.8xlarge might have a higher price than the m5.4xlarge, but the r5.8xlarge might be better
optimized for your workloads. With better optimized instances, you can run your data flows in less time
at lower cost.

The following table shows the instances that you can use to run your Data Wrangler flow.

Standard Instances vCPU Memory

ml.m5.4xlarge 16 64 GiB

ml.m5.8xlarge 32 128 GiB

ml.m5.16xlarge 64 256 GiB

ml.m5.24xlarge 96 384 GiB

r5.4xlarge 16 128 GiB

r5.8xlarge 32 256 GiB

r5.24xlarge 96 768 GiB

For more information about r5 instances, see Amazon EC2 R5 Instances. For more information about m5
instances, see Amazon EC2 M5 Instances.

Each Data Wrangler flow has an Amazon EC2 instance associated with it. You might have multiple flows
that are associated with a single instance.

1034
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

For each flow file, you can seamlessly switch the instance type. If you switch the instance type, the
instance that you used to run the flow continues to run.

To switch the instance type of your flow, do the following.

1.
Choose the home icon, .
2. Navigate to the instance that you're using and choose it.
3. Choose the instance type that you want to use.

4. Choose Save.

You are charged for all running instances. To avoid incurring additional charges, shut down the instances
that you aren't using manually. To shut down an instance that is running, use the following procedure.

To shut down a running instance.

1. Choose the instance icon. The following image shows you where to select the RUNNING INSTANCES
icon.

2. Choose Shut down next to the instance that you want to shut down.

1035
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

If you shut down an instance used to run a flow, you temporarily can't access the flow. If you get an error
while attempting to open the flow running an instance you previously shut down, wait for 5 minutes and
try opening it again.

When you export your data flow to a location such as Amazon Simple Storage Service or Amazon
SageMaker Feature Store, Data Wrangler runs an Amazon SageMaker processing job. You can use one
of the following instances for the processing job. For more information on exporting your data, see
Export (p. 1116).

Standard Instances vCPU Memory

ml.m5.4xlarge 16 64 GiB

ml.m5.12xlarge 48 192 GiB

ml.m5.24xlarge 96 384 GiB

For more information about the cost per hour for using the available instance types, see SageMaker
Pricing.

The Data Flow UI


When you import a dataset, the original dataset appears on the data flow and is named Source. If
you turned on sampling when you imported your data, this dataset is named Source - sampled. Data
Wrangler automatically infers the types of each column in your dataset and creates a new dataframe
named Data types. You can select this frame to update the inferred data types. You see results similar to
those shown in the following image after you upload a single dataset:

Each time you add a transform step, you create a new dataframe. When multiple transform steps (other
than Join or Concatenate) are added to the same dataset, they are stacked.

Join and Concatenate create standalone steps that contain the new joined or concatenated dataset.

The following diagram shows a data flow with a join between two datasets, as well as two stacks of
steps. The first stack (Steps (2)) adds two transforms to the type inferred in the Data types dataset. The
downstream stack, or the stack to the right, adds transforms to the dataset resulting from a join named
demo-join.

1036
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

The small, gray box in the bottom right corner of the data flow provides an overview of number of stacks
and steps in the flow and the layout of the flow. The lighter box inside the gray box indicates the steps
that are within the UI view. You can use this box to see sections of your data flow that fall outside of the

UI view. Use the fit screen icon ( ) to fit all steps and datasets into your UI view.

The bottom left navigation bar includes icons that you can use to zoom in ( )

and out ( ) of your data flow and resize the data flow to fit the screen

( ) . Use the lock icon ( ) to lock and unlock the location of each
step on the screen.

1037
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

Add a Step to Your Data Flow


Select + next to any dataset or previously added step and then select one of the following options:

• Edit data types (For a Data types step only): If you have not added any transforms to a Data types
step, you can select Edit data types to update the data types Data Wrangler inferred when importing
your dataset.
• Add transform: Adds a new transform step. See Transform Data (p. 1058) to learn more about the
data transformations you can add.
• Add analysis: Adds an analysis. You can use this option to analyze your data at any point in the data
flow. When you add one or more analyses to a step, an analysis icon ( ) appears on that step. See
Analyze and Visualize (p. 1101) to learn more about the analyses you can add.
• Join: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see Join
Datasets (p. 1064).
• Concatenate: Concatenates two datasets and adds the resulting dataset to the data flow. To learn
more, see Concatenate Datasets (p. 1064).

Delete a Step from Your Data Flow


To delete a step, select the step and select Delete. If the node is a node that has a single input, you
delete only the step that you select. Deleting a step that has a single input doesn't delete the steps that
follow it. If you're deleting a step for a source, join, or concatenate node, all the steps that follow it are
also deleted.

To delete a step from a stack of steps, select the stack and then select the step you want to delete.

You can use one of the following procedures to delete a step without deleting the downstream steps.

Delete a step in the Data Wrangler flow

You can delete an individual step for nodes in your data flow that have a single input. You can't
delete individual steps for source, join, and concatenate nodes.

Use the following procedure to delete a step in the Data Wrangler flow.

1. Choose the group of steps that has the step that you're deleting.
2. Choose the icon next to the step.
3. Choose Delete step.

1038
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

1039
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

Delete a step in the table view

Use the following procedure to delete a step in the table view.

You can delete an individual step for nodes in your data flow that have a single input. You can't
delete individual steps for source, join, and concatenate nodes.

1. Choose the step and open the table view for the step.
2. Move your cursor over the step so the ellipsis icon appears.
3. Choose the icon next to the step.
4. Choose Delete.

1040
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

1041
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

Edit a Step in Your Data Wrangler Flow


You can edit each step that you've added in your Data Wrangler flow. By editing steps, you can change
the transformations or the data types of the columns. You can edit the steps to make changes with
which you can perform better analyses.

There are many ways that you can edit a step. Some examples include changing the imputation method
or changing the threshold for considering a value to be an outlier.

Use the following procedure to edit a step.

To edit a step, do the following.

1. Choose a step in the Data Wrangler flow to open the table view.

1042
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

2. Choose a step in the data flow.


3. Edit the step.

The following image shows an example of editing a step.

1043
Amazon SageMaker Developer Guide
Create and Use a Data Wrangler Flow

Note
You can use the shared spaces within your Amazon SageMaker Domain to work collaboratively
on your Data Wrangler flows. Within a shared space, you and your collaborators can edit a flow
file in real-time. However, neither you nor your collaborators can see the changes in real-time.
When anyone makes a change to the Data Wrangler flow, they must save it immediately. When
someone saves a file, a collaborator won’t be able to see it unless the close the file and reopen

1044
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

it. Any changes that aren’t saved by one person are overwritten by the person who saved their
changes.

Get Insights On Data and Data Quality


Use the Data Quality and Insights Report to perform an analysis of the data that you've imported into
Data Wrangler. We recommend that you create the report after you import your dataset. You can use the
report to help you clean and process your data. It gives you information such as the number of missing
values and the number of outliers. If you have issues with your data, such as target leakage or imbalance,
the insights report can bring those issues to your attention.
Note
If you've sampled the data that you've imported, Data Wrangler creates the report from the
sampled data. For information about turning off sampling, see Import (p. 991).

The following topics show the sections of the report:

Topics
• Summary (p. 1045)
• Target column (p. 1047)
• Quick model (p. 1050)
• Feature summary (p. 1052)
• Samples (p. 1054)
• Definitions (p. 1055)

You can either download the report or view it online. To download the report, choose the download
button at the top right corner of the screen. The following image shows the button.

Summary
The insights report has a brief summary of the data that includes general information such as missing
values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings
that point to probable issues with the data. We recommend that you investigate the warnings.

The following is an example of a report summary.

1045
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

1046
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

Target column
When you create the data quality and insights report, Data Wrangler gives you the option to select a
target column. A target column is a column that you're trying to predict. When you choose a target
column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the
order of their predictive power. When you select a target column, you must specify whether you’re trying
to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a
category. It also presents observations, or rows, with a missing or invalid target value.

The following image shows an example target column analysis for a classification problem.

1047
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents
observations, or rows, with a missing, invalid, or outlier target value.

The following image shows an example target column analysis for a regression problem.

1048
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

1049
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

Quick model
The Quick model provides an estimate of the expected predicted quality of a model that you train on
your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training
and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split,
each data partition has the same ratio of labels. For classification problems, it's important to have the
same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost
model with the default hyperparameters. It applies early stopping on the validation data and performs
minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

The following is an example of a classification model summary. To learn more about the information that
it returns, see Definitions (p. 1055).

The following is an example of a confusion matrix that the quick model returns.

1050
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

A confusion matrix gives you the following information:

• The number of times the predicted label matches the true label.
• The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to
detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-
fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a
condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of
both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect
fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent
transactions as fraudulent.

The following is an example of the quick model outputs for a regression problem.

1051
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

Feature summary
When you specify a target column, Data Wrangler orders the features by their prediction power.
Prediction power is measured on the data after it was split into 80% training and 20% validation folds.
Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature
preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more
useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the
target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem
with other columns. You can confidently use the prediction scores to determine whether a feature in your
dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities,
which often indicates target leakage. Target leakage usually happens when the dataset contains a
column that isn’t available at the prediction time. For example, it could be a duplicate of the target
column.

The following are examples of the table and the histogram that show the prediction value of each
feature.

1052
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

1053
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

Samples
Data Wrangler provides information about whether your samples are anomalous or if there are
duplicates in your dataset.

Data Wrangler detects anomalous samples using the isolation forest algorithm. The isolation forest
associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate
anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative
anomaly score are usually considered anomalous and samples with positive anomaly score are
considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to
unusual values. For example, you might have anomalous values that result from errors in gathering and
processing the data. The following is an example of the most anomalous samples according to the Data
Wrangler’s implementation of the isolation forest algorithm. We recommend using domain knowledge
and business logic when you examine the anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data
sources could include valid duplicates. Other data sources could have duplicates that point to problems
in data collection. Duplicate samples that result from faulty data collection could interfere with machine
learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:

• Quick model

1054
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

• Prediction power estimation


• Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the Drop duplicates transform under Manage
rows. Data Wrangler shows you the most frequently duplicated rows.

Definitions
The following are definitions for the technical terms that are used in the data insights report.

Feature types

The following are the definitions for each of the feature types:

• Numeric – Numeric values can be either floats or integers, such as age or income. The machine
learning models assume that numeric values are ordered and a distance is defined over them. For
example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
• Categorical – The column entries belong to a set of unique values, which is usually much smaller
than the number of entries in the column. For example, a column of length 100 could contain
the unique values Dog, Cat, and Mouse. The values could be numeric, text, or a combination of
both. Horse, House, 8, Love, and 3.1 would all be valid values and could be found in the same
categorical column. The machine learning model does not assume order or distance on the values
of categorical features, as opposed to numeric features, even when all the values are numbers.
• Binary – Binary features are a special categorical feature type in which the cardinality of the set of
unique values is 2.
• Text – A text column contains many non-numeric unique values. In extreme cases, all the elements
of the column are unique. In an extreme case, no two entries are the same.
• Datetime – A datetime column contains information about the date or time. It can have
information about both the date and time.

Feature statistics

The following are definitions for each of the feature statistics:

• Prediction power – Prediction power measures how useful the column is in predicting the target.
• Outliers (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust
to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature
values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the
clipped vector. All values larger than median + 5 * RSTD or smaller than median - 5 * RSTD are
considered to be outliers.
• Skew (in numeric columns) – Skew measures the symmetry of the distribution and is defined as
the third moment of the distribution divided by the third power of the standard deviation. The
skewness of the normal distribution or any other symmetric distribution is zero. Positive values
imply that the right tail of the distribution is longer than the left tail. Negative values imply that
the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is
considered skewed when the absolute value of the skew is larger than 3.
• Kurtosis (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the
distribution. It's defined as the fourth moment of the distribution divided by the square of the
second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply
that the distribution is concentrated around the mean and the tails are lighter than the tails of the
normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
• Missing values – Null-like objects, empty strings and strings composed of only white spaces are
considered missing.

1055
Amazon SageMaker Developer Guide
Get Insights On Data and Data Quality

• Valid values for numeric features or regression target – All values that you can cast to finite
floats are valid. Missing values are not valid.
• Valid values for categorical, binary, or text features, or for classification target – All values that
are not missing are valid.
• Datetime features – All values that you can cast to a datetime object are valid. Missing values are
not valid.
• Invalid values – Values that are either missing or you can't properly cast. For example, in a
numeric column, you can't cast the string "six" or a null value.

Quick model metrics for regression

The following are the definitions for the quick model metrics:

• R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is


predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts
the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
• MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that
predicts the target perfectly.
• MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that
predicts the target perfectly.
• RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model
that predicts the target perfectly.
• Max error – The maximum absolute value of the error over the dataset. Max error is in the range
[0, infty]. 0 is the score of the model that predicts the target perfectly.
• Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model
that predicts the target perfectly.

Quick model metrics for classification

The following are the definitions for the quick model metrics:

• Accuracy – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range
[0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the
perfect model.
• Balanced accuracy – Balanced accuracy is the ratio of samples that are predicted accurately when
the class weights are adjusted to balance the data. All classes are given the same importance,
regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model
that predicts all samples wrong. 1 is the score of the perfect model.
• AUC (binary classification) – This is the area under the receiver operating characteristic curve.
AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model
returns a score of 1.
• AUC (OVR) – For multiclass classification, this is the area under the receiver operating
characteristic curve calculated separately for each label using one versus rest. Data Wrangler
reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score
of 0.5 and the perfect model returns a score of 1.
• Precision – Precision is defined for a specific class. Precision is the fraction of true positives out
of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the
score of the model that has no false positives for the class. For binary classification, Data Wrangler
reports the precision of the positive class.
• Recall – Recall is defined for a specific class. Recall is the fraction of the relevant class instances
that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that
classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the
recall of the positive class.

1056
Amazon SageMaker Developer Guide
Automatically Train Models on Your Data Flow

• F1 – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the
range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports
the F1 for classes with positive values.

Textual patterns

Patterns describe the textual format of a string using an easy to read format. The following are
examples of textual patterns:

• "{digits:4-7}" describes a sequence of digits that have a length between 4 and 7.


• "{alnum:5}" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can
describe many of the commonly used patterns. The confidence expressed as a percentage indicates
how much of the data is estimated to match the pattern. Using the textual pattern, you can see
which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:

Pattern Textual Format

{alnum} Alphanumeric strings

{any} Any string of word characters

{digits} A sequence of digits

{lower} A lowercase word

{mixed} A mixed-case word

{name} A word beginning with a capital letter

{upper} An uppercase word

{whitespace} whitespace characters

A word character is either an underscore or a character that might appear in a word in any language.
For example, the strings 'Hello_word' and 'écoute' both consist of word characters. 'H' and 'é' are
both examples of word characters.

Automatically Train Models on Your Data Flow


You can use Amazon SageMaker Autopilot to automatically train, tune, and deploy models on the
data that you've transformed in your data flow. Amazon SageMaker Autopilot can go through several
algorithms and use the one that works best with your data. For more information about Amazon
SageMaker Autopilot, see Automate model development with Amazon SageMaker Autopilot (p. 467).

When you train and tune a model, Data Wrangler exports your data to an Amazon S3 location where
Amazon SageMaker Autopilot can access it.

You can prepare and deploy a model by choosing a node in your Data Wrangler flow and choosing
Export and Train in the data preview. You can use this method to view your dataset before you choose to
train a model on it.

You can also train and deploy a model directly from your data flow.

1057
Amazon SageMaker Developer Guide
Transform Data

The following procedure prepares and deploys a model from the data flow. For Data Wrangler flows
with multi-row transforms, you can't use the transforms from the Data Wrangler flow when you're
deploying the model. You can use the following procedure to process the data before you use it to
perform inference.

To train and deploy a model directly from your data flow, do the following.

1. Choose the + next to the node containing the training data.


2. Choose Train model.
3. (Optional) Specify a AWS KMS key or ID. For more information about creating and controlling
cryptographic keys to protect your data, see AWS Key Management Service.
4. Choose Export and train.
5. After Amazon SageMaker Autopilot trains the model on the data that Data Wrangler exported,
specify a name for Experiment name.
6. Under Input data, choose Preview to verify that Data Wrangler properly exported your data to
Amazon SageMaker Autopilot.
7. For Target, choose the target column.
8. (Optional) For S3 location under Output data, specify an Amazon S3 location other than the default
location.
9. Choose Next: Training method.
10. Choose a training method. For more information, see Training modes (p. 476).
11. (Optional) For Auto deploy endpoint, specify a name for the endpoint.
12. For Deployment option, choose a deployment method. You can choose to deploy with or without
the transformations that you've made to your data.
Important
You can't deploy an Amazon SageMaker Autopilot model with the transformations that
you've made in your Data Wrangler flow if you've used the following transformations:

• Join
• Concatenate
• Group by

You can export the dataset from a join to Amazon S3. You can create a new flow using the
dataset that you've exported. You can use the dataset to train and deploy a model.
13. Choose Next: Review and create.
14. Choose Create experiment.

For more information about model training and deployment, see Create an Amazon SageMaker
Autopilot experiment (p. 470). Autopilot shows you analyses about the best model's performance. For
more information about model performance, see View an Autopilot Model Performance Report (p. 498).

Transform Data
Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning,
transforming, and featurizing your data. When you add a transform, it adds a step to the data flow. Each
transform you add modifies your dataset and produces a new dataframe. All subsequent transforms
apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code.
You can also add custom transformations using PySpark, Python (User-Defined Function), pandas,

1058
Amazon SageMaker Developer Guide
Transform Data

and PySpark SQL. Some transforms operate in place, while others create a new output column in your
dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in
a single step.

You can apply the Process numeric and Handle missing transforms only to a single column.

Use this page to learn more about these built-in and custom transforms.

Transform UI
Most of the built-in transforms are located in the Prepare tab of the Data Wrangler UI. You can access
the join and concatenate transforms through the data flow view. Use the following table to preview
these two views.

Transform

You can add a transform to any step in your data flow. Use the following procedure to add a
transform to your data flow.

To add a step to your data flow, do the following.

1. Choose the + next to the step in the data flow.


2. Choose Add transform.
3. Choose Add step.

1059
Amazon SageMaker Developer Guide
Transform Data

4. Choose a transform.
5. (Optional) You can search for the transform that you want to use. Data Wrangler highlights the
query in the results.

1060
Amazon SageMaker Developer Guide
Transform Data

Join View

To join two datasets, select the first dataset in your data flow and choose Join. When you choose
Join, you see results similar to those shown in the following image. Your left and right datasets are
displayed in the left panel. The main panel displays your data flow, with the newly joined dataset
added.

1061
Amazon SageMaker Developer Guide
Transform Data

When you choose Configure to configure your join, you see results similar to those shown in the
following image. Your join configuration is displayed in the left panel. You can use this panel to
choose the joined dataset name, join type, and columns to join. The main panel displays three tables.
The top two tables display the left and right datasets on the left and right respectively. Under this
table, you can preview the joined dataset.

See Join Datasets (p. 1064) to learn more.


Concatenate View

To concatenate two datasets, you select the first dataset in your data flow and choose Concatenate.
When you select Concatenate, you see results similar to those shown in the following image. Your

1062
Amazon SageMaker Developer Guide
Transform Data

left and right datasets are displayed in the left panel. The main panel displays your data flow, with
the newly concatenated dataset added.

When you choose Configure to configure your concatenation, you see results similar to those
shown in the following image. Your concatenate configuration displays in the left panel. You can
use this panel to choose the concatenated dataset's name, and choose to remove duplicates after
concatenation and add columns to indicate the source dataframe. The main panel displays three
tables. The top two tables display the left and right datasets on the left and right respectively. Under
this table, you can preview the concatenated dataset.

See Concatenate Datasets (p. 1064) to learn more.

1063
Amazon SageMaker Developer Guide
Transform Data

Join Datasets
You join dataframes directly in your data flow. When you join two datasets, the resulting joined dataset
appears in your flow. The following join types are supported by Data Wrangler.

• Left Outer – Include all rows from the left table. If the value for the column joined on a left table row
does not match any right table row values, that row contains null values for all right table columns in
the joined table.
• Left Anti – Include rows from the left table that do not contain values in the right table for the joined
column.
• Left semi – Include a single row from the left table for all identical rows that satisfy the criteria in the
join statement. This excludes duplicate rows from the left table that match the criteria of the join.
• Right Outer – Include all rows from the right table. If the value for the joined column in a right table
row does not match any left table row values, that row contains null values for all left table columns in
the joined table.
• Inner – Include rows from left and right tables that contain matching values in the joined column.
• Full Outer – Include all rows from the left and right tables. If the row value for the joined column in
either table does not match, separate rows are created in the joined table. If a row doesn’t contain a
value for a column in the joined table, null is inserted for that column.
• Cartesian Cross – Include rows which combine each row from the first table with each row from the
second table. This is a Cartesian product of rows from tables in the join. The result of this product is
the size of the left table times the size of the right table. Therefore, we recommend caution in using
this join between very large datasets.

Use the following procedure to join two dataframes.

1. Select + next to the left dataframe that you want to join. The first dataframe you select is always the
left table in your join.
2. Choose Join.
3. Select the right dataframe. The second dataframe you select is always the right table in your join.
4. Choose Configure to configure your join.
5. Give your joined dataset a name using the Name field.
6. Select a Join type.
7. Select a column from the left and right tables to join.
8. Choose Apply to preview the joined dataset on the right.
9. To add the joined table to your data flow, choose Add.

Concatenate Datasets
Concatenate two datasets:

1. Choose + next to the left dataframe that you want to concatenate. The first dataframe you select is
always the left table in your concatenate.
2. Choose Concatenate.
3. Select the right dataframe. The second dataframe you select is always the right table in your
concatenate.
4. Choose Configure to configure your concatenate.
5. Give your concatenated dataset a name using the Name field.

1064
Amazon SageMaker Developer Guide
Transform Data

6. (Optional) Select the checkbox next to Remove duplicates after concatenation to remove duplicate
columns.
7. (Optional) Select the checkbox next to Add column to indicate source dataframe if, for each
column in the new dataset, you want to add an indicator of the column's source.
8. Choose Apply to preview the new dataset.
9. Choose Add to add the new dataset to your data flow.

Balance Data
You can balance the data for datasets with an underrepresented category. Balancing a dataset can help
you create better models for binary classification.
Note
You can't balance datasets containing column vectors.

You can use the Balance data operation to balance your data using one of the following operators:

• Random oversampling – Randomly duplicates samples in the minority category. For example, if
you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal
proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases
within the dataset 8 times.
• Random undersampling – Roughly equivalent to random oversampling. Randomly removes samples
from the overrepresented category to get the proportion of samples that you desire.
• Synthetic Minority Oversampling Technique (SMOTE) – Uses samples from the underrepresented
category to interpolate new synthetic minority samples. For more information about SMOTE, see the
following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE
interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to
determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric
features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features
by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1].
For numeric features, Data Wrangler interpolates samples using a weighted average of the samples.
For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The
interpolated sample has a value of 0.7A + 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real
samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A
and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of
the time.

Custom Transforms
The Custom Transforms group allows you to use Python (User-Defined Function), PySpark, pandas,
or PySpark (SQL) to define custom transformations. For all three options, you use the variable df to
access the dataframe to which you want to apply the transform. To apply your custom code to your
dataframe, assign the dataframe with the transformations that you've made to the df variable. If you're
not using Python (User-Defined Function), you don't need to include a return statement. Choose Preview
to preview the result of the custom transform. Choose Add to add the custom transform to your list of
Previous steps.

You can import the popular libraries with an import statement in the custom transform code block,
such as the following:

1065
Amazon SageMaker Developer Guide
Transform Data

• NumPy version 1.19.0


• scikit-learn version 0.23.2
• SciPy version 1.5.4
• pandas version 1.0.3
• PySpark version 3.0.0

Important
Custom transform doesn't support columns with spaces or special characters in the name.
We recommend that you specify column names that only have alphanumeric characters and
underscores. You can use the Rename column transform in the Manage columns transform
group to remove spaces from a column's name. You can also add a Python (Pandas) Custom
transform similar to the following to remove spaces from multiple columns in a single step.
This example changes columns named A column and B column to A_column and B_column
respectively.

df.rename(columns={"A column": "A_column", "B column": "B_column"})

If you include print statements in the code block, the result appears when you select Preview. You can
resize the custom code transformer panel. Resizing the panel provides more space to write code. The
following image shows the resizing of the panel.

1066
Amazon SageMaker Developer Guide
Transform Data

1067
Amazon SageMaker Developer Guide
Transform Data

The following sections provide additional context and examples for writing custom transform code.

Python (User-Defined Function)

The Python function gives you the ability to write custom transformations without needing to know
Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar
performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:

• Input column – The input column where you're applying the transform.
• Mode – The scripting mode, either pandas or Python.
• Return type – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write
transformations by using pure Python functions.

The following video shows an example of how to use custom code to create a transformation. It uses the
Titanic dataset to create a column with the person's salutation.

1068
Amazon SageMaker Developer Guide
Transform Data

1069
Amazon SageMaker Developer Guide
Transform Data

PySpark

The following example extracts date and time from a timestamp.

from pyspark.sql.functions import from_unixtime, to_date, date_format


df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))

pandas

The following example provides an overview of the dataframe to which you are adding transforms.

df.info()

PySpark (SQL)

The following example creates a new dataframe with four columns: name, fare, pclass, survived.

SELECT name, fare, pclass, survived FROM df

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform
tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose Search example snippets and specify a query in the search bar. The text
you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a Drop duplicate rows code snippet that can delete rows with similar data
in your dataset. You can find the code snippet by searching for one of the following:

• Duplicates
• Identical
• Remove

The following snippet has comments to help you understand the changes that you need to make. For
most snippets, you must specify the column names of your dataset in the code.

# Specify the subset of columns


# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]


df = df.dropDuplicates(subset)

# to drop the full-duplicate rows run


# df = df.dropDuplicates()

To use a snippet, copy and paste its content into the Custom transform field. You can copy and paste
multiple code snippets into the custom transform field.

Custom Formula
Use Custom formula to define a new column using a Spark SQL expression to query data in the current
dataframe. The query must use the conventions of Spark SQL expressions.

1070
Amazon SageMaker Developer Guide
Transform Data

Important
Custom formula doesn't support columns with spaces or special characters in the name.
We recommend that you specify column names that only have alphanumeric characters and
underscores. You can use the Rename column transform in the Manage columns transform
group to remove spaces from a column's name. You can also add a Python (Pandas) Custom
transform similar to the following to remove spaces from multiple columns in a single step.
This example changes columns named A column and B column to A_column and B_column
respectively.

df.rename(columns={"A column": "A_column", "B column": "B_column"})

You can use this transform to perform operations on columns, referencing the columns by name. For
example, assuming the current dataframe contains columns named col_a and col_b, you can use the
following operation to produce an Output column that is the product of these two columns with the
following code:

col_a * col_b

Other common operations include the following, assuming a dataframe contains col_a and col_b
columns:

• Concatenate two columns: concat(col_a, col_b)


• Add two columns: col_a + col_b
• Subtract two columns: col_a - col_b
• Divide two columns: col_a / col_b
• Take the absolute value of a column: abs(col_a)

For more information, see the Spark documentation on selecting data.

Reduce Dimensionality within a Dataset


Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality
of your dataset corresponds to the number of features. When you use dimensionality reduction in
Data Wrangler, you get a new set of features called components. Each component accounts for some
variability in the data.

The first component accounts for the largest amount of variation in the data. The second component
accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models.
Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns
in your dataset. The first principal component is the value on the axis that has the largest amount of
variance. The second principal component is the value on the axis that has the second largest amount
of variance. The nth principal component is the value on the axis that has the nth largest amount of
variance.

You can configure the number of principal components that Data Wrangler returns. You can either
specify the number of principal components directly or you can specify the variance threshold
percentage. Each principal component explains an amount of variance in the data. For example, you
might have a principal component with a value of 0.5. The component would explain 50% of the
variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the
smallest number of components that meet the percentage that you specify.

1071
Amazon SageMaker Developer Guide
Transform Data

The following are example principal components with the amount of variance that they explain in the
data.

• Component 1 – 0.5
• Component 2 – 0.45
• Component 3 – 0.05

If you specify a variance threshold percentage of 94 or 95, Data Wrangler returns Component 1 and
Component 2. If you specify a variance threshold percentage of 96, Data Wrangler returns all three
principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.


2. Choose the +, and select Add transform.
3. Choose Add step.
4. Choose Dimensionality Reduction.
5. For Input Columns, choose the features that you're reducing into the principal components.
6. (Optional) For Number of principal components, choose the number of principal components that
Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for
Variance threshold percentage.
7. (Optional) For Variance threshold percentage, specify the percentage of variation in the data that
you want explained by the principal components. Data Wrangler uses the default value of 95 if you
don't specify a value for the variance threshold. You can't specify a variance threshold percentage if
you've specified a value for Number of principal components.
8. (Optional) Deselect Center to not use the mean of the columns as the center of the data. By default,
Data Wrangler centers the data with the mean before scaling.
9. (Optional) Deselect Scale to not scale the data with the unit standard deviation.
10. (Optional) Choose Columns to output the components to separate columns. Choose Vector to
output the components as a single vector.
11. (Optional) For Output column, specify a name for an output column. If you're outputting the
components to separate columns, the name that you specify is a prefix. If you're outputting the
components to a vector, the name that you specify is the name of the vector column.
12. (Optional) Select Keep input columns. We don't recommend selecting this option if you plan on
only using the principal components to train your model.
13. Choose Preview.
14. Choose Add.

Encode Categorical
Categorical data is usually composed of a finite number of categories, where each category is
represented with a string. For example, if you have a table of customer data, a column that indicates the
country a person lives in is categorical. The categories would be Afghanistan, Albania, Algeria, and so
on. Categorical data can be nominal or ordinal. Ordinal categories have an inherent order, and nominal
categories do not. The highest degree obtained (High school, Bachelors, Masters, and so on) is an example
of ordinal categories.

Encoding categorical data is the process of creating a numerical representation for categories. For
example, if your categories are Dog and Cat, you may encode this information into two vectors, [1,0] to
represent Dog, and [0,1] to represent Cat.

1072
Amazon SageMaker Developer Guide
Transform Data

When you encode ordinal categories, you may need to translate the natural order of categories into your
encoding. For example, you can represent the highest degree obtained with the following map: {"High
school": 1, "Bachelors": 2, "Masters":3}.

Use categorical encoding to encode categorical data that is in string format into arrays of integers.

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the
time the step is defined. If new categories have been added to a column when you start a Data Wrangler
job to process your dataset at time t, and this column was the input for a Data Wrangler categorical
encoding transform at time t-1, these new categories are considered missing in the Data Wrangler job.
The option you select for Invalid handling strategy is applied to these missing values. Examples of when
this can occur are:

• When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the
creation of the data flow. For example, you may use a data flow to regularly process sales data each
month. If that sales data is updated weekly, new categories may be introduced into columns for which
an encode categorical step is defined.
• When you select Sampling when you import your dataset, some categories may be left out of the
sample.

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an ordinal and a one-hot encode. Use the following sections to learn
more about these options.

Both transforms create a new column named Output column name. You specify the output format of
this column with Output style:

• Select Vector to produce a single column with a sparse vector.


• Select Columns to create a column for every category with an indicator variable for whether the text
in the original column contains a value that is equal to that category.

Ordinal Encode
Select Ordinal encode to encode categories into an integer between 0 and the total number of
categories in the Input column you select.

Invalid handing strategy: Select a method to handle invalid or missing values.

• Choose Skip if you want to omit the rows with missing values.
• Choose Keep to retain missing values as the last category.
• Choose Error if you want Data Wrangler to throw an error if missing values are encountered in the
Input column.
• Choose Replace with NaN to replace missing with NaN. This option is recommended if your ML
algorithm can handle missing values. Otherwise, the first three options in this list may produce better
results.

One-Hot Encode
Select One-hot encode for Transform to use one-hot encoding. Configure this transform using the
following:

• Drop last category: If True, the last category does not have a corresponding index in the one-hot
encoding. When missing values are possible, a missing category is always the last one and setting this
to True means that a missing value results in an all zero vector.

1073
Amazon SageMaker Developer Guide
Transform Data

• Invalid handing strategy: Select a method to handle invalid or missing values.


• Choose Skip if you want to omit the rows with missing values.
• Choose Keep to retain missing values as the last category.
• Choose Error if you want Data Wrangler to throw an error if missing values are encountered in the
Input column.
• Is input ordinal encoded: Select this option if the input vector contains ordinal encoded data. This
option requires that input data contain non-negative integers. If True, input i is encoded as a vector
with a non-zero in the ith location.

Similarity encode
Use similarity encoding when you have the following:

• A large number of categorical variables


• Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a
mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to
vectors containing similar values. For example, it creates very similar encodings for "California" and
"Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It
converts the tokens into an embedding using min-hash encoding.

The following example shows how the similarity encoder creates vectors from strings.

1074
Amazon SageMaker Developer Guide
Transform Data

The similarity encodings that Data Wrangler creates:

• Have low dimensionality


• Are scalable to a large number of categories
• Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the Amazon SageMaker Console.


2. Choose Open Studio.
3. Choose Launch app.
4. Choose Studio.
5. Specify your data flow.
6. Choose a step with a transformation.
7. Choose Add step.
8. Choose Encode categorical.
9. Specify the following:

• Transform – Similarity encode


• Input column – The column containing the categorical data that you're encoding.
• Target dimension – (Optional) The dimension of the categorical embedding vector. The default
value is 30. We recommend using a larger target dimension if you have a large dataset with many
categories.
• Output style – Choose Vector for a single vector with all of the encoded values. Choose Column
to have the encoded values in separate columns.
• Output column – (Optional) The name of the output column for a vector encoded output. For a
column-encoded output, this is the prefix of the column names followed by listed number.

Featurize Text
Use the Featurize Text transform group to inspect string-typed columns and use text embedding to
featurize these columns.

1075
Amazon SageMaker Developer Guide
Transform Data

This feature group contains two features, Character statistics and Vectorize. Use the following sections to
learn more about these transforms. For both options, the Input column must contain text data (string
type).

Character Statistics
Use Character statistics to generate statistics for each row in a column containing text data.

This transform computes the following ratios and counts for each row, and creates a new column to
report the result. The new column is named using the input column name as a prefix and a suffix that is
specific to the ratio or count.

• Number of words: The total number of words in that row. The suffix for this output column is -
stats_word_count.
• Number of characters: The total number of characters in that row. The suffix for this output column is
-stats_char_count.
• Ratio of upper: The number of uppercase characters, from A to Z, divided by all characters in the
column. The suffix for this output column is -stats_capital_ratio.
• Ratio of lower: The number of lowercase characters, from a to z, divided by all characters in the
column. The suffix for this output column is -stats_lower_ratio.
• Ratio of digits: The ratio of digits in a single row over the sum of digits in the input column. The suffix
for this output column is -stats_digit_ratio.
• Special characters ratio: The ratio of non-alphanumeric (characters like #$&%:@) characters
to over the sum of all characters in the input column. The suffix for this output column is -
stats_special_ratio.

Vectorize
Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use
the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–
inverse document frequency (TF-IDF) vectors.

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real
number that represents its semantic importance. Higher numbers are associated with less frequent
words, which tend to be more meaningful.

When you define a Vectorize transform step, Data Wrangler uses the data in your dataset to define the
count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following:

• Output column name: This transform creates a new column with the text embedding. Use this field to
specify a name for this output column.
• Tokenizer: A tokenizer converts the sentence into a list of words, or tokens.

Choose Standard to use a tokenizer that splits by white space and converts each word to lowercase.
For example, "Good dog" is tokenized to ["good","dog"].

Choose Custom to use a customized tokenizer. If you choose Custom, you can use the following fields
to configure the tokenizer:
• Minimum token length: The minimum length, in characters, for a token to be valid. Defaults to 1.
For example, if you specify 3 for minimum token length, words like a, at, in are dropped from
the tokenized sentence.
• Should regex split on gaps: If selected, regex splits on gaps. Otherwise, it matches tokens. Defaults
to True.

1076
Amazon SageMaker Developer Guide
Transform Data

• Regex pattern: Regex pattern that defines the tokenization process. Defaults to ' \\ s+'.
• To lowercase: If chosen, Data Wrangler converts all characters to lowercase before tokenization.
Defaults to True.

To learn more, see the Spark documentation on Tokenizer.


• Vectorizer: The vectorizer converts the list of tokens into a sparse numeric vector. Each token
corresponds to an index in the vector and a non-zero indicates the existence of the token in the input
sentence. You can choose from two vectorizer options, Count and Hashing.
• Count vectorize allows customizations that filter infrequent or too common tokens. Count vectorize
parameters include the following:
• Minimum term frequency: In each row, terms (tokens) with smaller frequency are filtered. If you
specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0
(inclusive) and 1, the threshold is relative to the total term count. Defaults to 1.
• Minimum document frequency: Minimum number of rows in which a term (token) must appear
to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a
fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to
1.
• Maximum document frequency: Maximum number of documents (rows) in which a term (token)
can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you
specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count.
Defaults to 0.999.
• Maximum vocabulary size: Maximum size of the vocabulary. The vocabulary is made up of all
terms (tokens) in all rows of the column. Defaults to 262144.
• Binary outputs: If selected, the vector outputs do not include the number of appearances of a
term in a document, but rather are a binary indicator of its appearance. Defaults to False.

To learn more about this option, see the Spark documentation on CountVectorizer.
• Hashing is computationally faster. Hash vectorize parameters includes the following:
• Number of features during hashing: A hash vectorizer maps tokens to a vector index according to
their hash value. This feature determines the number of possible hash values. Large values result
in fewer collisions between hash values but a higher dimension output vector.

To learn more about this option, see the Spark documentation on FeatureHasher
• Apply IDF applies an IDF transformation, which multiplies the term frequency with the standard
inverse document frequency used for TF-IDF embedding. IDF parameters include the following:
• Minimum document frequency : Minimum number of documents (rows) in which a term (token)
must appear to be included. If count_vectorize is the chosen vectorizer, we recommend that you
keep the default value and only modify the min_doc_freq field in Count vectorize parameters.
Defaults to 5.
• Output format:The output format of each row.
• Select Vector to produce a single column with a sparse vector.
• Select Flattened to create a column for every category with an indicator variable for whether the
text in the original column contains a value that is equal to that category. You can only choose
flattened when Vectorizer is set as Count vectorizer.

Transform Time Series


In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to
specific time. For example, a dataset that shows the number of customers in a store for each hour in a
day is a time series dataset. The following table shows an example of a time series dataset.

1077
Amazon SageMaker Developer Guide
Transform Data

Hourly number of customers in a store

Number of customers Time (hour)

4 09:00

10 10:00

14 11:00

25 12:00

20 13:00

18 14:00

For the preceding table, the Number of Customers column contains the time series data. The time series
data is indexed on the hourly data in the Time (hour) column.

You might need to perform a series of transformations on your data to get it in a format that you can
use for your analysis. Use the Time series transform group to transform your time series data. For more
information about the transformations that you can perform, see the following sections.

Topics
• Group by a Time Series (p. 1078)
• Resample Time Series Data (p. 1079)
• Handle Missing Time Series Data (p. 1081)
• Validate the Timestamp of Your Time Series Data (p. 1082)
• Standardizing the Length of the Time Series (p. 1083)
• Extract Features from Your Time Series Data (p. 1084)
• Use Lagged Features from Your Time Series Data (p. 1085)
• Create a Datetime Range In Your Time Series (p. 1085)
• Use a Rolling Window In Your Time Series (p. 1086)

Group by a Time Series


You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage

Household ID Daily timestamp Electricity usage (kWh) Number of household


occupants

household_0 1/1/2020 30 2

household_0 1/2/2020 40 2

household_0 1/4/2020 35 3

household_1 1/2/2020 45 3

household_1 1/3/2020 55 4

1078
Amazon SageMaker Developer Guide
Transform Data

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID

Household ID Electricity usage series (kWh) Number of household


occupants series

household_0 [30, 40, 35] [2, 2, 3]

household_1 [45, 55] [3, 4]

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of
the sequence corresponds to the first timestamp of the series. For household_0, 30 is the first value of
the Electricity Usage Series. The value of 30 corresponds to the first timestamp of 1/1/2020.

You can include the starting timestamp and ending timestamp. The following table shows how that
information appears.

Electricity usage grouped by household ID

Household ID Electricity usage Number of Start_time End_time


series (kWh) household
occupants series

household_0 [30, 40, 35] [2, 2, 3] 1/1/2020 1/4/2020

household_1 [45, 55] [3, 4] 1/2/2020 1/3/2020

You can use the following procedure to group by a time series column.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Time Series.
6. Under Transform, choose Group by.
7. Specify a column in Group by this column.
8. For Apply to columns, specify a value.
9. Choose Preview to generate a preview of the transform.
10. Choose Add to add the transform to the Data Wrangler data flow.

Resample Time Series Data


Time series data usually has observations that aren't taken at regular intervals. For example, a dataset
could have some observations that are recorded hourly and other observations that are recorded every
two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals.
Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between
observations in the dataset. For example, if you downsample observations that are taken either
every hour or every two hours, each observation in your dataset is taken every two hours. The hourly

1079
Amazon SageMaker Developer Guide
Transform Data

observations are aggregated into a single value using an aggregation method such as the mean or
median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample
observations that are taken every two hours into hourly observations, you can use an interpolation
method to infer hourly observations from the ones that have been taken every two hours. For
information on interpolation methods, see pandas.DataFrame.interpolate.

You can resample both numeric and non-numeric data.

Use the Resample operation to resample your time series data. If you have multiple time series in your
dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the
aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling

Timestamp Temperature (Celsius)

12:00 30

1:00 32

2:00 35

3:00 32

4:00 30

Temperature readings downsampled to every two hours

Timestamp Temperature (Celsius)

12:00 30

2:00 33.5

2:00 35

4:00 32.5

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Resample.
6. For Timestamp, choose the timestamp column.
7. For Frequency unit, specify the frequency that you're resampling.
8. (Optional) Specify a value for Frequency quantity.
9. Configure the transform by specifying the remaining fields.
10. Choose Preview to generate a preview of the transform.
11. Choose Add to add the transform to the Data Wrangler data flow.

1080
Amazon SageMaker Developer Guide
Transform Data

Handle Missing Time Series Data


If you have missing values in your dataset, you can do one of the following:

• For datasets that have multiple time series, drop the time series that have missing values that are
greater than a threshold that you specify.
• Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an
inferential method. The following are the methods that you can use for imputation:

• Constant value – Replace all the missing data in your dataset with a value that you specify.
• Most common value – Replace all the missing data with the value that has the highest frequency in the
dataset.
• Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes
the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced
with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
• Backward fill – Use a backward fill to replace the missing values with the non-missing value that
follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are
replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8].
• Interpolate – Uses an interpolation function to impute the missing values. For more information on the
functions that you can use for interpolation, see pandas.DataFrame.interpolate.

Some of the imputation methods might not be able to impute of all the missing value in your dataset.
For example, a Forward fill can't impute a missing value that appears at the beginning of the time series.
You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values

Household ID Electricity usage series (kWh)

household_0 [30, 40, 35, NaN, NaN]

household_1 [45, NaN, 55]

Electricity usage with values imputed using a forward fill

Household ID Electricity usage series (kWh)

household_0 [30, 40, 35, 35, 35]

household_1 [45, 45, 55]

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values

Household ID Electricity usage (kWh)

household_0 30

1081
Amazon SageMaker Developer Guide
Transform Data

Household ID Electricity usage (kWh)

household_0 40

household_0 NaN

household_1 NaN

household_1 NaN

Average daily household electricity usage with values imputed using a forward fill

Household ID Electricity usage (kWh)

household_0 30

household_0 40

household_0 40

household_1 40

household_1 40

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Handle missing.
6. For Time series input type, choose whether you want to handle missing values inside of a cell or
along a column.
7. For Impute missing values for this column, specify the column that has the missing values.
8. For Method for imputing values, select a method.
9. Configure the transform by specifying the remaining fields.
10. Choose Preview to generate a preview of the transform.
11. If you have missing values, you can specify a method for imputing them under Method for imputing
values.
12. Choose Add to add the transform to the Data Wrangler data flow.

Validate the Timestamp of Your Time Series Data


You might have time stamp data that is invalid. You can use the Validate time stamp function to
determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or
more of the following reasons:

• Your timestamp column has missing values.


• The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use
Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

1082
Amazon SageMaker Developer Guide
Transform Data

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your
dataset:

• Drop the rows that have the missing or invalid values.


• Identify the rows that have the missing or invalid values.
• Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the timestamp type or the string type.
If the column has the string type, Data Wrangler converts the type of the column to timestamp and
performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Validate timestamps.
6. For Timestamp Column, choose the timestamp column.
7. For Policy, choose whether you want to handle missing timestamps.
8. (Optional) For Output column, specify a name for the output column.
9. If the date time column is formatted for the string type, choose Cast to datetime.
10. Choose Preview to generate a preview of the transform.
11. Choose Add to add the transform to the Data Wrangler data flow.

Standardizing the Length of the Time Series


If you have time series data stored as arrays, you can standardize each time series to the same length.
Standardizing the length of the time series array might make it easier for you to perform your analysis
on the data.

You can standardize your time series for data transformations that require the length of your data to be
fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time
series data is separating each value of the time series into its own column in a dataset. The number of
columns in a dataset can't change, so the lengths of the time series need to be standardized between
you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For
example, you can have three sequences that have the following lengths:

• 3
• 4
• 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile
length.

1083
Amazon SageMaker Developer Guide
Transform Data

Time series arrays that are shorter than the length you've specified have missing values added. The
following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN,
NaN].

You can use different approaches to handle the missing values. For information on those approaches, see
Handle Missing Time Series Data (p. 1081).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Standardize length.
6. For Standardize the time series length for the column, choose a column.
7. (Optional) For Output column, specify a name for the output column. If you don't specify a name,
the transform is done in place.
8. If the datetime column is formatted for the string type, choose Cast to datetime.
9. Choose Cutoff quantile and specify a quantile to set the length of the sequence.
10. Choose Flatten the output to output the values of the time series into separate columns.
11. Choose Preview to generate a preview of the transform.
12. Choose Add to add the transform to the Data Wrangler data flow.

Extract Features from Your Time Series Data


If you're running a classification or a regression algorithm on your time series data, we recommend
extracting features from the time series before running the algorithm. Extracting features might improve
the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:

• Use Minimal subset to specify extracting 8 features that you know are useful in downstream analyses.
You can use a minimal subset when you need to perform computations quickly. You can also use it
when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
• Use Efficient subset to specify extracting the most features possible without extracting features that
are computationally intensive in your analyses.
• Use All features to specify extracting all features from the tune series.
• Use Manual subset to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Extract features.
6. For Extract features for this column, choose a column.
7. (Optional) Select Flatten to output the features into separate columns.

1084
Amazon SageMaker Developer Guide
Transform Data

8. For Strategy, choose a strategy to extract the features.


9. Choose Preview to generate a preview of the transform.
10. Choose Add to add the transform to the Data Wrangler data flow.

Use Lagged Features from Your Time Series Data


For many use cases, the best way to predict the future behavior of your time series is to use its most
recent behavior.

The most common uses of lagged features are the following:

• Collecting a handful of past values. For example, for time, t + 1, you collect t, t - 1, t - 2, and t - 3.
• Collecting values that correspond to seasonal behavior in the data. For example, to predict the
occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the
previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as
predictive as using the features from previous days.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Lag features.
6. For Generate lag features for this column, choose a column.
7. For Timestamp Column, choose the column containing the timestamps.
8. For Lag, specify the duration of the lag.
9. (Optional) Configure the output using one of the following options:

• Include the entire lag window


• Flatten the output
• Drop rows without history
10. Choose Preview to generate a preview of the transform.
11. Choose Add to add the transform to the Data Wrangler data flow.

Create a Datetime Range In Your Time Series


You might have time series data that don't have timestamps. If you know that the observations were
taken at regular intervals, you can generate timestamps for the time series in a separate column.
To generate timestamps, you specify the value for the start timestamp and the frequency of the
timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant

Number of customers

10

14

24

1085
Amazon SageMaker Developer Guide
Transform Data

Number of customers

40

30

20

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can
add a timestamp column that corresponds to the time series data. You can see the timestamp column in
the following table.

Time series data on the number of customers at a restaurant

Number of customers Timestamp

10 1:00 PM

14 2:00 PM

24 3:00 PM

40 4:00 PM

30 5:00 PM

20 6:00 PM

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.
3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Datetime range.
6. For Frequency type, choose the unit used to measure the frequency of the timestamps.
7. For Starting timestamp, specify the start timestamp.
8. For Output column, specify a name for the output column.
9. (Optional) Configure the output using the remaining fields.
10. Choose Preview to generate a preview of the transform.
11. Choose Add to add the transform to the Data Wrangler data flow.

Use a Rolling Window In Your Time Series


You can extract features over a time period. For example, for time, t, and a time window length of 3, and
for the row that indicates the tth timestamp, we append the features that are extracted from the time
series at times t - 3, t -2, and t - 1. For information on extracting features, see Extract Features from Your
Time Series Data (p. 1084).

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.


2. If you haven't imported your dataset, import it under the Import data tab.

1086
Amazon SageMaker Developer Guide
Transform Data

3. In your data flow, under Data types, choose the +, and select Add transform.
4. Choose Add step.
5. Choose Rolling window features.
6. For Generate rolling window features for this column, choose a column.
7. For Timestamp Column, choose the column containing the timestamps.
8. (Optional) For Output Column, specify the name of the output column.
9. For Window size, specify the window size.
10. For Strategy, choose the extraction strategy.
11. Choose Preview to generate a preview of the transform.
12. Choose Add to add the transform to the Data Wrangler data flow.

Featurize Datetime
Use Featurize date/time to create a vector embedding representing a datetime field. To use this
transform, your datetime data must be in one of the following formats:

• Strings describing datetime: For example, "January 1st, 2020, 12:44pm".


• A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or
nanoseconds from 1/1/1970.

You can choose to Infer datetime format and provide a Datetime format. If you provide a datetime
format, you must use the codes described in the Python documentation. The options you select for these
two configurations have implications for the speed of the operation and the final results.

• The most manual and computationally fastest option is to specify a Datetime format and select No
for Infer datetime format.
• To reduce manual labor, you can choose Infer datetime format and not specify a datetime format. It
is also a computationally fast operation; however, the first datetime format encountered in the input
column is assumed to be the format for the entire column. If there are other formats in the column,
these values are NaN in the final output. Inferring the datetime format can give you unparsed strings.
• If you don't specify a format and select No for Infer datetime format, you get the most robust results.
All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower
than the first two options in this list.

When you use this transform, you specify an Input column which contains datetime data in one of the
formats listed above. The transform creates an output column named Output column name. The format
of the output column depends on your configuration using the following:

• Vector: Outputs a single column as a vector.


• Columns: Creates a new column for every feature. For example, if the output contains a year, month,
and day, three separate columns are created for year, month, and day.

Additionally, you must choose an Embedding mode. For linear models and deep networks, we
recommend choosing cyclic. For tree-based algorithms, we recommend choosing ordinal.

Format String
The Format string transforms contain standard string formatting operations. For example, you can use
these operations to remove special characters, normalize string lengths, and update string casing.

1087
Amazon SageMaker Developer Guide
Transform Data

This feature group contains the following transforms. All transforms return copies of the strings in the
Input column and add the result to a new, output column.

Name Function

Left pad Left-pad the string with a given Fill character to


the given width. If the string is longer than width,
the return value is shortened to width characters.

Right pad Right-pad the string with a given Fill character to


the given width. If the string is longer than width,
the return value is shortened to width characters.

Center (pad on either side) Center-pad the string (add padding on both sides
of the string) with a given Fill character to the
given width. If the string is longer than width, the
return value is shortened to width characters.

Prepend zeros Left-fill a numeric string with zeros, up to a given


width. If the string is longer than width, the
return value is shortened to width characters.

Strip left and right Returns a copy of the string with the leading and
trailing characters removed.

Strip characters from left Returns a copy of the string with leading
characters removed.

Strip characters from right Returns a copy of the string with trailing
characters removed.

Lower case Convert all letters in text to lowercase.

Upper case Convert all letters in text to uppercase.

Capitalize Capitalize the first letter in each sentence.

Swap case Converts all uppercase characters to lowercase


and all lowercase characters to uppercase
characters of the given string, and returns it.

Add prefix or suffix Adds a prefix and a suffix the string column. You
must specify at least one of Prefix and Suffix.

Remove symbols Removes given symbols from a string. All listed


characters are removed. Defaults to white space.

Handle Outliers
Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or
rare values, can negatively impact model accuracy and lead to longer training times. Use this feature
group to detect and update outliers in your dataset.

When you define a Handle outliers transform step, the statistics used to detect outliers are generated on
the data available in Data Wrangler when defining this step. These same statistics are used when running
a Data Wrangler job.

Use the following sections to learn more about the transforms this group contains. You specify an
Output name and each of these transforms produces an output column with the resulting data.

1088
Amazon SageMaker Developer Guide
Transform Data

Robust standard deviation numeric outliers


This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an Upper quantile and a Lower quantile for the statistics used to calculate outliers. You
must also specify the number of Standard deviations from which a value must vary from the mean to be
considered an outlier. For example, if you specify 3 for Standard deviations, a value must fall more than
3 standard deviations from the mean to be considered an outlier.

The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:

• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.

Standard Deviation Numeric Outliers


This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of Standard deviations a value must vary from the mean to be considered an
outlier. For example, if you specify 3 for Standard deviations, a value must fall more than 3 standard
deviations from the mean to be considered an outlier.

The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:

• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.

Quantile Numeric Outliers


Use this transform to detect and fix outliers in numeric features using quantiles. You can define an Upper
quantile and a Lower quantile. All values that fall above the upper quantile or below the lower quantile
are considered outliers.

The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:

• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.

Min-Max Numeric Outliers


This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this
method if you know threshold values that demark outliers.

You specify a Upper threshold and a Lower threshold, and if values fall above or below those thresholds
respectively, they are considered outliers.

The Fix method is the method used to handle outliers when they are detected. You can choose from the
following:

1089
Amazon SageMaker Developer Guide
Transform Data

• Clip: Use this option to clip the outliers to the corresponding outlier detection bound.
• Remove: Use this option to remove rows with outliers from the dataframe.
• Invalidate: Use this option to replace outliers with invalid values.

Replace Rare
When you use the Replace rare transform, you specify a threshold and Data Wrangler finds all values
that meet that threshold and replaces them with a string that you specify. For example, you may want to
use this transform to categorize all outliers in a column into an "Others" category.

• Replacement string: The string with which to replace outliers.


• Absolute threshold: A category is rare if the number of instances is less than or equal to this absolute
threshold.
• Fraction threshold: A category is rare if the number of instances is less than or equal to this fraction
threshold multiplied by the number of rows.
• Max common categories: Maximum not-rare categories that remain after the operation. If the
threshold does not filter enough categories, those with the top number of appearances are classified
as not rare. If set to 0 (default), there is no hard limit to the number of categories.

Handle Missing Values


Missing values are a common occurrence in machine learning datasets. In some situations, it is
appropriate to impute missing data with a calculated value, such as an average or categorically common
value. You can process missing values using the Handle missing values transform group. This group
contains the following transforms.

Fill Missing
Use the Fill missing transform to replace missing values with a Fill value you define.

Impute Missing
Use the Impute missing transform to create a new column that contains imputed values where missing
values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute.
You can choose to impute the mean or the median over the values that are present in your dataset. Data
Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column.
To impute a custom string, use the Fill missing transform instead.

Add Indicator for Missing


Use the Add indicator for missing transform to create a new indicator column, which contains a Boolean
"false" if a row contains a value, and "true" if a row contains a missing value.

Drop Missing
Use the Drop missing option to drop rows that contain missing values from the Input column.

Manage Columns
You can use the following transforms to quickly update and manage columns in your dataset:

1090
Amazon SageMaker Developer Guide
Transform Data

Name Function

Drop Column Delete a column.

Duplicate Column Duplicate a column.

Rename Column Rename a column.

Move Column Move a column's location in the dataset. Choose


to move your column to the start or end of the
dataset, before or after a reference column, or to
a specific index.

Manage Rows
Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the
following:

• Sort: Sort the entire dataframe by a given column. Select the check box next to Ascending order for
this option; otherwise, deselect the check box and descending order is used for the sort.
• Shuffle: Randomly shuffle all rows in the dataset.

Manage Vectors
Use this transform group to combine or flatten vector columns. This group contains the following
transforms.

• Assemble: Use this transform to combine Spark vectors and numeric data into a single column. For
example, you can combine three columns: two containing numeric data and one containing vectors.
Add all the columns you want to combine in Input columns and specify a Output column name for
the combined data.
• Flatten: Use this transform to flatten a single column containing vector data. The input column must
contain PySpark vectors or array-like objects. You can control the number of columns created by
specifying a Method to detect number of outputs. For example, if you select Length of first vector,
the number of elements in the first valid vector or array found in the column determines the number
of output columns that are created. All other input vectors with too many items are truncated. Inputs
with too few items are filled with NaNs.

You also specify an Output prefix, which is used as the prefix for each output column.

Process Numeric
Use the Process Numeric feature group to process numeric data. Each scalar in this group is defined
using the Spark library. The following scalars are supported:

• Standard Scaler: Standardize the input column by subtracting the mean from each value and scaling
to unit variance. To learn more, see the Spark documentation for StandardScaler.
• Robust Scaler: Scale the input column using statistics that are robust to outliers. To learn more, see
the Spark documentation for RobustScaler.
• Min Max Scaler: Transform the input column by scaling each feature to a given range. To learn more,
see the Spark documentation for MinMaxScaler.
• Max Absolute Scaler: Scale the input column by dividing each value by the maximum absolute value.
To learn more, see the Spark documentation for MaxAbsScaler.

1091
Amazon SageMaker Developer Guide
Transform Data

Sampling
After you've imported your data, you can use the Sampling transformer to take one or more samples of
it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:

• Limit: Samples the dataset starting from the first row up to the limit that you specify.
• Randomized: Takes a random sample of a size that you specify.
• Stratified: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the
dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a
different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample.

To take a random sample from your data.

1. Choose the + to the right of the dataset that you've imported. The name of your dataset is located
below the +.
2. Choose Add transform.
3. Choose Sampling.
4. For Sampling method, choose the sampling method.
5. For Approximate sample size, choose the approximate number of observations that you want in
your sample.
6. (Optional) Specify an integer for Random seed to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the + to the right of the dataset that you've imported. The name of your dataset is located
below the +.
2. Choose Add transform.
3. Choose Sampling.
4. For Sampling method, choose the sampling method.
5. For Approximate sample size, choose the approximate number of observations that you want in
your sample.
6. For Stratify column, specify the name of the column that you want to stratify on.
7. (Optional) Specify an integer for Random seed to create a reproducible sample.

Search and Edit


Use this section to search for and edit specific patterns within strings. For example, you can find and
update strings within sentences or documents, split strings by delimiters, and find occurrences of specific
strings.

The following transforms are supported under Search and edit. All transforms return copies of the
strings in the Input column and add the result to a new output column.

1092
Amazon SageMaker Developer Guide
Transform Data

Name Function

Find substring Returns the index of the first occurrence of the


Substring for which you searched , You can start
and end the search at Start and End respectively.

Find substring (from right) Returns the index of the last occurrence of the
Substring for which you searched. You can start
and end the search at Start and End respectively.

Matches prefix Returns a Boolean value if the string contains


a given Pattern. A pattern can be a character
sequence or regular expression. Optionally, you
can make the pattern case sensitive.

Find all occurrences Returns an array with all occurrences of a given


pattern. A pattern can be a character sequence or
regular expression.

Extract using regex Returns a string that matches a given Regex


pattern.

Extract between delimiters Returns a string with all characters found between
Left delimiter and Right delimiter.

Extract from position Returns a string, starting from Start position in


the input string, that contains all characters up to
the start position plus Length.

Find and replace substring Returns a string with all matches of a given
Pattern (regular expression) replaced by
Replacement string.

Replace between delimiters Returns a string with the substring found between
the first appearance of a Left delimiter and the
last appearance of a Right delimiter replaced by
Replacement string. If no match is found, nothing
is replaced.

Replace from position Returns a string with the substring between


Start position and Start position plus Length
replaced by Replacement string. If Start position
plus Length is greater than the length of the
replacement string, the output contains ….

Convert regex to missing Converts a string to None if invalid and returns


the result. Validity is defined with a regular
expression in Pattern.

Split string by delimiter Returns an array of strings from the input string,
split by Delimiter, with up to Max number of
splits (optional). The delimiter defaults to white
space.

Split data
Use the Split data transform to split your dataset into two or three datasets. For example, you can split
your dataset into a dataset used to train your model and a dataset used to test it. You can determine the

1093
Amazon SageMaker Developer Guide
Transform Data

proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two
datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test
datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method
you use to split the data. For example, you might want to use a stratified split to make sure that the
distribution of the observations in the target column are the same across datasets. You can use the
following split transforms:

• Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger
datasets, using a randomized split might be computationally expensive and take longer than an
ordered split.
• Ordered split – Splits the dataset based on the sequential order of the observations. For example, for
an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training
dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in
keeping the existing order of the data between splits.
• Stratified split – Splits the dataset to make sure that the number of observations in the input column
have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the
1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go
to the testing set.
• Split by key – Avoids data with the same key occurring in more than one split. For example, if you have
a dataset with the column 'customer_id' and you're using it as a key, no customer id is in more than
one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases,
they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error
threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions
that you specify for the splits. If you set a higher error threshold, you get better performance, but lower
accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for
better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would
get observations approximating one of the following results:

• 8010 observations in the training set and 1990 in the testing set
• 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010
and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a
different value for the seed to create a different reproducible split.

Randomized split

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the + next to the node containing the dataset that you're splitting.

1094
Amazon SageMaker Developer Guide
Transform Data

2. Choose Add transform.


3. Choose Split data.
4. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
5. (Optional) Choose the + to create an additional split.

• Specify the names and proportions of all the splits. The proportions must sum to 1.
6. (Optional) Specify a value for Error threshold other than the default value.
7. (Optional) Specify a value for Random seed.
8. Choose Preview.
9. Choose Add.

Ordered split

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. For Transform, choose Ordered split.
4. Choose Split data.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.

• Specify the names and proportions of all the splits. The proportions must sum to 1.
7. (Optional) Specify a value for Error threshold other than the default value.
8. (Optional) For Input column, specify a column with numeric values. Uses the values of the
columns to infer which records are in each split. The smaller values are in one split with the
larger values in the other splits.
9. (Optional) Select Handle duplicates to add noise to duplicate values and create a dataset of
entirely unique values.
10. (Optional) Specify a value for Random seed.
11. Choose Preview.
12. Choose Add.

Stratified split

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. Choose Split data.
4. For Transform, choose Stratified split.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.

• Specify the names and proportions of all the splits. The proportions must sum to 1.

1095
Amazon SageMaker Developer Guide
Transform Data

7. For Input column, specify a column with up to 100 unique values. Data Wrangler can't stratify a
column with more than 100 unique values.
8. (Optional) Specify a value for Error threshold other than the default value.
9. (Optional) Specify a value for Random seed to specify a different seed.
10. Choose Preview.
11. Choose Add.

Split by column keys

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the + next to the node containing the dataset that you're splitting.
2. Choose Add transform.
3. Choose Split data.
4. For Transform, choose Split by key.
5. (Optional) For Splits, specify the names and proportions of each split. The proportions must
sum to 1.
6. (Optional) Choose the + to create an additional split.

• Specify the names and proportions of all the splits. The proportions must sum to 1.
7. For Key columns, specify the columns with values that you don't want to appear in both
datasets.
8. (Optional) Specify a value for Error threshold other than the default value.
9. Choose Preview.
10. Choose Add.

Parse Value as Type


Use this transform to cast a column to a new type. The supported Data Wrangler data types are:

• Long
• Float
• Boolean
• Date, in the format dd-MM-yyyy, representing day, month, and year respectively.
• String

Validate String
Use the Validate string transforms to create a new column that indicates that a row of text data meets
a specified condition. For example, you can use a Validate string transform to verify that a string only
contains lowercase characters. The following transforms are supported under Validate string.

The following transforms are included in this transform group. If a transform outputs a Boolean value,
True is represented with a 1 and False is represented with a 0.

Name Function

String length Returns True if a string length equals specified


length. Otherwise, returns False.

1096
Amazon SageMaker Developer Guide
Transform Data

Name Function

Starts with Returns True if a string starts will a specified


prefix. Otherwise, returns False.

Ends with Returns True if a string length equals specified


length. Otherwise, returns False.

Is alphanumeric Returns True if a string only contains numbers


and letters. Otherwise, returns False.

Is alpha (letters) Returns True if a string only contains letters.


Otherwise, returns False.

Is digit Returns True if a string only contains digits.


Otherwise, returns False.

Is space Returns True if a string only contains numbers


and letters. Otherwise, returns False.

Is title Returns True if a string contains any white


spaces. Otherwise, returns False.

Is lowercase Returns True if a string only contains lower case


letters. Otherwise, returns False.

Is uppercase Returns True if a string only contains upper case


letters. Otherwise, returns False.

Is numeric Returns True if a string only contains numbers.


Otherwise, returns False.

Is decimal Returns True if a string only contains decimal


numbers. Otherwise, returns False.

Unnest JSON Data


If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might
have nested data in columns of either a Parquet file or a JSON document.

Use the Flatten structured operator to separate the first level keys into separate columns. A first level
key is a key that isn't nested within a value.

For example, you might have a dataset that has a person column with demographic information on each
person stored as JSON strings. A JSON string might look like the following.

"{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city":


"Posbotno","state": "WV"}"

The Flatten structured operator converts the following first level keys into additional columns in your
dataset:

• seq
• name
• age

1097
Amazon SageMaker Developer Guide
Transform Data

• city
• state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column
names and values of the JSON.

seq, name, age, city, state


1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV

For each value in your dataset containing JSON, the Flatten structured operator creates columns for the
first-level keys. To create columns for nested keys, call the operator again. For the preceding example,
calling the operator creates the columns:

• name_first
• name_last

The following example shows the dataset that results from calling the operation again.

seq, name, age, city, state, name_first, name_last


1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson

Choose Keys to flatten on to specify the first-level keys that want to extract as separate columns. If you
don't specify any keys, Data Wrangler extracts all the keys by default.

Explode Array
Use Explode array to expand the values of the array into separate output rows. For example, the
operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the
following rows:

[1, 2, 3]
[4, 5, 6]
[7, 8, 9]

Data Wrangler names the new column, input_column_name_flatten.

You can call the Explode array operation multiple times to get the nested values of the array into
separate output columns. The following example shows the result of calling the operation multiple times
on a dataset with a nested array.

Putting the values of a nested array into separate columns

id array id array_items id array_items_items

1 [ [cat, dog], 1 [cat, dog] 1 cat


[bat, frog] ]

2 [[rose, 1 [bat, frog] 1 dog


petunia], [lily,
daisy]]

1098
Amazon SageMaker Developer Guide
Transform Data

id array id array_items id array_items_items

2 [rose, petunia] 1 bat

2 [lily, daisy] 1 frog

2 2 rose

2 2 petunia

2 2 lily

2 2 daisy

Transform Image Data


Use Data Wrangler to import and transform the images that you're using for your machine learning (ML)
pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your
ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming
image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about
supported image formats, see Image file reading and writing.

After you've familiarized yourself with the concepts of transforming your image data, go through the
following tutorial, Prepare image data with Amazon SageMaker Data Wrangler.

The following industries and use cases are examples where applying machine learning to transformed
image data can be useful:

• Manufacturing – Identifying defects in items from the assembly line


• Food – Identifying spoiled or rotten food
• Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.
2. Transform – Use the built-in transformations to prepare the images for your machine learning
pipeline.
3. Export – Export the images that you’ve transformed to a location that can be accessed from the
pipeline.

Use the following procedure to import your image data.

To import your image data

1. Navigate to the Create connection page.


2. Choose Amazon S3.
3. Specify the Amazon S3 file path that contains the image data.
4. For File type, choose Image.
5. (Optional) Choose Import nested directories to import images from multiple Amazon S3 paths.
6. Choose Import.

1099
Amazon SageMaker Developer Guide
Transform Data

Data Wrangler uses the open-source imgaug library for its built-in image transformations. You can use
the following built-in transformations:

• ResizeImage
• EnhanceImage
• CorruptImage
• SplitImage
• DropCorruptedImages
• DropImageDuplicates
• Brightness
• ColorChannels
• Grayscale
• Rotate

Use the following procedure to transform your images without writing code.

To transform the image data without writing code

1. From your Data Wrangler flow, choose the + next to the node representing the images that you've
imported.
2. Choose Add transform.
3. Choose Add step.
4. Choose the transform and configure it.
5. Choose Preview.
6. Choose Add.

In addition to using the transformations that Data Wrangler provides, you can also use your own
custom code snippets. For more information about using custom code snippets, see Custom
Transforms (p. 1065). You can import the OpenCV and imgaug libraries within your code snippets and
use the transforms associated with them. The following is an example of a code snippet that detects
edges within the images.

# A table with your image data is stored in the `df` variable


import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import


DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import
BasicImageOperationDecorator, PandasUDFOperationDecorator

@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
# To use the code snippet on your image data, modify the following lines within the
function
HYST_THRLD_1, HYST_THRLD_2 = 100, 200
edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
return edges

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):

1100
Amazon SageMaker Developer Guide
Analyze and Visualize

return my_transform(image_row)

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample
of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't
apply the transforms to all of your images.

To apply the transformations to all of your images, export your Data Wrangler flow to an Amazon S3
location. You can use the images that you've exported in your training or inference pipelines. Use a
destination node or a Jupyter Notebook to export your data. You can access either method for exporting
your data from the Data Wrangler flow. For information about using these methods, see Export to
Amazon S3 (p. 1118).

Map Columns for Amazon Personalize


Data Wrangler integrates with Amazon Personalize, a fully managed machine learning service that
generates item recommendations and user segments. You can use the Map columns for Amazon
Personalize transform to get your data into a format that Amazon Personalize can interpret. For more
information about the transforms specific to Amazon Personalize, see Importing data using Amazon
SageMaker Data Wrangler. For more information about Amazon Personalize see What is Amazon
Personalize?

Analyze and Visualize


Amazon SageMaker Data Wrangler includes built-in analyses that help you generate visualizations and
data analyses in a few clicks. You can also create custom analyses using your own code.

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing Add
analysis. To access an analysis you've created, select the step that contains the analysis, and select the
analysis.

All analyses are generated using 100,000 rows of your dataset.

You can add the following analysis to a dataframe:

• Data visualizations, including histograms and scatter plots.


• A quick summary of your dataset, including number of entries, minimum and maximum values (for
numeric data), and most and least frequent categories (for categorical data).
• A quick model of the dataset, which can be used to generate an importance score for each feature.
• A target leakage report, which you can use to determine if one or more features are strongly
correlated with your target feature.
• A custom visualization using your own code.

Use the following sections to learn more about these options.

Histogram
Use histograms to see the counts of feature values for a specific feature. You can inspect the
relationships between features using the Color by option. For example, the following histogram charts
the distribution of user ratings of the best-selling books on Amazon from 2009–2019, colored by genre.

1101
Amazon SageMaker Developer Guide
Analyze and Visualize

You can use the Facet by feature to create histograms of one column, for each value in another column.
For example, the following diagram shows histograms of user reviews of best-selling books on Amazon if
faceted by year.

Scatter Plot
Use the Scatter Plot feature to inspect the relationship between features. To create a scatter plot, select
a feature to plot on the X axis and the Y axis. Both of these columns must be numeric typed columns.

You can color scatter plots by an additional column. For example, the following example shows a scatter
plot comparing the number of reviews against user ratings of top-selling books on Amazon between
2009 and 2019. The scatter plot is colored by book genre.

1102
Amazon SageMaker Developer Guide
Analyze and Visualize

Additionally, you can facet scatter plots by features. For example, the following image shows an example
of the same review versus user rating scatter plot, faceted by year.

Table Summary
Use the Table Summary analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of
entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table
summary reports the number of entries (count), least frequent value (min), and most frequent value
(max).

1103
Amazon SageMaker Developer Guide
Analyze and Visualize

Quick Model
Use the Quick Model visualization to quickly evaluate your data and produce importance scores for
each feature. A feature importance score score indicates how useful a feature is at predicting a target
label. The feature importance score is between [0, 1] and a higher number indicates that the feature
is more important to the whole dataset. On the top of the quick model chart, there is a model score. A
classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against
which you want feature importance to be compared. Data Wrangler does the following:

• Infers the data types for the target label and each feature in the dataset selected.
• Determines the problem type. Based on the number of distinct values in the label column, Data
Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a
categorical threshold to 100. If there are more than 100 distinct values in the label column, Data
Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem.
• Pre-processes features and label data for training. The algorithm used requires encoding features to
vector type and encoding labels to double type.
• Trains a random forest algorithm with 70% of data. Spark’s RandomForestRegressor is used to train a
model for regression problems. The RandomForestClassifier is used to train a model for classification
problems.
• Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates
classification models using an F1 score and evaluates regression models using an MSE score.
• Calculates feature importance for each feature using the Gini importance method.

The following image shows the user interface for the quick model feature.

Target Leakage
Target leakage occurs when there is data in a machine learning training dataset that is strongly
correlated with the target label, but is not available in real-world data. For example, you may have a
column in your dataset that serves as a proxy for the column you want to predict with your model.

When you use the Target Leakage analysis, you specify the following:

• Target: This is the feature about which you want your ML model to be able to make predictions.

1104
Amazon SageMaker Developer Guide
Analyze and Visualize

• Problem type: This is the ML problem type on which you are working. Problem type can either be
classification or regression.
• (Optional) Max features: This is the maximum number of features to present in the visualization, which
shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic,
or AUC - ROC curve for each column, up to Max features. For regression, it uses a coefficient of
determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-
validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities,
which often indicates target leakage. A score of 0.5 or lower indicates that the information on the
column could not provide, on its own, any useful information towards predicting the target. Although it
can happen that a column is uninformative on its own but is useful in predicting the target when used in
tandem with other features, a low score could indicate the feature is redundant.

For example, the following image shows a target leakage report for a diabetes classification problem,
that is, predicting if a person has diabetes or not. An AUC - ROC curve is used to calculate the predictive
ability of five features, and all are determined to be safe from target leakage.

Multicollinearity
Multicollinearity is a circumstance where two or more predictor variables are related to each other. The
predictor variables are the features in your dataset that you're using to predict a target variable. When
you have multicollinearity, the predictor variables are not only predictive of the target variable, but also
predictive of each other.

You can use the Variance Inflation Factor (VIF), Principal Component Analysis (PCA), or Lasso feature
selection as measures for the multicollinearity in your data. For more information, see the following.

Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler
returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is
a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1
indicate higher correlation.

1105
Amazon SageMaker Developer Guide
Analyze and Visualize

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50.
If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:

• A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the
other variables.
• A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the
other variables.

Principle Component Analysis (PCA)

Principal Component Analysis (PCA) measures the variance of the data along different directions in
the feature space. The feature space consists of all the predictor variables that you use to predict the
target variable in your dataset.

For example, if you're trying to predict who survived on the RMS Titanic after it hit an iceberg, your
feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known
as singular values. The values in the list of variances are greater than or equal to 0. We can use them
to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When
there is a lot of variability among the values, we have many instances of multicollinearity. Before it
performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation
of 1.
Note
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).
Lasso feature selection

Lasso feature selection uses the L1 regularization technique to only include the most predictive
features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for
each feature. The absolute value of the coefficient provides an importance score for the feature. A
higher importance score indicates that it is more predictive of the target variable. A common feature
selection method is to use all the features that have a non-zero lasso coefficient.

Detect Anomalies In Time Series Data


You can use the anomaly detection visualization to see outliers in your time series data. To understand
what determines an anomaly, you need to understand that we decompose the time series into a
predicted term and an error term. We treat the seasonality and trend of the time series as the predicted
term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be
away from the mean for it to be considered an anomaly. For example, you can specify a threshold as
being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an
anomaly.

You can use the following procedure to perform an Anomaly detection analysis.

1. Open your Data Wrangler data flow.


2. In your data flow, under Data types, choose the +, and select Add analysis.
3. For Analysis type, choose Time Series.
4. For Visualization, choose Anomaly detection.

1106
Amazon SageMaker Developer Guide
Analyze and Visualize

5. For Anomaly threshold, choose the threshold that a value is considered an anomaly.
6. Choose Preview to generate a preview of the analysis.
7. Choose Add to add the transform to the Data Wrangler data flow.

Seasonal Trend Decomposition In Time Series Data


You can determine whether there's seasonality in your time series data by using the Seasonal Trend
Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method
to perform the decomposition. We decompose the time series into its seasonal, trend, and residual
components. The trend reflects the long term progression of the series. The seasonal component is a
signal that recurs in a time period. After removing the trend and the seasonal components from the time
series, you have the residual.

You can use the following procedure to perform a Seasonal-Trend decomposition analysis.

1. Open your Data Wrangler data flow.


2. In your data flow, under Data types, choose the +, and select Add analysis.
3. For Analysis type, choose Time Series.
4. For Visualization, choose Seasonal-Trend decomposition.
5. For Anomaly threshold, choose the threshold that a value is considered an anomaly.
6. Choose Preview to generate a preview of the analysis.
7. Choose Add to add the transform to the Data Wrangler data flow.

Bias Report
You can use the bias report in Data Wrangler to uncover potential biases in your data. To generate a bias
report, you must specify the target column, or Label, that you want to predict and a Facet, or the column
that you want to inspect for biases.

Label: The feature about which you want a model to make predictions. For example, if you are predicting
customer conversion, you may select a column containing data on whether or not a customer has placed
an order. You must also specify whether this feature is a label or a threshold. If you specify a label, you
must specify what a positive outcome looks like in your data. In the customer conversion example, a
positive outcome may be a 1 in the orders column, representing the positive outcome of a customer
placing an order within the last three months. If you specify a threshold, you must specify a lower bound
defining a positive outcome. For example, if your customer orders columns contains the number of
orders placed in the last year, you may want to specify 1.

Facet: The column that you want to inspect for biases. For example, if you are trying to predict customer
conversion, your facet may be the age of the customer. You may choose this facet because you believe
that your data is biased toward a certain age group. You must identify whether the facet is measured as a
value or threshold. For example, if you wanted to inspect one or more specific ages, you select Value and
specify those ages. If you want to look at an age group, you select Threshold and specify the threshold
of ages you want to inspect.

After you select your feature and label, you select the types of bias metrics you want to calculate.

To learn more, see Generate reports for bias in pre-training data.

Create Custom Visualizations


You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with
all the transformations you've applied, is available as a Pandas DataFrame. Data Wrangler uses the df
variable to store the dataframe. You access the dataframe by calling the variable.

1107
Amazon SageMaker Developer Guide
Analyze and Visualize

You must provide the output variable, chart, to store an Altair output chart. For example, you can use
the following code block to create a custom histogram using the Titanic dataset.

import altair as alt


df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
x='mean(value):Q',
size=alt.value(5))
chart = bar + rule

To create a custom visualization:

1. Next to the node containing the transformation that you'd like to visualize, choose the +.
2. Choose Add analysis.
3. For Analysis type, choose Custom Visualization.
4. For Analysis name, specify a name.
5. Enter your code in the code box.
6. Choose Preview to preview your visualization.
7. Choose Save to add your visualization.

If you don’t know how to use the Altair visualization package in Python, you can use custom code
snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose
Search example snippets and specify a query in the search bar.

1108
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

The following example uses the Binned scatterplot code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code.
You usually need to specify the column names of your dataset in the code.

import altair as alt

# Specify the number of top rows for plotting


rows_number = 1000
df = df.head(rows_number)
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)

chart = (
alt.Chart(df)
.mark_circle()
.encode(
# Specify the column names for binning and number of bins for X and Y axis
x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
size="count()",
)
)

# :Q specifies that label column has quantitative type.


# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types

Reusing Data Flows for Different Datasets


For Amazon Simple Storage Service (Amazon S3) data sources, you can create and use parameters. A
parameter is a variable that you've saved in your Data Wrangler flow. Its value can be any portion of the
data source's Amazon S3 path. Use parameters to quickly change the data that you're importing into a
Data Wrangler flow or exporting to a processing job. You can also use parameters to select and import a
specific subset of your data.

After you created a Data Wrangler flow, you might have trained a model on the data that you've
transformed. For datasets that have the same schema, you can use parameters to apply the same
transformations on a different dataset and train a different model. You can use the new datasets to
perform inference with your model or you could be using them to retrain your model.

In general, parameters have the following attributes:

• Name – The name you specify for the parameter


• Type – The type of value that the parameter represents
• Default value – The value of the parameter when you don't specify a new value

Note
Datetime parameters have a time range attribute that they use as the default value.

Data Wrangler uses curly braces, {{}}, to indicate that a parameter is being used in
the Amazon S3 path. For example, you can have a URL such as s3://DOC-EXAMPLE-
BUCKET1/{{example_parameter_name}}/example-dataset.csv.

1109
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

You create a parameter when you're editing the Amazon S3 data source that you've imported. You can
set any portion of the file path to a parameter value. You can set the parameter value to either a value or
a pattern. The following are the available parameter value types in the Data Wrangler flow:

• Number
• String
• Pattern
• Datetime

Note
You can't create a pattern parameter or a datetime parameter for the name of the bucket in the
Amazon S3 path.

You must set a number as the default value of a number parameter. You can change the value of
the parameter to a different number when you're editing a parameter or when you're launching a
processing job. For example, in the S3 path, s3://DOC-EXAMPLE-BUCKET/example-prefix/
example-file-1.csv, you can create a number parameter named number_parameter in the place
of 1. Your S3 path now appears as s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-
{{number_parameter}}.csv. The path continues to point to the example-file-1.csv dataset
until you change the value of the parameter. If you change the value of number_parameter to 2 the
path is now s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-2.csv. You can import
example-file-2.csv into Data Wrangler if you've uploaded the file to that Amazon S3 location.

A string parameter stores a string as its default value. For example, in the S3 path, s3://DOC-EXAMPLE-
BUCKET/example-prefix/example-file-1.csv, you can create a string parameter named
string_parameter in the place of the filename, example-file-1.csv. The path now appears as
s3://DOC-EXAMPLE-BUCKET/example-prefix/{{string_parameter}}. It continues to match
s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-1.csv, until you change the value
of the parameter.

Instead of specifying the filename as a string parameter, you can create a string parameter using the
entire Amazon S3 path. You can specify a dataset from any Amazon S3 location in the string parameter.

A pattern parameter stores a regular expression (Python REGEX) string as its default value. You can use
a pattern parameter to import multiple data files at the same time. To import more than one object at a
time, specify a parameter value that matches the Amazon S3 objects that you're importing.

You can also create a pattern parameter for the following datasets:

• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-1.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-2.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-10.csv
• s3://DOC-EXAMPLE-BUCKET/example-prefix/example-file-0123.csv

For s3://DOC-EXAMPLE-BUCKET1/example-prefix/example-file-1.csv, you can create


a pattern parameter in the place of 1, and set the default value of the parameter to \d+. The \d
+ REGEX string matches any one or more decimal digits. If you create a pattern parameter named
pattern_parameter, your S3 path appears as s3://DOC-EXAMPLE-BUCKET1/example-prefix/
example-file-{{pattern_parameter}}.csv.

You can also use pattern parameters to match all CSV objects within your bucket. To match all objects
in a bucket, create a pattern parameter with the default value of .* and set the path to s3://DOC-
EXAMPLE-BUCKET/{{pattern_parameter}}.csv. The .* character matches any string character in
the path.

1110
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

The s3://DOC-EXAMPLE-BUCKET/{{pattern_parameter}}.csv path can match the following


datasets.

• example-file-1.csv
• other-example-file.csv
• example-file-a.csv

A datetime parameter stores the format with the following information:

• A format for parsing strings inside an Amazon S3 path.


• A relative time range to limit the datetime values that match

For example, in the Amazon S3 file path, s3://DOC-EXAMPLE-BUCKET/2020/01/01/example-


dataset.csv, 2020/01/01 represents a datetime in the format of year/month/day. You can set the
parameter’s time range to an interval such as 1 years or 24 hours. An interval of 1 years matches
all S3 paths with datetimes that fall between the current time and the time exactly a year before the
current time. The current time is the time when you start exporting the transformations that you've
made to the data. For more information about exporting data, see Export (p. 1116). If the current date is
2022/01/01 and the time range is 1 years, the S3 path matches datasets such as the following:

• s3://DOC-EXAMPLE-BUCKET/2021/01/01/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET/2021/06/30/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET/2021/12/31/example-dataset.csv

The datetime values within a relative time range change as time passes. The S3 paths that fall within the
relative time range might also differ.

For the Amazon S3 file path, s3://DOC-EXAMPLE-BUCKET1/20200101/example-dataset.csv,


20220101 is an example of a path that can become a datetime parameter.

To view a table of all parameters that you've created in Data Wrangler flow, choose the `{{}}` to the
right of the text box containing the Amazon S3 path. If you no longer need a parameter that you've
created, you can edit or delete. To edit or delete a parameter, choose icons to the right of the parameter.
Important
Before you delete a parameter, make sure that you haven't used it anywhere in your Data
Wrangler flow. Deleted parameters that are still within the flow cause errors.

You can create parameters for any step of your Data Wrangler flow. You can edit or delete any parameter
that you create. If you're applying transformations to data that is no longer relevant to your use case,
you can modify the values of parameters. Modifying the values of the parameters changes the data that
you're importing.

The following sections provide additional examples and general guidance on using parameters. You can
use the sections to understand the parameters that work best for you.
Note
The following sections contain procedures that use the Data Wrangler interface to override the
parameters and create a processing job.
You can also override the parameters by using the following procedures.

To export your Data Wrangler flow and override the value of a parameter, do the following.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose the location where you're exporting the data.

1111
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

4. Under parameter_overrides, specify different values for the parameters that you've
created.
5. Run the Jupyter Notebook.

Applying a Data Wrangler flow to files using patterns


You can use parameters to apply transformations in your Data Wrangler flow to different files that match
a pattern in the Amazon S3 URI path. This helps you specify the files in your S3 bucket that you want
to transform with high specificity. For example, you might have a dataset with the path s3://DOC-
EXAMPLE-BUCKET1/example-prefix-0/example-prefix-1/example-prefix-2/example-
dataset.csv. Different datasets named example-dataset.csv are stored under many different
example prefixes. The prefixes might also be numbered sequentially. You can create patterns for the
numbers in the Amazon S3 URI. Pattern parameters use REGEX to select any number of files that match
the pattern of the expression. The following are REGEX patterns that might be useful:

• .* – Matches zero or more of any character, except newline characters


• .+ – Matches one or more of any character, excluding newline characters
• \d+ – Matches one or more of any decimal digit
• \w+ – Matches one or more of any alphanumeric character
• [abc-_]{2,4} – Matches a string two, three, or four characters composed of the set of characters
provided within a set of brackets
• abc|def – Matches one string or another. For example, the operation matches either abc or def

You can replace each number in the following paths with a single parameter that has a value of \d+.

• s3://DOC-EXAMPLE-BUCKET1/example-prefix-3/example-prefix-4/example-prefix-5/
example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix-8/example-prefix-12/example-
prefix-13/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix-4/example-prefix-9/example-
prefix-137/example-dataset.csv

The following procedure creates a pattern parameter for a dataset with the path s3://DOC-EXAMPLE-
BUCKET1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv.

To create a pattern parameter, do the following.

1. Next to the dataset that you've imported, choose Edit dataset.


2. Highlight the 0 in example-prefix-0.
3. Specify values for the following fields:

• Name – A name for parameter


• Type – Pattern
• Value – \d+ a regular expression that corresponds to one or more digits
4. Choose Create.
5. Replace the 1 and the 2 in S3 URI path with the parameter. The path should have the following
format: s3://DOC-EXAMPLE-BUCKET1/example-prefix-{{example_parameter_name}}/
example-prefix-{{example_parameter_name}}/example-prefix-
{{example_parameter_name}}/example-dataset.csv

The following is a general procedure for creating a pattern parameter.

1112
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

1. Navigate to your Data Wrangler flow.


2. Next to the dataset that you've imported, choose Edit dataset.
3. Highlight the portion of the URI that you're using as the value of the pattern parameter.
4. Choose Create custom parameter.
5. Specify values for the following fields:

• Name – A name for parameter


• Type – Pattern
• Value – A regular expression containing the pattern that you'd like to store.
6. Choose Create.

Applying a Data Wrangler flow to files using numeric values


You can use parameters to apply transformations in your Data Wrangler flow to different files that have
similar paths. For example, you might have a dataset with the path s3://DOC-EXAMPLE-BUCKET1/
example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv.

You might have the transformations from your Data Wrangler flow that you've applied to datasets under
example-prefix-1. You might want to apply the same transformations to example-dataset.csv
that falls under example-prefix-10 or example-prefix-20.

You can create a parameter that stores the value 1. If you want to apply the transformations to different
datasets, you can create processing jobs that replace the value of the parameter with a different value.
The parameter acts as a placeholder for you to change when you want to apply the transformations
from your Data Wrangler flow to new data. You can override the value of the parameter when you create
a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different
datasets.

Use the following procedure to create numeric parameters for s3://DOC-EXAMPLE-BUCKET1/


example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv.

To create parameters for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.


2. Next to the dataset that you've imported, choose Edit dataset.
3. Highlight the number in an example prefix of example-prefix-number.
4. Choose Create custom parameter.
5. For Name, specify a name for the parameter.
6. For Type, choose Integer.
7. For Value, specify the number.
8. Create parameters for the remaining numbers by repeating the procedure.

After you've created the parameters, apply the transforms to your dataset and create a destination node
for them. For more information about destination nodes, see Export (p. 1116).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following.

1. From your Data Wrangler flow, choose Create job


2. Select only the destination node that contains the transformations to the dataset containing the
datetime parameters.
3. Choose Configure job.

1113
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

4. Choose Parameters.
5. Choose the name of a parameter that you've created.
6. Change the value of the parameter.
7. Repeat the procedure for the other parameters.
8. Choose Run.

Applying a Data Wrangler flow to files using strings


You can use parameters to apply transformations in your Data Wrangler flow to different files that have
similar paths. For example, you might have a dataset with the path s3://DOC-EXAMPLE-BUCKET1/
example-prefix/example-dataset.csv.

You might have transformations from your Data Wrangler flow that you've applied to datasets under
example-prefix. You might want to apply the same transformations to example-dataset.csv
under another-example-prefix or example-prefix-20.

You can create a parameter that stores the value example-prefix. If you want to apply the
transformations to different datasets, you can create processing jobs that replace the value of the
parameter with a different value. The parameter acts as a placeholder for you to change when you want
to apply the transformations from your Data Wrangler flow to new data. You can override the value of
the parameter when you create a Data Wrangler processing job to apply the transformations in your
Data Wrangler flow to different datasets.

Use the following procedure to create a string parameter for s3://DOC-EXAMPLE-BUCKET1/example-


prefix/example-dataset.csv.

To create a parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.


2. Next to the dataset that you've imported, choose Edit dataset.
3. Highlight the example prefix, example-prefix.
4. Choose Create custom parameter.
5. For Name, specify a name for the parameter.
6. For Type, choose String.
7. For Value, specify the prefix.

After you've created the parameter, apply the transforms to your dataset and create a destination node
for them. For more information about destination nodes, see Export (p. 1116).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose Create job


2. Select only the destination node that contains the transformations to the dataset containing the
datetime parameters.
3. Choose Configure job.
4. Choose Parameters.
5. Choose the name of a parameter that you've created.
6. Change the value of the parameter.
7. Repeat the procedure for the other parameters.
8. Choose Run.

1114
Amazon SageMaker Developer Guide
Reusing Data Flows for Different Datasets

Applying a Data Wrangler flow to different datetime ranges


Use datetime parameters to apply transformations in your Data Wrangler flow to different time ranges.
Highlight the portion of the Amazon S3 URI that has a timestamp and create a parameter for it.
When you create a parameter, you specify a time range from the current time to a time in the past.
For example, you might have an Amazon S3 URI that looks like the following: s3://DOC-EXAMPLE-
BUCKET1/example-prefix/2022/05/15/example-dataset.csv. You can save 2022/05/15 as a
datetime parameter. If you specify a year as the time range, the time range includes the moment that
you run the processing job containing the datetime parameter and the time exactly one year ago. If the
moment you're running the processing job is September 6th, 2022 or 2022/09/06, the time ranges can
include the following:

• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/03/15/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/01/08/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2022/07/31/example-dataset.csv
• s3://DOC-EXAMPLE-BUCKET1/example-prefix/2021/09/07/example-dataset.csv

The transformations in the Data Wrangler flow apply to all of the preceding prefixes. Changing the value
of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler
flow. To apply the transformations to datasets within a different time range, do the following:

1. Create a destination node containing all the transformations that you'd like to use.
2. Create a Data Wrangler job.
3. Configure the job to use a different time range for the parameter. Changing the value of the
parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow.

For more information about destination nodes and Data Wrangler jobs, see Export (p. 1116).

The following procedure creates a datetime parameter for the Amazon S3 path: s3://DOC-EXAMPLE-
BUCKET1/example-prefix/2022/05/15/example-dataset.csv.

To create a datetime parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.


2. Next to the dataset that you've imported, choose Edit dataset.
3. Highlight the portion of the URI that you're using as the value of the datetime parameter.
4. Choose Create custom parameter.
5. For Name, specify a name for the parameter.
6. For Type, choose Datetime.
Note
By default, Data Wrangler selects Predefined, which provides a dropdown menu for you
to select a date format. However, the timestamp format that you're using might not be
available. Instead of using Predefined as the default option, you can choose Custom and
specify the timestamp format manually.
7. For Date format, open the dropdown menu following Predefined and choose yyyy/MM/dd. The
format, yyyy/MM/dd, corresponds to the year/month/day of the timestamp.
8. For Timezone, choose a time zone.
Note
The data that you're analyzing might have time stamps taken in a different time zone from
your time zone. Make sure that the time zone that you select matches the time zone of the
data.
9. For Time range, specify the time range for the parameter.

1115
Amazon SageMaker Developer Guide
Export

10. (Optional) Enter a description to describe how you're using the parameter.
11. Choose Create.

After you've created the datetime parameters, apply the transforms to your dataset and create a
destination node for them. For more information about destination nodes, see Export (p. 1116).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different
time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a datetime parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose Create job


2. Select only the destination node that contains the transformations to the dataset containing the
datetime parameters.
3. Choose Configure job.
4. Choose Parameters.
5. Choose the name of the datetime parameter that you've created.
6. For Time range, change the time range for the datasets.
7. Choose Run.

Export
In your Data Wrangler flow, you can export some or all of the transformations that you've made to your
data processing pipelines.

A Data Wrangler flow is the series of data preparation steps that you've performed on your data. In your
data preparation, you perform one or more transformations to your data. Each transformation is done
using a transform step. The flow has a series of nodes that represent the import of your data and the
transformations that you've performed. For an example of nodes, see the following image.

1116
Amazon SageMaker Developer Guide
Export

The preceding image shows a Data Wrangler flow with two nodes. The Source - sampled node shows the
data source from which you've imported your data. The Data types node indicates that Data Wrangler
has performed a transformation to convert the dataset into a usable format.

Each transformation that you add to the Data Wrangler flow appears as an additional node. For
information on the transforms that you can add, see Transform Data (p. 1058). The following image
shows a Data Wrangler flow that has a Rename-column node to change the name of a column in a
dataset.

You can export your data transformations to the following:

• Amazon S3
• SageMaker Pipelines
• Amazon SageMaker Feature Store
• Python Code

Important
We recommend that you use the IAM AmazonSageMakerFullAccess managed policy to grant
AWS permission to use Data Wrangler. If you don't use the managed policy, you can use an IAM
policy that gives Data Wrangler access to an Amazon S3 bucket. For more information on the
policy, see Security and Permissions (p. 1141).

When you export your data flow, you're charged for the AWS resources that you use. You can use cost
allocation tags to organize and manage the costs of those resources. You create these tags for your user-
profile and Data Wrangler automatically applies them to the resources used to export the data flow. For
more information, see Using Cost Allocation Tags.

1117
Amazon SageMaker Developer Guide
Export

Export to Amazon S3
Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You
can specify the location using one of the following methods:

• Destination node – Where Data Wrangler stores the data after it has processed it.
• Export to – Exports the data resulting from a transformation to Amazon S3.
• Export data – For small datasets, can quickly export the data that you've transformed.

Use the following sections to learn more about each of these methods.

Destination Node

If you want to output a series of data processing steps that you've performed to Amazon S3, you
create a destination node. A destination node tells Data Wrangler where to store the data after
you've processed it. After you create a destination node, you create a processing job to output the
data. A processing job is an Amazon SageMaker processing job. When you're using a destination
node, it runs the computational resources needed to output the data that you've transformed to
Amazon S3.

You can use a destination node to export some of the transformations or all of the transformations
that you've made in your Data Wrangler flow.

You can use multiple destination nodes to export different transformations or sets of
transformations. The following example shows two destination nodes in a single Data Wrangler flow.

You can use the following procedure to create destination nodes and export them to an Amazon S3
bucket.

To export your data flow, you create destination nodes and a Data Wrangler job to export the data.
Creating a Data Wrangler job starts a SageMaker processing job to export your flow. You can choose
the destination nodes that you want to export after you've created them.

1118
Amazon SageMaker Developer Guide
Export

Note
You can choose Create job in the Data Wrangler flow to view the instructions to use a
processing job.

Use the following procedure to create destination nodes.

1. Choose the + next to the nodes that represent the transformations that you want to export.
2. Choose Add destination.

3. Choose Amazon S3.

1119
Amazon SageMaker Developer Guide
Export

4. Specify the following fields.

• Dataset name – The name that you specify for the dataset that you're exporting.
• File type – The format of the file that you're exporting.
• Delimiter (CSV and Parquet files only) – The value used to separate other values.
• Compression (CSV and Parquet files only) – The compression method used to reduce the file
size. You can use the following compression methods:
• bzip2
• deflate
• gzip
• (Optional) Amazon S3 location – The S3 location that you're using to output the files.
• (Optional) Number of partitions – The number of datasets that you're writing as the output
of the processing job.
• (Optional) Partition by column – Writes all data with the same unique value from the column.
• (Optional) Inference Parameters – Selecting Generate inference artifact applies all of the
transformations you've used in the Data Wrangler flow to data coming into your inference
pipeline. The model in your pipeline makes predictions on the transformed data.
5. Choose Add destination.

Use the following procedure to create a processing job.

Create a job from the Data flow page and choose the destination nodes that you want to export.
Note
You can choose Create job in the Data Wrangler flow to view the instructions for creating a
processing job.

1. Choose Create job. The following image shows the pane that appears after you select Create
job.

2. For Job name, specify the name of the export job.

1120
Amazon SageMaker Developer Guide
Export

3. Choose the destination nodes that you want to export.


4. (Optional) Specify a AWS KMS key ARN. A AWS KMS key is a cryptographic key that you can use
to protect your data. For more information about AWS KMS keys, see AWS Key Management
Service.
5. (Optional) Under Trained parameters. choose Refit if you've done the following:

• Sampled your dataset


• Applied a transform that uses your data to create a new column in the dataset

For more information about refitting the transformations you've made to an entire dataset, see
Refit Transforms to The Entire Dataset and Export Them (p. 1132).
Note
For image data, Data Wrangler exports the transformations that you've made to all of
the images. Refitting the transformations isn't applicable to your use case.
6. Choose Configure job. The following image shows the Configure job page.

7. (Optional) Configure the Data Wrangler job. You can make the following configurations:

• Job configuration
• Spark memory configuration

1121
Amazon SageMaker Developer Guide
Export

• Network configuration
• Tags
• Parameters
• Associate Schedules
8. Choose Run.

Export to

As an alternative to using a destination node, you can use the Export to option to export your Data
Wrangler flow to Amazon S3 using a Jupyter notebook. You can choose any data node in your Data
Wrangler flow and export it. Exporting the data node exports the transformation that the node
represents and the transformations that precede it.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Amazon S3.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose Amazon S3 (via Jupyter Notebook).
4. Run the Jupyter notebook.

When you run the notebook, it exports your data flow (.flow file) in the same AWS Region as the
Data Wrangler flow.

The notebook provides options that you can use to configure the processing job and the data that it
outputs.
Important
We provide you with job configurations to configure the output of your data. For the
partitioning and driver memory options, we strongly recommend that you don't specify a
configuration unless you already have knowledge about them.

Under Job Configurations, you can configure the following:

• output_content_type – The content type of the output file. Uses CSV as the default format,
but you can specify Parquet.

1122
Amazon SageMaker Developer Guide
Export

• delimiter – The character used to separate values in the dataset when writing to a CSV file.
• compression – If set, compresses the output file. Uses gzip as the default compression format.
• num_partitions – The number of partitions or files that Data Wrangler writes as the output.
• partition_by – The names of the columns that you use to partition the output.

To change the output file format from CSV to Parquet, change the value from "CSV" to "Parquet".
For the rest of the preceding fields, uncomment the lines containing the fields that you want to
specify.

Under (Optional) Configure Spark Cluster Driver Memory you can configure Spark properties for
the job, such as the Spark driver memory, in the config dictionary.

The following shows the config dictionary.

config = json.dumps({
"Classification": "spark-defaults",
"Properties": {
"spark.driver.memory": f"{driver_memory_in_mb}m",
}
})

To apply the configuration to the processing job, uncomment the following lines:

# data_sources.append(ProcessingInput(
# source=config_s3_uri,
# destination="/opt/ml/processing/input/conf",
# input_name="spark-config",
# s3_data_type="S3Prefix",
# s3_input_mode="File",
# s3_data_distribution_type="FullyReplicated"
# ))

Export data

If you have a transformation on a small dataset that you want to export quickly, you can use the
Export data method. When you start choose Export data, Data Wrangler works synchronously to
export the data that you've transformed to Amazon S3. You can't use Data Wrangler until either it
finishes exporting your data or you cancel the operation.

For information on using the Export data method in your Data Wrangler flow, see the following
procedure.

To use the Export data method:

1. Choose a node in your Data Wrangler flow by opening (double-clicking on) it.

1123
Amazon SageMaker Developer Guide
Export

2. Configure how you want to export the data.


3. Choose Export data.

When you export your data flow to an Amazon S3 bucket, Data Wrangler stores a copy of the
flow file in the S3 bucket. It stores the flow file under the data_wrangler_flows prefix. If you use
the default Amazon S3 bucket to store your flow files, it uses the following naming convention:
sagemaker-region-account number. For example, if your account number is 111122223333
and you are using Studio in us-east-1, your imported datasets are stored in sagemaker-us-
east-1-111122223333. In this example, your .flow files created in us-east-1 are stored in s3://
sagemaker-region-account number/data_wrangler_flows/.

Export to SageMaker Pipelines


When you want to build and deploy large-scale machine learning (ML) workflows, you can use
SageMaker Pipelines to create workflows that manage and deploy SageMaker jobs. With SageMaker
Pipelines, you can build workflows that manage your SageMaker data preparation, model training,
and model deployment jobs. You can use the first-party algorithms that SageMaker offers by using
SageMaker Pipelines. For more information on SageMaker Pipelines, see SageMaker Pipelines.

When you export one or more steps from your data flow to SageMaker Pipelines, Data Wrangler creates
a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

Use a Jupyter Notebook to Create a Pipeline


Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to
SageMaker Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to SageMaker Pipelines.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose SageMaker Pipelines (via Jupyter Notebook).
4. Run the Jupyter notebook.

1124
Amazon SageMaker Developer Guide
Export

You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline
includes the data processing steps that are defined by your Data Wrangler flow.

You can add additional steps to your pipeline by adding steps to the steps list in the following code in
the notebook:

pipeline = Pipeline(
name=pipeline_name,
parameters=[instance_type, instance_count],
steps=[step_process], #Add more steps to this list to run in your Pipeline
)

For more information on defining pipelines, see Define SageMaker Pipeline.

Export to an Inference Endpoint


Use your Data Wrangler flow to process data at the time of inference. You can create a SageMaker serial
inference pipeline from your Data Wrangler flow and a machine learning model. You can either use your
own model or use the notebook we provide to train one with Amazon SageMaker Autopilot or XGBoost
using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the
Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see Host
multiple models in one container behind one endpoint (p. 2205).

When you export one or more steps from your data flow to an inference endpoint, Data Wrangler creates
a Jupyter notebook that you can use to define, instantiate, run, and manage the inference pipeline.

Use a Jupyter Notebook to create an inference endpoint


Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference using a Jupyter notebook, do the following.

1. Choose the + next to the node that you want to export.

1125
Amazon SageMaker Developer Guide
Export

2. Choose Export to.


3. Choose SageMaker Inference Pipeline (via Jupyter Notebook).
4. Run the Jupyter notebook.

Export to Python Code


To export all steps in your data flow to a Python file that you can manually integrate into any data
processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Python Code.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose Python Code.
4. Run the Jupyter notebook.

You might need to configure the Python script to make it run in your pipeline. For example, if you're
running a Spark environment, make sure that you are running the script from an environment that has
permission to access AWS resources.

Export to Amazon SageMaker Feature Store


You can use Data Wrangler to export features you've created to Amazon SageMaker Feature Store. A
feature is a column in your dataset. Feature Store is a centralized store for features and their associated
metadata. You can use Feature Store to create, share, and manage curated data for machine learning
(ML) development. Centralized stores make your data more discoverable and reusable. For more
information about Feature Store, see Amazon SageMaker Feature Store.

A core concept in Feature Store is a feature group. A feature group is a collection of features, their
records (observations), and associated metadata. It's similar to a table in a database.

You can use Data Wrangler to do one of the following:

1126
Amazon SageMaker Developer Guide
Export

• Update an existing feature group with new records. A record is an observation in the dataset.
• Create a new feature group from a node in your Data Wrangler flow. Data Wrangler adds the
observations from your datasets as records in your feature group.

If you're updating an existing feature group, your dataset's schema must match the schema of the
feature group. All the records in the feature group are replaced with the observations in your dataset.

You can use either a Jupyter notebook or a destination node to update your feature group with the
observations in the dataset.

If your feature groups with the Iceberg table format have a custom offline store encryption key, make
sure you grant the IAM that you're using for the Amazon SageMaker Processing job permissions to use it.
At a minimum, you must grant it permissions to encrypt the data that you're writing to Amazon S3. To
grant the permissions, give the IAM role the ability to use the GenerateDataKey. For more information
about granting IAM roles permissions to use AWS KMS keys see https://docs.aws.amazon.com/kms/
latest/developerguide/key-policies.html

Destination Node

If you want to output a series of data processing steps that you've performed to a feature group, you
can create a destination node. When you create and run a destination node, Data Wrangler updates
a feature group with your data. You can also create a new feature group from the destination node
UI. After you create a destination node, you create a processing job to output the data. A processing
job is an Amazon SageMaker processing job. When you're using a destination node, it runs the
computational resources needed to output the data that you've transformed to the feature group.

You can use a destination node to export some of the transformations or all of the transformations
that you've made in your Data Wrangler flow.

Use the following procedure to create a destination node to update a feature group with the
observations from your dataset.

To update a feature group using a destination node, do the following.


Note
You can choose Create job in the Data Wrangler flow to view the instructions for using a
processing job to update the feature group.

1. Choose the + symbol next to the node containing the dataset that you'd like to export.
2. Under Add destination, choose SageMaker Feature Store.

1127
Amazon SageMaker Developer Guide
Export

3. Choose (double-click) the feature group. Data Wrangler checks whether the schema of the
feature group matches the schema of the data that you're using to update the feature group.
4. (Optional) Select Export to offline store only for feature groups that have both an online store
and an offline store. This option only updates the offline store with observations from your
dataset.
5. After Data Wrangler validates the schema of your dataset, choose Add.

Use the following procedure to create a new feature group with data from your dataset.

You can store your feature group in one of the following ways:

• Online – Low-latency, high-availability cache for a feature group that provides real-time lookup of
records. The online store allows quick access to the latest value for a record in a feature group.
• Offline – Stores data for your feature group in an Amazon S3 bucket. You can store your data
offline when you don't need low-latency (sub-second) reads. You can use an offline store for
features used in data exploration, model training, and batch inference.
• Both online and offline – Stores your data in both an online store and an offline store.

To create a feature group using a destination node, do the following.

1. Choose the + symbol next to the node containing the dataset that you'd like to export.
2. Under Add destination, choose SageMaker Feature Store.
3. Choose Create Feature Group.
4. In the following dialog box, if your dataset doesn't have an event time column, select Create
"EventTime" column.
5. Choose Next.
6. Choose Copy JSON Schema. When you create a feature group, you paste the schema into the
feature definitions.
7. Choose Create.
8. For Feature group name, specify a name for your feature group.
9. For Description (optional), specify a description to make your feature group more discoverable.
10. To create a feature group for an online store, do the following.

a. Select Enable storage online.


b. For Online store encryption key, specify an AWS managed encryption key or an encryption
key of your own.
11. To create a feature group for an offline store, do the following.

a. Select Enable storage offline. Specify values for the following fields:

• S3 bucket name – The name of the Amazon S3 bucket that stores the feature group.
• (Optional) Dataset directory name – The Amazon S3 prefix that you're using to store the
feature group.
• IAM Role ARN – The IAM role that has access to Feature Store.
• Table Format – Table format of your offline store. You can specify Glue or Iceberg. Glue
is the default format.
• Offline store encryption key – By default, Feature Store uses an AWS Key Management
Service managed key, but you can use the field to specify a key of your own.
b. Specify values for the following fields:

• S3 bucket name – The name of the bucket storing the feature group.

1128
Amazon SageMaker Developer Guide
Export

• (Optional) Dataset directory name – The Amazon S3 prefix that you're using to store the
feature group.
• IAM Role ARN – The IAM role that has access to feature store.
• Offline store encryption key – By default, Feature Store uses an AWS managed key, but
you can use the field to specify a key of your own.
12. Choose Continue.
13. Choose JSON.
14. Remove the placeholder brackets in the window.
15. Paste the JSON text from Step 6.
16. Choose Continue.
17. For RECORD IDENTIFIER FEATURE NAME, choose the column in your dataset that has unique
identifiers for each record in your dataset.
18. For EVENT TIME FEATURE NAME, choose the column with the timestamp values.
19. Choose Continue.
20. (Optional) Add tags to make your feature group more discoverable.
21. Choose Continue.
22. Choose Create feature group.
23. Navigate back to your Data Wrangler flow and choose the refresh icon next to the Feature
Group search bar.

Note
If you've already created a destination node for a feature group within a flow, you can't
create another destination node for the same feature group. If you want to create another
destination node for the same feature group, you must create another flow file.

Use the following procedure to create a Data Wrangler job.

Create a job from the Data flow page and choose the destination nodes that you want to export.

1. Choose Create job. The following image shows the pane that appears after you select Create
job.
2. For Job name, specify the name of the export job.
3. Choose the destination nodes that you want to export.
4. (Optional) For Output KMS Key, specify an ARN, ID, or alias of an AWS KMS key. A KMS key is
a cryptographic key. You can use the key to encrypt the output data from the job. For more
information about AWS KMS keys, see AWS Key Management Service.
5. The following image shows the Configure job page with the Job configuration tab open.

1129
Amazon SageMaker Developer Guide
Export

(Optional) Under Trained parameters. choose Refit if you've done the following:

• Sampled your dataset


• Applied a transform that uses your data to create a new column in the dataset

For more information about refitting the transformations you've made to an entire dataset, see
Refit Transforms to The Entire Dataset and Export Them (p. 1132).
6. Choose Configure job.
7. (Optional) Configure the Data Wrangler job. You can make the following configurations:

• Job configuration
• Spark memory configuration
• Network configuration
• Tags
• Parameters
• Associate Schedules
8. Choose Run.

1130
Amazon SageMaker Developer Guide
Export

Jupyter notebook

Use the following procedure to a Jupyter notebook to export to Amazon SageMaker Feature Store.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler
flow to Feature Store.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose Amazon SageMaker Feature Store (via Jupyter Notebook).
4. Run the Jupyter notebook.

Running a Jupyter notebook runs a Data Wrangler job. Running a Data Wrangler job starts a
SageMaker processing job. The processing job ingests the flow into an online and offline feature
store.
Important
The IAM role you use to run this notebook must have the following
AWS managed policies attached: AmazonSageMakerFullAccess and
AmazonSageMakerFeatureStoreAccess.

You only need to enable one online or offline feature store when you create a feature group. You can
also enable both. To disable online store creation, set EnableOnlineStore to False:

# Online Store Configuration


online_store_config = {
"EnableOnlineStore": False
}

The notebook uses the column names and types of the dataframe you export to create a feature
group schema, which is used to create a feature group. A feature group is a group of features
defined in the feature store to describe a record. The feature group defines the schema and features
contained in the feature group. A feature group definition is composed of a list of features, a record
identifier feature name, an event time feature name, and configurations for its online store and
offline store.

1131
Amazon SageMaker Developer Guide
Export

Each feature in a feature group can have one of the following types: String, Fractional, or Integral. If
a column in your exported dataframe is not one of these types, it defaults to String.

The following is an example of a feature group schema.

column_schema = [
{
"name": "Height",
"type": "long"
},
{
"name": "Input",
"type": "string"
},
{
"name": "Output",
"type": "string"
},
{
"name": "Sum",
"type": "string"
},
{
"name": "Time",
"type": "string"
}
]

Additionally, you must specify a record identifier name and event time feature name:

• The record identifier name is the name of the feature whose value uniquely identifies a record
defined in the feature store. Only the latest record per identifier value is stored in the online store.
The record identifier feature name must be one of feature definitions' names.
• The event time feature name is the name of the feature that stores the EventTime of a record
in a feature group. An EventTime is a point in time when a new event occurs that corresponds
to the creation or update of a record in a feature. All records in the feature group must have a
corresponding EventTime.

The notebook uses these configurations to create a feature group, process your data at scale, and
then ingest the processed data into your online and offline feature stores. To learn more, see Data
Sources and Ingestion.

The notebook uses these configurations to create a feature group, process your data at scale, and then
ingest the processed data into your online and offline feature stores. To learn more, see Data Sources
and Ingestion.

Refit Transforms to The Entire Dataset and Export Them


When you import data, Data Wrangler uses a sample of the data to apply the encodings. By default, Data
Wrangler uses the first 50,000 rows as a sample, but you can import the entire dataset or use a different
sampling method. For more information, see Import (p. 991).

The following transformations use your data to create a column in the dataset:

• Encode Categorical (p. 1072)


• Featurize Text (p. 1075)
• Handle Outliers (p. 1088)
• Handle Missing Values (p. 1090)

1132
Amazon SageMaker Developer Guide
Export

If you used sampling to import your data, the preceding transforms only use the data from the sample
to create the column. The transform might not have used all of the relevant data. For example, if you use
the Encode Categorical transform, there might have been a category in the entire dataset that wasn't
present in the sample.

You can either use a destination node or a Jupyter notebook to refit the transformations to the entire
dataset. When Data Wrangler exports the transformations in the flow, it creates a SageMaker processing
job. When the processing job finishes, Data Wrangler saves the following files in either the default
Amazon S3 location or an S3 location that you specify:

• The Data Wrangler flow file that specifies the transformations that are refit to the dataset
• The dataset with the refit transformations applied to it

You can open a Data Wrangler flow file within Data Wrangler and apply the transformations to a
different dataset. For example, if you've applied the transformations to a training dataset, you can open
and use the Data Wrangler flow file to apply the transformations to a dataset used for inference.

For a information about using destination nodes to refit transforms and export see the following pages:

• Export to Amazon S3 (p. 1118)


• Export to Amazon SageMaker Feature Store (p. 1126)

Use the following procedure to run a Jupyter notebook to refit the transformations and export the data.

To run a Jupyter notebook and to refit the transformations and export your Data Wrangler flow, do the
following.

1. Choose the + next to the node that you want to export.


2. Choose Export to.
3. Choose the location to which you're exporting the data.
4. For the refit_trained_params object, set refit to True.
5. For the output_flow field, specify the name of the output flow file with the refit transformations.
6. Run the Jupyter notebook.

Create a Schedule to Automatically Process New Data


If you're processing data periodically, you can create a schedule to run the processing job automatically.
For example, you can create a schedule that runs a processing job automatically when you get new data.
For more information about processing jobs, see Export to Amazon S3 (p. 1118) and Export to Amazon
SageMaker Feature Store (p. 1126).

When you create a job you must specify an IAM role that has permissions to create the job. By default,
the IAM role that you use to access Data Wrangler is the SageMakerExecutionRole.

The following permissions allow Data Wrangler to access EventBridge and allow EventBridge to run
processing jobs:

• Add the following AWS Managed policy to the Amazon SageMaker Studio execution role that provides
Data Wrangler with permissions to use EventBridge:

arn:aws:iam::aws:policy/AmazonEventBridgeFullAccess

For more information about the policy, see AWS managed policies for EventBridge.

1133
Amazon SageMaker Developer Guide
Export

• Add the following policy to the IAM role that you specify when you create a job in Data Wrangler:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sagemaker:StartPipelineExecution",
"Resource": "arn:aws:sagemaker:Region:AWS-account-id:pipeline/data-wrangler-
*"
}
]
}

If you're using the default IAM role, you add the preceding policy to the Amazon SageMaker Studio
execution role.

Add the following trust policy to the role to allow EventBridge to assume it.

{
"Effect": "Allow",
"Principal": {
"Service": "events.amazonaws.com"
},
"Action": "sts:AssumeRole"
}

Important
When you create a schedule, Data Wrangler creates an eventRule in EventBridge. You incur
charges for both the event rules that you create and the instances used to run the processing
job.
For information about EventBridge pricing, see Amazon EventBridge pricing. For information
about processing job pricing, see Amazon SageMaker Pricing.

You can set a schedule using one of the following methods:

• CRON expressions
Note
Data Wrangler doesn't support the following expressions:
• LW#
• Abbreviations for days
• Abbreviations for months
• RATE expressions
• Recurring – Set an hourly or daily interval to run the job.
• Specific time – Set specific days and times to run the job.

The following sections provide procedures on creating jobs.

CRON

Use the following procedure to create a schedule with a CRON expression.

1134
Amazon SageMaker Developer Guide
Export

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.


2. Choose Create job.
3. (Optional) For Output KMS key, specify an AWS KMS key to configure the output of the job.
4. Choose Next, 2. Configure job.
5. Select Associate Schedules.
6. Choose Create a new schedule.
7. For Schedule Name, specify the name of the schedule.
8. For Run Frequency, choose CRON.
9. Specify a valid CRON expression.
10. Choose Create.
11. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
12. Choose one of the following:

• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
13. Choose Run

RATE

Use the following procedure to create a schedule with a RATE expression.

To specify a schedule with a RATE expression, do the following.

1. Open your Data Wrangler flow.


2. Choose Create job.
3. (Optional) For Output KMS key, specify an AWS KMS key to configure the output of the job.
4. Choose Next, 2. Configure job.
5. Select Associate Schedules.
6. Choose Create a new schedule.
7. For Schedule Name, specify the name of the schedule.
8. For Run Frequency, choose Rate.
9. For Value, specify an integer.
10. For Unit, select one of the following:

• Minutes
• Hours
• Days
11. Choose Create.
12. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
13. Choose one of the following:

1135
Amazon SageMaker Developer Guide
Export

• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
14. Choose Run

Recurring

Use the following procedure to create a schedule that runs a job on a recurring basis.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.


2. Choose Create job.
3. (Optional) For Output KMS key, specify an AWS KMS key to configure the output of the job.
4. Choose Next, 2. Configure job.
5. Select Associate Schedules.
6. Choose Create a new schedule.
7. For Schedule Name, specify the name of the schedule.
8. For Run Frequency, make sure Recurring is selected by default.
9. For Every x hours, specify the hourly frequency that the job runs during the day. Valid values are
integers in the inclusive range of 1 and 23.
10. For On days, select one of the following options:

• Every Day
• Weekends
• Weekdays
• Select Days

• (Optional) If you've selected Select Days, choose the days of the week to run the job.

Note
The schedule resets every day. If you schedule a job to run every five hours, it runs at
the following times during the day:

• 00:00
• 05:00
• 10:00
• 15:00
• 20:00
11. Choose Create.
12. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
13. Choose one of the following:

• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.

1136
Amazon SageMaker Developer Guide
Export

14. Choose Run

Specific time

Use the following procedure to create a schedule that runs a job at specific times.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.


2. Choose Create job.
3. (Optional) For Output KMS key, specify an AWS KMS key to configure the output of the job.
4. Choose Next, 2. Configure job.
5. Select Associate Schedules.
6. Choose Create a new schedule.
7. For Schedule Name, specify the name of the schedule.
8. Choose Create.
9. (Optional) Choose Add another schedule to run the job on an additional schedule.
Note
You can associate a maximum of two schedules. The schedules are independent and
don't affect each other unless the times overlap.
10. Choose one of the following:

• Schedule and run now – Data Wrangler the job runs immediately and subsequently runs on
the schedules.
• Schedule only – Data Wrangler the job only runs on the schedules that you specify.
11. Choose Run

You can use Amazon SageMaker Studio view the jobs that are scheduled to run. Your processing jobs run
within SageMaker Pipelines. Each processing job has its own pipeline. It runs as a processing step within
the pipeline. You can view the schedules that you've created within a pipeline. For information about
viewing a pipeline, see View a Pipeline (p. 2792).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio.


2. Open SageMaker Pipelines
3. View the pipelines for the jobs that you've created.

The pipeline running the job uses the job name as a prefix. For example, if you've created a job
named housing-data-feature-enginnering, the name of the pipeline is data-wrangler-
housing-data-feature-engineering.
4. Choose the pipeline containing your job.
5. View the status of the pipelines. Pipelines with a Status of Succeeded have run the processing job
successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an
event rule stops all the jobs associated with the schedule from running. For information about deleting a
rule, see Disabling or deleting an Amazon EventBridge rule.

1137
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights

You can stop and delete the pipelines associated with the schedules as well. For information about
stopping a pipeline, see StopPipelineExecution. For information about deleting a pipeline, see
DeletePipeline.

Use an Interactive Data Preparation Widget in an


Amazon SageMaker Studio Notebook to Get Data
Insights
Use the Data Wrangler data preparation widget to interact with your data, get visualizations, explore
actionable insights, and fix data quality issues.

You can access the data preparation widget from an Amazon SageMaker Studio notebook. For each
column, the widget creates a visualization that helps you better understand its distribution. If a column
has data quality issues, a warning appears in its header.

To see the data quality issues, select the column header showing the warning. You can use the
information that you get from the insights and the visualizations to apply the widget's built-in
transformations to help you fix the issues.

For example, the widget might detect that you have a column that only has one unique value and show
you a warning. The warning provides the option to drop the column from the dataset.

Getting started with running the widget


Use the following information to help you get started with running a notebook.

Open a notebook in Amazon SageMaker Studio. For information about opening a notebook, see Create
or Open an Amazon SageMaker Studio Notebook (p. 148).
Important
To run the widget, the notebook must use one of the following images:

• Python 3 (Data Science) with Python 3.7


• Python 3 (Data Science 2.0) with Python 3.8
• Python 3 (Data Science 3.0) with Python 3.10
• SparkAnalytics 1.0
• SparkAnalytics 2.0

For more information about images, see Available Amazon SageMaker Images (p. 164).

Use the following code to import the data preparation widget and pandas. The widget uses pandas
dataframes to analyze your data.

import pandas as pd
import sagemaker_datawrangler

The following example code loads a file into the dataframe called df.

df = pd.read_csv("example-dataset.csv")

You can use a dataset in any format that you can load as a pandas dataframe object. For more
information about pandas formats, see IO tools (text, CSV, HDF5, …).

The following cell runs the df variable to start the widget.

1138
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights

df

The top of the dataframe has the following options:

• View the Pandas table – Switches between the interactive visualization and a pandas table.
• Use all of the rows in your dataset to compute the insights. Using the entire dataset might increase
the time it takes to generate the insights. – If you don't select the option, Data Wrangler computes
the insights for the first 10,000 rows of the dataset.

The dataframe shows the first 1000 rows of the dataset. Each column header has a stacked bar chart
that shows the column's characteristics. It shows the proportion of valid values, invalid values, and
missing values. You can hover over the different portions of the stacked bar chart to get the calculated
percentages.

Each column has a visualization in the header. The following shows the types of visualizations the
columns can have:

• Categorical – Bar chart


• Numeric – Histogram
• Datetime – Bar chart
• Text – Bar chart

For each visualization, the data preparation widget highlights outliers in orange.

When you choose a column, it opens a side panel. The side panel shows you the Insights tab. The pane
provides a count for the following types of values:

• Invalid values – Values whose type doesn’t match the column type.
• Missing values – Values that are missing, such as NaN or None.
• Valid values – Values that are neither missing nor invalid.

For numeric columns, the Insights tab shows the following summary statistics:

• Minimum – The smallest value.


• Maximum – The largest value.
• Mean – The mean of the values.
• Mode – The value that appears most frequently.
• Standard deviation – The standard deviation of the values.

For categorical columns, the Insights tab shows the following summary statistics:

• Unique values – The number of unique values in the column.


• Top – The value that appears most frequently.

The columns that have warning icons in their headers have data quality issues. Choosing a column opens
a Data quality tab that you can use to find transforms to help you fix the issue. A warning has one of the
following severity levels:

• Low – Issues that might not affect your analysis, but can be useful to fix.
• Medium – Issues that are likely to affect your analysis, but are likely not critical to fix.
• High – Severe issues that we strongly recommend fixing.

1139
Amazon SageMaker Developer Guide
Use Data Preparation in a Studio
Notebook to Get Data Insights

Note
The widget sorts the column to show the values that have data quality issues at the top of the
dataframe. It also highlights the values that are causing the issues. The color of the highlighting
corresponds to the severity level.

Under SUGGESTED TRANSFORMS, you can choose a transform to fix the data quality issue. The widget
can offer multiple transforms that can fix the issue. It can offer recommendations for the transforms that
are best suited to the problem. You can move your cursor over the transform to get more information
about it.

To apply a transform to the dataset, choose Apply and export code. The transform modifies the
dataset and updates the visualization with modified values. The code for the transform appears in the
following cell of the notebook. If you apply additional transforms to the dataset, the widget appends the
transforms to the cell. You can use the code that the widget generates to do the following:

• Customize it to better fit your needs.


• Use it in your own workflows.

You can reproduce all the transforms you've made by rerunning all of the cells in the notebook.

The widget can provide insights and warnings for the target column. The target column is the column
that you're trying to predict. Use the following procedure to get target column insights.

To get target column insights, do the following.

1. Choose the column that you're using as the target column.


2. Choose Select as target column.
3. Choose the problem type. The widget's insights and warnings are tailored to the problem types. The
following are the problem types:

• Classification – The target column has categorical data.


• Regression – The target column has numeric data.
4. Choose Run.
5. (Optional) Under Target Column Insights, choose one of the suggested transforms.

Reference for the insights and transforms in the widget


For feature columns (columns that aren't the target column), you can get the following insights to warn
you about issues with your dataset.

• Missing values – The column has missing values such as None, NaN (not a number), or NaT (not a
timestamp). Many machine learning algorithms don’t support missing values in the input data. Filling
them in or dropping the rows with missing data is therefore a crucial data preparation step. If you see
the missing values warning, you can use one of the following transforms to correct the issue.
• Drop missing – Drops rows with missing values. We recommend dropping rows when the percentage
of rows with missing data is small and imputing the missing values isn't appropriate.
• Replace with new value – Replaces textual missing values with Other. You can change Other to a
different value in the output code. Replaces numeric missing values with 0.
• Replace with mean – Replaces missing values with the mean of the column.
• Replace with median – Replaces missing values with the median of the column.
• Drop column – Drops the column with missing values from the dataset. We recommend dropping
the entire column when there's a high percentage of rows with missing data.
• Disguised missing values – The column has disguised missing values. A disguised missing value is a
value that isn't explicitly encoded as a missing value. For example, instead of using a NaN to indicate

1140
Amazon SageMaker Developer Guide
Security and Permissions

a missing value, the value could be Placeholder. You can use one of the following transforms to
handle the missing values:
• Drop missing – Drops rows with missing values
• Replace with new value – Replaces textual missing values with Other. You can change Other to a
different value in the output code. Replaces numeric missing values with 0.
• Constant column – The column only has one value. It therefore has no predictive power. We strongly
recommend using the Drop column transform to drop the column from the dataset.
• ID column – The column has no repeating values. All of the values in the column are unique. They
might be either IDs or database keys. Without additional information, the column has no predictive
power. We strongly recommend using the Drop column transform to drop the column from the
dataset.
• High cardinality – The column has a high percentage of unique values. High cardinality limits the
predictive power of categorical columns. Examine the importance of the column in your analysis and
consider using the Drop column transform to drop it.

For the target column, you can get the following insights to warn you about issues with your dataset. You
can use the suggested transformation provided with the warning to correct the issue.

• Mixed data types in target (Regression) – There are some non-numeric values in the target column.
There might be data entry errors. We recommend removing the rows that have the values that can't be
converted.
• Frequent label – Certain values in the target column appear more frequently than what would
be normal in the context of regression. There might be an error in data collection or processing. A
frequently appearing category might indicate that either the value is used as a default value or that
it’s a placeholder for missing values. We recommend using the Replace with new value transform to
replace the missing values with Other.
• Too few instances per class – The target column has categories that appear rarely. Some of the
categories don't have enough rows for the target column to be useful. You can use one of the
following transforms:
• Drop rare target – Drops unique values with fewer than ten observations. For example, drops the
value cat if it appears nine times in the column.
• Replace rare target – Replaces categories that appear rarely in the dataset with the value Other.
• Classes too imbalanced (multi-class classification) – There are categories in the dataset that appear
much more frequently than the other categories. The class imbalance might affect prediction accuracy.
For the most accurate predictions possible, we recommend updating the dataset with rows that have
the categories that currently appear less frequently.
• Large amount of classes/too many classes – There's a large number of classes in the target column.
Having many classes might result in longer training times or poor predictive quality. We recommend
doing one of the following:
• Grouping some of the categories into their own category. For example, if six categories are closely
related, we recommend using a single category for them.
• Using an ML algorithm that's resilient to multiple categories.

Security and Permissions


When you query data from Athena or Amazon Redshift, the queried dataset is automatically stored in
the default SageMaker S3 bucket for the AWS Region in which you are using Studio. Additionally, when
you export a Jupyter Notebook from Amazon SageMaker Data Wrangler and run it, your data flows,
or .flow files, are saved to the same default bucket, under the prefix data_wrangler_flows.

For high-level security needs, you can configure a bucket policy that restricts the AWS roles that have
access to this default SageMaker S3 bucket. Use the following section to add this type of policy to an S3

1141
Amazon SageMaker Developer Guide
Security and Permissions

bucket. To follow the instructions on this page, use the AWS Command Line Interface (AWS CLI). To learn
how, see Configuring the AWS CLI in the IAM User Guide.

Additionally, you need to grant each IAM role that uses Data Wrangler permissions to access required
resources. If you do not require granular permissions for the IAM role you use to access Data Wrangler,
you can add the IAM managed policy, AmazonSageMakerFullAccess, to an IAM role that you use to
create your Studio user. This policy grants you full permission to use Data Wrangler. If you require more
granular permissions, refer to the section, Grant an IAM Role Permission to Use Data Wrangler (p. 1143).

Add a Bucket Policy To Restrict Access to Datasets Imported to


Data Wrangler
You can add a policy to the S3 bucket that contains your Data Wrangler resources using an Amazon S3
bucket policy. Resources that Data Wrangler uploads to your default SageMaker S3 bucket in the AWS
Region you are using Studio in include the following:

• Queried Amazon Redshift results. These are stored under the redshift/ prefix.
• Queried Athena results. These are stored under the athena/ prefix.
• The .flow files uploaded to Amazon S3 when you run an exported Jupyter Notebook Data Wrangler
produces. These are stored under the data_wrangler_flows/ prefix.

Use the following procedure to create an S3 bucket policy that you can add to restrict IAM role access to
that bucket. To learn how to add a policy to an S3 bucket, see How do I add an S3 Bucket policy?.

To set up a bucket policy on the S3 bucket that stores your Data Wrangler resources:

1. Configure one or more IAM roles that you want to be able to access Data Wrangler.
2. Open a command prompt or shell. For each role that you create, replace role-name with the name
of the role and run the following:

$ aws iam get-role --role-name role-name

In the response, you see a RoleId string which begins with AROA. Copy this string.
3. Add the following policy to the SageMaker default bucket in the AWS Region in which you are using
Data Wrangler. Replace region with the AWS Region in which the bucket is located, and account-
id with your AWS account ID. Replace userIds starting with AROAEXAMPLEID with the IDs of an
AWS roles to which you want to grant permission to use Data Wrangler.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/",
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/*",
"arn:aws:s3:::sagemaker-region-account-id/athena",
"arn:aws:s3:::sagemaker-region-account-id/athena/*",
"arn:aws:s3:::sagemaker-region-account-id/redshift",
"arn:aws:s3:::sagemaker-region-account-id/redshift/*"

],
"Condition": {
"StringNotLike": {
"aws:userId": [

1142
Amazon SageMaker Developer Guide
Security and Permissions

"AROAEXAMPLEID_1:*",
"AROAEXAMPLEID_2:*"
]
}
}
}
]
}

Create an Allowlist for Data Wrangler


Whenever a user starts running Data Wrangler from the Amazon SageMaker Studio user interface,
they make call to the SageMaker application programming interface (API) to create a Data Wrangler
application.

Your organization might not provide your users with permissions to make those API calls by default.
To provide permissions, you must create and attach a policy to the user's IAM roles using the following
policy template: Data Wrangler Allow List Example.
Note
The preceding policy example only gives your users access to the Data Wrangler application.

For information about creating a policy, see Creating policies on the JSON tab. When you're creating a
policy, copy and paste the JSON policy from Data Wrangler Allow List Example in the JSON tab.
Important
Delete any IAM policies that prevent users from running the following operations:

• CreateApp
• DescribeApp

If you don't delete the policies, your users could still be affected by them.

After you've creating the policy using the template, attach it to the IAM roles of your users. For
information about attaching a policy, see Adding IAM identity permissions (console).

Grant an IAM Role Permission to Use Data Wrangler


You can grant an IAM role permission to use Data Wrangler with the general IAM managed policy,
AmazonSageMakerFullAccess. This is a general policy that includes permissions required to use all
SageMaker services. This policy grants an IAM role full access to Data Wrangler. You should be aware of
the following when using AmazonSageMakerFullAccess to grant access to Data Wrangler:

• If you import data from Amazon Redshift, the Database User name must have the prefix
sagemaker_access.
• This managed policy only grants permission to access buckets with one of the following in the name:
SageMaker, SageMaker, sagemaker, or aws-glue. If want to use Data Wrangler to import from an
S3 bucket without these phrases in the name, refer to the last section on this page to learn how to
grant permission to an IAM entity to access your S3 buckets.

If you have high-security needs, you can attach the policies in this section to an IAM entity to grant
permissions required to use Data Wrangler.

If you have datasets in Amazon Redshift or Athena that an IAM role needs to import from Data Wrangler,
you must add a policy to that entity to access these resources. The following policies are the most
restrictive policies you can use to give an IAM role permission to import data from Amazon Redshift and
Athena.

1143
Amazon SageMaker Developer Guide
Security and Permissions

To learn how to attach a custom policy to an IAM role, refer to Managing IAM policies in the IAM User
Guide.

Policy example to grant access to an Athena dataset import

The following policy assumes that the IAM role has permission to access the underlying S3 bucket where
data is stored through a separate IAM policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateTable"
],
"Resource": [
"arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
"arn:aws:glue:*:*:table/sagemaker_featurestore/*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:DeleteTable"
],
"Resource": [
"arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:*:*:table/*",
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/*"
]
},
{
"Effect": "Allow",
"Action": [

1144
Amazon SageMaker Developer Guide
Security and Permissions

"glue:CreateDatabase",
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/sagemaker_featurestore",
"arn:aws:glue:*:*:database/sagemaker_processing",
"arn:aws:glue:*:*:database/default",
"arn:aws:glue:*:*:database/sagemaker_data_wrangler"
]
}
]
}

Policy example to grant access to an Amazon Redshift dataset import

The following policy grants permission to set up an Amazon Redshift connection to Data Wrangler using
database users that have the prefix sagemaker_access in the name. To grant permission to connect
using additional database users, add additional entries under "Resources" in the following policy. The
following policy assumes that the IAM role has permission to access the underlying S3 bucket where data
is stored through a separate IAM policy, if applicable.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"redshift-data:ExecuteStatement",
"redshift-data:DescribeStatement",
"redshift-data:CancelStatement",
"redshift-data:GetStatementResult",
"redshift-data:ListSchemas",
"redshift-data:ListTables"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"redshift:GetClusterCredentials"
],
"Resource": [
"arn:aws:redshift:*:*:dbuser:*/sagemaker_access*",
"arn:aws:redshift:*:*:dbname:*"
]
}
]
}

Policy to grant access to an S3 bucket

If your dataset is stored in Amazon S3, you can grant an IAM role permission to access this bucket with a
policy similar to the following. This example grants programmatic read-write access to the bucket named
test.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",

1145
Amazon SageMaker Developer Guide
Security and Permissions

"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::test"]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": ["arn:aws:s3:::test/*"]
}
]
}

To import data from Athena and Amazon Redshift, you must grant an IAM role permission to access
the following prefixes under the default Amazon S3 bucket in the AWS Region Data Wrangler in which
is being used: athena/, redshift/. If a default Amazon S3 bucket does not already exist in the AWS
Region, you must also give the IAM role permission to create a bucket in this region.

Additionally, if you want the IAM role to be able to use the Amazon SageMaker Feature Store,
SageMaker Pipelines, and Data Wrangler job export options, you must grant access to the prefix
data_wrangler_flows/ in this bucket.

Data Wrangler uses the athena/ and redshift/ prefixes to store preview files and imported datasets.
To learn more, see Imported Data Storage (p. 1033).

Data Wrangler uses the data_wrangler_flows/ prefix to store .flow files when you run a Jupyter
Notebook exported from Data Wrangler. To learn more, see Export (p. 1116).

Use a policy similar to the following to grant the permissions described in the preceding paragraphs.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/",
"arn:aws:s3:::sagemaker-region-account-id/data_wrangler_flows/*",
"arn:aws:s3:::sagemaker-region-account-id/athena",
"arn:aws:s3:::sagemaker-region-account-id/athena/*",
"arn:aws:s3:::sagemaker-region-account-id/redshift",
"arn:aws:s3:::sagemaker-region-account-id/redshift/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::sagemaker-region-account-id"
},
{
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"

1146
Amazon SageMaker Developer Guide
Security and Permissions

],
"Resource": "*"
}
]
}

You can also access data in your Amazon S3 bucket from another AWS account by specifying the Amazon
S3 bucket URI. To do this, the IAM policy that grants access to the Amazon S3 bucket in the other
account should use a policy similar to the following example, where BucketFolder is the specific
directory in the user's bucket UserBucket. This policy should be added to the user granting access to
their bucket for another user.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::UserBucket/BucketFolder/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::UserBucket",
"Condition": {
"StringLike": {
"s3:prefix": [
"BucketFolder/*"
]
}
}
}
]
}

The user that is accessing the bucket (not the bucket owner) must add a policy similar to the following
example to their user. Note that AccountX and TestUser below refers to the bucket owner and their
user respectively.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountX:user/TestUser"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::UserBucket/BucketFolder/*"
]
},
{

1147
Amazon SageMaker Developer Guide
Security and Permissions

"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountX:user/TestUser"
},
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::UserBucket"
]
}
]
}

Policy example to grant access to use SageMaker Studio

Use a policy like to the following to create an IAM execution role that can be used to set up a Studio
instance.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedDomainUrl",
"sagemaker:DescribeDomain",
"sagemaker:ListDomains",
"sagemaker:DescribeUserProfile",
"sagemaker:ListUserProfiles",
"sagemaker:*App",
"sagemaker:ListApps"
],
"Resource": "*"
}
]
}

Snowflake and Data Wrangler


All permissions for AWS resources are managed via your IAM role attached to your Studio instance.
The Snowflake administrator manages Snowflake-specific permissions, as they can grant granular
permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses,
and storage integration objects. You must ensure that the correct permissions are set up outside of Data
Wrangler.

Note that the Snowflake COPY INTO Amazon S3 command moves data from Snowflake to Amazon S3
over the public internet by default, but data in transit is secured using SSL. Data at rest in Amazon S3 is
encrypted with SSE-KMS using the default AWS KMS key.

With respect to Snowflake credentials storage, Data Wrangler does not store customer credentials. Data
Wrangler uses Secrets Manager to store the credentials in a secret and rotates secrets as part of a best
practice security plan. The Snowflake or Studio administrator needs to ensure that the data scientist’s
Studio execution role is granted permission to perform GetSecretValue on the secret storing the
credentials. If already attached to the Studio execution role, the AmazonSageMakerFullAccess policy
has the necessary permissions to read secrets created by Data Wrangler and secrets created by following
the naming and tagging convention in the instructions above. Secrets that do not follow the conventions
must be separately granted access. We recommend using Secrets Manager to prevent sharing credentials
over unsecured channels; however, note that a logged-in user can retrieve the plain-text password by
launching a terminal or Python notebook in Studio and then invoking API calls from the Secrets Manager
API.

1148
Amazon SageMaker Developer Guide
Security and Permissions

Data Encryption with AWS KMS


Within Data Wrangler, you can decrypt encrypted files and add them to your Data Wrangler flow. You
can also encrypt the output of the transforms using either a default AWS KMS key or one that you
provide.

You can import files if they have the following:

• server-side encryption
• SSE-KMS as the encryption type

To decrypt the file and import to a Data Wrangler flow, you must add the SageMaker Studio user that
you're using as a key user.

The following screenshot shows a Studio user role added as a key user. See IAM Roles to access users
under the left panel to make this change.

Amazon S3 customer managed key setup for Data Wrangler imported data
storage
By default, Data Wrangler uses Amazon S3 buckets that have the following naming convention:
sagemaker-region-account number. For example, if your account number is 111122223333
and you are using Studio in us-east-1, your imported datasets are stored with the following naming
convention: sagemaker-us-east-1-111122223333.

The following instructions explain how to set up a customer managed key for your default Amazon S3
bucket.

1. To enable server-side encryption and setup a customer managed key for your default S3 bucket, see
Using KMS Encryption.
2. After following step 1, navigate to AWS KMS in your AWS Management Console. Find the customer
managed key you selected in step 1 of the previous step and add the Studio role as the key user. To do
this, follow the instructions in Allows key users to use a customer managed key.

Encrypting the Data That You Export


You can encrypt the data that you export using one of the following methods:

• Specifying that your Amazon S3 bucket has object use SSE-KMS encryption.
• Specifying an AWS KMS key to encrypt the data that you export from Data Wrangler.

On the Export data page, specify a value for the AWS KMS key ID or ARN.

1149
Amazon SageMaker Developer Guide
Security and Permissions

For more information on using AWS KMS keys, see Protecting Data Using Server-Side Encryption with
AWS KMS keys Stored in AWSAWS Key Management Service (SSE-KMS) .

Amazon AppFlow Permissions


When you're performing a transfer, you must specify an IAM role that has permissions to perform the
transfer. You can use the same IAM role that has permissions to use Data Wrangler. By default, the IAM
role that you use to access Data Wrangler is the SageMakerExecutionRole.

The IAM role must have the following permissions:

• Permissions to Amazon AppFlow


• Permissions to the AWS Glue Data Catalog
• Permissions for AWS Glue to discover the data sources that are available

When you run a transfer, Amazon AppFlow stores metadata from the transfer in the AWS Glue Data
Catalog. Data Wrangler uses the metadata from the catalog to determine whether it's available for you
to query and import.

To add permissions to Amazon AppFlow, add the AmazonAppFlowFullAccess AWS managed policy
to the IAM role. For more information about adding policies, see Adding or removing IAM identity
permissions.

If you're transferring data to Amazon S3, you must also attach the following policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetBucketTagging",
"s3:ListBucketVersions",
"s3:CreateBucket",
"s3:ListBucket",
"s3:GetBucketPolicy",
"s3:PutEncryptionConfiguration",
"s3:GetEncryptionConfiguration",
"s3:PutBucketTagging",
"s3:GetObjectTagging",
"s3:GetBucketOwnershipControls",
"s3:PutObjectTagging",
"s3:DeleteObject",
"s3:DeleteBucket",
"s3:DeleteObjectTagging",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketPolicyStatus",
"s3:PutBucketPublicAccessBlock",
"s3:PutAccountPublicAccessBlock",
"s3:ListAccessPoints",
"s3:PutBucketOwnershipControls",
"s3:PutObjectVersionTagging",
"s3:DeleteObjectVersionTagging",
"s3:GetBucketVersioning",
"s3:GetBucketAcl",
"s3:PutObject",
"s3:GetObject",
"s3:GetAccountPublicAccessBlock",
"s3:ListAllMyBuckets",

1150
Amazon SageMaker Developer Guide
Security and Permissions

"s3:GetAnalyticsConfiguration",
"s3:GetBucketLocation"
],
"Resource": "*"
}
]
}

To add AWS Glue permissions, add the AWSGlueConsoleFullAccess managed policy to the IAM role.
For more information about AWS Glue permissions with Amazon AppFlow, see [link-to-appflow-page].

Amazon AppFlow needs to access AWS Glue and Data Wrangler for you to import the data that you've
transferred. To grant Amazon AppFlow access, add the following trust policy to the IAM role.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root",
"Service": [
"appflow.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}

To display the Amazon AppFlow data in Data Wrangler, add the following policy to the IAM role:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "glue:SearchTables",
"Resource": [
"arn:aws:glue:*:*:table/*/*",
"arn:aws:glue:*:*:database/*",
"arn:aws:glue:*:*:catalog"
]
}
]
}

Using Lifecycle Configurations in Data Wrangler


You might have an Amazon EC2 instance that is configured to run Kernel Gateway applications, but not
the Data Wrangler application. Kernel Gateway applications provide access to the environment and the
kernels that you use to run Studio notebooks and terminals. The Data Wrangler application is the UI
application that runs Data Wrangler. Amazon EC2 instances that aren't Data Wrangler instances require
a modification to their lifecycle configurations to run Data Wrangler. Lifecycle configurations are shell
scripts that automate the customization of your Amazon SageMaker Studio environment.

For more information about lifecycle configurations, see Use Lifecycle Configurations with Amazon
SageMaker Studio (p. 182).

1151
Amazon SageMaker Developer Guide
Release Notes

The default lifecycle configuration for your instance doesn't support using Data Wrangler. You can make
the following modifications to the default configuration to use Data Wrangler with your instance.

#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'

# Replace this with the URL of your git repository


export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-
examples.git"

git -C /root clone $REPOSTIORY_URL

fi

You can save the script as lifecycle_configuration.sh.

You attach the lifecycle configuration to your Studio domain or user profile. For more information
about creating and attaching a lifecycle configuration, see Creating and Associating a Lifecycle
Configuration (p. 183).

The following instructions show you how to attach a lifecycle configuration to a Studio domain or user
profile.

You might run into errors when you're creating or attaching a lifecycle configuration. For information
about debugging lifecycle configuration errors, KernelGateway App failure (p. 190).

Release Notes
Data Wrangler is regularly updated with new features and bug fixes. To upgrade the version of Data
Wrangler you are using in Studio, follow the instructions in Shut down and Update Studio Apps (p. 200).

Release Notes

4/18/2022

New functionality:

You can now get your data in a format that Amazon Personalize can interpret. For more information,
see Map Columns for Amazon Personalize (p. 1101).

3/1/2022

New functionality:

You can now use Hive to import your data from Amazon EMR. For more information, see Import data
from Amazon EMR (p. 1004).

12/10/2022

New functionality:

1152
Amazon SageMaker Developer Guide
Release Notes

Release Notes
You can now export your Data Wrangler flow to an inference endpoint. For more information, see
Export to an Inference Endpoint (p. 1125).

New functionality:

You can now use an interactive notebook widget for data preparation. For more information, see
Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Notebook to Get Data
Insights (p. 1138).

New functionality:

You can now import data from SaaS platforms. For more information, see Import Data From Software
as a Service (SaaS) Platforms (p. 1030).

10/12/2022

New functionality:

You can now reuse data flows for different data sets. For more information, see Reusing Data Flows for
Different Datasets (p. 1109).

10/05/2022

New functionality:

You can now use Principal Component Analysis (PCA) as a transform. For more information, see Reduce
Dimensionality within a Dataset (p. 1071).

10/05/2022

New functionality:

You can now refit parameters in your Data Wrangler flow. For more information, see Export (p. 1116).

10/03/2022

New functionality:

You can now deploy models from your Data Wrangler flow. For more information, see Automatically
Train Models on Your Data Flow (p. 1057).

9/20/2022

New functionality:

You can now set data retention periods in Athena. For more information, see Import data from
Athena (p. 997).

6/9/2022

New functionality:

You can now use Amazon SageMaker Autopilot to train a model directly from your Data Wrangler flow.
For more information, see Automatically Train Models on Your Data Flow (p. 1057).

5/6/2022

New functionality:

1153
Amazon SageMaker Developer Guide
Release Notes

Release Notes
You can now use additional m5 and r5 instances. For more information, see Instances (p. 1034).

4/27/2022

New functionalities:

• You can now get a data quality report. For more information, see Get Insights On Data and Data
Quality (p. 1045)
• You can now perform random sampling and stratified sampling. For more information, see
Sampling (p. 1092).

4/1/2022

New functionality:

You can now use Databricks as a data source. For more information, see Import data from Databricks
(JDBC) (p. 1011).

2/2/2022

New functionalities:

• You can now export using destination nodes. For more information, see Export (p. 1116)
• You can import ORC and JSON files. For more information about file types, see Import (p. 991).
• Data Wrangler now supports using the SMOTE transform. For more information, see Balance
Data (p. 1065).
• Data Wrangler now supports similarity encoding for categorical data. For more information, see
Similarity encode (p. 1074).
• Data Wrangler now supports unnesting JSON data. For more information, see Unnest JSON
Data (p. 1097).
• Data Wrangler now supports expanding the values of an array into separate columns. For more
information, see Explode Array (p. 1098).
• Data Wrangler now supports reaching out to the service team when you're having issues. For more
information, see Troubleshoot (p. 1156).
• Data Wrangler supports editing and deleting steps in your data flow. For more information, see
Delete a Step from Your Data Flow (p. 1038) and Edit a Step in Your Data Wrangler Flow (p. 1042).
• You can now perform transformations on multiple columns. For more information, see Transform
Data (p. 1058).
• Data Wrangler now supports cost allocation tags. For more information, see Using Cost Allocation
Tags.

10/16/2021

New functionality:

Data Wrangler now supports Athena workgroups. For more information, see Import data from
Athena (p. 997).

10/6/2021

New functionality:

1154
Amazon SageMaker Developer Guide
Release Notes

Release Notes
Data Wrangler now supports transforming time series data. For more information, see Transform Time
Series (p. 1077).

7/15/2021

New functionalities:

• Snowflake and Data Wrangler (p. 1148) is now supported. You can use Snowflake as a data source in
Data Wrangler.
• Added support for custom field delimiter in CSV. Now comma, colon, semicolon, pipe (|) and Tab are
supported.
• Now you can export results directly to Amazon S3.
• Added a few new multicollinearity analyzers: Variance Inflation Factors, Principal Component
Analysis and Lasso feature selection.

Enhancements:

• The analyze charts can no longer be could be packed with overlapping labels.

Bug Fixes:

• One-hot encoder handles empty string gracefully.


• Fixed crashes that occured when a dataframe column name contained dots.

4/26/2021

Enhancements:

• Added support for distributed processing Jobs. You can use multiple instances when running a
processing job.
• Data Wrangler Processing job now automatically coalesces small outputs when estimated result size
is less than 1 gigabytes.
• Feature Store Notebook: Improved feature store ingestion performance
• Data Wrangler Processing jobs now use 1.x as the authoritative container tag for future releases.

Bug Fixes:

• Fixed rendering issues for faceted histogram.


• Fixed Export to Processing Job to support vector type columns.
• Fixed Extract using regex operator to return the first captured group if one or more exists in
the regular expression or regex.

2/8/2021

New Functionalities:

• Data Wrangler Flows supports multiple instances.


• Updated Export to Data Wrangler Job Notebook to use SageMaker SDK 2.20.0.
• Updated Export to Pipeline Notebook to use SageMaker SDK 2.20.0.
• Updated Export to Pipeline Notebook to add XGBoost training example as an optional step.

1155
Amazon SageMaker Developer Guide
Troubleshoot

Release Notes
Enhancements:

• To improve performance, importing CSV files that contain multiple lines in a single field is no longer
supported.

Bug Fixes:

• Fixed type inference issue in Quick model.


• Fixed the bias metric bug in bias reports.
• Fixed the Featurize text transform to work with columns with missing values.
• Fixed Histogram and Scatter plot built-in visualizations to work with datasets that contain array-like
columns.
• Athena query now re-runs if the query execution ID has expired.

Troubleshoot
If an issue arises when using Amazon SageMaker Data Wrangler, we recommend you do the following:

• If an error message is provided, read the message and resolve the issue it reports if possible.
• Make sure the IAM role of your Studio user has the required permissions to perform the action. For
more information, see Security and Permissions (p. 1141).
• If the issue occurs when you are trying to import from another AWS service, such as Amazon Redshift
or Athena, make sure that you have configured the necessary permissions and resources to perform
the data import. For more information, see Import (p. 991).
• If you're still having issues, choose Get help at the top right of your screen to reach out to the Data
Wrangler team. For more information, see the following images.

1156
Amazon SageMaker Developer Guide
Troubleshoot

As a last resort, you can try restarting the kernel on which Data Wrangler is running.

1. Save and exit the .flow file for which you want to restart the kernel.
2. Select the Running Terminals and Kernels icon, as shown in the following image.

1157
Amazon SageMaker Developer Guide
Troubleshoot

3. Select the Stop icon to the right of the .flow file for which you want to terminate the kernel, as
shown in the following image.

1158
Amazon SageMaker Developer Guide
Troubleshoot

4. Refresh the browser.


5. Reopen the .flow file on which you were working.

Troubleshooting issues with Amazon EMR


Use the following information to help you troubleshoot errors that might come up when you're using
Amazon EMR.

• Connection failure – If the connection fails with the following message The IP address of the
EMR cluster isn't private error message, your Amazon EMR cluster might not have been
launched in a private subnet. As a security best practice, Data Wrangler only supports connecting to
private Amazon EMR clusters. Choose a private EC2 subnet you launch an EMR cluster.
• Connection hanging and timing out – The issue is most likely due to a network connectivity issue.
After you start connecting to the cluster, the screen doesn't refresh. After about 2 minutes, you
might see the following error JdbcAddConnectionError: An error occurred when trying
to connect to presto: xxx: Connect to xxx failed: Connection timed out
(Connection timed out) will display on top of the screen..

1159
Amazon SageMaker Developer Guide
Increase Amazon EC2 Instance Limit

The errors might have two root causes:


• The Amazon EMR and Amazon SageMaker Studio are in different VPCs. We recommend launching
both Amazon EMR and Studio in the same VPC. You can also use VPC peering. For more information,
see What is VPC peering?.
• The Amazon EMR master security group lacks the inbound traffic rule for the security group of
Amazon SageMaker Studio on the port used for Presto. To resolve the issue, allow inbound traffic on
port 8889.
• Connection fails due to the connection type being misconfigured – You might see the following error
message: Data Wrangler couldn't create a connection to {connection_source}
successfully. Try connecting to {connection_source} again. For more
information, see Troubleshoot. If you’re still experiencing issues, contact
support.

Check the authentication method. The authentication method that you've specified in Data Wrangler
should match the authentication method that you're using on the cluster.
• You don't have HDFS permissions for LDAP authentication – Use the following guidance to resolve the
issue Set up HDFS Permissions using Linux Credentials. You can log into the cluster using the following
commands:

hdfs dfs -mkdir /user/USERNAME


hdfs dfs -chown USERNAME:USERNAME /user/USERNAME

• LDAP authentication missing connection key error – You might see the following error message:
Data Wrangler couldn't connect to EMR hive successfully. JDBC connection is
missing required connection key(s): PWD.

For LDAP authentication, you must specify both a username and a password. The JDBC URL stored in
Secrets Manager is missing property PWD.
• When you're troubleshooting the LDAP configuration: We recommend making sure that the LDAP
authenticator (LDAP server) is correctly configured to connect to the Amazon EMR cluster. Use the
ldapwhoami command to help you resolve the configuration issue. The following are example
commands that you can run:
• For LDAPS – ldapwhoami -x -H ldaps://ldap-server
• For LDAP – ldapwhoami -x -H ldap://ldap-server

Either command should return Anonymous if you've configured the authenticator successfully.

Increase Amazon EC2 Instance Limit


You might see the following error message when you're using Data Wrangler: The following
instance type is not available: ml.m5.4xlarge. Try selecting a different
instance below.

The message can indicate that you need to select a different instance type, but it can also indicate that
you don't have enough Amazon EC2 instances to successfully run Data Wrangler on your workflow. You
can increase the number of instances by using the following procedure.

To increase the number of instances, do the following.

1. Open the AWS Management Console.


2. In the search bar, specify Services Quotas.
3. Choose Service Quotas.

1160
Amazon SageMaker Developer Guide
Update Data Wrangler

4. Choose AWS services.


5. In the search bar, specify Amazon SageMaker.
6. Choose Amazon SageMaker.
7. Under Service quotas, specify Studio KernelGateway Apps running on ml.m5.4xlarge
instance.
Note
ml.m5.4xlarge is the default instance type for Data Wrangler. You can use other instance
types and request quota increases for them. For more information, see Instances (p. 1034).
8. Select Studio KernelGateway Apps running on ml.m5.4xlarge instance.
9. Choose Request quota increase.
10. For Change quota value, specify a value greater than Applied quota value.
11. Choose Request.

If your request is approved, AWS sends a notification to the email address associated with your account.
You can also check the status of your request by choosing Quota request history on the Service Quotas
page. Processed requests have a Status of Closed.

Update Data Wrangler


To update Data Wrangler to the latest release, first shut down the corresponding KernelGateway app
from the Amazon SageMaker Studio control panel. After the KernelGateway app is shut down, restart
it by opening a new or existing Data Wrangler flow in Studio. When you open a new or existing Data
Wrangler flow, the kernel that starts contains the latest version of Data Wrangler.

Update your Studio and Data Wrangler instance

1. Navigate to your SageMaker Console.


2. Choose SageMaker and then Studio.
3. Choose your user name.
4. Under Apps, in the row displaying the App name, choose Delete app for the app that starts with
sagemaker-data-wrang, and for the JupyterServer app.
5. Choose Yes, delete app.
6. Type delete in the confirmation box.
7. Choose Delete.
8. Reopen your Studio instance. When you begin to create a Data Wrangler flow, your instance now
uses the latest version of Data Wrangler.

Alternatively, if you are using a Data Wrangler application version that is not the latest version, and you
have an existing Data Wrangler flow open, you are prompted to update your Data Wrangler application
version in the Studio UI. The following screenshot shows this prompt.
Important
This updates the Data Wrangler kernel gateway app only. You still need to shut down the
JupyterServer app in your user account. To do this, follow the preceding steps.

1161
Amazon SageMaker Developer Guide
Shut Down Data Wrangler

You can also choose Remind me later, in which case an Update button appears in the top-right corner of
the screen.

Shut Down Data Wrangler


When you are not using Data Wrangler, it is important to shut down the instance on which it runs to
avoid incurring additional fees.

To avoid losing work, save your data flow before shutting Data Wrangler down. To save your data flow in
Studio, choose File and then choose Save Data Wrangler Flow. Data Wrangler automatically saves your
data flow every 60 seconds.

1162
Amazon SageMaker Developer Guide
Prepare data at scale with Studio notebooks

To shut down the Data Wrangler instance in Studio

1. In Studio, select the Running Instances and Kernels icon (

).
2. Under RUNNING APPS is the sagemaker-data-wrangler-1.0 app. Select the shutdown icon next to

this app ( ).

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING
INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler
flow file. This can take a few minutes.

Prepare data at scale from Studio notebooks with


Amazon EMR or AWS Glue
Amazon SageMaker Studio gives data scientists, machine learning (ML) engineers, and general
practitioners tools to perform data analytics and data preparation at scale. Analyzing, transforming, and
preparing large amounts of data is a foundational step of any data science and ML workflow. SageMaker
Studio comes with built-in integration of Amazon EMR and AWS Glue Interactive Sessions to handle
your large-scale interactive data preparation and machine learning workflows, all within your Studio
notebook.

1163
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Amazon EMR is a managed big data platform with resources to help you run petabyte-scale distributed
data processing jobs using open-source analytics frameworks on AWS such as Apache Spark, Apache
Hive, Presto, HBase, Flink, and Hudi among others. Data engineers and data scientists use Amazon
EMR for a wide variety of use cases, including big data analytics, what-if analyses, real-time analytics,
and data preparation for machine learning. With Studio integration with Amazon EMR, you can create,
browse, discover, and connect to Amazon EMR clusters without leaving your Studio notebook. You can
also monitor and debug your Spark workloads with one-click access to the Spark UI from within the
notebook. You should consider Amazon EMR for your data preparation workloads if you want maximum
control over hardware and software versions, containers, and big data processing applications.

AWS Glue Interactive Sessions is a serverless service that you can enlist to collect, transform, cleanse,
and prepare data for storage in your data lakes and data pipelines. AWS Glue Interactive Sessions
provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds
on a dedicated Data Processing Unit (DPU) without having to worry about provisioning and managing
complex compute cluster infrastructure. After initialization, you can quickly browse the AWS Glue data
catalog, run large queries, access data governed by AWS Lake Formation, and interactively analyze and
prepare data using Spark, right in your Studio notebook. You can then use the prepared data to train,
tune, and deploy models using the purpose-built ML tools within SageMaker Studio. You should consider
AWS Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark
service with moderate control of configurability and flexibility.

Content
• Prepare data using Amazon EMR (p. 1164)
• Prepare data using AWS Glue Interactive Sessions (p. 1192)

Prepare data using Amazon EMR


Amazon SageMaker Studio comes with built-in integration of Amazon EMR, with which data scientists
and data engineers can perform petabyte-scale interactive data preparation and machine learning (ML)
right from their Studio notebook. Within a notebook, they can discover and connect to existing Amazon
EMR clusters, then interactively explore, visualize, and prepare large-scale data for machine learning
using Apache Spark, Apache Hive, Presto. Additionally, users can access Spark UI with a single click to
monitor their Spark jobs from their Studio notebooks.

Administrators can use the AWS Service Catalog to define AWS CloudFormation templates of Amazon
EMR clusters accessible to Studio users. Data scientists can then choose a predefined template to self-
provision an Amazon EMR cluster directly from Amazon SageMaker Studio notebooks. Administrators
can further parameterize the templates to let users choose aspects of the cluster to match their
workloads within predefined values. For example, a data scientist or data engineer may want to specify
the number of core nodes of the cluster up to a predetermined maximum value, or select the instance
type of a node from a dropdown menu.

• If you are an administrator, make sure that you have enabled communication between Amazon
SageMaker Studio notebooks and Amazon EMR clusters. For instructions, see the Configure
networking (for administrators) (p. 1165) section. Once this communication is enabled, you have the
option to:
• Define cluster templates in AWS Service Catalog and ensure the availability of these templates
through Studio's notebooks: Configure Amazon EMR templates in AWS Service Catalog (for
administrators) (p. 1168).
• Configure the discoverability of existing Amazon EMR clusters directly from Studio's notebooks:
Configure the discoverability of Amazon EMR clusters (for administrators) (p. 1178).
• If you are a data scientist or data engineer looking to self-provision an Amazon EMR cluster, see
Launch an Amazon EMR cluster from Studio (p. 1175).
• If you are a data scientist or data engineer looking to discover and connect to existing Amazon EMR
clusters from Studio, see Use Amazon EMR clusters from Studio notebooks (p. 1177).

1164
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

List of topics
• Configure networking (for administrators) (p. 1165)
• Create an Amazon EMR cluster from Studio notebooks (p. 1168)
• Use Amazon EMR clusters from Studio notebooks (p. 1177)
• Access Spark UI from Studio (p. 1189)
• Walkthroughs and whitepapers (p. 1190)
• Additional Configuration for cross accounts use cases (for administrators) (p. 1191)

Configure networking (for administrators)


This section provides information about how administrators can configure their network to allow
communication between Amazon SageMaker Studio notebooks and an Amazon EMR cluster.

The networking instructions vary based on whether SageMaker Studio and Amazon EMR are deployed
within a private Amazon Virtual Private Cloud (VPC) or communicate over the internet.

By default, SageMaker Studio runs in an AWS managed VPC with internet access. When using an internet
connection, Studio accesses AWS resources, such as Amazon S3 buckets, over the internet. However,
if you have security requirements to control access to your data and job containers, we recommend
that you configure Studio and Amazon EMR so that your data and containers aren’t accessible over the
internet. To control access to your resources or run SageMaker Studio without public internet access, you
can specify the VPC only network access type when you onboard to Amazon SageMaker Domain (p. 37).
In this scenario, SageMaker Studio establishes connections with other AWS services via private VPC
endpoints. For information about configuring SageMaker Studio in VPC only mode, see Connect
SageMaker Studio notebooks in a Amazon VPC to external resources..

The first two sections describe how to ensure communication between SageMaker Studio and an
Amazon EMR cluster in VPCs without public internet access. The last section covers how to ensure
communication between SageMaker Studio and Amazon EMR using an internet connection. Prior
to connecting SageMaker Studio and Amazon EMR without internet access, make sure to establish
endpoints for Amazon Simple Storage Service (data storage), Amazon CloudWatch (logging and
monitoring), and Amazon SageMaker Runtime (fine-grained role-based access control (RBAC)).

• If your Amazon SageMaker Studio and Amazon EMR cluster are set up in different VPCs in the
same AWS account or in different accounts, see Studio and Amazon EMR are deployed in separate
VPCs (p. 1165).
• If your Amazon SageMaker Studio and Amazon EMR cluster are set up in the same VPC, see Amazon
SageMaker Studio and Amazon EMR are in the same VPC (p. 1167).
• If you chose to connect Amazon SageMaker Studio and Amazon EMR cluster over public internet, see
Amazon SageMaker Studio and Amazon EMR communicate over public internet (p. 1168).

Studio and Amazon EMR are deployed in separate VPCs


To allow communication between SageMaker Studio and an Amazon EMR cluster when they are
deployed in different VPCs:

1. Start by connecting your VPCs through a VPC peering connection.


2. Update your routing tables in each VPC to route the network traffic between Studio subnets and
Amazon EMR subnets both ways.
3. Configure your security groups to allow inbound and outbound traffic.

1165
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

The steps are similar, regardless of whether Amazon SageMaker Studio and the Amazon EMR cluster
are deployed within the same AWS account (Single account use case) or different AWS accounts (Cross
accounts use case).

1. VPC peering

Create a VPC peering connection to facilitate the networking between the two VPCs (SageMaker
Studio and Amazon EMR).

a. From your SageMaker Studio account, on the Amazon VPC dashboard, choose Peering
connections, then Create peering connection.
b. Create your request to peer the Studio VPC within the Amazon EMR VPC. When requesting
peering in another AWS account, choose Another account in Select another VPC to peer with.

For cross accounts peering, the administrator must accept the request from the Amazon EMR
account.

When peering private subnets, you should enable private IP DNS resolution at the VPC peering
connection level.
2. Routing tables

Send the network traffic between SageMaker Studio subnets and Amazon EMR subnets both ways.

After you establish the peering connection, the administrator (on each account for cross accounts
access) can add routes to the private subnet route tables to route the traffic between the notebooks
and the cluster subnets. You can define those routes by going to the Route Tables section of each
VPC in the Amazon VPC dashboard.

The following illustration of the route table of a Studio VPC subnet shows an example of an
outbound route from the Studio account to the Amazon EMR VPC IP range (here 2.0.1.0/24)
through the peering connection.

The following illustration of a route table of an Amazon EMR VPC subnet shows an example of
return routes from the Amazon EMR VPC to Studio VPC IP range (here 10.0.20.0/24) through the
peering connection.

3. Security groups

Lastly, the security group of your Studio domain must allow outbound traffic, and the security group
of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive, or Presto TCP

1166
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

ports (respectively 8998, 10000, and 8889) from the Studio instance security group. Apache Livy is a
service that enables interaction with a Amazon EMR cluster over a REST interface.

The following illustration is an example of a VPC setup allowing SageMaker Studio notebooks to
provision Amazon EMR clusters from AWS CloudFormation templates in the Service Catalog, then
connect to an Amazon EMR cluster deployed in the same AWS account. The diagram provides an
additional illustration of the required endpoints when Studio and Amazon EMR communicate without
access to internet or the option to use a NAT gateway, which enables internet connectivity through an
internet gateway.

Amazon SageMaker Studio and Amazon EMR are in the same VPC
If Amazon SageMaker Studio and the cluster are in different subnets, add routes to each private subnet
route table to route the traffic between the notebooks and the cluster subnets. You can define those
routes by going to the Route Tables section of each VPC in the Amazon VPC dashboard. If you deployed
Amazon SageMaker Studio and an Amazon EMR cluster in the same VPC and the same subnet, you do
not need to route the traffic between the notebooks and the cluster.

Whether or not you needed to update your routing tables, the security group of your Studio domain
must allow outbound traffic, and the security group of the Amazon EMR primary node must allow
inbound traffic on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from the
Studio instance security group. Apache Livy is a service that enables interaction with a Amazon EMR
cluster over a REST interface.

1167
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Amazon SageMaker Studio and Amazon EMR communicate over public internet
By default, SageMaker Studio provides a network interface that allows communication with the internet
through an internet gateway in the VPC associated with the SageMaker Domain. If you choose to connect
to Amazon EMR through the public internet, your Amazon EMR cluster needs to accept inbound traffic
on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from its internet gateway.
Apache Livy is a service that enables interaction with an Amazon EMR cluster over a REST interface.

Keep in mind that any port on which you allow inbound traffic represents a potential security
vulnerability. Carefully review custom security groups to ensure that you minimize vulnerabilities. For
more information, see Control network traffic with security groups.

Alternatively, see Walkthroughs and whitepapers (p. 1190) for a detailed walkthrough of how to enable
Kerberos on Amazon EMR, set the cluster in a private subnet, and access the cluster using a Network
Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups.
Note
When connecting to your Apache Livy endpoint through the public internet, we recommend that
you secure communications between Amazon SageMaker Studio and your Amazon EMR cluster
using TLS.
For information on setting up HTTPS with Apache Livy, see Enabling HTTPS with Apache
Livy. For information on setting an Amazon EMR cluster with transit encryption enabled, see
Providing certificates for encrypting data in transit with Amazon EMR encryption. Additionally,
you need to configure Studio to access your certificate key as specified in Connect to an Amazon
EMR cluster over HTTPS (p. 1183).

Create an Amazon EMR cluster from Studio notebooks


Administrators can use AWS Service Catalog to define AWS CloudFormation templates of Amazon
EMR clusters as products of a portfolio, then make them available to selected users. Using the Service
Catalog, administrators can fully control the organizational, security, and networking setup of Amazon
EMR clusters. Data scientists and data engineers can then view, select, and customize those templates
for their specific workloads to create on-demand Amazon EMR clusters directly from their SageMaker
Studio notebooks. This can be done without manually setting up complex configurations. Users can also
terminate Amazon EMR clusters from Studio notebooks after use.

• If you are an administrator looking to configure AWS CloudFormation templates as AWS Service
Catalog products so users can create Amazon EMR clusters from Studio, see Configure Amazon EMR
templates in AWS Service Catalog (for administrators) (p. 1168).
• If you are a data scientist or data engineer looking to self-provision an Amazon EMR cluster to process
data at scale using open-source frameworks such as Apache Spark, Apache Hive, or Presto, see Launch
an Amazon EMR cluster from Studio (p. 1175).
• If you are looking to discover and connect to existing Amazon EMR clusters from Studio, see Use
Amazon EMR clusters from Studio notebooks (p. 1177).

Topics
• Configure Amazon EMR templates in AWS Service Catalog (for administrators) (p. 1168)
• Launch an Amazon EMR cluster from Studio (p. 1175)

Configure Amazon EMR templates in AWS Service Catalog (for administrators)


This section provides details about how administrators can configure an AWS Service Catalog product
so users can independently self-provision Amazon EMR clusters from Amazon SageMaker Studio
notebooks. Additionally, administrators can configure the Amazon EMR cluster templates in a way so

1168
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

end-users can customize various aspects of the cluster to suit their specific requirements. For example,
the administrator can define a list of permissible instance types from which users can choose when
creating cluster.

This topic assumes that you are familiar with the creation of portfolios and products in AWS Service
Catalog as well as Amazon EMR, and AWS CloudFormation (CFN).
Note
You can refer to the CFN templates in aws-samples/sagemaker-studio-emr GitHub repository as
examples of AWS CloudFormation stacks to deploy IAM roles, VPCs, a sandbox Studio Domain,
a user profile, as well as an AWS CloudFormation template to launch an Amazon EMR cluster.
Several options are available depending on your authentication method between Studio and the
Amazon EMR cluster. In these examples, a parent CFN template passes the SageMaker VPC ID,
security group, and subnet ID parameters to the CFN template of an Amazon EMR cluster.
You can access various examples of CFN Amazon EMR templates in the nested repository
sagemaker-studio-emr/cloudformation/emr_servicecatalog_templates and further choose from
a single account deployment to cross accounts.
For more information about the authentication methods available when connecting to an
Amazon EMR cluster, see Use Amazon EMR clusters from Studio notebooks (p. 1177).

To simplify the creation of Amazon EMR clusters, administrators can register the CloudFormation
template of an Amazon EMR cluster as a product in the portfolio of the AWS Service Catalog. Then
they associate the Service Catalog portfolio with the Studio execution role to ensure the availability of
the template in Studio. Furthermore, to make sure that data scientists can discover those templates,
provision Amazon EMR clusters, and connect to Amazon EMR clusters from their Studio notebooks,
administrators need to provide the Studio execution role with additional permissions.

The following list provides the additional settings that administrators need to apply to a baseline CFN
stack to enable Studio to access the Service Catalog products and provision Amazon EMR clusters. Those
settings must be applied at multiple levels:

• In the Service Catalog portfolio


• In the Service Catalog product
• In the CFN Amazon EMR template declared as the Service Catalog product

Finally, administrators need to assign a set of necessary permissions to the Studio execution role and the
account where Amazon EMR is deployed, depending on whether Studio and Amazon EMR are deployed
within the same or different AWS accounts.

• Pre-requisites: Networking and authentication requirements

As a prerequisite, ensure that you have reviewed the networking and security requirements in
Configure networking (for administrators) (p. 1165) and that you have created a baseline CFN stack
supporting the authentication method of your choice. You can find examples of CFN templates in aws-
samples/sagemaker-studio-emr.
• In your Service Catalog portfolio:

Add the following section to your portfolio CFN template (see the example in YAML format) to
associate your portfolio with the Studio execution role used by the user profiles.

SageMakerStudioEMRProductPortfolioPrincipalAssociation:
Type: AWS::ServiceCatalog::PortfolioPrincipalAssociation
Properties:
PrincipalARN: SageMakerExecutionRole.Arn
PortfolioId: SageMakerStudioEMRProductPortfolio ID
PrincipalType: IAM

• In your Service Catalog product:

1169
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Add the following tag key "sagemaker:studio-visibility:emr" and set to the value "true"
(here in YAML) to the Service Catalog product referencing the Amazon EMR template resource. This
ensures the visibility of the template in Studio .

SMStudioEMRNoAuthProduct:
Type: AWS::ServiceCatalog::CloudFormationProduct
Properties:
Owner: AWS
Name: SageMaker Studio Domain No Auth EMR
ProvisioningArtifactParameters:
- Name: SageMaker Studio Domain No Auth EMR
Description: Provisions a SageMaker domain and No Auth EMR Cluster
Info:
LoadTemplateFromURL: Link to your CFN template. For example, https://
aws-ml-blog.s3.amazonaws.com/artifacts/astra-m4-sagemaker/end-to-end/CFN-EMR-
NoStudioNoAuthTemplate-v3.yaml
Tags:
- Key: "sagemaker:studio-visibility:emr"
Value: "true"

• In the CFN template of the Amazon EMR cluster within your Service Catalog product:

Add the following mandatory stack parameters as a placeholder. This section is populated with the
Studio project name and identifier used by the user when provisioning a cluster from Studio.

SageMakerProjectName:
Type: String
Description: Name of the project

SageMakerProjectId:
Type: String
Description: Service generated Id of the project.

Administrators have the option to incorporate choices in the parameters section of a template
so users can input or select custom values when creating a cluster by specifying Default and
AllowedValues. The following example illustrates additional input parameters that administrators
can set when creating an Amazon EMR template.

"Parameters": {
"EmrClusterName": {
"Type": "String",
"Description": "EMR cluster Name."
},
"MasterInstanceType": {
"Type": "String",
"Description": "Instance type of the EMR master node.",
"Default": "m5.xlarge",
"AllowedValues": [
"m5.xlarge",
"m5.2xlarge",
"m5.4xlarge"
]
},
"CoreInstanceType": {
"Type": "String",
"Description": "Instance type of the EMR core nodes.",
"Default": "m5.xlarge",
"AllowedValues": [
"m5.xlarge",
"m5.2xlarge",
"m5.4xlarge",

1170
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

"m3.medium",
"m3.large",
"m3.xlarge",
"m3.2xlarge"
]
},
"CoreInstanceCount": {
"Type": "String",
"Description": "Number of core instances in the EMR cluster.",
"Default": "2",
"AllowedValues": [
"2",
"5",
"10"
]
},
"EmrReleaseVersion": {
"Type": "String",
"Description": "The release version of EMR to launch.",
"Default": "emr-5.33.1",
"AllowedValues": [
"emr-5.33.1",
"emr-6.4.0"
]
}
}

• Last, attach the required IAM policies to enable the visibility of CFN Amazon EMR templates and the
self-provisioning of Amazon EMR clusters from the Studio notebooks. The role to which you must add
those policies depends on whether Studio and Amazon EMR are deployed within the same account
(single account) or in different accounts (cross accounts).
• If your Amazon EMR cluster is deployed in the same AWS account as the Studio account, see the
Single Account tab.
• If your Amazon EMR cluster is deployed in a different AWS account than the Studio account, see the
Cross Accounts tab.
Single account

Attach the following permissions to the Studio execution role accessing your cluster.

The following list provides a breakdown of the permissions required.


• AllowEMRTemplateDiscovery allows the discoverability for Amazon EMR templates.
• AllowSagemakerProjectManagement enables the creation of SageMaker projects. In Studio,
access to the AWS Service Catalog is granted through Projects.
• AllowClusterDetailsDiscovery and AllowClusterDiscovery allow the discovery and
connection to Amazon EMR clusters.
• AllowPresignedUrl allows the creation of pre-signed URLs to access Spark UI.

The following is a comprehensive JSON that includes these permissions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",

1171
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:studio-region:studio-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:studio-region:studio-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
},
{
"Sid": "AllowEMRTemplateDiscovery",
"Effect": "Allow",
"Action": [
"servicecatalog:SearchProducts"
],
"Resource": "*"
},
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:studio-region:studio-account-id:project/*"
},
]
}

Cross accounts

If your Amazon EMR clusters and Studio are deployed in separate AWS accounts, you configure the
permissions in multiple steps.
• On the trusting account (the account in which Amazon EMR is deployed ), create a custom
IAM role (referred to as ASSUME-ROLE in this page) with the following trust relationship and
permissions.

For information about creating a role on an AWS account, see Creating an IAM role (console).
• To grant the trusted account (the account in which the Studio account is deployed) the
permission to assume a role in the trusting account, add the following trust relationship.

{
"Version": "2012-10-17",
"Statement": [

1172
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio-account:root"
},
"Action": "sts:AssumeRole"
}
]
}

• Add a policy defining the following permissions.


• AllowClusterDetailsDiscovery and AllowClusterDiscovery to allow the
discovery and connection to Amazon EMR clusters.
• AllowPresignedUrl to allow the creation of pre-signed URLs to access Spark UI.

The following is a comprehensive JSON that includes these permissions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
}
]
}

• On the trusted account (the account in which Studio is deployed), add the following trust
relationship and permissions to the Studio execution role.
• To grant SageMaker Studio's execution role the permission to assume the ASSUME-ROLE in
the trusting account, add the following trust relationship.

1173
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": ["arn:aws:iam::emr-account:role/ASSUME-ROLE" ]
}]
}

• Additionally, add a policy defining the following permissions.


• AllowSagemakerProjectManagement to allow the creation of SageMaker projects. In
Studio, access to the AWS Service Catalog is granted through Projects.
• AllowEMRTemplateDiscovery to allow the discoverability of Amazon EMR templates.

The following is a comprehensive JSON that includes these permissions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:::project/*"
},
{
"Sid": "AllowEMRTemplateDiscovery",
"Effect": "Allow",
"Action": [
"servicecatalog:SearchProducts"
],
"Resource": "*"
}
]
}

• Last, see Additional Configuration for cross accounts use cases (for administrators) (p. 1191) to
provide the ARN of the previously created IAM role ASSUME-ROLE to the Studio execution role.
The ARN of this assumable cross-accounts role is loaded by the Studio Jupyter server at launch.
The Studio execution role assumes that remote role to discover and connect to Amazon EMR
clusters in the trusting account.

Once the CFN templates are available in Amazon SageMaker Studio, data scientists can self-provision
Amazon EMR clusters from those templates. Each item in the list of "Parameters" provided
in the template becomes an input box of the cluster creation form in Studio where the values in
"AllowedValues" appear in a dropdown menu.

The following illustration shows the dynamic form assembled from a CFN Amazon EMR template to
create an Amazon EMR cluster in SageMaker Studio.

1174
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Visit Launch an Amazon EMR cluster from Studio (p. 1175) to learn about how to launch a cluster from
Studio using those Amazon EMR templates.

Launch an Amazon EMR cluster from Studio


Data scientists and data engineers can self-provision Amazon EMR clusters from Studio using AWS
CloudFormation templates configured by their administrators. If you are an administrator looking
to configure AWS CloudFormation templates as AWS Service Catalog products so users can create
Amazon EMR clusters from Studio, see Configure Amazon EMR templates in AWS Service Catalog (for
administrators) (p. 1168).

1175
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

To provision a new Amazon EMR cluster from Studio:

1.
Select the Home ( ) icon in the Studio UI's left-side panel, then select the Data node in the
navigation menu. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR
clusters that you can access from SageMaker Studio.
2. Choose Create cluster. This opens up a page, in the main working area, listing the cluster templates
available to you.
3. Select a cluster configuration template by choosing a template name. The selection of a template
activates the Select template button. Choose Select template. This opens up a cluster creation
form.
4. Enter the cluster's details, such as a cluster name and any specific configurable parameter set by
your administrator, then choose Create cluster. The creation of the cluster might take a couple of
minutes.

Once the cluster is provisioned, the Studio UI displays a The cluster has been successfully created
message.

To connect to your cluster, see Use Amazon EMR clusters from Studio notebooks (p. 1177)

1176
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Use Amazon EMR clusters from Studio notebooks


In this section, you learn about how to discover, connect to, or terminate an Amazon EMR cluster from
SageMaker Studio notebooks.

• If you are an administrator, see Configure the discoverability of Amazon EMR clusters (for
administrators) (p. 1178) to configure the discoverability of Amazon EMR clusters from SageMaker
Studio notebooks.
• If you are a data scientist or data engineer looking to discover Amazon EMR clusters from your Studio
notebooks, see Discover Amazon EMR clusters from SageMaker Studio (p. 1180).
• If you are a data scientist or data engineer looking to connect to existing Amazon EMR clusters from
your Studio notebooks, see Connect to an Amazon EMR cluster from SageMaker Studio (p. 1181).

When connecting to your Amazon EMR cluster from SageMaker Studio, you can authenticate to
your cluster with Kerberos, Lightweight Directory Access Protocol (LDAP), or use runtime IAM role
authentication. Your authentication method depends on your cluster configuration. You can refer
to this example Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon
EMR cluster to set up an Amazon EMR cluster that uses Kerberos. Alternatively, you can look at the
CloudFormation example templates using Kerberos or LDAP in the aws-samples/sagemaker-studio-emr
GitHub repository.

Find the list of available connection commands to an Amazon EMR cluster per authentication method
in Enter the connection command to an Amazon EMR cluster manually (p. 1182) to connect to your
Amazon EMR cluster.

Supported images and kernels to connect to an Amazon EMR cluster from


SageMaker Studio
SageMaker Studio provides built-in support to connect to Amazon EMR clusters in the following images
and kernels:

• DataScience – Python 3 kernel


• DataScience 2.0 – Python 3 kernel
• DataScience 3.0 – Python 3 kernel
• SparkAnalytics 1.0 – SparkMagic and PySpark kernels
• SparkAnalytics 2.0 – SparkMagic and PySpark kernels
• SparkMagic – SparkMagic and PySpark kernels
• PyTorch 1.8 – Python 3 kernels
• TensorFlow 2.6 – Python 3 kernel
• TensorFlow 2.11 – Python 3 kernel

Those images and kernels come with sagemaker-studio-analytics-extension, a notebook extension that
enables connection to a remote Spark (Amazon EMR) cluster via the SparkMagic library using Apache
Livy.

To connect to Amazon EMR clusters using another built-in image or your own image, follow the
instructions in Bring your own image (p. 1177).

Bring your own image


To bring your own image in SageMaker Studio and allow your notebooks to connect to Amazon EMR
clusters, install the following sagemaker-studio-analytics-extension extension to your kernel. It supports
connecting SageMaker Studio notebooks to Spark(Amazon EMR) clusters through the SparkMagic library.

1177
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

pip install sparkmagic


pip install sagemaker-studio-sparkmagic-lib
pip install sagemaker-studio-analytics-extension

Additionally, to connect to Amazon EMR with Kerberos authentication, you must install the kinit client.
Depending on your OS, the command to install the kinit client can vary. To bring an Ubuntu (Debian
based) image, use the apt-get install -y -qq krb5-user command.

For more information on bringing your own image in SageMaker Studio, see Bring your own SageMaker
image.

Configure the discoverability of Amazon EMR clusters (for administrators)


This section provides details about how administrators can configure the discoverability of existing
Amazon EMR clusters from SageMaker Studio. The clusters can be deployed in the same AWS account
(Single Account tab) or across accounts (Cross Accounts tab).

Single Account

Attach the following permissions to the SageMaker Studio's execution role accessing your cluster.

The following list provides a breakdown of the permissions required.

• AllowSagemakerProjectManagement enables the creation of SageMaker projects. In Studio,


access to the AWS Service Catalog is granted through Projects.
• AllowClusterDetailsDiscovery and AllowClusterDiscovery allow the discovery and
connection to Amazon EMR clusters.
• AllowPresignedUrl allows the creation of pre-signed URLs to access Spark UI.

The following is a comprehensive JSON that includes these permissions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"

1178
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
},
{
"Sid": "AllowSagemakerProjectManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateProject",
"sagemaker:DeleteProject"
],
"Resource": "arn:aws:sagemaker:region:account-id:project/*"
},
]
}

Cross Accounts

If your Amazon EMR clusters and SageMaker Studio are deployed in separate AWS accounts, you
configure the permissions in multiple steps.

• On the trusting account (the account in which Amazon EMR is deployed ), create a custom IAM role
(referred to as ASSUME-ROLE in this page) with the following trust relationship and permissions.

For information about creating a role on an AWS account, see Creating an IAM role (console).
• To grant the trusted account (the account in which SageMaker Studio's account is deployed ) the
permission to assume a role in the trusting account, add the following trust relationship.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio-account:root"
},
"Action": "sts:AssumeRole"
}
]
}

• Add a policy defining the following permissions.


• AllowClusterDetailsDiscovery and AllowClusterDiscovery to allow the discovery
and connection to Amazon EMR clusters.
• AllowPresignedUrl to allow the creation of pre-signed URLs to access Spark UI.

The following is a comprehensive JSON that includes these permissions.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",

1179
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDetailsDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:DescribeSecurityConfiguration"
],
"Resource": [
"arn:aws:elasticmapreduce:emr-region:emr-account-id:cluster/*"
]
},
{
"Sid": "AllowClusterDiscovery",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters"
],
"Resource": "*"
}
]
}

• On the trusted account (the account in which SageMaker Studio is deployed), add the following
trust relationship to SageMaker Studio's execution role.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": ["arn:aws:iam::emr-account:role/ASSUME-ROLE" ]
}]
}

• Last, see Additional Configuration for cross accounts use cases (for administrators) (p. 1191)
to provide the ARN of the previously created IAM role ASSUME-ROLE to SageMaker Studio's
execution role. The ARN of this assumable cross accounts role is loaded by the Studio Jupyter
server at launch. SageMaker Studio's execution role assumes that remote role to discover and
connect to Amazon EMR clusters in the trusting account.

Visit Discover Amazon EMR clusters from SageMaker Studio (p. 1180) to learn about how to discover and
connect to Amazon EMR clusters from Studio notebooks.

Discover Amazon EMR clusters from SageMaker Studio


Data scientists and data engineers can discover, connect to, and manage Amazon EMR clusters from
Amazon SageMaker Studio. The Amazon EMR clusters may be in the same AWS account as Amazon
SageMaker Studio or in a different AWS account.

1180
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

If your administrator configured the cross accounts discovery of Amazon EMR clusters, you can see a
consolidated list of clusters in the AWS account used by SageMaker Studio as well as in the remote
accounts.

If you are an administrator looking to set up the discoverability of Amazon EMR clusters from SageMaker
Studio, see Configure the discoverability of Amazon EMR clusters (for administrators) (p. 1178).

To view the list of available Amazon EMR clusters from SageMaker Studio:

1.
Select the Home ( ) icon in Studio UI's left-side panel, then select the Data node in the
navigation menu.
2. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR clusters that you
can access from SageMaker Studio.

The list displays the status of each cluster. A cluster status can be Starting, Bootstrapping,
Running/Walking, Terminating, Terminated, and Terminated with error. You can filter clusters by
status by selecting the filter icon. The following image shows an example of a list of clusters.

3. To connect to a particular Running/Walking cluster, see Connect to an Amazon EMR cluster from
SageMaker Studio (p. 1181).

Connect to an Amazon EMR cluster from SageMaker Studio


This section explains how you can connect to an Amazon EMR cluster from a Studio notebook when you
use any of the supported kernels.

Connect to an Amazon EMR cluster automatically


To connect to your cluster using the Studio UI, you can either initiate a connection from the list of
clusters accessed in Discover Amazon EMR clusters from SageMaker Studio (p. 1180), or from a notebook
in SageMaker Studio.

To connect to a particular cluster from your list of clusters

1. Choose the name of the cluster in your list. This activates the Attach to new notebook button.
2. Choose Attach to new notebook. This opens up the images and kernels selection box.
3. Select your image and kernel, then choose Select. For a list of supported images, see Supported
images and kernels to connect to an Amazon EMR cluster from SageMaker Studio (p. 1177) or refer
to Bring your own image (p. 1177).
4. If the cluster you select does not use Kerberos, LDAP, or runtime role authentication, Studio prompts
you to select the credential type. Choose from Http basic authentication or No credentials,
then enter your credentials, if applicable. A connection command populates the first cell of your
notebook and initiates the connection with the Amazon EMR cluster.

Once the connection succeeds, a message confirms the connection and the start of the Spark
application.

1181
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Alternatively, you can connect to a cluster from a notebook.

1. Choose Cluster at the top of your notebook.

Cluster is only visible when you use a kernel from Supported images and kernels to connect to an
Amazon EMR cluster from SageMaker Studio (p. 1177) or from Bring your own image (p. 1177). If
you cannot see Cluster at the top of your notebook, ensure that your administrator has configured
the discoverability of your clusters and switch to a supported kernel.

This opens up a list of available clusters.


2. Select the cluster to which you want to connect, then choose Connect.
3. If you configured your Amazon EMR clusters to support runtime IAM roles and your administrator
preloaded your roles in an execution role configuration JSON, you can select your Amazon EMR
access role from the Amazon EMR execution role drop down menu. If your roles are not preloaded,
Studio uses your Studio execution role by default. For information about using runtime roles with
Amazon EMR, see Connect to an Amazon EMR cluster from Studio using runtime IAM roles (p. 1184).
When you connect to a cluster, Studio adds a code block to an active cell to establish the connection.

Otherwise, if the cluster you choose does not use Kerberos, LDAP, or runtime role authentication,
Studio prompts you to select the credential type. You can choose HTTP basic authentication or No
credential.
4. An active cell populates and runs. This cell contains the connection command to connect to your
Amazon EMR cluster.

Once the connection succeeds, a message confirm the connection and the start of the Spark
application.

Enter the connection command to an Amazon EMR cluster manually

You can manually connect to your Amazon EMR cluster from a Studio notebook whether or not your
Studio application and cluster reside in the same AWS account.

For each of the following authentication types, use the specified command to manually connect to your
cluster from your Studio notebook.

• Kerberos

Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.

%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type Kerberos --language python
[--assumable-role-arn EMR_access_role_ARN ]
[--verify-certificate /home/user/certificateKey.pem]

• LDAP

Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.

%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \

1182
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

--auth-type Basic_Access --language python


[--assumable-role-arn EMR_access_role_ARN ]
[--verify-certificate /home/user/certificateKey.pem]

• NoAuth

Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.

%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type None --language python
[--assumable-role-arn EMR_access_role_ARN ]
[--verify-certificate /home/user/certificateKey.pem]

• Runtime IAM roles

Append the --assumable-role-arn argument if you need cross-account Amazon EMR access.
Append the --verify-certificate argument if you connect to your cluster with HTTPS.

For more information on connecting to an Amazon EMR cluster using runtime IAM roles, see Connect
to an Amazon EMR cluster from Studio using runtime IAM roles (p. 1184).

%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id cluster_id \
--auth-type Basic_Access \
--emr-execution-role-arn arn:aws:iam::studio_account_id:role/emr-execution-role-name
[--assumable-role-arn EMR_access_role_ARN]
[--verify-certificate /home/user/certificateKey.pem]

Connect to an Amazon EMR cluster over HTTPS

If you have configured your Amazon EMR cluster with transit encryption enabled and Apache Livy server
for HTTPS and would like Studio to communicate with Amazon EMR using HTTPS, you need to configure
Studio to access your certificate key.

For self-signed or local Certificate Authority (CA) signed certificates, you can do this in two steps:

1. Download the PEM file of your certificate to your local file system using one of the following options:
• Jupyter's built-in file upload function.
• A notebook cell.
• A lifecycle configuration (LCC) script.

For information on how to use an LCC script, see Customize a Notebook Instance Using a Lifecycle
Configuration Script
2. Enable the validation of the certificate by providing the path to your certificate in the --verify-
certificate argument of your connection command.

%sm_analytics emr connect --cluster-id cluster_id \


--verify-certificate /home/user/certificateKey.pem ...

For public CA issued certificates, set the certificate validation by setting the --verify-certificate
parameter as true.

Alternatively, you can disable the certificate validation by setting the --verify-certificate
parameter as false.

1183
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

You can find the list of available connection commands to an Amazon EMR cluster in Enter the
connection command to an Amazon EMR cluster manually (p. 1182).

Connect to an Amazon EMR cluster from Studio using runtime IAM roles

When you connect to an Amazon EMR cluster from your Amazon SageMaker Studio notebook, you can
visually browse a list of IAM roles, known as runtime roles, and select one on the fly. Subsequently, all
your Apache Spark, Apache Hive, or Presto jobs created from your Studio notebook access only the data
and resources permitted by policies attached to the runtime role. Also, when data is accessed from data
lakes managed with AWS Lake Formation, you can enforce table-level and column-level access using
policies attached to the runtime role.

With this capability, you and your teammates can connect to the same cluster, each using a runtime
role scoped with permissions matching your individual level of access to data. Your sessions are also
isolated from one another on the shared cluster. With this ability to control fine-grained access to data
on the same shared cluster, you can simplify provisioning of Amazon EMR clusters, reducing operational
overhead and saving costs.

To try out this new feature, see Apply fine-grained data access controls with AWS Lake Formation and
Amazon EMR from Amazon SageMaker Studio . This blog post helps you set up a demo environment
where you can try using preconfigured runtime roles to connect to Amazon EMR clusters.

Prerequisites

Before you get started, make sure you meet the following prerequisites:

• Use Amazon EMR version 6.9 or above.


• Use JupyterLab version 3 in the Studio Jupyter server application configuration. This version supports
Studio connection to Amazon EMR clusters using runtime roles.
• Allow the use of runtime roles in your cluster’s security configuration. For more information, see
Runtime roles for Amazon EMR steps.
• Create a notebook with any of the kernels listed in Use Amazon EMR clusters from Studio
notebooks (p. 1177).
• Make sure you review the instructions in Set up Studio to use runtime IAM roles (p. 1185) to configure
runtime roles with Studio.

Cross-account connection scenarios

Runtime role authentication supports a variety of cross-account connection scenarios when your data
resides outside of your Studio account. The following image shows three different ways you can assign
your Amazon EMR cluster, data, and even Amazon EMR execution role between your Studio and data
accounts:

1184
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

In option 1, your Amazon EMR cluster and Amazon EMR execution role are in a separate data account
from your Studio account. You define a separate Amazon EMR access role permission policy which grants
permission to your Studio execution role to assume the Amazon EMR access role. The Amazon EMR
access role then calls the Amazon EMR API GetClusterSessionCredentials on behalf of your Studio
execution role, giving you access to the cluster.

In option 2, your Amazon EMR cluster and Amazon EMR execution role are in your Studio account. Your
Studio execution role has permission to use the Amazon EMR API GetClusterSessionCredentials
to gain access to your cluster. To access the Amazon S3 bucket, give the Amazon EMR execution role
cross-account Amazon S3 bucket access permissions — you grant these permissions within your Amazon
S3 bucket policy.

In option 3, your Amazon EMR clusters are in your Studio account, and the Amazon EMR execution
role is in the data account. Your Studio execution role has permission to use the Amazon EMR API
GetClusterSessionCredentials to gain access to your cluster. Add the Amazon EMR execution role
into the execution role configuration JSON. Then you can select the role in the UI when you choose your
cluster. For details about how to set up your execution role configuration JSON file, see Preload your
execution roles into Studio (p. 1188).

Set up Studio to use runtime IAM roles

To establish runtime role authentication for your Amazon EMR clusters, configure the required IAM
policies, network, and usability enhancements. Your setup depends on whether you handle any cross-
account arrangements if your Amazon EMR clusters, Amazon EMR execution role, or both, reside outside
of your Amazon SageMaker Studio account. The following discussion guides you through the policies to
install, how to configure the network to allow traffic between cross-accounts, and the local configuration
file to set up to automate your Amazon EMR connection.

Configure runtime role authentication when your Amazon EMR cluster and Studio are in the
same account

If your Amazon EMR cluster resides in your Studio account, add the basic policy to
connect to your Amazon EMR cluster and set permissions to call the Amazon EMR API
GetClusterSessionCredentials, which gives you access to the cluster. Complete the following steps
to add necessary permissions to your Studio execution policy:

1185
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

1. Add the required IAM policy to connect to Amazon EMR clusters. For details, see Discover Amazon EMR
clusters from SageMaker Studio (p. 1180).
2. Grant permission to call the Amazon EMR API GetClusterSessionCredentials when you pass
one or more permitted Amazon EMR execution roles specified in the policy.
3. (Optional) Grant permission to pass IAM roles that follow any user-defined naming conventions.
4. (Optional) Grant permission to access Amazon EMR clusters that are tagged with specific user-defined
strings.
5. If you don't want to manually call the Amazon EMR connection command, install a SageMaker
configuration file in your local Amazon EFS and select the role to use when you select your Amazon
EMR cluster. For details about how to preload your IAM roles, see Preload your execution roles into
Studio (p. 1188).

The following example policy permits Amazon EMR execution roles belonging to the modeling and
training groups to call GetClusterSessionCredentials. In addition, the policyholder can access
Amazon EMR clusters tagged with the strings modeling or training.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "elasticmapreduce:GetClusterSessionCredentials",
"Resource": "*",
"Condition": {
"StringLike": {
"elasticmapreduce:ExecutionRoleArn": [
"arn:aws:iam::123456780910:role/emr-execution-role-ml-modeling*",
"arn:aws:iam::123456780910:role/emr-execution-role-ml-training*"
],
"elasticmapreduce:ResourceTag/group": [
"*modeling*",
"*training*"
]
}
}
}
]
}

Configure runtime role authentication when your cluster and Studio are in different accounts

If your Amazon EMR cluster is not in your Studio account, allow your Studio execution role to assume the
cross-account Amazon EMR access role so you can connect to the cluster. Complete the following steps
to set up your cross-account configuration:

1. Create your Studio execution role permission policy so that the execution role can assume the Amazon
EMR access role. The following policy is an example:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeCrossAccountEMRAccessRole",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::emr_account_id:role/emr-access-role-name"
}
]

1186
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

2. Create the trust policy to specify which Studio account IDs are trusted to assume the Amazon EMR
access role. The following policy is an example:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCrossAccountSageMakerExecutionRoleToAssumeThisRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::studio_account_id:role/studio_execution_role"
},
"Action": "sts:AssumeRole"
}
}

3. Create the Amazon EMR access role permission policy, which grants the Amazon EMR execution role
the needed permissions to carry out the intended tasks on the cluster. Configure the Amazon EMR
access role to call the API GetClusterSessionCredentials with the Amazon EMR execution roles
specified in the access role permission policy. The following policy is an example:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCallingEmrGetClusterSessionCredentialsAPI",
"Effect": "Allow",
"Action": "elasticmapreduce:GetClusterSessionCredentials",
"Resource": "",
"Condition": {
"StringLike": {
"elasticmapreduce:ExecutionRoleArn": [
"arn:aws:iam::emr_account_id:role/emr-execution-role-name"
]
}
}
}
]
}

4. Set up the cross-account network so that traffic can move back and forth between your accounts. For
guided instruction, see Set up the network in the blog post Create and manage Amazon EMR Clusters
from SageMaker Studio to run interactive Spark and ML workloads – Part 2. The steps in the blog post
help you complete the following tasks:
a. VPC-peer your Studio account and your Amazon EMR account to establish a connection.
b. Manually add routes to the private subnet route tables in both accounts. This permits creation
and connection of Amazon EMR clusters from the Studio account to the remote account’s private
subnet.
c. Set up the security group attached to your Studio domain to allow outbound traffic and the
security group of the Amazon EMR primary node to allow inbound TCP traffic from the Studio
instance security group.
5. If you don't want to manually call the Amazon EMR connection command, install a SageMaker
configuration file in your local Amazon EFS so you can select the role to use when you choose your
Amazon EMR cluster. For details about how to preload your IAM roles, see Preload your execution roles
into Studio (p. 1188).

1187
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Configure Lake Formation access


When you access data from data lakes managed by AWS Lake Formation, you can enforce table-level
and column-level access using policies attached to your runtime role. To configure permission for Lake
Formation access, see Integrate Amazon EMR with AWS Lake Formation.

Preload your execution roles into Studio


If you don't want to manually call the Amazon EMR connection command, you can install a SageMaker
configuration file in your local Amazon EFS so you can select the execution role to use when you choose
your Amazon EMR cluster.

To write a configuration file for the Amazon EMR execution roles, associate a Use Lifecycle
Configurations with Amazon SageMaker Studio (p. 182) (LCC) to the Jupyter server application.
Alternatively, you can write or update the configuration file and restart the Jupyter server with the
command: restart-jupyter-server.

The following snippet is an example LCC bash script you can apply if your Studio application and cluster
are in the same account:

#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"


{
"emr-execution-role-arns":
{
"123456789012": [
"arn:aws:iam::123456789012:role/emr-execution-role-1",
"arn:aws:iam::123456789012:role/emr-execution-role-2"
]
}
}
EOF

If your Studio application and clusters are in different accounts, specify the Amazon EMR access roles
that can use the cluster. In the following example policy, 123456789012 is the ARN for the Amazon EMR
cluster account, and 212121212121 and 434343434343 are the ARNs for the permitted Amazon EMR
access roles.

#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"


{
"emr-execution-role-arns":
{
"123456789012": [
"arn:aws:iam::212121212121:role/emr-execution-role-1",

1188
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

"arn:aws:iam::434343434343:role/emr-execution-role-2"
]
}
}
EOF

# add your cross-account EMR access role


FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"


{
"123456789012": "arn:aws:iam::123456789012:role/cross-account-emr-access-role"
}
EOF

Terminate an Amazon EMR cluster from Studio


The following procedure shows how to terminate an Amazon EMR cluster from a Studio notebook.

To terminate a cluster in a Running state, navigate to the list of available Amazon EMR
clusters.

1.
In SageMaker Studio, select the Home ( ) icon in Studio UI's left-side panel, then select the Data
node in the navigation menu.
2. Navigate down to the Clusters node. This opens up a page listing the Amazon EMR clusters that you
can access from SageMaker Studio.
3. Select the name of the cluster that you want to terminate, then choose Terminate.
4. This opens up a confirmation window informing you that any pending work or data on your cluster
will be lost permanently after termination. Confirm by choosing Terminate again.

Access Spark UI from Studio


The following sections give instructions for accessing the Spark UI from SageMaker Studio notebooks.
The Spark UI allows you to monitor and debug your Spark Jobs submitted to run on Amazon EMR from
Studio notebooks. SSH tunneling and presigned URLs are two ways for accessing the Spark UI.

Set up SSH tunneling for Spark UI access


To set up SSH tunneling to access the Spark UI, follow one of the two options in this section.

Options for setting up SSH tunneling:

• Option 1: Set up an SSH tunnel to the master node using local port forwarding
• Option 2, part 1: Set up an SSH tunnel to the master node using dynamic port forwarding

Option 2, part 2: Configure proxy settings to view websites hosted on the master node

For information about viewing web interfaces hosted on Amazon EMR clusters, see View web interfaces
hosted on Amazon EMR Clusters. You can also visit your Amazon EMR console to get access to the Spark
UI.
Note
You can set up an SSH tunnel even if presigned URLs are not available to you.

1189
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

Presigned URLs
To create one-click URLs that can access Spark UI on Amazon EMR from SageMaker Studio notebooks,
you must enable the following IAM permissions. Choose the option that applies to you:

• For Amazon EMR clusters that are in the same account as the SageMaker Studio notebook: Add the
following permissions to the SageMaker Studio IAM execution role.
• For Amazon EMR clusters that are in a different account (not SageMaker Studio notebook): Add the
following permissions to the cross-account role that you created for Discover Amazon EMR clusters
from SageMaker Studio (p. 1180).

Note
You can access presigned URLs from the console in the following regions:

• US East (N. Virginia) Region


• US West (N. California) Region
• Canada (Central) Region
• Europe (Frankfurt) Region
• Europe (Stockholm) Region
• Europe (Ireland) Region
• Europe (London) Region
• Europe (Paris) Region
• Asia Pacific (Tokyo) Region
• Asia Pacific (Seoul) Region
• Asia Pacific (Sydney) Region
• Asia Pacific (Mumbai) Region
• Asia Pacific (Singapore) Region
• South America (São Paulo)

The following policy gives access to presigned URLs for your execution role.

{
"Sid": "AllowPresignedUrl",
"Effect": "Allow",
"Action": [
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:CreatePersistentAppUI",
"elasticmapreduce:DescribePersistentAppUI",
"elasticmapreduce:GetPersistentAppUIPresignedURL",
"elasticmapreduce:GetOnClusterAppUIPresignedURL"
],
"Resource": [
"arn:aws:elasticmapreduce:region:account-id:cluster/*"
]
}

Walkthroughs and whitepapers


The following blogs use a case study of sentiment prediction for a movie review to illustrate the process
of executing a complete machine learning workflow. This includes data preparation, monitoring Spark
jobs, and training and deploying a ML model to get predictions directly from your Studio notebook.

1190
Amazon SageMaker Developer Guide
Prepare data using Amazon EMR

• Create and manage Amazon EMR clusters from SageMaker Studio to run interactive Spark and ML
workloads.
• To extend the use case to a cross-account configuration where SageMaker Studio and your Amazon
EMR cluster are deployed in separete AWS accounts, see Create and manage Amazon EMR clusters
from SageMaker Studio to run interactive Spark and ML workloads - Part 2.

See also:

• A walkthrough of the configuration of Access Apache Livy using a Network Load Balancer on a
Kerberos-enabled Amazon EMR cluster.
• AWS whitepapers for SageMaker Studio best practices.

Additional Configuration for cross accounts use cases (for


administrators)
To enable cluster discovery across accounts, administrators need to provide the ARN of an assumable
cross-account IAM role to the execution role of SageMaker Studio. SageMaker Studio's execution role
assumes that remote role to discover and connect to Amazon EMR clusters in the trusting account
account. The ARN of this assumable cross accounts role is loaded by the Studio's Jupyter server at launch.

You can specify this information in two ways.

• Write this remote role in a file named emr-discovery-iam-role-arns-DO_NOT_DELETE.json


placed in the directory .cross-account-configuration-DO_NOT_DELETE in your home directory
located in the Amazon EFS storage volume used by SageMaker Studio.
• Alternatively, you can automate this process by using Lifecycle Configuration (LCC) scripts. You can
attach the LCC to your Domain or a specific user profile. You have the option to set your LCC script to
run by default when your Jupyter server starts. The LCC script that you use must be a JupyterServer
configuration. For more information on how to create and use your LCC script and how to attach it at
the Domain and UserProfile level, see Use Lifecycle Configurations with Studio.

The following is an example LCC script. To modify the script, replace ASSUMABLE-ROLE and emr-
account with your role name and remote account ID, respectively. The number of cross accounts is
limited to five.

# This script creates the file that informs SageMaker Studio that the role
"arn:aws:iam::emr-account:role/ASSUMABLE-ROLE" in remote account "emr-account" must be
assumed to list and describe Amazon EMR clusters in the remote account.

#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat > "$FILE" <<- "EOF"


{
emr-account: "arn:aws:iam::emr-account:role/ASSUME-ROLE"
}
EOF

1191
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions

After the LCC runs and the files are written, the server reads the file /home/sagemaker-
user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-
DO_NOT_DELETE.json and stores that cross-account ARN.

Prepare data using AWS Glue Interactive Sessions


AWS Glue Interactive Sessions is a serverless service that equips you with the tools to collect, transform,
cleanse, and prepare the data that will populate your data lakes and pipelines. Glue Interactive Sessions
provides an on-demand, serverless Apache Spark runtime environment that data scientists and engineers
can use to rapidly build, test, and run data preparation and analytics applications.

Starting a Glue interactive session from a SageMaker Studio notebook is simple. When you create
your Studio notebook, choose the built-in Glue PySpark or Glue Spark kernel and start coding in
your interactive, serverless Spark session in just seconds. You don’t have worry about provisioning or
managing complex compute cluster infrastructure. After initialization, you can quickly browse the Glue
data catalog, run large queries, and interactively analyze and prepare data using Spark, all within your
Studio notebook. You can then use the prepared data to build, train, tune, and deploy models using the
purpose-built ML tools within SageMaker Studio.

Before you start your AWS Glue interactive session in SageMaker Studio, you need to set the appropriate
roles and policies. You may also need access to additional resources, such as Amazon S3, which may
require additional policies. For more information about required and additional IAM policies, see
Permissions for AWS Glue Interactive Sessions in SageMaker Studio (p. 1193).

SageMaker Studio provides a default configuration for your AWS Glue interactive session, but you
can use Glue’s full catalog of Jupyter magic commands to further customize your environment. For
information about the default and additional Jupyter magics that you can use in your Glue interactive
session, see Configure your Glue interactive session in SageMaker Studio (p. 1194).

The supported images and kernels for connecting to a Glue interactive session are as follows:

• Images: SparkAnalytics 1.0, SparkAnalytics 2.0


• Kernel: Glue Python [PySpark and Ray] and Glue Spark

Prerequisites:

The SparkAnalytics image that you select to launch your Glue session in Studio is a combination of two
frameworks - the SparkMagic framework (used with Amazon EMR), and AWS Glue. For this reason, the
prerequisites for both frameworks apply. However, you do not have to set up the EMR cluster if you
only plan to use Glue Interactive Sessions. Before you start your first Glue interactive session in Studio,
complete the following:

• Complete the prerequisites required to use the SparkMagic image. For a list of the prerequisites, see
the Prerequisites section in Prepare Data at Scale with Studio Notebooks.
• Create an execution role with permissions for both AWS Glue and SageMaker Studio. Add the managed
policy AwsGlueSessionUserRestrictedServiceRole, and create a custom policy that includes
permissions sts:GetCallerIdentity, iam:GetRole, and IAM:Passrole. For instructions
about how to create the necessary permissions, see Permissions for AWS Glue Interactive Sessions in
SageMaker Studio (p. 1193).
• Create a SageMaker domain with the execution role you created. For instructions about how to create
a domain, see Onboard to Amazon SageMaker Domain Using IAM (p. 43).

Get Started with AWS Glue Interactive Sessions


This guide will show you how set up the necessary permissions, launch your first AWS Glue interactive
session in Studio, and manage your environment with Jupyter magics.

1192
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions

Permissions for AWS Glue Interactive Sessions in SageMaker Studio


This section lists the required policies to run AWS Glue interactive sessions in Studio and explains how to
set them up. In particular, it details how to:

• Attach the AwsGlueSessionUserRestrictedServiceRole managed policy to your SageMaker


execution role.
• Create an inline custom policy on your SageMaker execution role.
• Modify the trust relationship of your SageMaker execution role.

To attach the AwsGlueSessionUserRestrictedServiceRole managed policy to your


execution role

1. Open the IAM console.


2. Select Roles in the left-side panel.
3. Find the Studio execution role that you will use, and choose the role name to go to the role
summary page.
4. Under the Permissions tab, select Attach policies from the Add Permissions dropdown menu.
5. Select the checkbox next to the managed policy AwsGlueSessionUserRestrictedServiceRole.
6. Choose Attach policies.

The summary page shows your newly-added managed policies.

To create the inline custom policy on your execution role

1. Select Create inline policy in the Add Permissions dropdown menu.


2. Select the JSON tab.
3. Copy and paste in the following policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "unique_statement_id",

"Effect": "Allow",
"Action": [
"iam:GetRole",
"iam:PassRole",
"sts:GetCallerIdentity"
],
"Resource": "*"
}
]
}

4. Choose Review policy.


5. Enter a Name and choose Create policy.

The summary page shows your newly-added custom policy.

1193
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions

To modify the trust relationship of your execution role

1. Select the Trust relationships tab.


2. Chose Edit trust policy.
3. Copy and paste in the following policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"glue.amazonaws.com",
"sagemaker.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}

4. Choose Update policy.

You can add additional roles and policies if you need to access other AWS resources. For a description
of the additional roles and policies you can include, see Interactive sessions with IAM in the AWS Glue
documentation.

Launch your Glue interactive session on SageMaker Studio


After you create the roles, policies, and SageMaker domain, you can launch your Glue interactive session
in SageMaker Studio.

To launch Glue in SageMaker Studio

1. Create a SageMaker domain. For instructions on how to create a new domain, see Onboard to
Amazon SageMaker Domain (p. 37).
2. Sign in to the SageMaker console at https://console.aws.amazon.com/sagemaker/.
3. Select Control Panel in the left-side panel.
4. In the Launch App dropdown menu next to the user name, select Studio.
5. In the Jupyter view, choose File, then New, then Notebook.
6. In the Image dropdown menu, select SparkAnalytics 1.0 or SparkAnalytics 2.0. In the kernel
dropdown menu, select Glue Spark or Glue Python [PySpark and Ray]. Choose Select.
7. (optional) Use Jupyter magics to customize your environment. For more information about Jupyter
magics, see Configure your Glue interactive session in SageMaker Studio (p. 1194).
8. Start writing your Spark data processing scripts.

Configure your Glue interactive session in SageMaker Studio


You can use Jupyter magics in your AWS Glue interactive session to modify your session and
configuration parameters. Magics are short commands prefixed with % at the start of Jupyter cells that
provide a quick and easy way to help you control your environment. In your Glue interactive session, the
following magics are set for you by default:

1194
Amazon SageMaker Developer Guide
Prepare data using Glue Interactive Sessions

Magic Default value

%glue_version 3.0

%iam_role execution role attached to your SageMaker domain

%region your region

You can use magics to further customize your environment. For example, if you want to change
the number of workers allocated to your job from the default five to 10, you can specify
%number_of_workers 10. If you want to configure your session to stop after 10 minutes of idle time
instead of the default 2880, you can specify %idle_timeout 10.

All of the Jupyter magics currently available in AWS Glue are also available in SageMaker Studio. For the
complete list of AWS Glue magics available, see Configuring AWS Glue interactive sessions for Jupyter
and AWS Glue Studio notebooks.

AWS Glue Interactive Session Pricing


When you use AWS Glue Interactive Sessions on SageMaker Studio notebooks, you are charged
separately for resource usage on AWS Glue and Studio notebooks.

AWS charges for Glue Interactive Sessions based on how long the session is active and the number of
Data Processing Units (DPU) used. You are charged an hourly rate for the number of DPUs used to run
your workloads, billed in increments of one second. Glue Interactive Sessions assigns a default of five
DPUs and requires a minimum of two DPUs. There is also a one-minute minimum billing duration for
each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using
the AWS Pricing Calculator, see AWS Glue pricing .

Your SageMaker Studio notebook runs on an Amazon EC2 instance and you are charged for the instance
type you choose, based on the duration of use. Studio assigns you a default EC2 instance type of ml-
t3-medium when you select the SparkAnalytics image and associated kernel. You can change the
instance type for of your Studio notebook to suit your workload. For information about SageMaker
Studio pricing, see Amazon SageMaker Pricing.

1195
Amazon SageMaker Developer Guide
Sample Notebooks

Process Data
To analyze data and evaluate machine learning models on Amazon SageMaker, use Amazon SageMaker
Processing. With Processing, you can use a simplified, managed experience on SageMaker to run your
data processing workloads, such as feature engineering, data validation, model evaluation, and model
interpretation. You can also use the Amazon SageMaker Processing APIs during the experimentation
phase and after the code is deployed in production to evaluate performance.

The preceding diagram shows how Amazon SageMaker spins up a Processing job. Amazon SageMaker
takes your script, copies your data from Amazon Simple Storage Service (Amazon S3), and then pulls a
processing container. The processing container image can either be an Amazon SageMaker built-in image
or a custom image that you provide. The underlying infrastructure for a Processing job is fully managed
by Amazon SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up
when a job completes. The output of the Processing job is stored in the Amazon S3 bucket you specified.
Note
Your input data must be stored in an Amazon S3 bucket. Alternatively, you can use Amazon
Athena or Amazon Redshift as input sources.
Tip
To learn best practices for distributed computing of machine learning (ML) training and
processing jobs in general, see Distributed computing with SageMaker best practices (p. 1944).

Use Amazon SageMaker Processing Sample


Notebooks
We provide two sample Jupyter notebooks that show how to perform data preprocessing, model
evaluation, or both.

For a sample notebook that shows how to run scikit-learn scripts to perform data preprocessing
and model training and evaluation with the SageMaker Python SDK for Processing, see scikit-learn
Processing. This notebook also shows how to use your own custom container to run processing
workloads with your Python libraries and other specific dependencies.

For a sample notebook that shows how to use Amazon SageMaker Processing to perform distributed
data preprocessing with Spark, see Distributed Processing (Spark). This notebook also shows how to train
a regression model using XGBoost on the preprocessed dataset.

For instructions on how to create and access Jupyter notebook instances that you can use to run these
samples in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a

1196
Amazon SageMaker Developer Guide
CloudWatch Logs and Metrics

notebook instance and opened it, choose the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.

Monitor Amazon SageMaker Processing Jobs with


CloudWatch Logs and Metrics
Amazon SageMaker Processing provides Amazon CloudWatch logs and metrics to monitor processing
jobs. CloudWatch provides CPU, GPU, memory, GPU memory, and disk metrics, and event logging. For
more information, see Monitor Amazon SageMaker with Amazon CloudWatch (p. 3271) and Log Amazon
SageMaker Events with Amazon CloudWatch (p. 3284).

Data Processing with Apache Spark


Apache Spark is a unified analytics engine for large-scale data processing. Amazon SageMaker
provides prebuilt Docker images that include Apache Spark and other dependencies needed to run
distributed data processing jobs. With the Amazon SageMaker Python SDK, you can easily apply data
transformations and extract features (feature engineering) using the Spark framework. For information
about using the SageMaker Python SDK to run Spark processing jobs, see Data Processing with Spark in
the Amazon SageMaker Python SDK.

A code repository that contains the source code and Dockerfiles for the Spark images is available on
GitHub.

Running a Spark Processing Job


You can use the sagemaker.spark.PySparkProcessor or
sagemaker.spark.SparkJarProcessor class to run your Spark application inside of a processing job.
Note you can set MaxRuntimeInSeconds to a maximum runtime limit of 5 days. With respect to execution
time, and number of instances used, simple spark workloads see a near linear relationship between the
number of instances vs. time to completion.

The following code example shows how to run a processing job that invokes your PySpark script
preprocess.py.

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
base_job_name="spark-preprocessor",
framework_version="2.4",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)

spark_processor.run(
submit_app="preprocess.py",
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', output_prefix]
)

1197
Amazon SageMaker Developer Guide
Data Processing with scikit-learn

For an in-depth look, see the Distributed Data Processing with Apache Spark and SageMaker Processing
example notebook.

If you are not using the Amazon SageMaker Python SDK and one of its Processor classes to retrieve the
pre-built images, you can retrieve these images yourself. The SageMaker prebuilt Docker images are
stored in Amazon Elastic Container Registry (Amazon ECR). For a complete list of the available pre-built
Docker images, see the available images document.

To learn more about using the SageMaker Python SDK with Processing containers, see Amazon
SageMaker Python SDK.

Data Processing with scikit-learn


For a sample notebook that shows how to run scikit-learn scripts using a Docker image provided and
maintained by SageMaker to preprocess data and evaluate models, see scikit-learn Processing. To use
this notebook, you need to install the SageMaker Python SDK for Processing.

This notebook runs a processing job using SKLearnProcessor class from the the SageMaker Python
SDK to run a scikit-learn script that you provide. The script preprocesses data, trains a model using a
SageMaker training job, and then runs a processing job to evaluate the trained model. The processing job
estimates how the model is expected to perform in production.

To learn more about using the SageMaker Python SDK with Processing containers, see the SageMaker
Python SDK. For a complete list of pre-built Docker images available for processing jobs, see Docker
Registry Paths and Example Code.

The following code example shows how the notebook uses SKLearnProcessor to run your own scikit-
learn script using a Docker image provided and maintained by SageMaker, instead of your own Docker
image.

from sagemaker.sklearn.processing import SKLearnProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
instance_type='ml.m5.xlarge',
instance_count=1)

sklearn_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/
validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')]
)

To process data in parallel using Scikit-Learn on Amazon SageMaker Processing, you can shard
input objects by S3 key by setting s3_data_distribution_type='ShardedByS3Key' inside a
ProcessingInput so that each instance receives about the same number of input objects.

Data Processing with Framework Processors


A FrameworkProcessor can run Processing jobs with a specified machine learning framework,
providing you with an Amazon SageMaker–managed container for whichever machine learning

1198
Amazon SageMaker Developer Guide
Hugging Face Framework Processor

framework you choose. FrameworkProcessor provides premade containers for the following machine
learning frameworks: Hugging Face, MXNet, PyTorch, TensorFlow, and XGBoost.

The FrameworkProcessor class also provides you with customization over the container configuration.
The FrameworkProcessor class supports specifying a source directory source_dir for your
processing scripts and dependencies. With this capability, you can give the processor access to multiple
scripts in a directory instead of only specifying one script. FrameworkProcessor also supports
including a requirements.txt file in the source_dir for customizing the Python libraries to install in
the container.

For more information on the FrameworkProcessor class and its methods and parameters, see
FrameworkProcessor in the Amazon SageMaker Python SDK.

To see examples of using a FrameworkProcessor for each of the supported machine learning
frameworks, see the following topics.

Topics
• Hugging Face Framework Processor (p. 1199)
• MXNet Framework Processor (p. 1200)
• PyTorch Framework Processor (p. 1201)
• TensorFlow Framework Processor (p. 1202)
• XGBoost Framework Processor (p. 1203)

Hugging Face Framework Processor


Hugging Face is an open-source provider of natural language processing (NLP) models. The
HuggingFaceProcessor in the Amazon SageMaker Python SDK provides you with the ability to
run processing jobs with Hugging Face scripts. When you use the HuggingFaceProcessor, you can
leverage an Amazon-built Docker container with a managed Hugging Face environment so that you don't
need to bring your own container.

The following code example shows how you can use the HuggingFaceProcessor to run your
Processing job using a Docker image provided and maintained by SageMaker. Note that when you
run the job, you can specify a directory containing your scripts and dependencies in the source_dir
argument, and you can have a requirements.txt file located inside your source_dir directory that
specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies
in requirements.txt in the container for you.

from sagemaker.huggingface import HuggingFaceProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the HuggingFaceProcessor


hfp = HuggingFaceProcessor(
role=get_execution_role(),
instance_count=1,
instance_type='ml.g4dn.xlarge',
transformers_version='4.4.2',
pytorch_version='1.6.0',
base_job_name='frameworkprocessor-hf'
)

#Run the processing job


hfp.run(
code='processing-script.py',
source_dir='scripts',
inputs=[

1199
Amazon SageMaker Developer Guide
MXNet Framework Processor

ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input/data/'
)
],
outputs=[
ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='val', source='/opt/ml/processing/output/val/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
]
)

If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the HuggingFaceProcessor class, see Hugging Face
Estimator in the Amazon SageMaker Python SDK.

MXNet Framework Processor


Apache MXNet is an open-source deep learning framework commonly used for training and deploying
neural networks. The MXNetProcessor in the Amazon SageMaker Python SDK provides you with the
ability to run processing jobs with MXNet scripts. When you use the MXNetProcessor, you can leverage
an Amazon-built Docker container with a managed MXNet environment so that you don’t need to bring
your own container.

The following code example shows how you can use the MXNetProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.

from sagemaker.mxnet import MXNetProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the MXNetProcessor


mxp = MXNetProcessor(
framework_version='1.8.0',
py_version='py37',
role=get_execution_role(),
instance_count=1,
instance_type='ml.c5.xlarge',
base_job_name='frameworkprocessor-mxnet'
)

#Run the processing job


mxp.run(
code='processing-script.py',
source_dir='scripts',
inputs=[
ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input/data/'
)

1200
Amazon SageMaker Developer Guide
PyTorch Framework Processor

],
outputs=[
ProcessingOutput(
output_name='processed_data',
source='/opt/ml/processing/output/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
)
]
)

If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the MXNetProcessor class, see MXNet Estimator in the
Amazon SageMaker Python SDK.

PyTorch Framework Processor


PyTorch is an open-source machine learning framework. The PyTorchProcessor in the Amazon
SageMaker Python SDK provides you with the ability to run processing jobs with PyTorch scripts. When
you use the PyTorchProcessor, you can leverage an Amazon-built Docker container with a managed
PyTorch environment so that you don’t need to bring your own container.

The following code example shows how you can use the PyTorchProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.

from sagemaker.pytorch.processing import PyTorchProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the PyTorchProcessor


pytorch_processor = PyTorchProcessor(
framework_version='1.8',
role=get_execution_role(),
instance_type='ml.m5.xlarge',
instance_count=1,
base_job_name='frameworkprocessor-PT'
)

#Run the processing job


pytorch_processor.run(
code='processing-script.py',
source_dir='scripts',
inputs=[
ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input'
)
],
outputs=[
ProcessingOutput(output_name='data_structured', source='/opt/ml/processing/tmp/
data_structured', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='validation', source='/opt/ml/processing/output/val',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),

1201
Amazon SageMaker Developer Guide
TensorFlow Framework Processor

ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
ProcessingOutput(output_name='logs', source='/opt/ml/processing/logs',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
]
)

If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an
Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the PyTorchProcessor class, see PyTorch Estimator in
the Amazon SageMaker Python SDK.

TensorFlow Framework Processor


TensorFlow is an open-source machine learning and artificial intelligence library. The
TensorFlowProcessor in the Amazon SageMaker Python SDK provides you with the ability to run
processing jobs with TensorFlow scripts. When you use the TensorFlowProcessor, you can leverage an
Amazon-built Docker container with a managed TensorFlow environment so that you don’t need to bring
your own container.

The following code example shows how you can use the TensorFlowProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.

from sagemaker.tensorflow import TensorFlowProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the TensorFlowProcessor


tp = TensorFlowProcessor(
framework_version='2.3',
role=get_execution_role(),
instance_type='ml.m5.xlarge',
instance_count=1,
base_job_name='frameworkprocessor-TF',
py_version='py37'
)

#Run the processing job


tp.run(
code='processing-script.py',
source_dir='scripts',
inputs=[
ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input/data'
),
ProcessingInput(
input_name='model',
source=f's3://{BUCKET}/{S3_PATH_TO_MODEL}',
destination='/opt/ml/processing/input/model'
)
],
outputs=[
ProcessingOutput(
output_name='predictions',

1202
Amazon SageMaker Developer Guide
XGBoost Framework Processor

source='/opt/ml/processing/output',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
)
]
)

If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use
an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory
you specify for source_dir. To learn more about the TensorFlowProcessor class, see TensorFlow
Estimator in the Amazon SageMaker Python SDK.

XGBoost Framework Processor


XGBoost is an open-source machine learning framework. The XGBoostProcessor in the Amazon
SageMaker Python SDK provides you with the ability to run processing jobs with XGBoost scripts. When
you use the XGBoostProcessor, you can leverage an Amazon-built Docker container with a managed
XGBoost environment so that you don’t need to bring your own container.

The following code example shows how you can use the XGBoostProcessor to run your Processing
job using a Docker image provided and maintained by SageMaker. Note that when you run the job,
you can specify a directory containing your scripts and dependencies in the source_dir argument,
and you can have a requirements.txt file located inside your source_dir directory that specifies
the dependencies for your processing script(s). SageMaker Processing installs the dependencies in
requirements.txt in the container for you.

from sagemaker.xgboost import XGBoostProcessor


from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the XGBoostProcessor


xgb = XGBoostProcessor(
framework_version='1.2-2',
role=get_execution_role(),
instance_type='ml.m5.xlarge',
instance_count=1,
base_job_name='frameworkprocessor-XGB',
)

#Run the processing job


xgb.run(
code='processing-script.py',
source_dir='scripts',
inputs=[
ProcessingInput(
input_name='data',
source=f's3://{BUCKET}/{S3_INPUT_PATH}',
destination='/opt/ml/processing/input/data'
)
],
outputs=[
ProcessingOutput(
output_name='processed_data',
source='/opt/ml/processing/output/',
destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
)
]
)

If you have a requirements.txt file, it should be a list of libraries you want to install in the container.
The path for source_dir can be a relative, absolute, or Amazon S3 URI path. However, if you use an

1203
Amazon SageMaker Developer Guide
Use Your Own Processing Code

Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you
specify for source_dir. To learn more about the XGBoostProcessor class, see XGBoost Estimator in
the Amazon SageMaker Python SDK.

Use Your Own Processing Code


You can install libraries to run your scripts in your own processing container or, in a more advanced
scenario, you can build your own processing container that satisfies the contract to run in Amazon
SageMaker. For more information about containers in SageMaker, see Using Docker containers with
SageMaker (p. 2668). For a formal specification that defines the contract for an Amazon SageMaker
Processing container, see Build Your Own Processing Container (Advanced Scenario) (p. 1205).

Topics
• Run Scripts with Your Own Processing Container (p. 1204)
• Build Your Own Processing Container (Advanced Scenario) (p. 1205)

Run Scripts with Your Own Processing Container


You can use scikit-learn scripts to preprocess data and evaluate your models. To see how to run scikit-
learn scripts to perform these tasks, see the scikit-learn Processing sample notebook. This notebook uses
the ScriptProcessor class from the Amazon SageMaker Python SDK for Processing.

The following example shows a general workflow for using a ScriptProcessor class with your own
processing container. The workflow shows how to create your own image, build your container, and use
a ScriptProcessor class to run a Python preprocessing script with the container. The processing job
processes your input data and saves the processed data in Amazon Simple Storage Service (Amazon S3).

Before using the following examples, you need to have your own input data and a Python script
prepared to process your data. For an end-to-end, guided example of this process, refer back to the
scikit-learn Processing sample notebook.

1. Create a Docker directory and add the Dockerfile used to create the processing container. Install
pandas and scikit-learn into it. (You could also install your own dependencies with a similar RUN
command.)

mkdir docker

%%writefile docker/Dockerfile

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3


ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

2. Build the container using the docker command, create an Amazon Elastic Container Registry (Amazon
ECR) repository, and push the image to Amazon ECR.

import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.Session().region_name
ecr_repository = 'sagemaker-processing-container'

1204
Amazon SageMaker Developer Guide
Build Your Own Processing Container

tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region,
ecr_repository + tag)

# Create ECR repository and push docker image


!docker build -t $ecr_repository docker
!aws ecr get-login-password --region {region} | docker login --username AWS --password-
stdin {account_id}.dkr.ecr.{region}.amazonaws.com
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

3. Set up the ScriptProcessor from the SageMaker Python SDK to run the script. Replace image_uri
with the URI for the image you created, and replace role_arn with the ARN for an AWS Identity and
Access Management role that has access to your target Amazon S3 bucket.

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')

4. Run the script. Replace preprocessing.py with the name of your own Python processing script, and
replace s3://path/to/my/input-data.csv with the Amazon S3 path to your input data.

script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/
train'),
ProcessingOutput(source='/opt/ml/processing/output/
validation'),
ProcessingOutput(source='/opt/ml/processing/output/
test')])

You can use the same procedure with any other library or system dependencies. You can also use existing
Docker images. This includes images that you run on other platforms such as Kubernetes.

Build Your Own Processing Container (Advanced


Scenario)
You can provide Amazon SageMaker Processing with a Docker image that has your own code and
dependencies to run your data processing, feature engineering, and model evaluation workloads.

The following example of a Dockerfile builds a container with the Python libraries scikit-learn and
pandas, which you can run as a processing job.

FROM python:3.7-slim-buster

# Install scikit-learn and pandas


RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3

# Add a Python script and configure Docker to run it


ADD processing_script.py /
ENTRYPOINT ["python3", "/processing_script.py"]

1205
Amazon SageMaker Developer Guide
Build Your Own Processing Container

For an example of a processing script, see Get started with SageMaker Processing.

Build and push this Docker image to an Amazon Elastic Container Registry (Amazon ECR) repository and
ensure that your SageMaker IAM role can pull the image from Amazon ECR. Then you can run this image
on Amazon SageMaker Processing.

How Amazon SageMaker Processing Runs Your Processing


Container Image
Amazon SageMaker Processing runs your processing container image in a similar way as the following
command, where AppSpecification.ImageUri is the Amazon ECR image URI that you specify in a
CreateProcessingJob operation.

docker run [AppSpecification.ImageUri]

This command runs the ENTRYPOINT command configured in your Docker image.

You can also override the entrypoint command in the image or give command-line arguments
to your entrypoint command using the AppSpecification.ContainerEntrypoint and
AppSpecification.ContainerArgument parameters in your CreateProcessingJob request.
Specifying these parameters configures Amazon SageMaker Processing to run the container similar to
the way that the following command does.

docker run --entry-point [AppSpecification.ContainerEntrypoint]


[AppSpecification.ImageUri] [AppSpecification.ContainerArguments]

For example, if you specify the ContainerEntrypoint to be [python3, -v, /


processing_script.py] in your CreateProcessingJob request, and ContainerArguments
to be [data-format, csv], Amazon SageMaker Processing runs your container with the following
command.

python3 -v /processing_script.py data-format csv

When building your processing container, consider the following details:

• Amazon SageMaker Processing decides whether the job completes or fails depending on the exit code
of the command run. A processing job completes if all of the processing containers exit successfully
with an exit code of 0, and fails if any of the containers exits with a non-zero exit code.
• Amazon SageMaker Processing lets you override the processing container's entrypoint and set
command-line arguments just like you can with the Docker API. Docker images can also configure
the entrypoint and command-line arguments using the ENTRYPOINT and CMD instructions. The
way CreateProcessingJob's ContainerEntrypoint and ContainerArgument parameters
configure a Docker image's entrypoint and arguments mirrors how Docker overrides the entrypoint
and arguments through the Docker API:
• If neither ContainerEntrypoint nor ContainerArguments are provided, Processing uses the
default ENTRYPOINT or CMD in the image.
• If ContainerEntrypoint is provided, but not ContainerArguments, Processing runs the image
with the given entrypoint, and ignores the ENTRYPOINT and CMD in the image.
• If ContainerArguments is provided, but not ContainerEntrypoint, Processing runs the image
with the default ENTRYPOINT in the image and with the provided arguments.
• If both ContainerEntrypoint and ContainerArguments are provided, Processing runs the
image with the given entrypoint and arguments, and ignores the ENTRYPOINT and CMD in the
image.

1206
Amazon SageMaker Developer Guide
Build Your Own Processing Container

• You must use the exec form of the ENTRYPOINT instruction in your Dockerfile (ENTRYPOINT
["executable", "param1", "param2"]) instead of the shell form (ENTRYPOINT command
param1 param2). This lets your processing container receive SIGINT and SIGKILL signals, which
Processing uses to stop processing jobs with the StopProcessingJob API.
• /opt/ml and all its subdirectories are reserved by SageMaker. When building your Processing Docker
image, don't place any data required by your processing container in these directories.
• If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Include
only the CUDA toolkit in containers. Don't bundle NVIDIA drivers with the image. For more information
about nvidia-docker, see NVIDIA/nvidia-docker.

How Amazon SageMaker Processing Configures Input and


Output For Your Processing Container
When you create a processing job using the CreateProcessingJob operation, you can specify multiple
ProcessingInput and ProcessingOutput. values.

You use the ProcessingInput parameter to specify an Amazon Simple Storage Service (Amazon
S3) URI to download data from, and a path in your processing container to download the data to.
The ProcessingOutput parameter configures a path in your processing container from which to
upload data, and where in Amazon S3 to upload that data to. For both ProcessingInput and
ProcessingOutput, the path in the processing container must begin with /opt/ml/processing/ .

For example, you might create a processing job with one ProcessingInput parameter that downloads
data from s3://your-data-bucket/path/to/input/csv/data into /opt/ml/processing/
csv in your processing container, and a ProcessingOutput parameter that uploads data from /opt/
ml/processing/processed_csv to s3://your-data-bucket/path/to/output/csv/data.
Your processing job would read the input data, and write output data to /opt/ml/processing/
processed_csv. Then it uploads the data written to this path to the specified Amazon S3 output
location.
Important
Symbolic links (symlinks) can not be used to upload output data to Amazon S3. Symlinks are not
followed when uploading output data.

How Amazon SageMaker Processing Provides Logs and Metrics


for Your Processing Container
When your processing container writes to stdout or stderr, Amazon SageMaker Processing saves the
output from each processing container and puts it in Amazon CloudWatch logs. For information about
logging, see Log Amazon SageMaker Events with Amazon CloudWatch (p. 3284).

Amazon SageMaker Processing also provides CloudWatch metrics for each instance running your
processing container. For information about metrics, see Monitor Amazon SageMaker with Amazon
CloudWatch (p. 3271).

How Amazon SageMaker Processing Configures Your Processing


Container
Amazon SageMaker Processing provides configuration information to your processing container through
environment variables and two JSON files—/opt/ml/config/processingjobconfig.json and /
opt/ml/config/resourceconfig.json— at predefined locations in the container.

When a processing job starts, it uses the environment variables that you specified with
the Environment map in the CreateProcessingJob request. The /opt/ml/config/

1207
Amazon SageMaker Developer Guide
Build Your Own Processing Container

processingjobconfig.json file contains information about the hostnames of your processing


containers, and is also specified in the CreateProcessingJob request.

The following example shows the format of the /opt/ml/config/processingjobconfig.json file.

{
"ProcessingJobArn": "<processing_job_arn>",
"ProcessingJobName": "<processing_job_name>",
"AppSpecification": {
"ImageUri": "<image_uri>",
"ContainerEntrypoint": null,
"ContainerArguments": null
},
"Environment": {
"KEY": "VALUE"
},
"ProcessingInputs": [
{
"InputName": "input-1",
"S3Input": {
"LocalPath": "/opt/ml/processing/input/dataset",
"S3Uri": "<s3_uri>",
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3CompressionType": "None",
"S3DownloadMode": "StartOfJob"
}
}
],
"ProcessingOutputConfig": {
"Outputs": [
{
"OutputName": "output-1",
"S3Output": {
"LocalPath": "/opt/ml/processing/output/dataset",
"S3Uri": "<s3_uri>",
"S3UploadMode": "EndOfJob"
}
}
],
"KmsKeyId": null
},
"ProcessingResources": {
"ClusterConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m5.xlarge",
"VolumeSizeInGB": 30,
"VolumeKmsKeyId": null
}
},
"RoleArn": "<IAM role>",
"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
}
}

The /opt/ml/config/resourceconfig.json file contains information about the hostnames of your


processing containers. Use the following hostnames when creating or running distributed processing
code.

{
"current_host": "algo-1",

1208
Amazon SageMaker Developer Guide
Build Your Own Processing Container

"hosts": ["algo-1","algo-2","algo-3"]
}

Don't use the information about hostnames contained in /etc/hostname or /etc/hosts because it
might be inaccurate.

Hostname information might not be immediately available to the processing container. We recommend
adding a retry policy on hostname resolution operations as nodes become available in the cluster.

Save and Access Metadata Information About Your Processing


Job
To save metadata from the processing container after exiting it, containers can write UTF-8 encoded
text to the /opt/ml/output/message file. After the processing job enters any terminal status
("Completed", "Stopped", or "Failed"), the "ExitMessage" field in DescribeProcessingJob
contains the first 1 KB of this file. Access that initial part of file with a call to DescribeProcessingJob,
which returns it through the ExitMessage parameter. For failed processing jobs, you can use this field
to communicate information about why the processing container failed.
Important
Don't write sensitive data to the /opt/ml/output/message file.

If the data in this file isn't UTF-8 encoded, the job fails and returns a ClientError. If multiple
containers exit with an ExitMessage, the content of the ExitMessage from each processing container
is concatenated, then truncated to 1 KB.

Run Your Processing Container Using the SageMaker Python


SDK
You can use the SageMaker Python SDK to run your own processing image by using the Processor
class. The following example shows how to run your own processing container with one input from
Amazon Simple Storage Service (Amazon S3) and one output to Amazon S3.

from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

processor = Processor(image_uri='<your_ecr_image_uri>',
role=role,
instance_count=1,
instance_type="ml.m5.xlarge")

processor.run(inputs=[ProcessingInput(
source='<s3_uri or local path>',
destination='/opt/ml/processing/input_data')],
outputs=[ProcessingOutput(
source='/opt/ml/processing/processed_data',
destination='<s3_uri>')],
)

Instead of building your processing code into your processing image, you can provide a
ScriptProcessor with your image and the command that you want to run, along with the code
that you want to run inside that container. For an example, see Run Scripts with Your Own Processing
Container (p. 1204).

You can also use the scikit-learn image that Amazon SageMaker Processing provides through
SKLearnProcessor to run scikit-learn scripts. For an example, see Data Processing with scikit-
learn (p. 1198).

1209
Amazon SageMaker Developer Guide
How Feature Store works

Create, store, and share features


with Amazon SageMaker Feature
Store
The machine learning (ML) development process often begins with extracting data signals also known
as features from data to train ML models. Amazon SageMaker Feature Store makes it easy for data
scientists, machine learning engineers, and general practitioners to create, share, and manage features
for ML development. Feature Store accelerates this process by reducing repetitive data processing and
curation work required to convert raw data into features for training an ML algorithm.

Further, the processing logic for your data is authored only once, and features generated are used for
both training and inference, reducing the training-serving skew. Feature Store is a centralized store for
features and associated metadata so features can be easily discovered and reused. You can create an
online or an offline store. The online store is used for low latency real-time inference use cases, and the
offline store is used for training and batch inference.

The following diagram shows how you can use Feature Store as part of your machine learning pipeline.
First, you read in your raw data and process it. You can ingest data via streaming to the online and offline
store, or in batches directly to the offline store. You first create a FeatureGroup and configure it to an
online or offline store, or both. Then, you can ingest data into your FeatureGroup and store it in your
store. A FeatureGroup is a group of features that is defined via a schema in Feature Store to describe a
record.

Online store is primarily designed for supporting real-time predictions that need low millisecond latency
reads and high throughput writes. Offline store is primarily intended for batch predictions and model
training. Offline store is an append only store and can be used to store and access historical feature data.
The offline store can help you store and serve features for exploration and model training. The online
store retains only the latest feature data. Feature Groups are mutable and can evolve their schema after
creation.

How Feature Store works


In Feature Store, features are stored in a collection called a feature group. You can visualize a feature
group as a table in which each column is a feature, with a unique identifier for each row. In principle, a

1210
Amazon SageMaker Developer Guide
Create feature groups

feature group is composed of features and values specific to each feature. A Record is a collection of
values for features that correspond to a unique RecordIdentifier. Altogether, a FeatureGroup is a
group of features defined in your FeatureStore to describe a Record.

You can use Feature Store in the following modes:

• Online – In online mode, features are read with low latency (milliseconds) reads and used for high
throughput predictions. This mode requires a feature group to be stored in an online store.
• Offline – In offline mode, large streams of data are fed to an offline store, which can be used for
training and batch inference. This mode requires a feature group to be stored in an offline store. The
offline store uses your S3 bucket for storage and can also fetch data using Athena queries.
• Online and Offline – This includes both online and offline modes.

You can ingest data into feature groups in Feature Store in two ways: streaming or in batches. When
you ingest data through streaming, a collection of records are pushed to Feature Store by calling a
synchronous PutRecord API call. This API enables you to maintain the latest feature values in Feature
Store and to push new feature values as soon an update is detected.

Alternatively, Feature Store can process and ingest data in batches. You can author features using
Amazon SageMaker Data Wrangler, create feature groups in Feature Store and ingest features in batches
using a SageMaker Processing job with a notebook exported from Data Wrangler. This mode allows for
batch ingestion into the offline store. It also supports ingestion into the online store if the feature group
is configured for both online and offline use.

Create feature groups


To ingest features into Feature Store, you must first define the feature group and the feature definitions
(feature name and data type) for all features that belong to the feature group. After they are created,
feature groups are mutable and can evolve their schema. Feature group names are unique within an
AWS Region and AWS account. When creating a feature group, you can also create the metadata for the
feature group, such as a short description, storage configuration, features for identifying each record,
and the event time, as well as tags to store information such as the author, data source, version, and
more.
Important
FeatureGroup names or associated metadata such as description or tags should not contain
any personal identifiable information (PII) or confidential information.

Find, discover, and share features


After you create a feature group in Feature Store, other authorized users of the feature store can share
and discover it. Users can browse through a list of all feature groups in Feature Store or discover existing
feature groups by searching by feature group name, description, record identifier name, creation date,
and tags.

Real-time inference for features stored in the


online store
With Feature Store, you can enrich your features stored in the online store in real time with data from
a streaming source (clean stream data from another application) and serve the features with low
millisecond latency for real-time inference.

1211
Amazon SageMaker Developer Guide
Offline store for model training and batch inference

You can also perform joins across different FeatureGroups for real-time inference by querying two
different FeatureGroups in the client application.

Offline store for model training and batch


inference
Feature Store provides offline storage for feature values in your S3 bucket. Your data is stored in your S3
bucket using a prefixing scheme based on event time. The offline store is an append-only store, enabling
Feature Store to maintain a historical record of all feature values. Data is stored in the offline store in
Parquet format for optimized storage and query access.

You can query, explore, and visualize features using Data Wrangler from Amazon SageMaker Studio.
Feature Store supports combining data to produce, train, validate, and test data sets, and allows you to
extract data at different points in time.

Feature data ingestion


Feature generation pipelines can be created to process large batches (1 million rows of data or more) or
small batches, and to write feature data to the offline or online store. Streaming sources such as Amazon
Managed Streaming for Apache Kafka or Amazon Kinesis can also be used as data sources from which
features are extracted and directly fed to the online store for training, inference, or feature creation.

You can push records to Feature Store by calling the synchronous PutRecord API call. Since this is a
synchronous API call, it allows small batches of updates to be pushed in a single API call. This enables
you to maintain high freshness of the feature values and publish values as soon as an update is detected.
These are also called streaming features.

When feature data is ingested and updated, Feature Store stores historical data for all features in the
offline store. For batch ingest, you can pull feature values from your S3 bucket or use Athena to query.
You can also use Data Wrangler to process and engineer new features that can then be exported to a
chosen S3 bucket to be accessed by Feature Store. For batch ingestion, you can configure a processing
job to batch ingest your data into Feature Store, or you can pull feature values from your S3 bucket using
Athena.

To remove a Record from your online store, use the DeleteRecord API call. This will also add the
deleted record to the offline store.

Resilience in Feature Store


Feature Store is distributed across multiple Availability Zones (AZs). An AZ is an isolated location within
an AWS Region. If some AZs fail, Feature Store can use other AZs. For more information about AZs, see
Resilience in Amazon SageMaker (p. 3208).

Get started with Amazon SageMaker Feature Store


The following topics give information about using Amazon SageMaker Feature Store. First learn the
Feature Store concepts, how to manage permissions to use Feature Store, how to create and use feature

1212
Amazon SageMaker Developer Guide
Feature Store concepts

groups using a Amazon SageMaker Studio Jupyter or JupyterLab notebook, how to use Feature Store in
the Studio User Interface, and how to delete feature groups using SDK for Python and Studio.

Topics
• Feature Store concepts (p. 1213)
• Adding required policies to your IAM role (p. 1215)
• Create feature groups (p. 1215)
• Use Amazon SageMaker Feature Store with Amazon SageMaker Studio (p. 1227)
• Delete a feature group (p. 1227)

Feature Store concepts


We list common terms used in Amazon SageMaker Feature Store, followed by an example diagram to
visualize a few concepts:

• Feature Store: Storage and data management layer for machine learning (ML) features. Serves as the
single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
In the following example diagram, the Feature Store is a store for your feature groups, which contains
your ML data, and provides additional services.
• Online store: Low latency, high availability store for a feature group that enables real-time lookup of
records. The online store allows quick access to the latest record via the GetRecord API.
• Offline store: Stores historical data in your Amazon S3 bucket. The offline store is used when low (sub-
second) latency reads are not needed. For example, the offline store can be used when you want to
store and serve features for exploration, model training, and batch inference.
• Feature group: The main resource of Feature Store that contains the data and metadata used for
training or predicting with a ML model. A feature group is a logical grouping of features used to
describe records. In the following example diagram, a feature group contains your ML data.
• Feature: A property that is used as one of the inputs to train or predict using your ML model. In the
Feature Store API a feature is an attribute of a record. In the following example diagram, a feature
describes a column in your ML data table.
• Feature definition: Consists of a name and one of the data types: integral, string or fractional. A
feature group contains a list of feature definitions. For more information on Feature Store data types,
see Data types (p. 1266).
• Record: Collection of values for features for a single record identifier. A combination of record
identifier and event time values uniquely identify a record within a feature group. In the following
example diagram, a record is a row in your ML data table.
• Record identifier name: The record identifier name is the name of the feature that identifies
the records. It must refer to one of the names of a feature defined in the feature group's feature
definitions. Each feature group is defined with a record identifier name.
• Event time: Timestamp that you provide corresponding to when the record event occurred. All records
in a feature group must have a corresponding event time. The online store only contains the record
corresponding to the latest event time, whereas the offline store contains all historic records. For more
information on event time formats, see Data types (p. 1266).
• Ingestion: Adding new records to a feature group. Ingestion is typically achieved via the PutRecord
API.

The following example diagram conceptualizes a few Feature Store concepts:

1213
Amazon SageMaker Developer Guide
Feature Store concepts

1214
Amazon SageMaker Developer Guide
Adding required policies to your IAM role

The Feature Store contains your feature groups and a feature group contains your ML data. In the
example diagram, the original feature group contains ML data (table) contains three features (each
describing a column) and two records (rows).

• A feature (describes a column) is made up of a feature definition, that describes the feature name and
data type of the feature values, that are associated with records.
• A record (row) must be uniquely identified by its record identifier (diamond markers) and include the
event time (circle markers) of when the record event occurred.

Ingestion is the action of adding new data to a feature group. Records are added to a feature group
differently, depending on if you are ingesting into the online store or offline store. While ingesting new
data into a feature group and a new record identifier does not already exist within the feature group, the
record is added for both stores. While ingesting data into a feature group and a record identifier already
exists within the feature group:

• Only the latest event time is kept in the online store.


• All records are kept and act as historical records in the offline store.

Adding required policies to your IAM role


To get started with Amazon SageMaker Feature Store you must have a role and add the required policy
to your role, AmazonSageMakerFeatureStoreAccess. Below is a walkthrough on how to view the
policies attached to a role and how to add a policy to your role. For information on how to create a role
or get your execution role in a notebook within SageMaker, see SageMaker Roles (p. 3086).

1. Open the IAM console at https://console.aws.amazon.com/iam/.


2. In the navigation pane on the left of the IAM console, choose Roles.
3. In the search bar enter the role you are using for Amazon SageMaker Feature Store.

For examples on how to find your execution role ARN for a notebook within SageMaker (from the
SageMaker console or Amazon SageMaker Studio), see Get execution role (p. 3086). The role is at
the end of the execution role ARN.
4. After you enter the role in the search bar, choose the role.

Under Permissions policies you can view the policies attached to the role.
5. After you choose the role, choose Add permissions, then choose Attach policies.
6. In the search bar under Other permissions policies enter
AmazonSageMakerFeatureStoreAccess and press enter. If the policy does not show, you may
already have the policy attached, listed under your Current permissions policies.
7. After you press enter, select the check box next to the policy and then choose Add permissions.
8. After you have attached the policy to your role, the policy will appear under Permissions policies for
your IAM role.

Create feature groups


A FeatureGroup is the main Feature Store resource that contains the metadata for all the data stored
in Amazon SageMaker Feature Store. A feature group is a logical grouping of features, defined in the
feature store, to describe records. A feature group’s definition is composed of a list of feature definitions,
a record identifier name, and configurations for its online and offline store. The example code in this
topic uses the SageMaker Python SDK. The underlying APIs are available for developers using other
languages.

1215
Amazon SageMaker Developer Guide
Create feature groups

Prior to using a feature store you typically load your dataset, run transformations, and set up your
features for ingestion. This process has a lot of variation and is highly dependent on your data. The
example code in the following topics refer to the Introduction to Feature Store, Fraud Detection with
Amazon SageMaker Feature Store example notebooks respectively. We recommend that you run this
notebook in Amazon SageMaker Studio because the code in this guide is conceptual and not fully
functional if copied.

Feature Store supports the following data types: String, Fractional (IEEE 64-bit floating point
value), and Integral (Int64 - 64 bit signed integral value). The default type is set to String. This
means that, if a column in your dataset is not a float or long type, it defaults to String in your
feature store.

You may use a schema to describe your data’s columns and data types. You pass this schema
into FeatureDefinitions, a required parameter for a FeatureGroup. You can use
the SageMaker Python SDK, which has automatic data type detection when you use the
load_feature_definitions function.

The default behavior when a new feature record is added with an already existing record ID is as follows.
In the offline store, the new record will be appended. In the online store, if the event time of the new
record is less than the existing event time than nothing will happen, however if the event time of the
new record is greater than or equal to the existing event time, the record will be over written.

When you create a new feature group you can choose one of the following table formats:

• AWS Glue (Default)


• Apache Iceberg

Ingesting data, especially when streaming, can result in a large number of small files deposited into the
offline store. This can negatively impact query performance due the higher number of file operations
required. To avoid potential performance issues, use the Apache Iceberg table format when creating new
feature groups. With Iceberg you can compact the small data files into fewer large files in the partition,
resulting in significantly faster queries. This compaction operation is concurrent and does not affect
ongoing read and write operations on the feature group. If you choose the Iceberg option when creating
new feature groups, Amazon SageMaker Feature Store will create the Iceberg tables using Parquet file
format, and register the tables with the AWS Glue Data Catalog.
Important
Note that for feature groups in Iceberg table format, you must specify String as the value for
the event time. If you specify any other type, you can't create the feature group successfully.

Topics
• Introduction to Feature Store notebook (p. 1216)
• Fraud detection with Feature Store Notebook (p. 1221)

Introduction to Feature Store notebook


The example code on this page refers to the Introduction to Feature Store notebook. It is recommended
that you run this notebook in Amazon SageMaker Studio, Jupyter, and JupyterLab because the code in
this guide is conceptual and not fully functional if copied. To open the notebook in Studio, first clone
the aws/amazon-sagemaker-examples GitHub repository to Studio by following the steps in Clone a Git
Repository in SageMaker Studio (p. 194), then navigate to the amazon-sagemaker-examples/sagemaker-
featurestore directory and choose the file.

1216
Amazon SageMaker Developer Guide
Create feature groups

Step 1: Set up
To start using Feature Store, create a SageMaker session and set up the Amazon S3 bucket you want
to use for your features. The Amazon S3 bucket is your offline store. The following code uses the
SageMaker default bucket and adds a custom prefix to it.
Note
The role that you use to run the notebook must have the following managed policies attached
to it: AmazonS3FullAccess and AmazonSageMakerFeatureStoreAccess. For information
on adding policies to your IAM role, see Adding required policies to your IAM role (p. 1215).

# SageMaker Python SDK version 2.x is required


import sagemaker
import sys

import boto3
import pandas as pd
import numpy as np
import io
from sagemaker.session import Session
from sagemaker import get_execution_role

prefix = 'sagemaker-featurestore-introduction'
role = get_execution_role()

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()

Step 2: Inspect your data


In this notebook example we ingest synthetic data from the Github repository that hosts the full
notebook.

customer_data = pd.read_csv("data/feature_store_introduction_customer.csv")
orders_data = pd.read_csv("data/feature_store_introduction_orders.csv")

print(customer_data.head())
print(orders_data.head())

The following diagram illustrates the steps the data goes through before it is ingested into Feature
Store. In this notebook, we illustrate the use-case where you have data from multiple sources and want
to store them independently in a Feature Store. Our example considers data from a data warehouse
(customer data), and data from a real-time streaming service (order data).

1217
Amazon SageMaker Developer Guide
Create feature groups

1218
Amazon SageMaker Developer Guide
Create feature groups

Step 3: Create feature groups


We first start by creating feature group names for customer_data and orders_data. Following this, we
create two Feature Groups, one for customer_data and another for orders_data.

import time
from time import strftime, gmtime
customers_feature_group_name = 'customers-feature-group-' + strftime('%d-%H-%M-%S',
gmtime())
orders_feature_group_name = 'orders-feature-group-' + strftime('%d-%H-%M-%S', gmtime())

Instantiate a FeatureGroup object for customers_data and orders_data.

from sagemaker.feature_store.feature_group import FeatureGroup

customers_feature_group = FeatureGroup(
name=customers_feature_group_name, sagemaker_session=sagemaker_session
)
orders_feature_group = FeatureGroup(
name=orders_feature_group_name, sagemaker_session=sagemaker_session
)

import time
current_time_sec = int(round(time.time()))
record_identifier_feature_name = "customer_id"

Append EventTime feature to your data frame. This parameter is required, and time stamps each data
point.

customer_data["EventTime"] = pd.Series([current_time_sec]*len(customer_data),
dtype="float64")
orders_data["EventTime"] = pd.Series([current_time_sec]*len(orders_data), dtype="float64")

Load feature definitions to your feature group.

customers_feature_group.load_feature_definitions(data_frame=customer_data)
orders_feature_group.load_feature_definitions(data_frame=orders_data)

Below we call create to create two feature groups, customers_feature_group and


orders_feature_group respectively.

customers_feature_group.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name="EventTime",
role_arn=role,
enable_online_store=True
)

orders_feature_group.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name="EventTime",
role_arn=role,
enable_online_store=True
)

1219
Amazon SageMaker Developer Guide
Create feature groups

To confirm that your feature group has been created we use DescribeFeatureGroup and
ListFeatureGroups APIs to display the created feature group.

customers_feature_group.describe()

orders_feature_group.describe()

sagemaker_session.boto_session.client('sagemaker',
region_name=region).list_feature_groups() # We use the boto client to list FeatureGroups

Step 4: Ingest data into a feature group


After the FeatureGroups have been created, we can put data into the FeatureGroups. If you are using the
SageMaker Python SDK, use the ingest API call. If you are using boto3 then use the PutRecord API.
It will take less than 1 minute to ingest data both of these options. This example uses the SageMaker
Python SDK, so it uses the ingest API call.

def check_feature_group_status(feature_group):
status = feature_group.describe().get("FeatureGroupStatus")
while status == "Creating":
print("Waiting for Feature Group to be Created")
time.sleep(5)
status = feature_group.describe().get("FeatureGroupStatus")
print(f"FeatureGroup {feature_group.name} successfully created.")

check_feature_group_status(customers_feature_group)
check_feature_group_status(orders_feature_group)

customers_feature_group.ingest(
data_frame=customer_data, max_workers=3, wait=True
)

orders_feature_group.ingest(
data_frame=orders_data, max_workers=3, wait=True
)

Using an arbirary customer record id, 573291 we use get_record to check that the data has been
ingested into the feature group.

customer_id = 573291
sample_record = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime',
region_name=region).get_record(FeatureGroupName=customers_feature_group_name,
RecordIdentifierValueAsString=str(customer_id))

print(sample_record)

Below demonstrates how to use the batch_get_record to get a batch of records.

all_records = sagemaker_session.boto_session.client(
"sagemaker-featurestore-runtime", region_name=region
).batch_get_record(
Identifiers=[
{
"FeatureGroupName": customers_feature_group_name,

1220
Amazon SageMaker Developer Guide
Create feature groups

"RecordIdentifiersValueAsString": ["573291", "109382", "828400", "124013"],


},
{
"FeatureGroupName": orders_feature_group_name,
"RecordIdentifiersValueAsString": ["573291", "109382", "828400", "124013"],
},
]
)

print(all_records)

Step 5: Clean up
Here we remove the Feature Groups we created.

customers_feature_group.delete()
orders_feature_group.delete()

Step 6: Next steps


In this example notebook, you learned how to quickly get started with Feature Store, create feature
groups, and ingest data into them.

For an advanced example on how to use Feature Store for a Fraud Detection use-case, see Fraud
Detection with Feature Store.

Step 7: Programmers note


In this notebook we used a variety of different API calls. Most of them are accessible through
the Python SDK, however some only exist within boto3. You can invoke the Python SDK
API calls directly on your Feature Store objects, whereas to invoke API calls that exist within
boto3, you must first access a boto client through your boto and SageMaker sessions: e.g.,
sagemaker_session.boto_session.client().

Below we list API calls used in this notebook that exist within the Python SDK and ones that exist in
boto3 for your reference.

Python SDK API Calls

describe()
ingest()
delete()
create()
load_feature_definitions()

Boto3 API Calls

list_feature_groups()
get_record()

Fraud detection with Feature Store Notebook


The example code on this page refers to the Fraud Detection with Amazon SageMaker Feature Store
notebook. To open the notebook in Studio, first clone the aws/amazon-sagemaker-examples GitHub
repository to Studio by following the steps in Clone a Git Repository in SageMaker Studio (p. 194), then
navigate to the amazon-sagemaker-examples/sagemaker-featurestore directory and choose the file.

1221
Amazon SageMaker Developer Guide
Create feature groups

Step 1: Set up Feature Store


To start using Feature Store, create a SageMaker session, boto3 session, and a Feature Store session.
Also, set up the S3 bucket you want to use for your features. This is your offline store. The following code
uses the SageMaker default bucket and adds a custom prefix to it.
Note
The role that you use to run the notebook must have the following managed policies attached
to it: AmazonSageMakerFullAccess and AmazonSageMakerFeatureStoreAccess. For
information on adding policies to your IAM role, see Adding required policies to your IAM
role (p. 1215).

import boto3
import sagemaker
from sagemaker.session import Session

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
boto_session = boto3.Session(region_name=region)
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-featurestore'
offline_feature_store_bucket = 's3://{}/{}'.format(default_bucket, prefix)

sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)


featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime',
region_name=region)

feature_store_session = Session(
boto_session=boto_session,
sagemaker_client=sagemaker_client,
sagemaker_featurestore_runtime_client=featurestore_runtime
)

Step 2: Load datasets and partition data into feature groups


Load your data into data frames for each of your features. You use these data frames after you set up the
feature group. In the fraud detection example, you can see these steps in the following code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io

fraud_detection_bucket_name = 'sagemaker-featurestore-fraud-detection'
identity_file_key = 'sampled_identity.csv'
transaction_file_key = 'sampled_transactions.csv'

identity_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name,
Key=identity_file_key)
transaction_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name,
Key=transaction_file_key)

identity_data = pd.read_csv(io.BytesIO(identity_data_object['Body'].read()))
transaction_data = pd.read_csv(io.BytesIO(transaction_data_object['Body'].read()))

identity_data = identity_data.round(5)
transaction_data = transaction_data.round(5)

identity_data = identity_data.fillna(0)
transaction_data = transaction_data.fillna(0)

1222
Amazon SageMaker Developer Guide
Create feature groups

# Feature transformations for this dataset are applied before ingestion into FeatureStore.
# One hot encode card4, card6
encoded_card_bank = pd.get_dummies(transaction_data['card4'], prefix = 'card_bank')
encoded_card_type = pd.get_dummies(transaction_data['card6'], prefix = 'card_type')

transformed_transaction_data = pd.concat([transaction_data, encoded_card_type,


encoded_card_bank], axis=1)
transformed_transaction_data =
transformed_transaction_data.rename(columns={"card_bank_american express":
"card_bank_american_express"})

Step 3: Set up feature groups


When you set up your feature groups, you need to customize the feature names with a unique name and
set up each feature group with the FeatureGroup class.

from sagemaker.feature_store.feature_group import FeatureGroup


feature_group_name = "some string for a name"
feature_group = FeatureGroup(name=feature_group_name,
sagemaker_session=feature_store_session)

For example, in the fraud detection example, the two feature groups are identity and transaction.
In the following code you can see how the names are customized with a timestamp, and then each group
is set up by passing in the name and the session.

import time
from time import gmtime, strftime, sleep
from sagemaker.feature_store.feature_group import FeatureGroup

identity_feature_group_name = 'identity-feature-group-' + strftime('%d-%H-%M-%S', gmtime())


transaction_feature_group_name = 'transaction-feature-group-' + strftime('%d-%H-%M-%S',
gmtime())

identity_feature_group = FeatureGroup(name=identity_feature_group_name,
sagemaker_session=feature_store_session)
transaction_feature_group = FeatureGroup(name=transaction_feature_group_name,
sagemaker_session=feature_store_session)

Step 4: Set up record identifier and event time features


In this step, you specify a record identifier name and an event time feature name. This name maps to
the column of the corresponding features in your data. For example, in the fraud detection example, the
column of interest is TransactionID. EventTime can be appended to your data when no timestamp
is available. In the following code, you can see how these variables are set, and then EventTime is
appended to both feature’s data.

record_identifier_name = "TransactionID"
event_time_feature_name = "EventTime"
current_time_sec = int(round(time.time()))
identity_data[event_time_feature_name] = pd.Series([current_time_sec]*len(identity_data),
dtype="float64")
transformed_transaction_data[event_time_feature_name] =
pd.Series([current_time_sec]*len(transaction_data), dtype="float64")

Step 5: Load feature definitions


You can now load the feature definitions by passing a data frame containing the feature data. In the
following code for the fraud detection example, the identity feature and transaction feature are each
loaded by using load_feature_definitions, and this function automatically detects the data type

1223
Amazon SageMaker Developer Guide
Create feature groups

of each column of data. For developers using a schema rather than automatic detection, see the Export
Feature Groups from Data Wrangler example for code that shows how to load the schema, map it, and
add it as a FeatureDefinition that you can use to create the FeatureGroup. This example also
covers a boto3 implementation, which you can use instead of the SageMaker Python SDK.

identity_feature_group.load_feature_definitions(data_frame=identity_data); # output is
suppressed
transaction_feature_group.load_feature_definitions(data_frame=transformed_transaction_data);
# output is suppressed

Step 6: Create a feature group


In this step, you use the create function to create the feature group. The following code shows all of
the available parameters. The online store is not created by default, so you must set this as True if you
want to enable it. The s3_uri is the S3 bucket location of your offline store.

# create a FeatureGroup
feature_group.create(
description = "Some info about the feature group",
feature_group_name = feature_group_name,
record_identifier_name = record_identifier_name,
event_time_feature_name = event_time_feature_name,
feature_definitions = feature_definitions,
role_arn = role,
s3_uri = offline_feature_store_bucket,
enable_online_store = True,
online_store_kms_key_id = None,
offline_store_kms_key_id = None,
disable_glue_table_creation = False,
data_catalog_config = None,
tags = ["tag1","tag2"])

The following code from the fraud detection example shows a minimal create call for each of the two
features groups being created.

identity_feature_group.create(
s3_uri=offline_feature_store_bucket,
record_identifier_name=record_identifier_name,
event_time_feature_name=event_time_feature_name,
role_arn=role,
enable_online_store=True
)

transaction_feature_group.create(
s3_uri=offline_feature_store_bucket,
record_identifier_name=record_identifier_name,
event_time_feature_name=event_time_feature_name,
role_arn=role,
enable_online_store=True
)

When you create a feature group, it takes time to load the data, and you need to wait until the feature
group is created before you can use it. You can check status using the following method.

status = feature_group.describe().get("FeatureGroupStatus")

While the feature group is being created, you receive Creating as a response. When this step has
finished successfully, the response is Created. Other possible statuses are CreateFailed, Deleting,
or DeleteFailed.

1224
Amazon SageMaker Developer Guide
Create feature groups

Step 7: Work with feature groups


Now that you've set up your feature group, you can perform any of the following tasks:

Topics
• Describe a feature group (p. 1225)
• List feature groups (p. 1225)
• Put records in a feature group (p. 1225)
• Get records from a feature group (p. 1225)
• Generate hive DDL commands (p. 1226)
• Build a training dataset (p. 1226)
• Write and execute an Athena query (p. 1226)
• Delete a feature group (p. 1227)

Describe a feature group

You can retrieve information about your feature group with the describe function.

feature_group.describe()

List feature groups

You can list all of your feature groups with the list_feature_groups function.

sagemaker_client.list_feature_groups()

Put records in a feature group

You can use the ingest function to load your feature data. You pass in a data frame of feature data, set
the number of workers, and choose to wait for it to return or not. The following example demonstrates
using the ingest function.

feature_group.ingest(
data_frame=feature_data, max_workers=3, wait=True
)

For each feature group you have, run the ingest function on the feature data you want to load.

Get records from a feature group

You can use the get_record function to retrieve the data for a specific feature by its record identifier.
The following example uses an example identifier to retrieve the record.

record_identifier_value = str(2990130)
featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name,
RecordIdentifierValueAsString=record_identifier_value)

An example response from the fraud detection example:

...
'Record': [{'FeatureName': 'TransactionID', 'ValueAsString': '2990130'},
{'FeatureName': 'isFraud', 'ValueAsString': '0'},
{'FeatureName': 'TransactionDT', 'ValueAsString': '152647'},

1225
Amazon SageMaker Developer Guide
Create feature groups

{'FeatureName': 'TransactionAmt', 'ValueAsString': '75.0'},


{'FeatureName': 'ProductCD', 'ValueAsString': 'H'},
{'FeatureName': 'card1', 'ValueAsString': '4577'},
...

Generate hive DDL commands

The SageMaker Python SDK’s FeatureStore class also provides the functionality to generate Hive DDL
commands. The schema of the table is generated based on the feature definitions. Columns are named
after feature name and data-type are inferred based on feature type.

print(feature_group.as_hive_ddl())

Example output:

CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.identity-feature-


group-27-19-33-00 (
TransactionID INT
id_01 FLOAT
id_02 FLOAT
id_03 FLOAT
id_04 FLOAT
...

Build a training dataset

Feature Store automatically builds an AWS Glue data catalog when you create feature groups and you
can turn this off if you want. The following describes how to create a single training dataset with feature
values from both identity and transaction feature groups created earlier in this topic. Also, the following
describes how to run an Amazon Athena query to join data stored in the offline store from both identity
and transaction feature groups.

To start, create an Athena query using athena_query() for both identity and transaction feature
groups. The `table_name` is the AWS Glue table that is autogenerated by Feature Store.

identity_query = identity_feature_group.athena_query()
transaction_query = transaction_feature_group.athena_query()

identity_table = identity_query.table_name
transaction_table = transaction_query.table_name

Write and execute an Athena query

You write your query using SQL on these feature groups, and then execute the query with the .run()
command and specify your S3 bucket location for the data set to be saved there.

# Athena query
query_string = 'SELECT * FROM "'+transaction_table+'" LEFT JOIN "'+identity_table+'" ON
"'+transaction_table+'".transactionid = "'+identity_table+'".transactionid'

# run Athena query. The output is loaded to a Pandas dataframe.


dataset = pd.DataFrame()
identity_query.run(query_string=query_string,
output_location='s3://'+default_s3_bucket_name+'/query_results/')
identity_query.wait()
dataset = identity_query.as_dataframe()

From here you can train a model using this data set and then perform inference.

1226
Amazon SageMaker Developer Guide
Use Feature Store with Studio

Delete a feature group

You can delete a feature group with the delete function.

feature_group.delete()

The following code example is from the fraud detection example.

identity_feature_group.delete()
transaction_feature_group.delete()

For more information, see the Delete a feature group API

Use Amazon SageMaker Feature Store with Amazon


SageMaker Studio
You can use Amazon SageMaker Studio to create and view details about your feature groups.

Topics
• Create a feature group in Amazon SageMaker Studio (p. 1227)
• View feature group details in Studio (p. 1229)

Create a feature group in Amazon SageMaker Studio


The create feature group process has four steps: enter feature group information, feature definitions,
required features, and feature group tags.

Consider which of the following options best fits your use case:

• Create an online store, an offline store, or both. For more information on the differences between
online and offline stores, see Feature Store concepts (p. 1213).
• Use a default AWS KMS key or your own AWS KMS key. The default key is the AWS managed
encryption key (SSE-S3), though you can reduce AWS KMS request costs by using Amazon S3 bucket
keys. For more information on reducing the cost by using Amazon S3 bucket keys, see Reducing the
cost of SSE-KMS with Amazon S3 Bucket Keys.

You can use the same key for both online and offline stores, or have a unique key for each. For more
information on AWS KMS, see AWS Key Management Service.
• If you create an offline store:
• You should decide if you want to create an Amazon S3 bucket or use an existing one. When using an
existing one, you need to know the Amazon S3 bucket URL or Amazon S3 bucket name and dataset
directory name, if applicable.
• You should choose which IAM role ARN to use. For more information on how to find your role and
attached policies, see Adding required policies to your IAM role (p. 1215).
• You should decide whether to use the AWS Glue (Default) or Apache Iceberg table format. In most
use cases you will want to use the Apache Iceberg Table format. For more information on table
formats, see Create feature groups (p. 1215)

Steps to create a feature group using Studio

1. Open Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).

1227
Amazon SageMaker Developer Guide
Use Feature Store with Studio

2.
Choose the Home icon ( ) on the left panel.
3. Choose Data.
4. From the dropdown list, choose Feature Store.
5. Choose Create feature group.
6. Under Feature group details, enter a feature group name.
7. (Optional) Enter a description of the feature group.
8. Under Feature group storage configuration, choose a storage type from the Storage type
dropdown list.

If you choose offline storage:

a. From the S3 bucket name dropdown list, you may choose an existing Amazon S3 bucket name,
enter a new bucket name, or choose Enter bucket URL manually and enter the URL under S3
bucket address.
b. (Optional) If you have a specified directory name for your dataset, choose from the Dataset
directory name dropdown list.
c. From the Table format dropdown list, choose the table format. In most use cases, you should
use the Apache Iceberg Table format. For more information on table formats, see Create
feature groups (p. 1215).
d. Under IAM role ARN, choose the IAM role ARN you want to attach to this feature group. For
more information on how to find your role and attached policies, see Adding required policies to
your IAM role (p. 1215).
9. Under the Offline store encryption key dropdown list, choose Use AWS managed AWS KMS key
(default) or Enter a AWS KMS key ARN and enter your AWS KMS key ARN under Offline store
encryption key ARN. For more information about AWS KMS, see AWS Key Management Service
10. If you have chosen the offline storage Table format and AWS Glue (default) Table format, under
Data catalog, you have the option to choose Use default values for your AWS Glue data catalog
or provide your existing data catalog name, table name, and database name to extend your existing
AWS Glue catalog.
11. Once all of the required information has been specified, the Continue button is available. Choose
Continue.
12. Under Specify feature definitions, you have two options for providing a schema for your features: a
JSON editor, or a table editor. In the JSON tab, type in or copy and paste your feature definitions in
the JSON format. For the table editor, type in the name and choose the corresponding data type for
each feature in your feature group. Choose Add feature definitions to include more features.

There must be at least two features in a feature group representing the record identifier and event
time:

• The record Type can be a string, fractional, or an integral.


• The event time Type must be a string or a fractional. However, if you chose the Iceberg Table
format, the event time must be a string.
13. Once all of the features are included, choose Continue.
14. Under Select required features you must specify the record identifier and event time features by
choosing the feature name under Record identifier feature name and Event time feature name
dropdown lists, respectively.
15. Once the record identifier and event time features are chosen, choose Continue.
16. (Optional) Add tags for the feature group by first choosing Add new tag and then entering a tag key
and corresponding value under Key and Value, respectively.
17. Choose Continue.

1228
Amazon SageMaker Developer Guide
Delete a feature group

18. Under Review feature group, review the feature group information. You may edit any step by
choosing the Edit button that corresponds to that step. This brings you to the corresponding step
for editing. To return to step 5, choose Continue until you return to step 5.
19. Once you have finalized the setup for your feature group, choose Create feature group.

If there are any issues with the setup, there is a red alert pop-up message that appears at the
bottom of the page with tips on solving the issue. You can return to previous steps to fix them.

If the feature group has been successfully created, a green pop-up message appears at the bottom
of the page. When the feature group is successfully created, it appears in your feature groups
catalog.

View feature group details in Studio


You can view details of your feature groups once a feature group has successfully been created in the
Feature Store.

View feature group details in Studio

1. Open Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).
2.
Choose the Home icon ( ) on the left panel.
3. Choose Data.
4. From the dropdown list, choose Feature Store.
5. Under the Feature group catalog tab, choose your feature group name from the list. This opens the
feature group page.
6. Under the Details tab, you can review your feature group Information and Tags. Choose Add new
tag to add a new tag or remove to remove a tag.
7. On the Features tab, you can find a list of all of the features. Use the filter to refine your list. Choose
a feature to view its details.

Delete a feature group


You can use Amazon SageMaker Studio or the Amazon SageMaker Feature Store API to delete your
feature group.

The following sections provide an overview of using both to delete a feature group.

Topics
• Delete a feature group using Studio (p. 1229)
• Delete feature group example Python code (p. 1230)

Delete a feature group using Studio


1. Open Studio. For more information, see Launch Amazon SageMaker Studio (p. 133).
2.
Choose the Home icon ( ) on the left panel.
3. Choose Data.
4. From the dropdown list, choose Feature Store.
5. In the Feature Group Catalog tab choose the feature group you wish to delete under Feature group
name.

1229
Amazon SageMaker Developer Guide
Data sources and ingestion

6. Choose Delete feature group.


7. In the popup window confirm deletion by typing "delete" in the field, then choose Delete.

Delete feature group example Python code


The following code uses the DeleteFeatureGroup API operation to delete your feature group
using the AWS SDK for Python (Boto3). It assumes that you've set up Feature Store and created
a feature group. For more information about getting started, see Introduction to Feature Store
notebook (p. 1216).

import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup

sagemaker_session = sagemaker.Session()
fg_name = 'your-feature-group-name'

my_fg = FeatureGroup(name=fg_name, sagemaker_session=sagemaker_session)


my_fg.delete()

Data sources and ingestion


There are multiple ways to bring your data into Amazon SageMaker Feature Store. Feature Store offers
a single API call for data ingestion called PutRecord that enables you to ingest data in batches or from
streaming sources. You can use Amazon SageMaker Data Wrangler to engineer features and then ingest
your features into your Feature Store. You can also use Amazon EMR for batch data ingestion through a
Spark connector.

Topics
• Stream ingestion (p. 1230)
• Data Wrangler with Feature Store (p. 1230)
• Batch ingestion with Amazon SageMaker Feature Store Spark (p. 1232)

Stream ingestion
You can use streaming sources such as Kafka or Kinesis as a data source where features are extracted
from there and directly fed to the online feature store for training, inference or feature creation. Records
can be pushed into the feature store by calling the synchronous PutRecord API call. Since this is a
synchronous API call it allows small batches of updates to be pushed in a single API call. This enables you
to maintain high freshness of the feature values and publish values as soon an update is detected. These
are also called streaming features.

Data Wrangler with Feature Store


Data Wrangler is a feature of Studio that provides an end-to-end solution to import, prepare, transform,
featurize, and analyze data. Data Wrangler enables you to engineer your features and ingest them into a
feature store.

In Studio, after interacting with Data Wrangler, choose the Export tab, choose Export Step, and the
choose Feature Store, as shown in the following screenshot. This exports a Jupyter notebook that has all
the source code in it to create a Feature Store feature group that adds your features from Data Wrangler
to an offline or online feature store.

1230
Amazon SageMaker Developer Guide
Data Wrangler with Feature Store

After the feature group has been created, you can also select and join data across multiple feature
groups to create new engineered features in Data Wrangler and then export your data set to an S3
bucket.

1231
Amazon SageMaker Developer Guide
Feature Store Spark

For more information on how to export to Feature Store, see Export to SageMaker Feature Store.

Batch ingestion with Amazon SageMaker Feature


Store Spark
Amazon SageMaker Feature Store Spark is a Spark connector that connects the Spark library to Feature
Store. Feature Store Spark simplifies data ingestion from Spark DataFrames to feature groups. Feature
Store supports batch data ingestion with Spark, using your existing ETL pipeline, on Amazon EMR, GIS,
an AWS Glue job, an Amazon SageMaker Processing job, or a SageMaker notebook.

Methods for installing and implementing batch data ingestion are provided for Python and Scala
developers. Python developers can use the open-source sagemaker-feature-store-pyspark Python
library for local development, installation on Amazon EMR, and for Jupyter Notebooks by following the
instructions in the Amazon SageMaker Feature Store Spark GitHub repository. Scala developers can use
the Feature Store Spark connector available in the Amazon SageMaker Feature Store Spark SDK Maven
central repository.

You can use the Spark connector to ingest data in the following ways, depending on if the online store,
offline store, or both are enabled.

1. Ingest by default – If the online store is enabled, Spark connector first ingests your dataframe into the
online store using the PutRecord API. Only the record with the largest event time remains in the online
store. If the offline store is enabled, within 15 minutes Feature Store ingests your dataframe into the
offline store. For more information about how the online and offline stores work, see Feature Store
concepts (p. 1213).

You can accomplish this by not specifying target_stores in the .ingest_data(...) method.
2. Offline store direct ingestion – If offline store is enabled, Spark connector batch ingests your
dataframe directly into the offline store. Ingesting the dataframe directly into the offline store doesn't
update the online store.

You can accomplish this by setting target_stores=["OfflineStore"] in the


.ingest_data(...) method.
3. Online store only – If online store is enabled, Spark connector ingests your dataframe into the online
store using the PutRecord API. Ingesting the dataframe directly into the online store doesn't update
the offline store.

You can accomplish this by setting target_stores=["OnlineStore"] in the


.ingest_data(...) method.

For information about using the different ingestion methods, see Example implementations (p. 1236).

Topics
• Feature Store Spark installation (p. 1232)
• Retrieving the JAR for Feature Store Spark (p. 1235)
• Example implementations (p. 1236)

Feature Store Spark installation


Scala users

The Feature Store Spark SDK is available in the Amazon SageMaker Feature Store Spark SDK Maven
central repository for Scala users.

Requirements

1232
Amazon SageMaker Developer Guide
Feature Store Spark

• Spark >= 3.0.0 and <= 3.3.0


• iceberg-spark-runtime >= 0.14.0
• Scala >= 2.12.x
• Amazon EMR >= 6.1.0 (only if you are using Amazon EMR)

Declare the dependency in POM.xml

The Feature Store Spark connector has a dependency on the iceberg-spark-runtime library. You
must therefore add corresponding version of the iceberg-spark-runtime library to the dependency
if you're ingesting data into a feature group that you've auto-created with the Iceberg table format. For
example, if you're using Spark 3.1, you must declare the following in your project’s POM.xml:

<dependency>
<groupId>software.amazon.sagemaker.featurestore</groupId>
<artifactId>sagemaker-feature-store-spark-sdk_2.12</artifactId>
<version>1.0.0</version>
</dependency>

<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.1_2.12</artifactId>
<version>0.14.0</version>
</dependency>

Python users

The Feature Store Spark SDK is available in the open-source Amazon SageMaker Feature Store Spark
GitHub repository.

Requirements

• Spark >= 3.0.0 and <= 3.3.0


• Amazon EMR >= 6.1.0 (only if you are using Amazon EMR)
• Kernel = conda_python3

We recommend setting the $SPARK_HOME to the directory where you have Spark installed. During
installation, Feature Store uploads the required JAR to SPARK_HOME, so that the dependencies load
automatically. Spark starting a JVM is required to make this PySpark library work.

Local installation

To find more info about the installation, enable verbose mode by appending --verbose to the
following installation command.

pip3 install sagemaker-feature-store-pyspark-3.1 --no-binary :all:

Installation on Amazon EMR

Create an Amazon EMR cluster with the release version 6.1.0 or later. Enable SSH to help you
troubleshoot any issues.

You can do one of the following to install the library:

• Create a custom step within Amazon EMR.


• Connect to your cluster using SSH and install the library from there.

1233
Amazon SageMaker Developer Guide
Feature Store Spark

Note
The following information uses Spark version 3.1, but you can specify any version that meets
the requirements.

export SPARK_HOME=/usr/lib/spark
sudo -E pip3 install sagemaker-feature-store-pyspark-3.1 --no-binary :all: --verbose

Note
If you want to install the dependent JARs automatically to SPARK_HOME, do not use the
bootstrap step.

Installation on a SageMaker notebook instance

Install a version of PySpark that's compatible with the Spark connector using the following commands:

!pip3 install pyspark==3.1.1


!pip3 install sagemaker-feature-store-pyspark-3.1 --no-binary :all:

If you're performing batch ingestion to the offline store, the dependencies aren't within the notebook
instance environment.

from pyspark.sql import SparkSession


import feature_store_pyspark

extra_jars = ",".join(feature_store_pyspark.classpath_jars())

spark = SparkSession.builder \
.config("spark.jars", extra_jars) \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-
aws:3.2.1,org.apache.hadoop:hadoop-common:3.2.1") \
.getOrCreate()

Installation on notebooks with GIS


Important
You must use AWS Glue Version 2.0 or later.

Use the following information to help you install the PySpark connector in an AWS Glue Interactive
Session (GIS).

Amazon SageMaker Feature Store Spark requires a specific Spark connector JAR during the initialization
of the session to be uploaded to your Amazon S3 bucket. For more information on uploading the
required JAR to your S3 bucket, see Retrieving the JAR for Feature Store Spark (p. 1235).

After you’ve uploaded the JAR, you must provide the GIS sessions with the JAR using the following
command.

%extra_jars s3:/<YOUR_BUCKET>/spark-connector-jars/sagemaker-feature-store-spark-sdk.jar

To install Feature Store Spark in the AWS Glue runtime, use the %additional_python_modules magic
command within the GIS notebook. AWS Glue runs pip to the modules that you’ve specified under
%additional_python_modules.

%additional_python_modules sagemaker-feature-store-pyspark-3.1

1234
Amazon SageMaker Developer Guide
Feature Store Spark

Before you start the AWS Glue session, you must use both of the preceding magic commands.

Installation on an AWS Glue job


Important
You must use AWS Glue Version 2.0 or later.

To install the Spark connector on a AWS Glue job, use the --extra-jars argument to provide the
necessary JARs and --additional-python-modules to install the Spark Connector as job parameters
when you create the AWS Glue job as shown in the following example. For more information on
uploading the required JAR to your S3 bucket, see Retrieving the JAR for Feature Store Spark (p. 1235).

glue_client = boto3.client('glue', region_name=region)


response = glue_client.create_job(
Name=pipeline_id,
Description='Feature Store Compute Job',
Role=glue_role_arn,
ExecutionProperty={'MaxConcurrentRuns': max_concurrent_run},
Command={
'Name': 'glueetl',
'ScriptLocation': script_location_uri,
'PythonVersion': '3'
},
DefaultArguments={
'--TempDir': temp_dir_location_uri,
'--additional-python-modules': 'sagemaker-feature-store-pyspark-3.1',
'--extra-jars': "s3:/<YOUR_BUCKET>/spark-connector-jars/sagemaker-feature-store-
spark-sdk.jar",
...
},
MaxRetries=3,
NumberOfWorkers=149,
Timeout=2880,
GlueVersion='3.0',
WorkerType='G.2X'
)

Installation on an Amazon SageMaker Processing job

To use Feature Store Spark with Amazon SageMaker Processing jobs, bring your own image. For
more information about bringing your image, see Bring your own SageMaker image (p. 169). Add the
installation step to a Dockerfile. After you've pushed the Docker image to an Amazon ECR repository,
you can use the PySparkProcessor to create the processing job. For more information about creating a
processing job with the PySpark processor, see Data Processing with Apache Spark (p. 1197).

The following is an example of adding an installation step to the Dockerfile.

FROM <ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/sagemaker-spark-processing:3.1-cpu-
py38-v1.0

RUN /usr/bin/python3 -m pip install sagemaker-feature-store-pyspark-3.1 --no-binary :all:


--verbose

Retrieving the JAR for Feature Store Spark


To retrieve the Feature Store Spark dependency JAR, you must install the Spark connector from the
Python Package Index (PyPI) repository using pip in any Python environment with network access. A
SageMaker Jupyter Notebook is an example of a Python environment with network access.

1235
Amazon SageMaker Developer Guide
Feature Store Spark

The following command installs the Spark connector.

!pip install sagemaker-feature-store-pyspark-3.1

After you've installed Feature Store Spark, you can retrieve the JAR location and upload the JAR to
Amazon S3.

The feature-store-pyspark-dependency-jars command provides the location of the necessary


dependency JAR that Feature Store Spark added. You can use the command to retrieve the JAR and
upload it to Amazon S3.

jar_location = !feature-store-pyspark-dependency-jars
jar_location = jar_location[0]

s3_client = boto3.client("s3")
s3_client.upload_file(jar_location, "<YOUR_BUCKET>","spark-connector-jars/sagemaker-
feature-store-spark-sdk.jar")

Example implementations
Example Python script

FeatureStoreBatchIngestion.py

from pyspark.sql import SparkSession


from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager
import feature_store_pyspark

spark = SparkSession.builder \
.getOrCreate()

# Construct test DataFrame


columns = ["RecordIdentifier", "EventTime"]
data = [("1","2021-03-02T12:20:12Z"), ("2", "2021-03-02T12:20:13Z"), ("3",
"2021-03-02T12:20:14Z")]

df = spark.createDataFrame(data).toDF(*columns)

# Initialize FeatureStoreManager with a role arn if your feature group is created by


another account
feature_store_manager= FeatureStoreManager("arn:aws:iam::111122223333:role/role-arn")

# Load the feature definitions from input schema. The feature definitions can be used
to create a feature group
feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

feature_group_arn = "arn:aws:sagemaker:<AWS_REGION>:<ACCOUNT_ID>:feature-
group/<YOUR_FEATURE_GROUP_NAME>"

# Ingest by default. The connector will leverage PutRecord API to ingest your data in
stream
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_feature_store_PutRecord.html
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn)

# To select the target stores for ingestion, you can specify the target store as the
paramter

1236
Amazon SageMaker Developer Guide
Feature Store Spark

# If OnlineStore is selected, the connector will leverage PutRecord API to ingest your
data in stream
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn, target_stores=["OfflineStore", "OnlineStore"])

# If only OfflineStore is selected, the connector will batch write the data to offline
store directly
feature_store_manager.ingest_data(input_data_frame=df,
feature_group_arn=feature_group_arn, target_stores=["OfflineStore"])

# To retrieve the records failed to be ingested by spark connector


failed_records_df = feature_store_manager.get_failed_stream_ingestion_data_frame()

Submit a Spark job with example Python script

The PySpark version requires an extra dependent JAR to be imported, so extra steps are needed to
run the Spark application.

If you did not specify SPARK_HOME during installation, then you have to load required JARs in JVM
when running spark-submit. feature-store-pyspark-dependency-jars is a Python script
installed by the Spark library to automatically fetch the path to all JARs for you.

spark-submit --jars `feature-store-pyspark-dependency-


jars` FeatureStoreBatchIngestion.py

If you are running this application on Amazon EMR, we recommended that you run the application in
client mode, so that you do not need to distribute the dependent JARs to other task nodes. Add one
more step in Amazon EMR cluster with Spark argument similar to the following:

spark-submit --deploy-mode client --master yarn s3:/<PATH_TO_SCRIPT>/


FeatureStoreBatchIngestion.py

Example Scala script

FeatureStoreBatchIngestion.scala

import software.amazon.sagemaker.featurestore.sparksdk.FeatureStoreManager
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object TestSparkApp {
def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().getOrCreate()

// Construct test DataFrame


val data = List(
Row("1", "2021-07-01T12:20:12Z"),
Row("2", "2021-07-02T12:20:13Z"),
Row("3", "2021-07-03T12:20:14Z")
)

val schema = StructType(


List(StructField("RecordIdentifier", StringType), StructField("EventTime",
StringType))
)

1237
Amazon SageMaker Developer Guide
Add features to a feature group

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

// Initialize FeatureStoreManager with a role arn if your feature group is created


by another account
val featureStoreManager = new FeatureStoreManager("arn:aws:iam::111122223333:role/
role-arn")

// Load the feature definitions from input schema. The feature definitions can be
used to create a feature group
val featureDefinitions = featureStoreManager.loadFeatureDefinitionsFromSchema(df)

val featureGroupArn = "arn:aws:sagemaker:<AWS_REGION>:<ACCOUNT_ID>:feature-


group/<YOUR_FEATURE_GROUP_NAME>"

// Ingest by default. The connector will leverage PutRecord API to ingest your data
in stream
// https://docs.aws.amazon.com/sagemaker/latest/APIReference/
API_feature_store_PutRecord.html
featureStoreManager.ingestData(df, featureGroupArn)

// To select the target stores for ingestion, you can specify the target store as
the paramter
// If OnlineStore is selected, the connector will leverage PutRecord API to ingest
your data in stream
featureStoreManager.ingestData(df, featureGroupArn, List("OfflineStore",
"OnlineStore"))

// If only OfflineStore is selected, the connector will batch write the data to
offline store directly
featureStoreManager.ingestData(df, featureGroupArn, ["OfflineStore"])

// To retrieve the records failed to be ingested by spark connector


val failedRecordsDf = featureStoreManager.getFailedStreamIngestionDataFrame()
}
}

Submit a Spark job

Scala

You should be able to use Feature Store Spark as a normal dependency. No extra instruction is
needed to run the application on all platforms.

Add features to a feature group


You can use the Amazon SageMaker Feature Store API or Amazon SageMaker Studio to add features to
your feature group. You can think of a feature group as a data table and a feature as a column in the
table. When you add a feature to the feature group, you're effectively adding a column to the table.

The features that you've added don't have any data. You can add new records to the feature group or
overwrite them. You can think of a record as a row in the data table.

The following sections provide an overview of using the API and Studio to add features to a feature
group. With the API, you can also add or overwrite records after you've updated the feature group.

Update feature group using Studio


To search through your features, do the following.

1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).

1238
Amazon SageMaker Developer Guide
Example code

2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Feature Store.
8. Choose Feature group catalog.
9. Under Feature group name, choose a feature group.
10. Choose Add feature definitions.
11. Choose Add feature definition.
12. Specify a name for the Feature name field.
13. For Type, select the feature's data type.
14. Choose Add new feature definition.
15. (Optional) Choose Add new feature definition to add feature definitions.
16. Specify information for the additional features.
17. Choose Save changes.
18. Choose Confirm.

API
Use the UpdateFeatureGroup operation to add features to a feature group.

You can use the DescribeFeatureGroup operation to see if you've added the features successfully.

To add or overwrite records, use the PutRecord operation.

To see the updates that you've made to a record, use the GetRecord operation. To see the updates that
you've made to multiple records, use the BatchGetRecord operation. It can take up to five minutes for
the updates that you've made to appear.

You can use the example code in the following section to walk through adding features and records
using the AWS SDK for Python (Boto3).

Example code
The example code walks you through the following process:

1. Adding features to the feature group


2. Verifying that you've added them successfully
3. Adding a record to the feature group
4. Verifying that you've added it successfully

Step 1: Add features to a feature group


The following code uses the UpdateFeatureGroup operation to add new features to the feature group.
It assumes that you've set up Feature Store and created a feature group. For more information about
getting started, see Introduction to Feature Store notebook (p. 1216).

import boto3

1239
Amazon SageMaker Developer Guide
Example code

sagemaker_client = boto3.client("sagemaker")

sagemaker_client.update_feature_group(
FeatureGroupName=feature_group_name,
FeatureAdditions=[
{"FeatureName": "new-feature-1", "FeatureType": "Integral"},
{"FeatureName": "new-feature-2", "FeatureType": "Fractional"},
{"FeatureName": "new-feature-3", "FeatureType": "String"}
]
)

The following code uses the DescribeFeatureGroup operation to check the status of the update. If
the LastUpdateStatus field is Successful, you've added the features successfully.

sagemaker_client.describe_feature_group(
FeatureGroupName=feature_group_name
)

Step 2: Add a new record to the feature group


The following code uses the PutRecord operation to add records to the feature group that you've
created.

record_identifier_value = 'new_record'

sagemaker_featurestore_runtime_client = boto3.client("sagemaker-featurestore-runtime")

sagemaker_runtime_client.put_record(
FeatureGroupName=feature_group_name,
Record=[
{
'FeatureName': "record-identifier-feature-name",
'ValueAsString': record_identifier_value
},
{
'FeatureName': "event-time-feature",
'ValueAsString': "timestamp-that-feature-store-returns"
},
{
'FeatureName': "new-feature-1",
'ValueAsString': "value-as-string"
},
{
'FeatureName': "new-feature-2",
'ValueAsString': "value-as-string"
},
{
'FeatureName': "new-feature-3",
'ValueAsString': "value-as-string"
},
]
)

Use the GetRecord operation to see which records in your feature group don't have data for the
features that you've added. You can use the PutRecord operation to overwrite the records that don't
have data for the features that you've added.

1240
Amazon SageMaker Developer Guide
Find features in your feature groups

Find features in your feature groups


With Amazon SageMaker Feature Store, you can search for the features that you've created in your
feature groups. You can search through all of your features without needing to first select a feature
group. You can use the search functionality to quickly find the features that are relevant to your use case.

To search for features in your feature groups, the feature groups must be within the same AWS account
and Region.
Important
Use the latest version of Amazon SageMaker Studio to make sure that you're using the most
recent version of the search functionality. For information on updating Studio, see Shut down
and Update SageMaker Studio (p. 199).

If you're on a team, you might have teammates that are looking for features to use in their models, they
can search through all the features in all of the feature groups.

You can add searchable parameters and descriptions to make your features more discoverable. For more
information, see Adding searchable metadata to your features (p. 1248).

The following are the types of metadata that you can use in your search.

You can search for features using either Amazon SageMaker Studio or the Search operation in the
SageMaker API. The following table lists all of the searchable metadata and whether you can search for it
in Studio.

Searchable metadata API field name Searchable in Studio?

Feature name FeatureName Yes

Feature group name FeatureGroupName No

Description Description Yes

Parameters Parameters.key Yes

All Parameters AllParameters Yes

Feature type FeatureType No

Creation time CreationTime Yes

Last modified time LastModifiedTime No

The following sections show you how to search for your features.

Studio

Use the following procedure to search through all the features that you've created.

To search through your features, do the following.

1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.

1241
Amazon SageMaker Developer Guide
Find features in your feature groups

7. Choose Feature Store.


8. Choose Feature Catalog.
9. Specify a text query with at least three characters to search for your features.
10. (Optional) Use advanced filters after you've specified a query. You can use filters to specify
parameters or date ranges in your search results. If you're searching for a parameter, specify
both its key and value. To find your features more easily, you can do the following:

• Specify time ranges.


• Deselect columns that you don't want to query.

SDK for Python (Boto3)

The example uses the Search operation in the AWS SDK for Python (Boto3) to run the search query.
For information about the other languages to submit a query, see See Also in the Amazon SageMaker
API Reference.

The following code shows different example search queries using the API.

# Return all features in your feature groups


sagemaker_client.search(
Resource="FeatureMetadata",
)

# Search for all features that belong to a feature group that contain the "ver"
substring
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
]
}
)

# Search for all features that belong to a feature group that have the EXACT name
"airport"
sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Equals',
'Value': 'airport'
},
]
}
)

# Search for all features that belong to a feature group that contains the name "ver"
AND have a name that contains "wha"
AND have a parameter (key or value) that contains "hea"

sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={

1242
Amazon SageMaker Developer Guide
Find features in your feature groups

'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllParameters',
'Operator': 'Contains',
'Value': 'hea'
},
]
}
)

# Search for all features that belong to a feature group with substring "ver" in its
name
OR features that have a name that contain "wha"
OR features that have a parameter (key or value) that contains "hea"

sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllParameters',
'Operator': 'Contains',
'Value': 'hea'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)

# Search for all features that belong to a feature group with substring "ver" in its
name
OR features that have a name that contain "wha"
OR parameters with the value 'Sage' for the 'org' key

sagemaker_client.search(
Resource="FeatureMetadata",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{

1243
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store

'Name': 'FeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'Parameters.org',
'Operator': 'Contains',
'Value': 'Sage'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)

Find feature groups in your Feature Store


With Amazon SageMaker Feature Store, you can search for the feature groups using either Amazon
SageMaker Studio or the Search operation. You can use the search functionality to find features and
feature groups that are relevant to the models that you're creating. You can use the search functionality
to quickly find the feature groups that are relevant to your use case.
Note
The feature groups that you're searching for must be within the same AWS account and AWS
Region.
Important
Use the latest version of Amazon SageMaker Studio to make sure that you're using the most
recent version of the search functionality. For information on updating Studio, see Shut down
and Update SageMaker Studio (p. 199).

The following table shows the searchable fields and whether you can use Studio to search for a specific
field.

You can search for features using either Amazon SageMaker Studio or the Search operation in the
SageMaker API. The following table lists all of the searchable metadata and whether you can search for it
in Studio.

Searchable metadata API field name Searchable in Studio?

Creation time CreationTime Yes

Description Description Yes

Event Time Feature Name EventTimeFeatureName No

Creation Failure Reason FailureReason No

Feature Definitions FeatureDefinitions No

Feature Group ARN FeatureGroupARN No

Feature Group Name FeatureGroupName Yes

Creation Status FeatureGroupStatus Yes

Last Update Status LastUpdateStatus No

Record Identfier Feature Name RecordIdentifierFeatureName Yes

1244
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store

Searchable metadata API field name Searchable in Studio?

Offline Store Configuration OfflineStoreConfig No

Offline Store Status OfflineStoreStatus Yes

Tags Tags.key Yes

All Tags AllTags Yes

The following sections show you how to search for your features.

Studio

Use the following procedure to search through all the feature groups that you've created.

To search through your feature groups, do the following.

1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.
7. Choose Feature Store.
8. Under Feature Group Catalog, specify a text query with at least three characters to search for
your feature groups.
9. (Optional) Use advanced filters after you've specified a query. You can use filters to specify
parameters or date ranges in your search results. If you're searching for a parameter, specify
both its key and value. To find your features more easily, you can do the following:

• Specify time ranges.


• Deselect columns that you don't want to query.
• Select whether you're searching for feature groups in the:
• Offline store
• Online store
• Offline store and online store
• Specify feature groups with a specific status, such as whether they've been created.

SDK for Python (Boto3)

The example uses the Search operation in the AWS SDK for Python (Boto3) to run the search query.
For information about the other languages to submit a query, see See Also in the Amazon SageMaker
API Reference.

The following code shows different example search queries using the API.

# Return all feature groups


sagemaker_client.search(
Resource="FeatureGroups",
)

# Search for all feature groups with a name that contains the "ver" substring
sagemaker_client.search(

1245
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store

Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
]
}
)

# Search for all feature groups that have the EXACT name "airport"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Equals',
'Value': 'airport'
},
]
}
)

# Search for all feature groups that contains the name "ver"
# AND have a record identifier feature name that contains "wha"
# AND have a tag (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllTags',
'Operator': 'Contains',
'Value': 'hea'
},
]
}
)

# Search for all feature groups with substring "ver" in its name
# OR feature groups that have a record identifier feature name that contains "wha"
# OR feature groups that have a tag (key or value) that contains "hea"
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',

1246
Amazon SageMaker Developer Guide
Find feature groups in your Feature Store

'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'AllTags',
'Operator': 'Contains',
'Value': 'hea'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)

# Search for all feature groups with substring "ver" in its name
# OR feature groups that have a record identifier feature name that contains "wha"
# OR tags with the value 'Sage' for the 'org' key
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'FeatureGroupName',
'Operator': 'Contains',
'Value': 'ver'
},
{
'Name': 'RecordIdentifierFeatureName',
'Operator': 'Contains',
'Value': 'wha'
},
{
'Name': 'Tags.org',
'Operator': 'Contains',
'Value': 'Sage'
},
],
'Operator': 'Or' # note that this is explicitly set to "Or"- the default is
"And"
}
)

# Search for all offline only feature groups


sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'OnlineStoreConfig.EnableOnlineStore',
'Operator': 'NotEquals',
'Value': 'true'
},
{
'Name': 'OfflineStoreConfig.S3StorageConfig.S3Uri',
'Operator': 'Exists'
}
]
}
)

# Search for all online only feature groups


sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [

1247
Amazon SageMaker Developer Guide
Adding searchable metadata to your features

{
'Name': 'OnlineStoreConfig.EnableOnlineStore',
'Operator': 'Equals',
'Value': 'true'
},
{
'Name': 'OfflineStoreConfig.S3StorageConfig.S3Uri',
'Operator': 'NotExists'
}
]
}
)

# Search for all feature groups that are BOTH online and offline
sagemaker_client.search(
Resource="FeatureGroups",
SearchExpression={
'Filters': [
{
'Name': 'OnlineStoreConfig.EnableOnlineStore',
'Operator': 'Equals',
'Value': 'true'
},
{
'Name': 'OfflineStoreConfig.S3StorageConfig.S3Uri',
'Operator': 'Exists'
}
]
}
)

Adding searchable metadata to your features


In Amazon SageMaker Feature Store, you can search through all of your features. To make your features
more discoverable, you can add metadata to them. You can add the following types of metadata:

• Description – A searchable description of the feature.


• Parameters – Searchable key-value pairs.

The description can have up to 255 characters.

For parameters, you must specify a key-value pair in your search. You can add up to 25 parameters.

To update the metadata of a feature, you can use either Amazon SageMaker Studio or the
UpdateFeatureMetadata operation.

Use the following procedure to update the metadata using Amazon SageMaker Studio.

To update the feature metadata with Studio, do the following.

1. Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain (p. 37).
2. Choose Studio.
3. Choose Launch app.
4. From the dropdown list, select Studio.
5. Choose the Home icon.
6. Choose Data.

1248
Amazon SageMaker Developer Guide
Adding searchable metadata to your features

7. Choose Feature Store.


8. Choose Feature catalog.
9. In the Feature name column, choose a feature name.
10. Choose Edit metadata.
11. In the Description field, add or update the description.
12. In the Parameters field under Parameters, specify a key-value pair for the parameter.
13. (Optional) Choose Add new parameter to add another parameter.
14. Choose Save changes.
15. Choose Confirm.

The following describes how you can use the UpdateFeatureMetadata operation for different
scenarios.

Add a list of parameters to a feature

To add a list of parameters to a feature, specify values for the following fields:

• FeatureGroupName
• Feature
• Parameters

The following example code uses the AWS SDK for Python (Boto3) to add two parameters.

sagemaker_client.update_feature_metadata(
FeatureGroupName="feature_group_name",
FeatureName="feature-name",
ParameterAdditions=[
{"Key": "example-key-0", "Value": "example-value-0"},
{"Key": "example-key-1", "Value": "example-value-1"},
]
)

Add a description to a feature

To add a description to a feature, specify values for the following fields:

• FeatureGroupName
• Feature
• Description

sagemaker_client.update_feature_metadata(
FeatureGroupName="feature-group-name",
FeatureName="feature-name",
Description="description"
)

Remove parameters for a feature

To remove all parameters for a feature, do the following.

Specify values for the following fields:

1249
Amazon SageMaker Developer Guide
Example code

• FeatureGroupName
• Feature

Specify the keys for the parameters that you're removing under ParameterRemovals.

sagemaker_client.update_feature_metadata(
FeatureGroupName="feature_group_name",
FeatureName="feature-name",
ParameterRemovals=[
{"Key": "example-key-0"},
{"Key": "example-key-1"},
]
)

Remove the description for a feature

To remove the description for a feature, do the following.

Specify values for the following fields:

• FeatureGroupName
• Feature

Specify an empty string for Description.

sagemaker_client.update_feature_metadata(
FeatureGroupName="feature-group-name",
FeatureName="feature-name",
Description=""
)

After you've updated the metadata for a feature, you can use the DescribeFeatureMetadata
operation to see the updates that you've made.

The following code goes through an example workflow using the AWS SDK for Python (Boto3).

Example code
The example code does the following:

1. Sets up your SageMaker environment.


2. Creates a feature group.
3. Adds features to the group.
4. Adds metadata to the features.

Step 1: Setup
To start using Feature Store, create SageMaker, boto3 and Feature Store sessions. Then set up the
S3 bucket you want to use for your features. This is your offline store. The following code uses the
SageMaker default bucket and adds a custom prefix to it.

1250
Amazon SageMaker Developer Guide
Example code

Note
The role that you use must have the following managed policies attached to it:
AmazonS3FullAccess and AmazonSageMakerFeatureStoreAccess.

# SageMaker Python SDK version 2.x is required


%pip install 'sagemaker>=2.0.0'
import sagemaker
import sys

import boto3
import pandas as pd
import numpy as np
import io
from sagemaker.session import Session
from sagemaker import get_execution_role
from botocore.exceptions import ClientError

prefix = 'sagemaker-featurestore-introduction'
role = get_execution_role()

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()

Step 2: Create a feature group and Add Features

feature_group_name = "test-for-feature-metadata"
feature_definitions = [
{"FeatureName": "feature-1", "FeatureType": "String"},
{"FeatureName": "feature-2", "FeatureType": "String"},
{"FeatureName": "feature-3", "FeatureType": "String"},
{"FeatureName": "feature-4", "FeatureType": "String"},
{"FeatureName": "feature-5", "FeatureType": "String"}
]
try:
sagemaker_client.create_feature_group(
FeatureGroupName=feature_group_name,
RecordIdentifierFeatureName="feature-1",
EventTimeFeatureName="feature-2",
FeatureDefinitions=feature_definitions,
OnlineStoreConfig={"EnableOnlineStore": True}
)
except ClientError as e:
if e.response["Error"]["Code"] == "ResourceInUse":
pass
else:
raise e

Step 3: Add metadata


Before you add metadata, use the DescribeFeatureGroup operation to make sure that the status of
the feature group is Created.

sagemaker_client.describe_feature_group(

1251
Amazon SageMaker Developer Guide
Create a dataset from your feature groups

FeatureGroupName=feature_group_name
)

Add a description to the feature.

sagemaker_client.update_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1",
Description="new description"
)

You can use the DescribeFeatureMetadata operation to see if you' have successfully updated the
description for the feature group.

sagemaker_client.describe_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1"
)

You can also use it to add parameters to the feature group.

sagemaker_client.update_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1",
ParameterAdditions=[
{"Key": "team", "Value": "featurestore"},
{"Key": "org", "Value": "sagemaker"},
]
)

You can use the DescribeFeatureMetadata operation again to see if you have successfully added the
parameters.

sagemaker_client.describe_feature_metadata(
FeatureGroupName=feature_group_name,
FeatureName="feature-1"
)

Create a dataset from your feature groups


After a Feature Store feature group has been created in an offline store, you can choose to use the
following methods to get your data:

• Using the Amazon SageMaker Python SDK


• Running SQL queries in the Amazon Athena

Important
Feature Store requires data to be registered in a AWS Glue data catalog. By default, Feature
Store automatically builds an AWS Glue data catalog when you create a feature group.

1252
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups

After you've created feature groups for your offline store and populated them with data, you can create a
dataset by running queries or using the SDK to join data stored in the offline store from different feature
groups. You can also join the feature groups to a single pandas dataframe. You can use Amazon Athena
to write and execute SQL queries.
Note
To make sure that your data is up to date, you can set up a AWS Glue crawler to run on a
schedule.
To set up a AWS Glue crawler, specify an IAM role that the crawler is using to access the offline
store’s Amazon S3 buckets. For more information, see Create an IAM role.
For more information on how to use AWS Glue and Athena to build a training dataset for model
training and inference, see Create feature groups (p. 1215).

Using the Amazon SageMaker Python SDK to get


your data from your feature groups
You can use the Feature Store APIs to create a dataset from your feature groups. Data scientists create
ML datasets for training by retrieving ML feature data from one or more feature groups in the offline
store. Use the create_dataset() function to create the dataset. You can use the SDK to do the
following:

• Create a dataset from multiple feature groups.


• Create a dataset from the feature groups and a pandas data frame.

By default, Feature Store doesn't include records that you've deleted from the dataset. It also doesn't
include duplicated records. A duplicate record has the record ID and timestamp value in the event time
column.

Before you use the SDK to create a dataset, you must start a SageMaker session. Use the following code
to start the session.

import boto3
from sagemaker.session import Session
from sagemaker.feature_store.feature_store import FeatureStore

region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client(
service_name="sagemaker", region_name=region
)
featurestore_runtime = boto_session.client(
service_name="sagemaker-featurestore-runtime",region_name=region
)

feature_store_session = Session(
boto_session=boto_session,
sagemaker_client=sagemaker_client,
sagemaker_featurestore_runtime_client=featurestore_runtime,
)

feature_store = FeatureStore(feature_store_session)

The following code shows an example of creating a dataset from multiple feature groups. The
following code snippet uses the example feature groups "base_fg_name", "first_fg_name", and
"second_fg_name", which may not exist or have the same schema within your Feature Store. It is
recommended to replace these feature groups with feature groups that exist within your Feature Store.
For information on how to create a feature group, see Step 3: Create feature groups (p. 1219).

1253
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups

from sagemaker.feature_store.feature_group import FeatureGroup

s3_bucket_name = "offline-store-sdk-test"

base_fg_name = "base_fg_name"
base_fg = FeatureGroup(name=base_fg_name, sagemaker_session=feature_store_session)

first_fg_name = "first_fg_name"
first_fg = FeatureGroup(name=first_fg_name, sagemaker_session=feature_store_session)

second_fg_name = "second_fg_name"
second_fg = FeatureGroup(name=second_fg_name, sagemaker_session=feature_store_session)

feature_store = FeatureStore(feature_store_session)
builder = feature_store.create_dataset(
base=base_fg,
output_path=f"s3://{DOC-EXAMPLE-BUCKET1}",
).with_feature_group(first_fg
).with_feature_group(second_fg, "base_id", ["base_feature_1"])

The following code shows an example of creating a dataset from multiple feature groups and a pandas
dataframe.

base_data = [[1, 187512346.0, 123, 128],


[2, 187512347.0, 168, 258],
[3, 187512348.0, 125, 184],
[1, 187512349.0, 195, 206]]
base_data_df = pd.DataFrame(
base_data,
columns=["base_id", "base_time", "base_feature_1", "base_feature_2"]
)

builder = feature_store.create_dataset(
base=base_data_df,
event_time_identifier_feature_name='base_time',
record_identifier_feature_name='base_id',
output_path=f"s3://{s3_bucket_name}"
).with_feature_group(first_fg
).with_feature_group(second_fg, "base_id", ["base_feature_1"])

The Feature Store APIs provides you with helper methods for the create_dataset function. You can
use them to do the following:

• Create a dataset from multiple feature groups.


• Create a dataset from multiple feature groups and a pandas dataframe.
• Create a dataset from a single feature group and a pandas dataframe.
• Create a dataset using a point in time accurate join where records in the joined feature group follow
sequentially.
• Create a dataset with the duplicated records, instead of following the default behavior of the function.
• Create a dataset with the deleted records, instead of following the default behavior of the function.
• Create a dataset for time periods that you specify.
• Save the dataset as a CSV file.
• Save the dataset as a pandas dataframe.

The base feature group is an important concept for joins. The base feature group is the feature group
that has other feature groups or the pandas dataframe joined to it. For each dataset

1254
Amazon SageMaker Developer Guide
Using the Amazon SageMaker Python SDK
to get your data from your feature groups

You can add the following optional methods to the create_dataset function to configure how you're
creating dataset:

• with_feature_group – Performs an inner join between the base feature group and another feature
group using the record identifier and the target feature name in the base feature group. The following
provides information about the parameters that you specify:
• feature_group – The feature group that you're joining.
• target_feature_name_in_base – The name of the feature in the base feature group that you're
using as a key in the join. The record identifier in the other feature groups are the other keys that
Feature Store uses in the join.
• included_feature_names – A list of strings representing the feature names of the base feature
group. You can use the field to specify the features that you want to include in the dataset.
• feature_name_in_target – Optional string representing the feature in the target feature group
that will be compared to the target feature in the base feature group.
• join_comparator – Optional JoinComparatorEnum representing the comparator used when
joining the target feature in the base feature group and the feature in the target feature group.
These JoinComparatorEnum values can be GREATER_THAN, GREATER_THAN_OR_EQUAL_TO,
LESS_THAN, LESS_THAN_OR_EQUAL_TO, NOT_EQUAL_TO or EQUALS by default.
• join_type – Optional JoinTypeEnum representing the type of join between the base and target
feature groups. These JoinTypeEnum values can be LEFT_JOIN, RIGHT_JOIN, FULL_JOIN,
CROSS_JOIN or INNER_JOIN by default.
• with_event_time_range – Creates a dataset using the event time range that you specify.
• as_of – Creates a dataset up to a timestamp that you specify. For example, if you specify
datetime(2021, 11, 28, 23, 55, 59, 342380) as the value, creates a dataset up to
November 28th, 2021.
• point_time_accurate_join – Creates a dataset where all of the event time values of the base
feature group is less than all the event time values of the feature group or pandas dataframe that
you're joining.
• include_duplicated_records – Keeps duplicated values in the feature groups.
• include_deleted_records – Keeps deleted values in the feature groups.
• with_number_of_recent_records_by_record_identifier – An integer that you specify to
determine how many of the most recent records appear in the dataset.
• with_number_of_records_by_record_identifier – An integer that represents how many
records appear in the dataset.

After you've configured the dataset, you can specify the output using one of the following methods:

• to_csv_file – Saves the dataset as a CSV file.


• to_dataframe – Saves the dataset as a pandas dataframe.

You can retrieve data that comes after a specific period in time. The following code retrieves data after a
timestamp.

fg1 = FeatureGroup("example-feature-group-1")
feature_store.create_dataset(
base=fg1,
output_path="s3://example-S3-path"
).with_number_of_records_from_query_results(5).to_csv_file()

You can also retrieve data from a specific time period. You can use the following code to get data for a
specific time range:

1255
Amazon SageMaker Developer Guide
Sample Amazon Athena queries

fg1 = FeatureGroup("fg1")
feature_store.create_dataset(
base=fg1,
output_path="example-S3-path"
).with_event_time_range(
datetime(2021, 11, 28, 23, 55, 59, 342380),
datetime(2020, 11, 28, 23, 55, 59, 342380)
).to_csv_file() #example time range specified in datetime functions

You might want to join multiple feature groups to a pandas dataframe where the event time values of
the feature group happen no later than the event time of the data frame. Use the following code as a
template to help you perform the join.

fg1 = FeatureGroup("fg1")
fg2 = FeatureGroup("fg2")
events = [['2020-02-01T08:30:00Z', 6, 1],
['2020-02-02T10:15:30Z', 5, 2],
['2020-02-03T13:20:59Z', 1, 3],
['2021-01-01T00:00:00Z', 1, 4]]
df = pd.DataFrame(events, columns=['event_time', 'customer-id', 'title-id'])
feature_store.create_dataset(
base=df,
event_time_identifier_feature_name='event_time',
record_identifier_feature_name='customer_id',
output_path="s3://example-S3-path"
).with_feature_group(fg1, "customer-id"
).with_feature_group(fg2, "title-id"
).point_in_time_accurate_join(
).to_csv_file()

You can also retrieve data that comes after a specific period in time. The following code retrieves data
after the time specified by the timestamp in the as_of method.

fg1 = FeatureGroup("fg1")
feature_store.create_dataset(
base=fg1,
output_path="s3://example-s3-file-path"
).as_of(datetime(2021, 11, 28, 23, 55, 59, 342380)
).to_csv_file() # example datetime values

Sample Amazon Athena queries


You can write queries in Amazon Athena to create a dataset from your feature groups. You can also write
queries that create a dataset from feature groups and a single pandas dataframe.

Interactive Exploration

This query selects the first 1000 records.

SELECT *
FROM <FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>
LIMIT 1000

Latest snapshot without duplicates

This query selects the latest non-duplicate records.

SELECT *

1256
Amazon SageMaker Developer Guide
Cross-account offline store access

FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>)
WHERE row_num = 1;

Latest snapshot without duplicates and deleted records in the offline store

This query filters out any deleted records and selects non-duplicate records from the offline store.

SELECT *
FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>)
WHERE row_num = 1 and
NOT is_deleted;

Time Travel without duplicates and deleted records in the offline store

This query filters out any deleted records and selects non-duplicate records from a particular point in
time.

SELECT *
FROM
(SELECT *,
row_number()
OVER (PARTITION BY <RecordIdentiferFeatureName>
ORDER BY <EventTimeFeatureName> desc, Api_Invocation_Time DESC, write_time DESC) AS
row_num
FROM
<FeatureGroup.DataCatalogConfig.DatabaseName>.<FeatureGroup.DataCatalogConfig.TableName>
where <EventTimeFeatureName> <= timestamp '<timestamp>')
-- replace timestamp '<timestamp>' with just <timestamp> if EventTimeFeature is of
type fractional
WHERE row_num = 1 and
NOT is_deleted

Cross-account offline store access


Amazon SageMaker Feature Store allows users to create a feature group in one account (Account A) and
configure it with an offline store using an Amazon S3 bucket in another account (Account B). This can be
set up using the steps in the following section.

Topics
• Step 1: Set up the offline store access role in Account A (p. 1258)
• Step 2: Set up an offline store Amazon S3 bucket in Account B (p. 1259)
• Step 3: Set up an offline store AWS KMS encryption key in Account A (p. 1259)

1257
Amazon SageMaker Developer Guide
Step 1: Set up the offline store access role in Account A

• Step 4: Create a feature group in Account A (p. 1261)

Step 1: Set up the offline store access role in Account


A
First, set up a role for Amazon SageMaker Feature Store to write the data into the
offline store. The simplest way to accomplish this is to create a new role using the
AmazonSageMakerFeatureStoreAccess policy or to use an existing role that already has the
AmazonSageMakerFeatureStoreAccess policy attached. This document refers to this policy as
Account-A-Offline-Feature-Store-Role-ARN.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketAcl",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::*SageMaker*",
"arn:aws:s3:::*Sagemaker*",
"arn:aws:s3:::*sagemaker*"
]
}
]
}

The preceding code snippet shows the AmazonSageMakerFeatureStoreAccess policy. The Resource
section of the policy is scoped down by default to S3 buckets with names that contain SageMaker,
Sagemaker, or sagemaker. This means the offline store Amazon S3 bucket being used must follow
this naming convention. If this is not your case, or if you want to further scope down the resource, you
can copy and paste the policy to your Amazon S3 bucket policy in the console, customize the Resource
section to be arn:aws:s3:::your-offline-store-bucket-name, and then attach to the role.

Additionally, this role must have AWS KMS permissions attached. At a minimum, it requires the
kms:GenerateDataKey permission to be able to write to the offline store using your customer
managed key. See Step 3 to learn about why a customer managed key is needed for the cross-account
scenario and how to set it up. The following example shows an inline policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey"
],
"Resource": "arn:aws:kms:*:Account-A-Account-Id:key/*"
}
]
}

The Resource section of this policy is scoped to any key in Account A. To further scope this down, after
setting up the offline store KMS key in Step 3, return to this policy and replace it with the key ARN.

1258
Amazon SageMaker Developer Guide
Step 2: Set up an offline store
Amazon S3 bucket in Account B

Step 2: Set up an offline store Amazon S3 bucket in


Account B
Create an Amazon S3 bucket in Account B. If you are using the default
AmazonSageMakerFeatureStoreAccess policy, the bucket name must include SageMaker,
Sagemaker, or sagemaker. Edit the bucket policy as shown in the following example to allow Account A
to read and write objects.

This document refers to the following example bucket policy as Account-B-Offline-Feature-


Store-Bucket.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3CrossAccountBucketAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetBucketAcl"
],
"Principal": {
"AWS": [
"*Account-A-Offline-Feature-Store-Role-ARN*"
],
},
"Resource": [
"arn:aws:s3:::offline-store-bucket-name/*",
"arn:aws:s3:::offline-store-bucket-name"
]
}
]
}

In the preceding policy, the principal is Account-A-Offline-Feature-Store-Role-ARN, which is


the role created in Account A in Step 1 and provided to Amazon SageMaker Feature Store to write to the
offline store. You can provide multiple ARN roles under Principal.

Step 3: Set up an offline store AWS KMS encryption


key in Account A
Amazon SageMaker Feature Store ensures that server-side encryption is always enabled for Amazon
S3 objects in the offline store. For cross-account use cases, you must provide a customer managed key
so that you are in control of who can write to the offline store (in this case, Account-A-Offline-
Feature-Store-Role-ARN from Account A) and who can read from the offline store (in this case,
identities from Account B).

This document refers to the following example key policy as Account-A-Offline-Feature-Store-


KMS-Key-ARN.

{
"Version": "2012-10-17",
"Id": "key-consolepolicy-3",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",

1259
Amazon SageMaker Developer Guide
Step 3: Set up an offline store AWS
KMS encryption key in Account A

"Principal": {
"AWS": "arn:aws:iam::Account-A-Account-Id:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow access for Key Administrators",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::Account-A-Account-Id:role/Administrator",
]
},
"Action": [
"kms:Create*",
"kms:Describe*",
"kms:Enable*",
"kms:List*",
"kms:Put*",
"kms:Update*",
"kms:Revoke*",
"kms:Disable*",
"kms:Get*",
"kms:Delete*",
"kms:TagResource",
"kms:UntagResource",
"kms:ScheduleKeyDeletion",
"kms:CancelKeyDeletion"
],
"Resource": "*"
},
{
"Sid": "Allow Feature Store to get information about the customer managed key",
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": [
"kms:Describe*",
"kms:Get*",
"kms:List*"
],
"Resource": "*"
},
{
"Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {
"AWS": [
"*Account-A-Offline-Feature-Store-Role-ARN*",
"*arn:aws:iam::Account-B-Account-Id:root*"
]
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:ReEncryptFrom",
"kms:ReEncryptTo",
"kms:GenerateDataKey",
"kms:ListAliases",
"kms:ListGrants"
],

1260
Amazon SageMaker Developer Guide
Step 4: Create a feature group in Account A

"Resource": "*",
}
]
}

Step 4: Create a feature group in Account A


Next, create the feature group in Account A, with an offline store Amazon
S3 bucket in Account B. To do this, provide the following parameters for
RoleArn, OfflineStoreConfig.S3StorageConfig.KmsKeyId and
OfflineStoreConfig.S3StorageConfig.S3Uri respectively:

• Provide Account-A-Offline-Feature-Store-Role-ARN as the RoleArn.


• Provide Account-A-Offline-Feature-Store-KMS-Key-ARN for
OfflineStoreConfig.S3StorageConfig.KmsKeyId.
• Provide Account-B-Offline-Feature-Store-Bucket for
OfflineStoreConfig.S3StorageConfig.S3Uri.

Logging Feature Store operations by using AWS


CloudTrail
Amazon SageMaker Feature Store is integrated with AWS CloudTrail, a service that provides a record of
actions taken by a user, role, or an AWS service in Feature Store. CloudTrail captures all of the API calls
for Feature Store listed on this page. The logged events include API calls from Feature Store resource
management and data operations. When you create a trail, you activate continuous delivery of CloudTrail
events from Feature Store to an Amazon S3 bucket. Using the information collected by CloudTrail, you
can determine the request that was made to Feature Store, the IP address from which the request was
made, who made the request, when it was made, and additional details.

To learn more about CloudTrail, see the AWS CloudTrail User Guide.

Management events
Management events capture operations performed on Feature Store resources in your AWS account. For
example, the log generated from the management events provides visibility if a user creates or deletes a
Feature Store. The following APIs log management events with Amazon SageMaker Feature Store.

• CreateFeatureGroup
• DeleteFeatureGroup
• DescribeFeatureGroup
• UpdateFeatureGroup

Amazon SageMaker API calls and management events are logged by default when you create the
account, as described in Log Amazon SageMaker API Calls with AWS CloudTrail (p. 3285). For more
information, see Logging management events for trails.

Data events
Data events capture data plane operations performed using the Feature Store resources in your AWS
account. For example, the log generated from the data events provides visibility if a user adds or deletes
a record within a feature group. The following APIs log data events with Amazon SageMaker Feature
Store.

1261
Amazon SageMaker Developer Guide
Security and access control

• BatchGetRecord
• DeleteRecord
• GetRecord
• PutRecord

Data events are not logged by CloudTrail trails by default. To activate logging of data events, turn on
logging of data plane API activity in CloudTrail. For more information, see CloudTrail's Logging data
events for trails.

The following is an example CloudTrail event for a PutRecord API call:

{
"eventVersion": "1.08",
"userIdentity": {
"type": "IAMUser",
"principalId": "USERPRINCIPALID",
"arn": "arn:aws:iam::123456789012:user/user",
"accountId": "123456789012",
"accessKeyId": "USERACCESSKEYID",
"userName": "your-user-name"
},
"eventTime": "2023-01-01T01:00:00Z",
"eventSource": "sagemaker.amazonaws.com",
"eventName": "PutRecord",
"awsRegion": "us-east-1",
"sourceIPAddress": "192.0.2.0",
"userAgent": "your-user-agent",
"requestParameters": {
"featureGroupName": "your-feature-group-name"
},
"responseElements": null,
"requestID": "request-id",
"eventID": "event-id",
"readOnly": false,
"resources": [
{
"accountId": "123456789012",
"type": "AWS::SageMaker::FeatureGroup",
"ARN": "arn:aws:sagemaker:us-east-1:123456789012:feature-group/your-feature-
group-name"
}
],
"eventType": "AwsApiCall",
"managementEvent": false,
"recipientAccountId": "123456789012",
"eventCategory": "Data",
"tlsDetails": {
...
}
}

Security and access control


Amazon SageMaker Feature Store enables you to create two types of stores: an online store or offline
store. The online store is used for low latency real-time inference use cases whereas the offline store is
used for training and batch inference use cases. When you create a feature group for online or offline
use you can provide a AWS Key Management Service customer managed key to encrypt all your data at
rest. In case you do not provide a AWS KMS key then we ensure that your data is encrypted on the server
side using an AWS owned AWS KMS key or AWS managed AWS KMS key. While creating a feature group,

1262
Amazon SageMaker Developer Guide
Using AWS KMS permissions for
Amazon SageMaker Feature Store

you can select storage type and optionally provide a AWS KMS key for encrypting data, then you can call
various APIs for data management such as PutRecord, GetRecord, DeleteRecord.

Feature Store allows you to grant or deny access to individuals at the feature group-level and enables
cross-account access to Feature Store. For example, you can set up developer accounts to access the
offline store for model training and exploration that do not have write access to production accounts.
You can set up production accounts to access both online and offline stores. Feature Store uses unique
customer AWS KMS keys for offline and online store data at-rest encryption. Access control is enabled
through both API and AWS KMS key access. You can also create feature group-level access control.

For more information about customer managed key, see customer managed keys. For more information
about AWS KMS, see AWS KMS.

Using AWS KMS permissions for Amazon SageMaker


Feature Store
Encryption at rest protects Feature Store under an AWS KMS customer managed key. By default, it uses
an AWS owned customer managed key for OnlineStore and AWS managed customer managed key for
OfflineStore. Feature Store supports an option to encrypt your online or offline store under customer
managed key. You can select the customer managed key for Feature Store when you create your online
or offline store, and they can be different for each store.

Feature Store supports only symmetric customer managed keys. You cannot use an asymmetric customer
managed key to encrypt your data in your online or offline store. For help determining whether a
customer managed key is symmetric or asymmetric, see Identifying symmetric and asymmetric customer
managed keys.

When you use a customer managed key, you can take advantage of the following features:

• You create and manage the customer managed key, including setting the key policies, IAM policies
and grants to control access to the customer managed key. You can enable and disable the customer
managed key, enable and disable automatic key rotation, and delete the customer managed key when
it is no longer in use.
• You can use a customer managed key with imported key material or a customer managed key in a
custom key store that you own and manage.
• You can audit the encryption and decryption of your online or offline store by examining the API calls
to AWS KMS in AWS CloudTrail logs.

You do not pay a monthly fee for AWS owned customer managed keys. Customer managed keys will
incur a charge for each API call and AWS Key Management Service quotas apply to each customer
managed key.

Authorizing use of a customer managed Key for your


online store
If you use a customer managed key to protect your online store, the policies on that customer managed
key must give Feature Store permission to use it on your behalf. You have full control over the policies
and grants on a customer managed key.

Feature Store does not need additional authorization to use the default AWS owned KMS key to protect
your online or offline stores in your AWS account.

Customer managed key policy


When you select a customer managed key to protect your Online Store, Feature Store must have
permission to use the customer managed key on behalf of the principal who makes the selection. That

1263
Amazon SageMaker Developer Guide
Authorizing use of a customer
managed Key for your online store

principal, a user or role, must have the permissions on the customer managed key that Feature Store
requires. You can provide these permissions in a key policy, an IAM policy, or a grant. At a minimum,
Feature Store requires the following permissions on a customer managed key:

• "kms:Encrypt", "kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant", "kms:RetireGrant",


"kms:ReEncryptFrom", "kms:ReEncryptTo", "kms:GenerateDataKey", "kms:ListAliases", "kms:ListGrants",
"kms:RevokeGrant"

For example, the following example key policy provides only the required permissions. The policy has the
following effects:

• Allows Feature Store to use the customer managed key in cryptographic operations and create grants,
but only when it is acting on behalf of principals in the account who have permission to use your
Feature Store. If the principals specified in the policy statement don't have permission to use your
Feature Store, the call fails, even when it comes from the Feature Store service.
• The kms:ViaService condition key allows the permissions only when the request comes from
FeatureStore on behalf of the principals listed in the policy statement. These principals can't call these
operations directly. The value for kms:ViaService should be sagemaker.*.amazonaws.com.
Note
The kms:ViaService condition key can only be used for the online store customer managed
AWS KMS key, and cannot be used for the offline store. If you add this special condition to
your customer managed key, and use the same AWS KMS key for both the online and offline
store, then it will fail the CreateFeatureGroup API operation.
• Gives the customer managed key administrators read-only access to the customer managed key and
permission to revoke grants, including the grants that Feature Store uses to protect your data.

Before using an example key policy, replace the example principals with actual principals from your AWS
account.

{"Id": "key-policy-feature-store",
"Version":"2012-10-17",
"Statement": [
{"Sid" : "Allow access through Amazon SageMaker Feature Store for all principals in
the account that are authorized to use Amazon SageMaker Feature Store",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:user/featurestore-user"},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:RetireGrant",
"kms:ReEncryptFrom",
"kms:ReEncryptTo",
"kms:GenerateDataKey",
"kms:ListAliases",
"kms:ListGrants"
],
"Resource": "*",
"Condition": {"StringLike": {"kms:ViaService" : "sagemaker.*.amazonaws.com"
}
}
},
{"Sid": "Allow administrators to view the customer managed key and revoke grants",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:role/featurestore-admin"
},
"Action": [
"kms:Describe*",

1264
Amazon SageMaker Developer Guide
Using grants to authorize Feature Store

"kms:Get*",
"kms:List*",
"kms:RevokeGrant"
],
"Resource": "*"
},
{"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789:root"
},
"Action": "kms:*",
"Resource": "*"
}
]
}

Using grants to authorize Feature Store


In addition to key policies, Feature Store uses grants to set permissions on the customer managed key.
To view the grants on a customer managed key in your account, use the ListGrants operation. Feature
Store does not need grants, or any additional permissions, to use the AWS owned customer managed key
to protect your online store.

Feature Store uses the grant permissions when it performs background system maintenance and
continuous data protection tasks.

Each grant is specific to an online store. If the account includes multiple stores encrypted under the
same customer managed key, there will be unique grants per FeatureGroup using the same customer
managed key.

The key policy can also allow the account to revoke the grant on the customer managed key. However,
if you revoke the grant on an active encrypted online store, Feature Store won't be able to protect and
maintain the store.

Monitoring Feature Store interaction with AWS KMS


If you use a customer managed key to protect your online or offline store, you can use AWS CloudTrail
logs to track the requests that Feature Store sends to AWS KMS on your behalf.

Accessing data in your online store


The caller (either user or role) to ALL DataPlane operations (Put, Get, DeleteRecord) must have below
permissions on the customer managed key:

"kms:Decrypt"

Authorizing use of a customer managed key for your


offline store
The roleArn that is passed as a parameter to createFeatureGroup must have below permissions to
the OfflineStore KmsKeyId:

"kms:GenerateDataKey"

1265
Amazon SageMaker Developer Guide
Quotas, naming rules and data types

Note
The key policy for the online store also works for the offline store, only when the
kms:ViaService condition is not specified.
Important
You can specify a AWS KMS encryption key to encrypt the Amazon S3 location used for your
offline feature store when you create a feature group. If AWS KMS encryption key is not
specified, by default we encrypt all data at rest using AWS KMS key. By defining your bucket-
level key for SSE, you can reduce AWS KMS requests costs by up to 99 percent.

Quotas, naming rules and data types


Quota terminologies
• Read Request Unit (RRU): Measure of read throughput, where the number of RRUs per read request is
equal to the ceiling of read record's size divided into 4KB chunks. The minimum RRU per request is 0.
• Write Request Unit (WRU): Measure of write throughput, where the number of WRUs per write request
is equal to the ceiling of the written record's size divided into 1KB chunks. The minimum WRU per
request is 1 (including delete operations).

Limits and quotas


Note
Soft limits can be increased based on your needs.

• Maximum number of feature groups per AWS account: Soft limit of 100.
• Maximum number of feature definitions per feature group: 2500.
• Maximum number of RRU per record identifier: 2400 RRU per second.
• Maximum number of WRU per record identifier: 500 WRU per second.
• Maximum Transactions per second (TPS) per API per AWS account: Soft limit of 10000 TPS per API
excluding the BatchGetRecord API call, which has a soft limit of 500 TPS.
• Maximum size of a record: 350KB.
• Maximum size of a record identifier: 2KB.
• Maximum size of a feature value: 350KB.
• Maximum number of concurrent feature group creation workflows: 4.
• BatchGetRecord API: Can contain as many as 100 records and can query up to 10 feature groups.

For information about service quotas, see AWS service quotas. For information about requesting an
increase to a quota, see Requesting a quota increase.

Naming rules
• Reserved Words: The following are reserved words and cannot be used as feature names in feature
definitions: is_deleted, write_time, and api_invocation_time.

Data types
• String Feature Type: Strings are Unicode with UTF-8 binary encoding. The minimum length of a string
can be zero, the maximum length is constrained by the maximum size of a record.

1266
Amazon SageMaker Developer Guide
Amazon SageMaker Feature Store offline store data format

• Fractional Feature Type: Fractional feature values must conform to a double precision floating point
number as defined by the IEEE 754 standard.
• Integral Feature Type: Feature Store supports integral values in the range of a 64-bit signed integer.
63 63
Minimum value of -2 and a maximum value: 2 - 1.
• Event Time Features: All feature groups have an event time feature with nanosecond precision. Any
event time with lower than nanosecond precision will lead to backwards incompatibility. The feature
can have a feature type of either String or Fractional.
• A string event time is accepted in ISO-8601 format, in UTC time, conforming to the pattern(s): [yyyy-
MM-dd'T'HH:mm:ssZ, yyyy-MM-dd'T'HH:mm:ss.SSSSSSSSSZ].
• A fractional event time value is accepted as seconds from unix epoch. Event times must be in the
range of [0000-01-01T00:00:00.000000000Z, 9999-12-31T23:59:59.999999999Z]. For feature
groups in the Iceberg table format, you can only use String type for the event time.

Amazon SageMaker Feature Store offline store


data format
Amazon SageMaker Feature Store supports the AWS Glue and Apache Iceberg table formats for the
offline store. You can choose the table format when you’re creating a new feature group. AWS Glue is the
default format.

Amazon SageMaker Feature Store offline store data is stored in an Amazon S3 bucket within your
account. When you call PutRecord, your data is buffered, batched, and written into Amazon S3 within
15 minutes. Feature Store only supports the Parquet file format. Specifically, when your data is written
to your offline store, the data can only be retrieved from your Amazon S3 bucket in Parquet format. Each
file can contain multiple Records.

For the Iceberg format, Feature Store saves the table’s metadata in the same Amazon S3 bucket that
you’re using to store the offline store data. You can find it under the metadata prefix.

Feature Store also exposes the OfflineStoreConfig.S3StorageConfig.ResolvedOutputS3Uri field, which


can be found from in the DescribeFeatureGroup API call. This is the S3 path under which the files for the
specific feature group are written.

The following additional fields are added by Feature Store to each Record when they persist in the offline
store:

• api_invocation_time – The timestamp when the service receives the PutRecord or DeleteRecord
call. If using managed ingestion (e.g. Data Wrangler), this is the timestamp when data was written into
the offline store.
• write_time – The timestamp when data was written into the offline store. Can be used for constructing
time-travel related queries.
• is_deleted – False by default. If DeleteRecord is called, a new Record is inserted into
RecordIdentifierValue and set to True in the offline store.

The following information shows the organization of a Parquet file using the AWS Glue format:

s3://DOC-EXAMPLE-BUCKET/example-prefix-name/111122223333/sagemaker/AWS Region/
offline-store/example-feature-group-account-id/data/year=year/month=month/day=day/
hour=hour/timestamp_of_latest_event_time_in_file_16-random-alphanumeric-digits.parquet

Records in the offline store are partitioned by event time into hourly partitions. You can’t configure the
partitioning scheme. The following shows an example of the output location of a Parquet file:

1267
Amazon SageMaker Developer Guide
Amazon SageMaker Feature Store notebook examples

s3://DOC-EXAMPLE-BUCKET/example-prefix/111122223333/sagemaker/AWS Region/offline-
store/customer-purchase-history-patterns-1593511200/data/year=2020/month=06/day=31/
hour=00/20200631T064401Z_108934320012Az11.parquet

The following shows the organization of the data files saved in the Iceberg table format.

s3://DOC-EXAMPLE-BUCKET/example-prefix/account-id/sagemaker/AWS Region/offline-
store/feature-group-name-feature-group-creation-time/data/8-random-alphanumeric-
digits/event-time-feature-name_trunc=event-time-year-event-time-month-event-time-day/
timestamp-of-latest-event-time-in-file_16-random-alphanumeric-digits.parquet

Records in the offline store are partitioned by event time into daily partitions. You can’t configure the
partitioning scheme. The following shows an example of the output location of a Parquet file where the
event time feature name is EventTime:

s3://DOC-EXAMPLE-BUCKET/example-prefix/sagemaker/AWS Region/offline-
store/customer-purchase-history-patterns-1593511200/data/0aec19ca/
EventTime_trunc=2022-11-09/20221109T215231Z_yolTtpyuWbkaeGIl.parquet

The following shows the example location of a metadata file for data files saved in the Iceberg table
format.

s3://DOC-EXAMPLE-BUCKET/example-prefix/account-id/sagemaker/AWS Region/offline-
store/feature-group-name-feature-group-creation-time/metadata/

Amazon SageMaker Feature Store notebook


examples
To get started using Amazon SageMaker Feature Store, you can choose from a variety of example
Jupyter notebooks in the table below. If this is your first time using Feature Store, try out the
Introduction to Feature Store notebook. To run any these notebooks, you must attach this policy to your
IAM execution role: AmazonSageMakerFeatureStoreAccess.

See IAM Roles to access your role and attach this policy. For a walkthrough on how to view the policies
attached to a role and how to add a policy to your role, see Adding policies to your IAM role.

Feature Store sample notebooks


The following table outlines a variety of sample notebooks that address different use cases of Feature
Store.

Notebook Title Description

Introduction to Feature Store An introduction to key Feature Store capabilities


such as how to create, configure a feature group,
and how to ingest data into an online or offline
store.

Fraud Detection with Feature Store An advanced example on how to train a fraud
detection model by ingesting data into a Feature
Store, querying it to form a training dataset, and
how to train a simple model for inference.

1268
Amazon SageMaker Developer Guide
Feature Store sample notebooks

Notebook Title Description

Encrypt Data in your online or offline store using An advanced example on how to encrypt and
AWS KMS key decrypt data in an online or offline store using
AWS KMS key and how to verify that your data
is encrypted. Note that this notebook tackles
encryption at rest.

Client-side Encryption with Feature Store using An advanced example how to do client-side
AWS Encryption SDK encryption with Feature Store using the AWS
Encryption SDK library, which encrypts your data
prior to ingesting it into your online or offline
store.

How to securely store an image dataset in Feature An advanced example that demonstrates how
Store with AWS KMS key? to securely store a dataset of images into your
Feature Store using a AWS KMS key for server-side
encryption.

Create a machine learning workflow from an A machine learning (ML) workflow that
Amazon SageMaker Ground Truth classification demonstrates how to feed the output of an image
labeling job to Feature Store or text classification labeling job from Amazon
SageMaker Ground Truth to Feature Store.

For a comprehensive set of notebooks with examples for common workflows, see SageMaker Feature
Store Workshop.

1269
Amazon SageMaker Developer Guide
The simplest training workflow in SageMaker

Train Models
The training stage of the full machine learning (ML) lifecycle spans from accessing your training dataset
to generating a final model and selecting the best performing model for deployment. The following
sections provide an overview of available SageMaker training features and resources with in-depth
technical information for each.

The simplest training workflow in SageMaker


If you’re using SageMaker for the first time and want to find a quick ML solution to train a model on
your dataset, consider using a no-code or low-code solution such as SageMaker Canvas, SageMaker
JumpStart within SageMaker Studio, or SageMaker Autopilot.

For intermediate coding experiences, consider using a SageMaker Studio notebook or SageMaker
Notebook Instances. To get started, follow the instructions at the section called “Step 4: Train a
Model” (p. 94) of the SageMaker Getting Started guide. We recommend this for use cases in which you
create your own model and training script using an ML framework.

The following architecture diagram shows how SageMaker manages ML training jobs and provisions
Amazon EC2 instances on behalf of SageMaker users. You as a SageMaker user can bring your own
training dataset, saving it to Amazon S3. You can choose an ML model training from available SageMaker
built-in algorithms, or bring your own training script with a model built with popular machine learning
frameworks.

Full view of the SageMaker Training workflow and


features
The full journey of ML training involves tasks beyond data ingestion to ML models, training models on
compute instances, and obtaining model artifacts and outputs. You need to evaluate every phase of

1270
Amazon SageMaker Developer Guide
Before training

before, during, and after training to make sure your model is trained well to meet the target accuracy for
your objectives.

The following flow chart shows a high-level overview of your actions (in blue boxes) and available
SageMaker Training features (in light blue boxes) throughout the training phase of the ML lifecycle.

The following sections walk you through each phase of training depicted in the previous flow chart and
useful features offered by SageMaker throughout the three sub-stages of the ML training.

Topics
• Before training (p. 1271)
• During training (p. 1273)
• After training (p. 1275)

Before training
There are a number of scenarios of setting up data resources and access you need to consider before
training. Refer to the following diagram and details of each before-training stage to get a sense of what
decisions you need to make.

1271
Amazon SageMaker Developer Guide
Before training

• Prepare data: Before training, you must have finished data cleaning and feature engineering during
the data preparation stage. SageMaker has several labeling and feature engineering tools to help you.
See Label Data, Prepare and Analyze Datasets, Process Data, and Create, Store, and Share Features for
more information.
• Choose an algorithm or framework: Depending on how much customization you need, there are
different options for algorithms and frameworks.
• If you prefer a low-code implementation of a pre-built algorithm, use one of the built-in algorithms
offered by SageMaker. For more information, see Choose an Algorithm.
• If you need more flexibility to customize your model, run your training script using your preferred
frameworks and toolkits within SageMaker. For more information, see ML Frameworks and Toolkits.
• To extend pre-built SageMaker Docker images as the base image of your own container, see Use Pre-
built SageMaker Docker images.
• To bring your custom Docker container to SageMaker, see Adapting your own Docker container to
work with SageMaker. You need to install the sagemaker-training-toolkit to your container.
• Manage data storage: Understand mapping between the data storage (such as Amazon S3, Amazon
EFS, or Amazon FSx) and the training container that runs in the Amazon EC2 compute instance.

1272
Amazon SageMaker Developer Guide
During training

SageMaker helps map the storage paths and local paths in the training container. You can also
manually specify them. After mapping is done, consider using one of the data transmission modes:
File, Pipe, and FastFile mode. To learn how SageMaker maps storage paths, see Training Storage
Folders.
• Set up access to training data: Use Amazon SageMaker Domain, a Domain user profile, IAM, Amazon
VPC, and AWS KMS to meet the requirements of the most security-sensitive organizations.
• For account administration, see Amazon SageMaker Domain.
• For a complete reference about IAM policies and security, see Security in Amazon SageMaker.
• Stream your input data: SageMaker provides three data input modes, File, Pipe, and FastFile. The
default input mode is File mode, which loads the entire dataset during initializing the training job. To
learn about general best practices for streaming data from your data storage to the training container,
see Access Training Data.

In case of Pipe mode, you can also consider using an augmented manifest file to stream your data
directly from Amazon Simple Storage Service (Amazon S3) and train your model. Using pipe mode
reduces disk space because Amazon Elastic Block Store only needs to store your final model artifacts,
rather than storing your full training dataset. For more information, see Provide Dataset Metadata to
Training Jobs with an Augmented Manifest File.
• Analyze your data for bias: Before training, you can analyze your dataset and model for bias against
a disfavored group so that you can check that your model learns an unbiased dataset using SageMaker
Clarify.
• Choose which SageMaker SDK to use: There are two ways to launch a training job in SageMaker:
using the high-level SageMaker Python SDK, or using the low-level SageMaker APIs for the SDK for
Python (Boto3) or the AWS CLI. The SageMaker Python SDK abstracts the low-level SageMaker API to
provide convenient tools. As aforementioned in the section called “The simplest training workflow in
SageMaker” (p. 1270), you can also pursue no-code or minimal-code options using SageMaker Canvas,
SageMaker JumpStart within SageMaker Studio, or SageMaker Autopilot.

During training
During training, you need to continuously improve training stability, training speed, training efficiency
while scaling compute resources, cost optimization, and, most importantly, model performance. Read on
for more information about during-training stages and relevant SageMaker Training features.

1273
Amazon SageMaker Developer Guide
During training

• Set up infrastructure: Choose the right instance type and infrastructure management tools for your
use case. You can start from a small instance and scale up depending on your workload. For training
a model on a tabular dataset, start with the smallest CPU instance of the C4 or C5 instance families.
For training a large model for computer vision or natural language processing, start with the smallest
GPU instance of the P2, P3, G4dn or G5 instance families. You can also mix different instance types in
a cluster, or keep instances in warm pools using the following instance management tools offered by
SageMaker. You can also use persistent cache to reduce latency and billable time on iterative training
jobs over the latency reduction from warm pools alone. To learn more, see the following topics.
• Train Using a Heterogeneous Cluster (p. 2105)
• Train Using SageMaker Managed Warm Pools (p. 2119)
• Using persistent cache (p. 2121)

To check the currently available quotas in your account, you must use your Service Quotas console.
To learn more about how to request quota increase, see Supported Regions and Quotas. Also, to find
pricing information and available instance types depending on the AWS Regions, loop up the tables in
the Amazon SageMaker Pricing page.
• Run a training job from a local code: You can annotate your local code with a remote decorator to run
your code as a SageMaker training job from inside Amazon SageMaker Studio, an Amazon SageMaker
notebook or from your local integrated development environment. For more information, see Run
your local code as a SageMaker training job (p. 1565).
• Track training jobs: Monitor and track your training jobs using SageMaker Experiments, SageMaker
Debugger, or Amazon CloudWatch. You can watch the model performance in terms of accuracy
and convergence, and run comparative analysis of metrics between multiple training jobs by using
SageMaker Experiments. You can watch the compute resource utilization rate by using SageMaker
Debugger’s profiling tools or Amazon CloudWatch. To learn more, see the following topics.
• Manage Machine Learning with Amazon SageMaker Experiments
• Profile Training Jobs Using Amazon SageMaker Debugger
• Monitor and Analyze Using CloudWatch Metrics

Additionally, for deep learning tasks, use the Amazon SageMaker Debugger model debugging tools
and built-in rules to identify more complex issues in model convergence and weight update processes.

1274
Amazon SageMaker Developer Guide
After training

• Distributed training: If your training job is going into a stable stage without breaking due to
misconfiguration of the training infrastructure or out-of-memory issues, you might want to find
more options to scale your job and run over an extended period of time for days and even months.
When you’re ready to scale up, consider distributed training. SageMaker provides various options for
distributed computation from light ML workloads to heavy deep learning workloads.

For deep learning tasks that involve training very large models on very large datasets, consider using
one of the SageMaker distributed training strategies to scale up and achieve data parallelism, model
parallelism, or a combination of the two. You can also use SageMaker Training Compiler for compiling
and optimizing model graphs on GPU instances. These SageMaker features support deep learning
frameworks such as PyTorch, TensorFlow, and Hugging Face Transformers.
• Model hyperparameter tuning: Tune your model hyperparameters using Automatic Model Tuning
with SageMaker. SageMaker provides hyperparameter tuning methods such as grid search and
Bayesian search, launching parallel hyperparameter tuning jobs with early-stopping functionality for
non-improving hyperparameter tuning jobs.
• Checkpointing and cost saving with Spot instances: If training time is not a big concern, you might
consider optimizing model training costs with managed Spot instances. Note that you must activate
checkpointing for Spot training to keep restoring from intermittent job pauses due to Spot instance
replacements. You can also use the checkpointing functionality to back up your models in case of
unexpected training job termination. To learn more, see the following topics.
• Managed Spot Training
• Use Checkpoints

After training
After training, you obtain a final model artifact to use for model deployment and inference. There are
additional actions involved in the after-training phase as shown in the following diagram.

1275
Amazon SageMaker Developer Guide
Choose an Algorithm

• Obtain baseline model: After you have the model artifact, you can set it as a baseline model. Consider
the following post-training actions and using SageMaker features before moving on to model
deployment to production.
• Examine model performance and check for bias: Use Amazon CloudWatch Metrics and SageMaker
Clarify for post-training bias to detect any bias in incoming data and model over time against the
baseline. You need to evaluate your new data and model predictions against the new data regularly or
in real time. Using these features, you can receive alerts about any acute changes or anomalies, as well
as gradual changes or drifts in data and model.
• You can also use the Incremental Training functionality of SageMaker to load and update your model
(or fine-tune) with an expanded dataset.
• You can register model training as a step in your SageMaker Pipeline or as part of other Workflow
features offered by SageMaker in order to orchestrate the full ML lifecycle.

Choose an Algorithm
Machine learning can help you accomplish empirical tasks that require some sort of inductive inference.
This task involves induction as it uses data to train algorithms to make generalizable inferences. This
means that the algorithms can make statistically reliable predictions or decisions, or complete other
tasks when applied to new data that was not used to train them.

To help you select the best algorithm for your task, we classify these tasks on various levels of
abstraction. At the highest level of abstraction, machine learning attempts to find patterns or
relationships between features or less structured items, such as text in a data set. Pattern recognition
techniques can be classified into distinct machine learning paradigms, each of which address specific

1276
Amazon SageMaker Developer Guide
Choose an algorithm implementation

problem types. There are currently three basic paradigms for machine learning used to address various
problem types:

• Supervised learning (p. 1279)


• Unsupervised learning (p. 1280)
• Reinforcement learning (p. 1280)

The types of problems that each learning paradigm can address are identified by considering the
inferences (or predictions, decisions, or other tasks) you want to make from the type of data that you
have or could collect. Machine learning paradigms use algorithmic methods to address their various
problem types. The algorithms provide recipes for solving these problems.

However, many algorithms, such as neural networks, can be deployed with different learning paradigms
and on different types of problems. Multiple algorithms can also address a specific problem type. Some
algorithms are more generally applicable and others are quite specific for certain kinds of objectives and
data. So the mapping between machine learning algorithms and problem types is many-to-many. Also,
there are various implementation options available for algorithms.

The following sections provide guidance concerning implementation options, machine learning
paradigms, and algorithms appropriate for different problem types.

Topics
• Choose an algorithm implementation (p. 1277)
• Problem types for the basic machine learning paradigms (p. 1279)
• Use Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281)
• Use Reinforcement Learning with Amazon SageMaker (p. 1559)

Choose an algorithm implementation


After choosing an algorithm, you must decide which implementation of it you want to use. Amazon
SageMaker supports three implementation options that require increasing levels of effort.

• Pre-trained models require the least effort and are models ready to deploy or to fine-tune and deploy
using SageMaker JumpStart.
• Built-in algorithms require more effort and scale if the data set is large and significant resources are
needed to train and deploy the model.
• If there is no built-in solution that works, try to develop one that uses pre-made images for machine
and deep learning frameworks for supported frameworks such as Scikit-Learn, TensorFlow, PyTorch,
MXNet, or Chainer.
• If you need to run custom packages or use any code which isn’t a part of a supported framework or
available via PyPi, then you need to build your own custom Docker image that is configured to install
the necessary packages or software. The custom image must also be pushed to an online repository
like the Amazon Elastic Container Registry.

Topics
• Use a built-in algorithm (p. 1278)
• Use script mode in a supported framework (p. 1278)
• Use a custom Docker image (p. 1279)

1277
Amazon SageMaker Developer Guide
Choose an algorithm implementation

Algorithm implementation guidance

Implementation Requires code Pre-coded Support for Support for Level of effort
algorithms third party custom code
packages

Built-in No Yes No No Low

Scikit-learn Yes Yes PyPi only Yes Medium

Spark ML Yes Yes PyPi only Yes Medium

XGBoost (open Yes Yes PyPi only Yes Medium


source)

TensorFlow Yes No PyPi only Yes Medium-high

PyTorch Yes No PyPi only Yes Medium-high

MXNet Yes No PyPi only Yes Medium-high

Chainer Yes No PyPi only Yes Medium-high

Custom image Yes No Yes, from any Yes High


source

Use a built-in algorithm


When choosing an algorithm for your type of problem and data, the easiest option is to use one of
Amazon SageMaker's built-in algorithms. These built-in algorithms come with two major benefits.

• The built-in algorithms require no coding to start running experiments. The only inputs you need to
provide are the data, hyperparameters, and compute resources. This allows you to run experiments
more quickly, with less overhead for tracking results and code changes.
• The built-in algorithms come with parallelization across multiple compute instances and GPU support
right out of the box for all applicable algorithms (some algorithms may not be included due to
inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms
can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be
easier to use its corollary in SageMaker and input the hyper-parameters you already know than to port
it over, using script mode on a supported framework.

For more information on the built-in algorithms provided by SageMaker, see Use Amazon SageMaker
Built-in Algorithms or Pre-trained Models (p. 1281).

For important information about docker registry paths, data formats, recommended EC2 instance types,
and CloudWatch logs common to all of the built-in algorithms provided by SageMaker, see Common
Information About Built-in Algorithms (p. 1287).

Use script mode in a supported framework


If the algorithm you want to use for your model is not supported by a built-in choice and you are
comfortable coding your own solution, then you should consider using an Amazon SageMaker supported
framework. This is referred to as "script mode" because you write your custom code (script) in a text file
with a .py extension. As the table above indicates, SageMaker supports most of the popular machine
learning frameworks. These frameworks come preloaded with the corresponding framework and some
additional Python packages, such as Pandas and NumPy, so you can write your own code for training an
algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a

1278
Amazon SageMaker Developer Guide
Problem types for the basic machine learning paradigms

requirements.txt file with your training code or to include your own code directories. R is also supported
natively in SageMaker notebook kernels. Some frameworks, like scikit-learn and Spark ML, have pre-
coded algorithms you can use easily, while other frameworks like TensorFlow and PyTorch may require
you to implement the algorithm yourself. The only limitation when using a supported framework image
is that you cannot import any software packages that are not hosted on PyPi or that are not already
included with the framework’s image.

For more information on the frameworks supported by SageMaker, see Use Machine Learning
Frameworks, Python, and R with Amazon SageMaker (p. 15).

Use a custom Docker image


Amazon SageMaker's built-in algorithms and supported frameworks should cover most use cases, but
there are times when you may need to use an algorithm from a package not included in any of the
supported frameworks. You might also have a pre-trained model picked or persisted somewhere which
you need to deploy. SageMaker uses Docker images to host the training and serving of all models, so
you can supply your own custom Docker image if the package or software you need is not included in
a supported framework. This may be your own Python package or an algorithm coded in a language
like Stan or Julia. For these images you must also configure the training of the algorithm and serving
of the model properly in your Dockerfile. This requires intermediate knowledge of Docker and is not
recommended unless you are comfortable writing your own machine learning algorithm. Your Docker
image must be uploaded to an online repository, such as the Amazon Elastic Container Registry (ECR)
before you can train and serve your model properly.

For more information on custom Docker images in SageMaker, see Using Docker containers with
SageMaker (p. 2668).

Problem types for the basic machine learning


paradigms
The following three sections describe the main problem types addressed by the three basic paradigms
for machine learning. For a list of the built-in algorithms that SageMaker provides to address these
problem types, see Use Amazon SageMaker Built-in Algorithms or Pre-trained Models (p. 1281).

Topics
• Supervised learning (p. 1279)
• Unsupervised learning (p. 1280)
• Reinforcement learning (p. 1280)

Supervised learning
If your data set consists of features or attributes (inputs) that contain target values (outputs), then you
have a supervised learning problem. If your target values are categorical (mathematically discrete),
then you have a classification problem. It is a standard practice to distinguish binary from multiclass
classification.

• Binary classification is a type of supervised learning that assigns an individual to one of two
predefined and mutually exclusive classes based on the individual's attributes. It is supervised because
the models are trained using examples in which the attributes are provided with correctly labeled
objects. A medical diagnosis for whether an individual has a disease or not based on the results of
diagnostic tests is an example of binary classification.
• Multiclass classification is a type of supervised learning that assigns an individual to one of several
classes based on the individual's attributes. It is supervised because the models are trained using
examples in which the attributes are provided with correctly labeled objects. An example is the

1279
Amazon SageMaker Developer Guide
Problem types for the basic machine learning paradigms

prediction of the topic most relevant to a text document. A document may be classified as being about
religion, politics, or finance, or as about one of several other predefined topic classes.

If the target values you are trying to predict are mathematically continuous, then you have a regression
problem. Regression estimates the values of a dependent target variable based on one or more other
variables or attributes that are correlated with it. An example is the prediction of house prices using
features like the number of bathrooms and bedrooms and the square footage of the house and garden.
Regression analysis can create a model that takes one or more of these features as an input and predicts
the price of a house.

For more information on the built-in supervised learning algorithms provided by SageMaker, see
Supervised Learning (p. 1285).

Unsupervised learning
If your data set consists of features or attributes (inputs) that do not contain labels or target values
(outputs), then you have an unsupervised learning problem. In this type of problem, the output must be
predicted based on the pattern discovered in the input data. The goal in unsupervised learning problems
is to discover patterns such as groupings within the data. There are a large variety of tasks or problem
types to which unsupervised learning can be applied. Principal component and cluster analyses are two
of the main methods commonly deployed for preprocessing data. Here is a short list of problem types
that can be addressed by unsupervised learning:

• Dimension reduction is typically part of a data exploration step used to determine the most relevant
features to use for model construction. The idea is to transform data from a high-dimensional,
sparsely populated space into a low-dimensional space that retains most significant properties of
the original data. This provides relief for the curse of dimensionality that can arise with sparsely
populated, high-dimensional data on which statistical analysis becomes problematic. It can also be
used to help understand data, reducing high-dimensional data to a lower dimensionality that can be
visualized.
• Cluster analysis is a class of techniques that are used to classify objects or cases into groups called
clusters. It attempts to find discrete groupings within data, where members of a group are as similar
as possible to one another and as different as possible from members of other groups. You define the
features or attributes that you want the algorithm to use to determine similarity, select a distance
function to measure similarity, and specify the number of clusters to use in the analysis.
• Anomaly detection is the identification of rare items, events, or observations in a data set which raise
suspicions because they differ significantly from the rest of the data. The identification of anomalous
items can be used, for example, to detect bank fraud or medical errors. Anomalies are also referred to
as outliers, novelties, noise, deviations, and exceptions.
• Density estimation is the construction of estimates of unobservable underlying probability density
functions based on observed data. A natural use of density estimates is for data exploration. Density
estimates can discover features such as skewness and multimodality in the data. The most basic form
of density estimation is a rescaled histogram.

SageMaker provides several built-in machine learning algorithms that you can use for these
unsupervised learning tasks. For more information on the built-in unsupervised algorithms provided by
SageMaker, see Unsupervised Learning (p. 1285).

Reinforcement learning
Reinforcement learning is a type of learning that is based on interaction with the environment. This
type of learning is used by an agent that must learn behavior through trial-and-error interactions with
a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives
as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain
rewards with exploiting actions that have known rewards.

1280
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information on SageMaker's frameworks, toolkits, and environments for reinforcement
learning, see Use Reinforcement Learning with Amazon SageMaker (p. 1559).

Use Amazon SageMaker Built-in Algorithms or Pre-


trained Models
Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution
templates to help data scientists and machine learning practitioners get started on training and
deploying machine learning models quickly. For someone who is new to SageMaker, choosing the
right algorithm for your particular use case can be a challenging task. The following table provides
a quick cheat sheet that shows how you can start with an example problem or use case and find an
appropriate built-in algorithm offered by SageMaker that is valid for that problem type. Additional
guidance organized by learning paradigms (supervised and unsupervised) and important data domains
(text and images) is provided in the sections following the table.

Table: Mapping use cases to built-in algorithms

Example Learning Problem types Data input format Built-in


problems and use paradigm or algorithms
cases domain

Here a few Pre-trained Image Image, Text, Popular models,


examples out of models and pre- Classification Tabular including
the 15 problem built solution Mobilenet, YOLO,
types that can templates Tabular Faster R-CNN,
be addressed Classification BERT, lightGBM,
by the pre- and CatBoost
trained models Tabular Regression
and pre-built For a list of pre-
Text Classification trained models
solution templates
provided by available, see
Object Detection
SageMaker JumpStart Models.
JumpStart: Text Embedding
For a list of pre-
Question Question built solution
answering: Answering templates
chatbot that available, see
outputs an Sentence Pair JumpStart
answer for a given Classification Solutions.
question.
Image Embedding
Text analysis:
analyze texts from Named Entity
models specific Recognition
to an industry
domain such as Instance
finance. Segmentation

Text Generation

Text
Summarization

Semantic
Segmentation

1281
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Example Learning Problem types Data input format Built-in


problems and use paradigm or algorithms
cases domain
Machine
Translation

Predict if an item Supervised Binary/multi-class Tabular AutoGluon-


belongs to a Learning (p. 1285) classification Tabular (p. 1301),
category: an email CatBoost (p. 1308),
spam filter Factorization
Machines
Algorithm (p. 1317),
K-Nearest
Neighbors (k-NN)
Algorithm (p. 1327),
LightGBM (p. 1336),
Linear Learner
Algorithm (p. 1345),
TabTransformer (p. 1362),
XGBoost
Algorithm (p. 1369)

Predict a numeric/ Regression Tabular AutoGluon-


continuous value: Tabular (p. 1301),
estimate the value CatBoost (p. 1308),
of a house Factorization
Machines
Algorithm (p. 1317),
K-Nearest
Neighbors (k-NN)
Algorithm (p. 1327),
LightGBM (p. 1336),
Linear Learner
Algorithm (p. 1345),
TabTransformer (p. 1362),
XGBoost
Algorithm (p. 1369)

Based on historical Time-series Tabular DeepAR


data for a forecasting Forecasting
behavior, predict Algorithm (p. 1460)
future behavior:
predict sales on
a new product
based on previous
sales data.

Improve the data Embeddings: Tabular Object2Vec


embeddings of the convert high- Algorithm (p. 1421)
high-dimensional dimensional
objects: identify objects into low-
duplicate support dimensional space.
tickets or find the
correct routing
based on similarity
of text in the
tickets

1282
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Example Learning Problem types Data input format Built-in


problems and use paradigm or algorithms
cases domain

Drop those Unsupervised Feature Tabular Principal


columns from a Learning (p. 1285) engineering: Component
dataset that have dimensionality Analysis (PCA)
a weak relation reduction Algorithm (p. 1493)
with the label/
target variable:
the color of a car
when predicting
its mileage.

Detect abnormal Anomaly detection Tabular Random Cut


behavior in Forest (RCF)
application: spot Algorithm (p. 1497)
when an IoT
sensor is sending
abnormal readings

Protect your IP anomaly Tabular IP


application from detection Insights (p. 1476)
suspicious users:
detect if an IP
address accessing
a service might be
from a bad actor

Group similar Clustering or Tabular K-Means


objects/data grouping Algorithm (p. 1485)
together: find
high-, medium-,
and low-spending
customers from
their transaction
histories

Organize a set of Topic modeling Text Latent Dirichlet


documents into Allocation (LDA)
topics (not known Algorithm (p. 1409),
in advance): tag Neural Topic
a document as Model (NTM)
belonging to a Algorithm (p. 1415)
medical category
based on the
terms used in the
document.

Assign pre-defined Textual Text classification Text BlazingText


categories to Analysis (p. 1286) algorithm (p. 1399),
documents in a Text
corpus: categorize Classification -
books in a library TensorFlow (p. 1450)
into academic
disciplines

1283
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Example Learning Problem types Data input format Built-in


problems and use paradigm or algorithms
cases domain

Convert text from Machine Text Sequence-


one language to translation to-Sequence
other: Spanish to algorithm Algorithm (p. 1437)
English

Summarize a long Text Text Sequence-


text corpus: an summarization to-Sequence
abstract for a Algorithm (p. 1437)
research paper

Convert audio files Speech-to-text Text Sequence-


to text: transcribe to-Sequence
call center Algorithm (p. 1437)
conversations for
further analysis

Label/tag an Image Image and multi- Image Image


image based Processing (p. 1286) label classification Classification -
on the content MXNet (p. 1506)
of the image:
alerts about adult
content in an
image

Classify something Image Image Image


in an image using classification Classification -
transfer learning. TensorFlow (p. 1517)

Detect people Object detection Image Object Detection -


and objects in and classification MXNet (p. 1530),
an image: police Object Detection -
review a large TensorFlow (p. 1541)
photo gallery for a
missing person

Tag every pixel Computer vision Image Semantic


of an image Segmentation
individually Algorithm (p. 1549)
with a category:
self-driving cars
prepare to identify
objects in their
way

For important information about Docker registry paths, data formats, recommenced Amazon EC2
instance types, and CloudWatch logs common to all of the built-in algorithms provided by SageMaker,
see Common Information About Built-in Algorithms (p. 1287).

The following sections provide additional guidance for the Amazon SageMaker built-in algorithms
grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions
of these learning paradigms and their associated problem types, see Choose an Algorithm (p. 1276).
Sections are also provided for the SageMaker built-in algorithms available to address two important
machine learning domains: textual analysis and image processing.

1284
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Pre-trained Models and Solution Templates (p. 1285)


• Supervised Learning (p. 1285)
• Unsupervised Learning (p. 1285)
• Textual Analysis (p. 1286)
• Image Processing (p. 1286)

Pre-trained Models and Solution Templates


SageMaker JumpStart provides a wide range of pre-trained models, pre-built solution templates, and
examples for popular problem types that use the SageMaker SDK as well as Studio. For more information
about these models, solutions, and the example notebooks provided by SageMaker JumpStart, see
SageMaker JumpStart (p. 47).

Supervised Learning
Amazon SageMaker provides several built-in general purpose algorithms that can be used for either
classification or regression problems.

• AutoGluon-Tabular (p. 1301)—an open-source AutoML framework that succeeds by ensembling


models and stacking them in multiple layers.
• CatBoost (p. 1308)—an implementation of the gradient-boosted trees algorithm that introduces
ordered boosting and an innovative algorithm for processing categorical features.
• Factorization Machines Algorithm (p. 1317)—an extension of a linear model that is designed to
economically capture interactions between features within high-dimensional sparse datasets.
• K-Nearest Neighbors (k-NN) Algorithm (p. 1327)—a non-parametric method that uses the k nearest
labeled points to assign a label to a new data point for classification or a predicted target value from
the average of the k nearest points for regression.
• LightGBM (p. 1336)—an implementation of the gradient-boosted trees algorithm that adds two novel
techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and
Exclusive Feature Bundling (EFB).
• Linear Learner Algorithm (p. 1345)—learns a linear function for regression or a linear threshold
function for classification.
• TabTransformer (p. 1362)—a novel deep tabular data modeling architecture built on self-attention-
based Transformers.
• XGBoost Algorithm (p. 1369)—an implementation of the gradient-boosted trees algorithm that
combines an ensemble of estimates from a set of simpler and weaker models.

Amazon SageMaker also provides several built-in supervised learning algorithms that are used for more
specialized tasks during feature engineering and forecasting from time series data.

• Object2Vec Algorithm (p. 1421)—a new highly customizable multi-purpose algorithm used for
feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to
produce features that improve training efficiencies for downstream models. While this is a supervised
algorithm, as it requires labeled data for training, there are many scenarios in which the relationship
labels can be obtained purely from natural clusterings in data, without any explicit human annotation.
• DeepAR Forecasting Algorithm (p. 1460)—a supervised learning algorithm for forecasting scalar (one-
dimensional) time series using recurrent neural networks (RNN).

Unsupervised Learning
Amazon SageMaker provides several built-in algorithms that can be used for a variety of unsupervised
learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.

1285
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Principal Component Analysis (PCA) Algorithm (p. 1493)—reduces the dimensionality (number of
features) within a dataset by projecting data points onto the first few principal components. The
objective is to retain as much information or variation as possible. For mathematicians, principal
components are eigenvectors of the data's covariance matrix.
• K-Means Algorithm (p. 1485)—finds discrete groupings within data, where members of a group are as
similar as possible to one another and as different as possible from members of other groups.
• IP Insights (p. 1476)—learns the usage patterns for IPv4 addresses. It is designed to capture
associations between IPv4 addresses and various entities, such as user IDs or account numbers.
• Random Cut Forest (RCF) Algorithm (p. 1497)—detects anomalous data points within a data set that
diverge from otherwise well-structured or patterned data.

Textual Analysis
SageMaker provides algorithms that are tailored to the analysis of textual documents used in natural
language processing, document classification or summarization, topic modeling or classification, and
language transcription or translation.

• BlazingText algorithm (p. 1399)—a highly optimized implementation of the Word2vec and text
classification algorithms that scale to large datasets easily. It is useful for many downstream natural
language processing (NLP) tasks.
• Sequence-to-Sequence Algorithm (p. 1437)—a supervised algorithm commonly used for neural
machine translation.
• Latent Dirichlet Allocation (LDA) Algorithm (p. 1409)—an algorithm suitable for determining topics in
a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with
answers during training.
• Neural Topic Model (NTM) Algorithm (p. 1415)—another unsupervised technique for determining
topics in a set of documents, using a neural network approach.
• Text Classification - TensorFlow (p. 1450)—a supervised algorithm that supports transfer learning with
available pretrained models for text classification.

Image Processing
SageMaker also provides image processing algorithms that are used for image classification, object
detection, and computer vision.

• Image Classification - MXNet (p. 1506)—uses example data with answers (referred to as a supervised
algorithm). Use this algorithm to classify images.
• Image Classification - TensorFlow (p. 1517)—uses pretrained TensorFlow Hub models to fine-tune for
specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.
• Semantic Segmentation Algorithm (p. 1549)—provides a fine-grained, pixel-level approach to
developing computer vision applications.
• Object Detection - MXNet (p. 1530)—detects and classifies objects in images using a single deep
neural network. It is a supervised learning algorithm that takes images as input and identifies all
instances of objects within the image scene.
• Object Detection - TensorFlow (p. 1541)—detects bounding boxes and object labels in an image. It is
a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow
models.

Topics
• Common Information About Built-in Algorithms (p. 1287)

1286
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Built-in SageMaker Algorithms for Tabular Data (p. 1300)


• Built-in SageMaker Algorithms for Text Data (p. 1398)
• Built-in SageMaker Algorithms for Time-Series Data (p. 1460)
• Unsupervised Built-in SageMaker Algorithms (p. 1475)
• Built-in SageMaker Algorithms for Computer Vision (p. 1505)

Common Information About Built-in Algorithms


The following table lists parameters for each of the algorithms provided by Amazon SageMaker.

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

AutoGluon- training and File CSV CPU or No


Tabular (optionally) GPU (single
validation instance
only)

BlazingText train File or Pipe Text file CPU or No


(one GPU (single
sentence instance
per line only)
with space-
separated
tokens)

CatBoost training and File CSV CPU (single No


(optionally) instance
validation only)

DeepAR train and File JSON Lines CPU or GPU Yes


Forecasting (optionally) or Parquet
test

Factorization train and File or Pipe recordIO- CPU (GPU Yes


Machines (optionally) protobuf for dense
test data)

Image train and File or Pipe recordIO or GPU Yes


Classification validation, image files
- MXNet (optionally) (.jpg or .png)
train_lst,
validation_lst,
and model

Image training and File image files CPU or GPU Yes (only
Classification validation (.jpg, .jpeg, across
- TensorFlow or .png) multiple
GPUs on
a single
instance)

IP Insights train and File CSV CPU or GPU Yes


(optionally)
validation

1287
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

K-Means train and File or Pipe recordIO- CPU or No


(optionally) protobuf or GPUCommon
test CSV (single GPU
device on
one or more
instances)

K-Nearest- train and File or Pipe recordIO- CPU or GPU Yes


Neighbors (optionally) protobuf or (single GPU
(k-NN) test CSV device on
one or more
instances)

LDA train and File or Pipe recordIO- CPU (single No


(optionally) protobuf or instance
test CSV only)

LightGBM train/ File CSV CPU Yes


training and
(optionally)
validation

Linear train and File or Pipe recordIO- CPU or GPU Yes


Learner (optionally) protobuf or
validation, CSV
test, or both

Neural Topic train and File or Pipe recordIO- CPU or GPU Yes
Model (optionally) protobuf or
validation, CSV
test, or both

Object2Vec train and File JSON Lines CPU or No


(optionally) GPU (single
validation, instance
test, or both only)

Object train and File or Pipe recordIO or GPU Yes


Detection - validation, image files
MXNet (optionally) (.jpg or .png)
train_annotation,
validation_annotation,
and model

Object training and File image files GPU Yes (only


Detection - validation (.jpg, .jpeg, across
TensorFlow or .png) multiple
GPUs on
a single
instance)

PCA train and File or Pipe recordIO- CPU or GPU Yes


(optionally) protobuf or
test CSV

1288
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

Random Cut train and File or Pipe recordIO- CPU Yes


Forest (optionally) protobuf or
test CSV

Semantic train and File or Pipe Image files GPU (single No


Segmentation validation, instance
train_annotation, only)
validation_annotation,
and
(optionally)
label_map
and model

Seq2Seq train, File recordIO- GPU (single No


Modeling validation, protobuf instance
and vocab only)

TabTransformertraining and File CSV CPU or No


(optionally) GPU (single
validation instance
only)

Text training and File CSV CPU or GPU Yes (only


Classification validation across
- TensorFlow multiple
GPUs on
a single
instance)

XGBoost train and File or Pipe CSV, CPU (or GPU Yes
(0.90-1, (optionally) LibSVM, or for 1.2-1)
0.90-2, validation Parquet
1.0-1, 1.2-1,
1.2-21)

Algorithms that are parallelizable can be deployed on multiple compute instances for distributed
training.

The following topics provide information about data formats, recommended Amazon EC2 instance types,
and CloudWatch logs common to all of the built-in algorithms provided by Amazon SageMaker.
Note
To look up the Docker image URIs of the built-in algorithms managed by SageMaker, see Docker
Registry Paths and Example Code.

Topics
• Common Data Formats for Built-in Algorithms (p. 1289)
• Instance Types for Built-in Algorithms (p. 1298)
• Logs for Built-in Algorithms (p. 1299)

Common Data Formats for Built-in Algorithms


The following topics explain the data formats for the algorithms provided by Amazon SageMaker.

1289
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Topics
• Common Data Formats for Training (p. 1290)
• Common Data Formats for Inference (p. 1293)

Common Data Formats for Training

To prepare for training, you can preprocess your data using a variety of AWS services, including AWS
Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After
preprocessing, publish the data to an Amazon S3 bucket. For training, the data need to go through a
series of conversions and transformations, including:

• Training data serialization (handled by you)


• Training data deserialization (handled by the algorithm)
• Training model serialization (handled by the algorithm)
• Trained model deserialization (optional, handled by you)

When using Amazon SageMaker in the training portion of the algorithm, make sure to upload all data
at once. If more data is added to that location, a new training call would need to be made to construct a
brand new model.

Topics
• Content Types Supported by Built-In Algorithms (p. 1290)
• Using Pipe Mode (p. 1291)
• Using CSV Format (p. 1291)
• Using RecordIO Format (p. 1291)
• Trained Model Deserialization (p. 1293)

Content Types Supported by Built-In Algorithms

The following table lists some of the commonly supported ContentType values and the algorithms that
use them:

ContentTypes for Built-in Algorithms

ContentType Algorithm

application/x-image Object Detection Algorithm, Semantic Segmentation

application/x-recordio Object Detection Algorithm

application/x-recordio- Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation,


protobuf Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence

application/jsonlines BlazingText, DeepAR

image/jpeg Object Detection Algorithm, Semantic Segmentation

image/png Object Detection Algorithm, Semantic Segmentation

text/csv IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner,


NTM, PCA, RCF, XGBoost

text/libsvm XGBoost

1290
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For a summary of the parameters used by each algorithm, see the documentation for the individual
algorithms or this table.

Using Pipe Mode

In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3).
Streaming can provide faster start times for training jobs and better throughput. This is in contrast to
File mode, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses
disk space to store both your final model artifacts and your full training dataset. By streaming in your
data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes
of your training instances. Pipe mode needs only enough disk space to store your final model artifacts.
See the AlgorithmSpecification for additional details on the training input mode.

Using CSV Format

Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV
format for training, in the input data channel specification, specify text/csv as the ContentType.
Amazon SageMaker requires that a CSV file does not have a header record and that the target variable
is in the first column. To run unsupervised learning algorithms that don't have a target, specify the
number of label columns in the content type. For example, in this case 'content_type=text/
csv;label_size=0'. For a notebook example that uses CSV format, see Breast Cancer Prediction. For
more information, see Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker
built-in algorithms.

Using RecordIO Format

In the protobuf recordIO format, SageMaker converts each observation in the dataset into a binary
representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python
for your data preparation, we strongly recommend that you use these existing transformations. However,
if you are using another language, the protobuf definition file below provides the schema that you use to
convert your data into SageMaker protobuf format.
Note
For an example that shows how to convert the commonly used numPy array into the protobuf
recordIO format, see An Introduction to Factorization Machines with MNIST .

syntax = "proto2";

package aialgs.data;

option java_package = "com.amazonaws.aialgorithms.proto";


option java_outer_classname = "RecordProtos";

// A sparse or dense rank-R tensor that stores data as doubles (float64).


message Float32Tensor {
// Each value in the vector. If keys is empty, this is treated as a
// dense vector.
repeated float values = 1 [packed = true];

// If key is not empty, the vector is treated as sparse, with


// each key specifying the location of the value in the sparse vector.
repeated uint64 keys = 2 [packed = true];

// An optional shape that allows the vector to represent a matrix.


// For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
// and keys[i] % 20 gives the column.
// This also supports n-dimensonal tensors.
// Note: If the tensor is sparse, you must specify this value.
repeated uint64 shape = 3 [packed = true];
}

// A sparse or dense rank-R tensor that stores data as doubles (float64).

1291
Amazon SageMaker Developer Guide
Use Built-in Algorithms

message Float64Tensor {
// Each value in the vector. If keys is empty, this is treated as a
// dense vector.
repeated double values = 1 [packed = true];

// If this is not empty, the vector is treated as sparse, with


// each key specifying the location of the value in the sparse vector.
repeated uint64 keys = 2 [packed = true];

// An optional shape that allows the vector to represent a matrix.


// For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
// and keys[i] % 20 gives the column.
// This also supports n-dimensonal tensors.
// Note: If the tensor is sparse, you must specify this value.
repeated uint64 shape = 3 [packed = true];
}

// A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
message Int32Tensor {
// Each value in the vector. If keys is empty, this is treated as a
// dense vector.
repeated int32 values = 1 [packed = true];

// If this is not empty, the vector is treated as sparse with


// each key specifying the location of the value in the sparse vector.
repeated uint64 keys = 2 [packed = true];

// An optional shape that allows the vector to represent a matrix.


// For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
// and keys[i] % 20 gives the column.
// This also supports n-dimensonal tensors.
// Note: If the tensor is sparse, you must specify this value.
repeated uint64 shape = 3 [packed = true];
}

// Support for storing binary data for parsing in other ways (such as JPEG/etc).
// This is an example of another type of value and may not immediately be supported.
message Bytes {
repeated bytes value = 1;

// If the content type of the data is known, stores it.


// This allows for the possibility of using decoders for common formats
// in the future.
optional string content_type = 2;
}

message Value {
oneof value {
// The numbering assumes the possible use of:
// - float16, float128
// - int8, int16, int32
Float32Tensor float32_tensor = 2;
Float64Tensor float64_tensor = 3;
Int32Tensor int32_tensor = 7;
Bytes bytes = 9;
}
}

message Record {
// Map from the name of the feature to the value.
//
// For vectors and libsvm-like datasets,
// a single feature with the name `values`
// should be specified.
map<string, Value> features = 1;

1292
Amazon SageMaker Developer Guide
Use Built-in Algorithms

// An optional set of labels for this record.


// Similar to the features field above, the key used for
// generic scalar / vector labels should be 'values'.
map<string, Value> label = 2;

// A unique identifier for this record in the dataset.


//
// Whilst not necessary, this allows better
// debugging where there are data issues.
//
// This is not used by the algorithm directly.
optional string uid = 3;

// Textual metadata describing the record.


//
// This may include JSON-serialized information
// about the source of the record.
//
// This is not used by the algorithm directly.
optional string metadata = 4;

// An optional serialized JSON object that allows per-record


// hyper-parameters/configuration/other information to be set.
//
// The meaning/interpretation of this field is defined by
// the algorithm author and may not be supported.
//
// This is used to pass additional inference configuration
// when batch inference is used (e.g. types of scores to return).
optional string configuration = 5;
}

After creating the protocol buffer, store it in an Amazon S3 location that Amazon SageMaker can access
and that can be passed as part of InputDataConfig in create_training_job.
Note
For all Amazon SageMaker algorithms, the ChannelName in InputDataConfig must be set to
train. Some algorithms also support a validation or test input channels. These are typically
used to evaluate the model's performance by using a hold-out dataset. Hold-out datasets are
not used in the initial training but can be used to further tune the model.

Trained Model Deserialization


Amazon SageMaker models are stored as model.tar.gz in the S3 bucket specified in OutputDataConfig
S3OutputPath parameter of the create_training_job call. The S3 bucket must be in the same
AWS Region as the notebook instance. You can specify most of these model artifacts when creating a
hosting model. You can also open and review them in your notebook instance. When model.tar.gz is
untarred, it contains model_algo-1, which is a serialized Apache MXNet object. For example, you use
the following to load the k-means model into memory and view it:

import mxnet as mx
print(mx.ndarray.load('model_algo-1'))

Common Data Formats for Inference


Amazon SageMaker algorithms accept and produce several different MIME types for the HTTP payloads
used in retrieving online and mini-batch predictions. You can use various AWS services to transform
or preprocess records prior to running inference. At a minimum, you need to convert the data for the
following:

• Inference request serialization (handled by you)


• Inference request deserialization (handled by the algorithm)

1293
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Inference response serialization (handled by the algorithm)


• Inference response deserialization (handled by you)

Topics
• Convert Data for Inference Request Serialization (p. 1294)
• Convert Data for Inference Response Deserialization (p. 1295)
• Common Request Formats for All Algorithms (p. 1296)
• Use Batch Transform with Built-in Algorithms (p. 1297)

Convert Data for Inference Request Serialization

Content type options for Amazon SageMaker algorithm inference requests include: text/csv,
application/json, and application/x-recordio-protobuf. Algorithms that don't support all of
these types can support other types. XGBoost, for example, only supports text/csv from this list, but
also supports text/libsvm.

For text/csv, the value for the Body argument to invoke_endpoint should be a string with commas
separating the values for each feature. For example, a record for a model with four features might look
like 1.5,16.0,14,23.0. Any transformations performed on the training data should also be performed
on the data before obtaining inference. The order of the features matters and must remain unchanged.

application/json is significantly more flexible and provides multiple possible formats for developers
to use in their applications. At a high level, in JavaScript, the payload might look like the following:

let request = {
// Instances might contain multiple rows that predictions are sought for.
"instances": [
{
// Request and algorithm specific inference parameters.
"configuration": {},
// Data in the specific format required by the algorithm.
"data": {
"<field name>": dataElement
}
}
]
}

You have the following options for specifying the dataElement:

Protocol buffers equivalent

// Has the same format as the protocol buffers implementation described for training.
let dataElement = {
"keys": [],
"values": [],
"shape": []
}

Simple numeric vector

// An array containing numeric values is treated as an instance containing a


// single dense vector.
let dataElement = [1.5, 16.0, 14.0, 23.0]

// It will be converted to the following representation by the SDK.


let converted = {

1294
Amazon SageMaker Developer Guide
Use Built-in Algorithms

"features": {
"values": dataElement
}
}

For multiple records

let request = {
"instances": [
// First instance.
{
"features": [ 1.5, 16.0, 14.0, 23.0 ]
},
// Second instance.
{
"features": [ -2.0, 100.2, 15.2, 9.2 ]
}
]
}

Convert Data for Inference Response Deserialization

Amazon SageMaker algorithms return JSON in several layouts. At a high level, the structure is:

let response = {
"predictions": [{
// Fields in the response object are defined on a per algorithm-basis.
}]
}

The fields that are included in predictions differ across algorithms. The following are examples of output
for the k-means algorithm.

Single-record inference

let response = {
"predictions": [{
"closest_cluster": 5,
"distance_to_cluster": 36.5
}]
}

Multi-record inference

let response = {
"predictions": [
// First instance prediction.
{
"closest_cluster": 5,
"distance_to_cluster": 36.5
},
// Second instance prediction.
{
"closest_cluster": 2,
"distance_to_cluster": 90.3
}
]
}

Multi-record inference with protobuf input

1295
Amazon SageMaker Developer Guide
Use Built-in Algorithms

{
"features": [],
"label": {
"closest_cluster": {
"values": [ 5.0 ] // e.g. the closest centroid/cluster was 1.0
},
"distance_to_cluster": {
"values": [ 36.5 ]
}
},
"uid": "abc123",
"metadata": "{ "created_at": '2017-06-03' }"
}

SageMaker algorithms also support the JSONLINES format, where the per-record response content
is same as that in JSON format. The multi-record structure is a concatenation of per-record response
objects separated by newline characters. The response content for the built-in KMeans algorithm for 2
input data points is:

{"distance_to_cluster": 23.40593910217285, "closest_cluster": 0.0}


{"distance_to_cluster": 27.250282287597656, "closest_cluster": 0.0}

While running batch transform, we recommended using the jsonlines response type by setting the
Accept field in the CreateTransformJobRequest to application/jsonlines.

Common Request Formats for All Algorithms


Most algorithms use several of the following inference request formats.

JSON Request Format


Content type: application/JSON

Dense format

let request = {
"instances": [
{
"features": [1.5, 16.0, 14.0, 23.0]
}
]
}

let request = {
"instances": [
{
"data": {
"features": {
"values": [ 1.5, 16.0, 14.0, 23.0]
}
}
}
]
}

Sparse format

{
"instances": [
{"data": {"features": {

1296
Amazon SageMaker Developer Guide
Use Built-in Algorithms

"keys": [26, 182, 232, 243, 431],


"shape": [2000],
"values": [1, 1, 1, 4, 1]
}
}
},
{"data": {"features": {
"keys": [0, 182, 232, 243, 431],
"shape": [2000],
"values": [13, 1, 1, 4, 1]
}
}
},
]
}

JSONLINES Request Format


Content type: application/JSONLINES

Dense format

A single record in dense format can be represented as either:

{ "features": [1.5, 16.0, 14.0, 23.0] }

or:

{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }

Sparse Format

A single record in sparse format is represented as:

{"data": {"features": { "keys": [26, 182, 232, 243, 431], "shape": [2000], "values": [1, 1,
1, 4, 1] } } }

Multiple records are represented as a concatenation of the above single-record representations,


separated by newline characters:

{"data": {"features": { "keys": [0, 1, 3], "shape": [4], "values": [1, 4, 1] } } }


{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
{ "features": [1.5, 16.0, 14.0, 23.0] }

CSV Request Format


Content type: text/CSV; label_size=0
Note
CSV support is not available for factorization machines.

RECORDIO Request Format


Content type: application/x-recordio-protobuf

Use Batch Transform with Built-in Algorithms


While running batch transform, we recommended using the JSONLINES response type instead
of JSON, if supported by the algorithm. This is accomplished by setting the Accept field in the
CreateTransformJobRequest to application/jsonlines.

1297
Amazon SageMaker Developer Guide
Use Built-in Algorithms

When you create a transform job, the SplitType must be set according to the ContentType of
the input data. Similarly, depending on the Accept field in the CreateTransformJobRequest,
AssembleWith must be set accordingly. Please use the following table to help appropriately set these
fields:

ContentType Recommended SplitType

application/x-recordio-protobuf RecordIO

text/csv Line

application/jsonlines Line

application/json None

application/x-image None

image/* None

Accept Recommended AssembleWith

application/x-recordio-protobuf None

application/json None

application/jsonlines Line

For more information on response formats for specific algorithms, see the following:

• DeepAR Inference Formats (p. 1472)


• Factorization Machines Response Formats (p. 1326)
• IP Insights Inference Data Formats (p. 1483)
• K-Means Response Formats (p. 1492)
• k-NN Request and Response Formats (p. 1333)
• Linear learner response formats (p. 1360)
• NTM Response Formats (p. 1420)
• Data Formats for Object2Vec Inference (p. 1435)
• Encoder Embeddings for Object2Vec (p. 1436)
• PCA Response Formats (p. 1496)
• RCF Response Formats (p. 1503)

Instance Types for Built-in Algorithms


For training and hosting Amazon SageMaker algorithms, we recommend using the following Amazon
EC2 instance types:

• ml.m5.xlarge, ml.m5.4xlarge, and ml.m5.12xlarge


• ml.c5.xlarge, ml.c5.2xlarge, and ml.c5.8xlarge
• ml.p3.xlarge, ml.p3.8xlarge, and ml.p3.16xlarge

Most Amazon SageMaker algorithms have been engineered to take advantage of GPU computing for
training. For most algorithm training, we support P2, P3, G4dn, and G5 GPU instances. Despite higher

1298
Amazon SageMaker Developer Guide
Use Built-in Algorithms

per-instance costs, GPUs train more quickly, making them more cost effective. Exceptions are noted in
this guide.

The size and type of data can have a great effect on which hardware configuration is most effective.
When the same model is trained on a recurring basis, initial testing across a spectrum of instance types
can discover configurations that are more cost-effective in the long run. Additionally, algorithms that
train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine
the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom
load tests, use Amazon SageMaker Inference Recommender.

For more information on SageMaker hardware specifications, see Amazon SageMaker ML Instance Types.

Logs for Built-in Algorithms


Amazon SageMaker algorithms produce Amazon CloudWatch logs, which provide detailed information
on the training process. To see the logs, in the AWS management console, choose CloudWatch, choose
Logs, and then choose the /aws/sagemaker/TrainingJobs log group. Each training job has one log
stream per node on which it was trained. The log stream’s name begins with the value specified in the
TrainingJobName parameter when the job was created.
Note
If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the
start of training. Reasons include specifying the wrong training image or S3 location.

The contents of logs vary by algorithms. However, you can typically find the following information:

• Confirmation of arguments provided at the beginning of the log


• Errors that occurred during training
• Measurement of an algorithm's accuracy or numerical performance
• Timings for the algorithm and any major stages within the algorithm

Common Errors
If a training job fails, some details about the failure are provided by the FailureReason return value in
the training job description, as follows:

sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']

Others are reported only in the CloudWatch logs. Common errors include the following:

1. Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.

From the CloudWatch Log

[10/16/2017 23:45:17 ERROR 139623806805824 train.py:48]


Additional properties are not allowed (u'mini_batch_siz' was
unexpected)

2. Specifying an invalid value for a hyperparameter.

FailureReason

AlgorithmError: u'abc' is not valid under any of the given


schemas\n\nFailed validating u'oneOf' in
schema[u'properties'][u'feature_dim']:\n {u'oneOf':
[{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n
{u'minimum': 1, u'type': u'integer'}]}\

1299
Amazon SageMaker Developer Guide
Use Built-in Algorithms

FailureReason

[10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc'


is not valid under any of the given schemas

3. Inaccurate protobuf file format.

From the CloudWatch log

[10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot


copy sequence with size 785 to array axis with dimension 784

Built-in SageMaker Algorithms for Tabular Data


Amazon SageMaker provides built-in algorithms that are tailored to the analysis of tabular data.
The built-in SageMaker algorithms for tabular data can be used for either classification or regression
problems.

• AutoGluon-Tabular (p. 1301)—an open-source AutoML framework that succeeds by ensembling


models and stacking them in multiple layers.
• CatBoost (p. 1308)—an implementation of the gradient-boosted trees algorithm that introduces
ordered boosting and an innovative algorithm for processing categorical features.
• Factorization Machines Algorithm (p. 1317)—an extension of a linear model that is designed to
economically capture interactions between features within high-dimensional sparse datasets.
• K-Nearest Neighbors (k-NN) Algorithm (p. 1327)—a non-parametric method that uses the k nearest
labeled points to assign a label to a new data point for classification or a predicted target value from
the average of the k nearest points for regression.
• LightGBM (p. 1336)—an implementation of the gradient-boosted trees algorithm that adds two novel
techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and
Exclusive Feature Bundling (EFB).
• Linear Learner Algorithm (p. 1345)—learns a linear function for regression or a linear threshold
function for classification.
• TabTransformer (p. 1362)—a novel deep tabular data modeling architecture built on self-attention-
based Transformers.
• XGBoost Algorithm (p. 1369)—an implementation of the gradient-boosted trees algorithm that
combines an ensemble of estimates from a set of simpler and weaker models.

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

AutoGluon- training and File CSV CPU or No


Tabular (optionally) GPU (single
validation instance
only)

CatBoost training and File CSV CPU (single No


(optionally) instance
validation only)

Factorization train and File or Pipe recordIO- CPU (GPU Yes


Machines (optionally) protobuf for dense
test data)

1300
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

K-Nearest- train and File or Pipe recordIO- CPU or GPU Yes


Neighbors (optionally) protobuf or (single GPU
(k-NN) test CSV device on
one or more
instances)

LightGBM training and File CSV CPU (single No


(optionally) instance
validation only)

Linear train and File or Pipe recordIO- CPU or GPU Yes


Learner (optionally) protobuf or
validation, CSV
test, or both

TabTransformertraining and File CSV CPU or No


(optionally) GPU (single
validation instance
only)

XGBoost train and File or Pipe CSV, CPU (or GPU Yes
(0.90-1, (optionally) LibSVM, or for 1.2-1)
0.90-2, validation Parquet
1.0-1, 1.2-1,
1.2-21)

AutoGluon-Tabular
AutoGluon-Tabular is a popular open-source AutoML framework that trains highly accurate machine
learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily
focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple
models and stacking them in multiple layers.

How to use SageMaker AutoGluon-Tabular

You can use AutoGluon-Tabular as an Amazon SageMaker built-in algorithm. The following section
describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use
AutoGluon-Tabular from the Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).

• Use AutoGluon-Tabular as a built-in algorithm

Use the AutoGluon-Tabular built-in algorithm to build an AutoGluon-Tabular training container as


shown in the following code example. You can automatically spot the AutoGluon-Tabular built-in
algorithm image URI using the SageMaker image_uris.retrieve API (or the get_image_uri API if
using Amazon SageMaker Python SDK version 2).

After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to
construct an estimator using the SageMaker Estimator API and initiate a training job. The AutoGluon-
Tabular built-in algorithm runs in script mode, but the training script is provided for you and there
is no need to replace it. If you have extensive experience using script mode to create a SageMaker
training job, then you can incorporate your own AutoGluon-Tabular training scripts.

from sagemaker import image_uris, model_uris, script_uris

1301
Amazon SageMaker Developer Guide
Use Built-in Algorithms

train_model_id, train_model_version, train_scope = "autogluon-classification-ensemble",


"*", "training"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the docker image


train_image_uri = image_uris.retrieve(
region=None,
framework=None,
model_id=train_model_id,
model_version=train_model_version,
image_scope=train_scope,
instance_type=training_instance_type
)

# Retrieve the training script


train_source_uri = script_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)

train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tabular_binary/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

from sagemaker import hyperparameters

# Retrieve the default hyperparameters for training the model


hyperparameters = hyperparameters.retrieve_default(
model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values


hyperparameters[
"auto_stack"
] = "True"
print(hyperparameters)

from sagemaker.estimator import Estimator


from sagemaker.utils import name_from_base

training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")

# Create SageMaker Estimator instance


tabular_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location

1302
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)

For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the
following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region
as the notebook instance used to run them.
• Tabular classification with Amazon SageMaker AutoGluon-Tabular algorithm
• Tabular regression with Amazon SageMaker AutoGluon-Tabular algorithm

Input and Output interface for the AutoGluon-Tabular algorithm

Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of AutoGluon-Tabular supports CSV for training and inference:

• For Training ContentType, valid inputs must be text/csv.


• For Inference ContentType, valid inputs must be text/csv.

Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.

Input format for training data, validation data, and categorical features

Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must
provide the path to an Amazon S3 bucket that contains your training and validation data. You can also
include a list of categorical features. Use both the training and validation channels to provide your
input data. Alternatively, you can use only the training channel.

Use both the training and validation channels

You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the AutoGluon-Tabular algorithm concatenates
the files. The validation data is used to compute a validation score at the end of each boosting iteration.
Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.

1303
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Use only the training channel:

You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.

SageMaker AutoGluon-Tabular uses the autogluon.tabular.TabularPredictor module to serialize


or deserialize the model, which can be used for saving or loading the model.

To use a model trained with SageMaker AutoGluon-Tabular with the AutoGluon framework

• Use the following Python code:

import tarfile
from autogluon.tabular import TabularPredictor

t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()

model = TabularPredictor.load(model_file_path)

# prediction with test data


# dtest should be a pandas DataFrame with column names feature_0, feature_1, ...,
feature_d
pred = model.predict(dtest)

Amazon EC2 instance recommendation for the AutoGluon-Tabular algorithm


SageMaker AutoGluon-Tabular supports single-instance CPU and single-instance GPU training. Despite
higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage
of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker
AutoGluon-Tabular currently does not support multi-GPU training.

AutoGluon-Tabular sample notebooks


The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker AutoGluon-Tabular algorithm.

Notebook Title Description

Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
AutoGluon-Tabular algorithm Amazon SageMaker AutoGluon-Tabular algorithm
to train and host a tabular classification model.

Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
AutoGluon-Tabular algorithm Amazon SageMaker AutoGluon-Tabular algorithm
to train and host a tabular regression model.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created

1304
Amazon SageMaker Developer Guide
Use Built-in Algorithms

a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

How AutoGluon-Tabular works

AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble
methods. It automatically recognizes the data type in each column for robust data preprocessing,
including special handling of text fields.

AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks.
These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-
wise manner that guarantees raw data can be translated into high-quality predictions within a given time
constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of
out-of-fold examples.

The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust
handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for
regression, classification (binary and multiclass), and ranking problems.

Refer to the following diagram illustrating how the multi-layer stacking strategy works.

For more information, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

AutoGluon-Tabular hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker AutoGluon-Tabular algorithm. Users set these parameters to facilitate
the estimation of model parameters from data. The SageMaker AutoGluon-Tabular algorithm is an
implementation of the open-source AutoGluon-Tabular package.
Note
The default hyperparameters are based on example datasets in the AutoGluon-Tabular sample
notebooks (p. 1304).

1305
Amazon SageMaker Developer Guide
Use Built-in Algorithms

By default, the SageMaker AutoGluon-Tabular algorithm automatically chooses an evaluation metric


based on the type of classification problem. The algorithm detects the type of classification problem
based on the number of labels in your data. For regression problems, the evaluation metric is root
mean squared error. For binary classification problems, the evaluation metric is area under the receiver
operating characteristic curve (AUC). For multiclass classification problems, the evaluation metric is
accuracy. You can use the eval_metric hyperparameter to change the default evaluation metric.
Refer to the following table for more information on AutoGluon-Tabular hyperparameters, including
descriptions, valid values, and default values.

Parameter Name Description

eval_metric The evaluation metric for validation data. If eval_metric is set


to the default "auto" value, then the algorithm automatically
chooses an evaluation metric based on the type of classification
problem:

• "root_mean_squared_error" for regression


• "roc_auc" for binary classification
• "accuracy" for multi-class classification

Valid values: string, refer to the AutoGluon documentation for valid


values.

Default value: "auto".

presets List of preset configurations for various arguments in fit().

• "best_quality": high predictive accuracy, slower inference


times and higher disk usage
• "high_quality": high predictive accuracy and fast inference
• "good_quality": good predictive accuracy and very fast
inference
• "medium_quality": medium predictive accuracy, very fast
inference and training time
• "optimize_for_deployment": delete unused models and
remove training artifacts
• "interpretable": fits only interpretable rule-based models
from the imodels package

For more details, see AutoGluon Predictors.

Valid values: string, any of the following: ("best_quality",


"high_quality", good_quality", "medium_quality",
"optimize_for_deployment", or "interpretable").

Default value: "medium_quality".

auto_stack Whether AutoGluon should automatically utilize bagging and


multi-layer stack ensembling to boost predictive accuracy. Set
auto_stack to "True" if you are willing to tolerate longer
training times in order to maximize predictive accuracy. This
automatically sets the num_bag_folds and num_stack_levels
arguments based on dataset properties.

Valid values: string, "True" or "False".

1306
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: "False".

num_bag_folds Number of folds used for bagging of models. When


num_bag_folds is equal to k, training time is roughly increased
by a factor of k. Set num_bag_folds to 0 to deactivate bagging.
This is disabled by default, but we recommend using values
between 5 and 10 to maximize predictive performance. Increasing
num_bag_folds results in models with lower bias, but that
are more prone to overfitting. One is an invalid value for this
parameter, and will raise a ValueError. Values greater than 10
may produce diminishing returns and can even harm overall results
due to overfitting. To further improve predictions, avoid increasing
num_bag_folds and instead increase num_bag_sets.

Valid values: string, any integer between (and including) "0" and
"10".

Default value: "0".

num_bag_sets Number of repeats of kfold bagging to perform (values must be


greater than or equal to 1). The total number of models trained
during bagging is equal to num_bag_folds * num_bag_sets.
This parameter defaults to one if time_limit is not specified.
This parameters is disabled if num_bag_folds is not specified.
Values greater than one result in superior predictive performance,
especially on smaller problems and with stacking enabled.

Valid values: integer, range: [1, 20].

Default value: 1.

num_stack_levels Number of stacking levels to use in stack ensemble. Roughly


increases model training time by factor of num_stack_levels
+ 1. Set this parameter to 0 to deactivate stack ensembling. This
parameter is deactivated by default, but we recommend using
values between 1 and 3 to maximize predictive performance. To
prevent overfitting and a ValueError, num_bag_folds must be
greater than or equal to 2.

Valid values: float, range: [0, 3].

Default value: 0.

refit_full Whether or not to retrain all models on all of the data (training and
validation) after the normal training procedure. For more details,
see AutoGluon Predictors.

Valid values: string, "True" or "False".

Default value: "False".

1307
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

set_best_to_refit_full Whether or not to change the default model that the predictor uses
for prediction. If set_best_to_refit_full is set to "True",
the default model changes to the model that exhibited the highest
validation score as a result of refitting (activated by refit_full).
Only valid if refit_full is set.

Valid values: string, "True" or "False".

Default value: "False".

save_space Whether or note to reduce the memory and disk size of predictor by
deleting auxiliary model files that aren’t needed for prediction on
new data. This has no impact on inference accuracy. We recommend
setting save_space to "True" if the only goal is to use the
trained model for prediction. Certain advanced functionality may
no longer be available if save_space is set to "True". Refer to the
predictor.save_space() documentation for more details.

Valid values: string, "True" or "False".

Default value: "False".

verbosity The verbosity of print messages. verbosity levels range from


0 to 4, with higher levels corresponding to more detailed print
statements. A verbosity of 0 suppresses warnings.

Valid values: integer, any of the following: (0, 1, 2, 3, or 4).

Default value: 2.

Tuning an AutoGluon-Tabular model

Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance
using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather
than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and
training in a layer-wise manner.

For more information about AutoGluon-Tabular hyperparameters, see AutoGluon-Tabular


hyperparameters (p. 1305).

CatBoost
CatBoost is a popular and high-performance open-source implementation of the Gradient Boosting
Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately
predict a target variable by combining an ensemble of estimates from a set of simpler and weaker
models.

CatBoost introduces two critical algorithmic advances to GBDT:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm


2. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage
present in all currently existing implementations of gradient boosting algorithms.

1308
Amazon SageMaker Developer Guide
Use Built-in Algorithms

How to use SageMaker CatBoost

You can use CatBoost as an Amazon SageMaker built-in algorithm. The following section describes how
to use CatBoost with the SageMaker Python SDK. For information on how to use CatBoost from the
Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).

• Use CatBoost as a built-in algorithm

Use the CatBoost built-in algorithm to build a CatBoost training container as shown in the following
code example. You can automatically spot the CatBoost built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 2).

After specifying the CatBoost image URI, you can use the CatBoost container to construct an estimator
using the SageMaker Estimator API and initiate a training job. The CatBoost built-in algorithm runs in
script mode, but the training script is provided for you and there is no need to replace it. If you have
extensive experience using script mode to create a SageMaker training job, then you can incorporate
your own CatBoost training scripts.

from sagemaker import image_uris, model_uris, script_uris

train_model_id, train_model_version, train_scope = "catboost-classification-model", "*",


"training"
training_instance_type = "ml.m5.xlarge"

# Retrieve the docker image


train_image_uri = image_uris.retrieve(
region=None,
framework=None,
model_id=train_model_id,
model_version=train_model_version,
image_scope=train_scope,
instance_type=training_instance_type
)

# Retrieve the training script


train_source_uri = script_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)

train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tabular_multiclass/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

from sagemaker import hyperparameters

# Retrieve the default hyperparameters for training the model


hyperparameters = hyperparameters.retrieve_default(
model_id=train_model_id, model_version=train_model_version
)

1309
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# [Optional] Override default hyperparameters with custom values


hyperparameters[
"iterations"
] = "500"
print(hyperparameters)

from sagemaker.estimator import Estimator


from sagemaker.utils import name_from_base

training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")

# Create SageMaker Estimator instance


tabular_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location
)

# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)

For more information about how to set up CatBoost as a built-in algorithm, see the following
notebook examples.
• Tabular classification with Amazon SageMaker LightGBM and CatBoost algorithm
• Tabular regression with Amazon SageMaker LightGBM and CatBoost algorithm

Input and Output interface for the CatBoost algorithm

Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of CatBoost supports CSV for training and inference:

• For Training ContentType, valid inputs must be text/csv.


• For Inference ContentType, valid inputs must be text/csv.

Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.

Input format for training data, validation data, and categorical features

Be mindful of how to format your training data for input to the CatBoost model. You must provide the
path to an Amazon S3 bucket that contains your training and validation data. You can also include a list
of categorical features. Use both the training and validation channels to provide your input data.
Alternatively, you can use only the training channel.

1310
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Use both the training and validation channels

You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the CatBoost algorithm concatenates the files.
The validation data is used to compute a validation score at the end of each boosting iteration. Early
stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.

Use only the training channel:

You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.

SageMaker CatBoost uses the catboost.CatBoostClassifier and


catboost.CatBoostRegressor modules to serialize or deserialize the model, which can be used for
saving or loading the model.

To use a model trained with SageMaker CatBoost with catboost

• Use the following Python code:

import tarfile
from catboost import CatBoostClassifier

t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()

file_path = os.path.join(model_file_path, "model")


model = CatBoostClassifier()
model.load_model(file_path)

# prediction with test data


# dtest should be a pandas DataFrame with column names feature_0, feature_1, ...,
feature_d
pred = model.predict(dtest)

1311
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Amazon EC2 instance recommendation for the CatBoost algorithm

SageMaker CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to
compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice
than a compute-optimized instance (for example, C5). Further, we recommend that you have enough
total memory in selected instances to hold the training data.

CatBoost sample notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker CatBoost algorithm.

Notebook Title Description

Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker CatBoost algorithm to train
and host a tabular classification model.

Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker CatBoost algorithm to train
and host a tabular regression model.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

How CatBoost Works

CatBoost implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition
of two critical algorithmic advances:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm


2. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage
present in all currently existing implementations of gradient boosting algorithms.

The CatBoost algorithm performs well in machine learning competitions because of its robust handling
of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you
can fine-tune. You can use CatBoost for regression, classification (binary and multiclass), and ranking
problems.

For more information on gradient boosting, see How XGBoost Works (p. 1376). For in-depth details
about the additional GOSS and EFB techniques used in the CatBoost method, see CatBoost: unbiased
boosting with categorical features.

CatBoost hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker CatBoost algorithm. Users set these parameters to facilitate the estimation
of model parameters from data. The SageMaker CatBoost algorithm is an implementation of the open-
source CatBoost package.
Note
The default hyperparameters are based on example datasets in the CatBoost sample
notebooks (p. 1312).

1312
Amazon SageMaker Developer Guide
Use Built-in Algorithms

By default, the SageMaker CatBoost algorithm automatically chooses an evaluation metric and loss
function based on the type of classification problem. The CatBoost algorithm detects the type of
classification problem based on the number of labels in your data. For regression problems, the
evaluation metric and loss functions are both root mean squared error. For binary classification
problems, the evaluation metric is Area Under the Curve (AUC) and the loss function is log loss. For
multiclass classification problems, the evaluation metric and loss functions are multiclass cross entropy.
You can use the eval_metric hyperparameter to change the default evaluation metric. Refer to the
following table for more information on LightGBM hyperparameters, including descriptions, valid values,
and default values.

Parameter Name Description

iterations The maximum number of trees that can be built.

Valid values: integer, range: Positive integer.

Default value: 500.

early_stopping_rounds The training will stop if one metric of one validation data point
does not improve in the last early_stopping_rounds round.
If early_stopping_rounds is less than or equal to zero, this
hyperparameter is ignored.

Valid values: integer.

Default value: 5.

eval_metric The evaluation metric for validation data. If eval_metric is set


to the default "auto" value, then the algorithm automatically
chooses an evaluation metric based on the type of classification
problem:

• "RMSE" for regression


• "AUC" for binary classification
• "MultiClass" for multi-class classification

Valid values: string, refer to the CatBoost documentation for valid


values.

Default value: "auto".

learning_rate The rate at which the model weights are updated after working
through each batch of training examples.

Valid values: float, range: (0.0, 1.0).

Default value: 0.009.

depth Depth of the tree.

Valid values: integer, range: (1, 16).

Default value: 6.

l2_leaf_reg Coefficient for the L2 regularization term of the cost function.

Valid values: integer, range: Positive integer.

1313
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: 3.

random_strength The amount of randomness to use for scoring splits when the tree
structure is selected. Use this parameter to avoid overfitting the
model.

Valid values: float, range: Positive floating point number.

Default value: 1.0.

max_leaves The maximum number of leaves in the resulting tree. Can only be
used with the "Lossguide" growing policy.

Valid values: integer, range: [2, 64].

Default value: 31.

rsm Random subspace method. The percentage of features to use


at each split selection, when features are selected over again at
random.

Valid values: float, range: (0.0, 1.0].

Default value: 1.0.

sampling_frequency Frequency to sample weights and objects when building trees.

Valid values: string, either: ("PerTreeLevel" or "PerTree").

Default value: "PerTreeLevel".

min_data_in_leaf The minimum number of training samples in a leaf. CatBoost does


not search for new splits in leaves with a sample count less than
the specified value. Can only be used with the "Lossguide" and
"Depthwise" growing policies.

Valid values: integer, range: (1 or ∞).

Default value: 1.

bagging_temperature Defines the settings of the Bayesian bootstrap. Use


the Bayesian bootstrap to assign random weights to
objects. If bagging_temperature is set to 1.0, then the
weights are sampled from an exponential distribution. If
bagging_temperature is set to 0.0, then all weights are 1.0.

Valid values: float, range: Non-negative float.

Default value: 1.0.

boosting_type The boosting scheme. "Auto" means that the boosting_type is


selected based on processing unit type, the number of objects in
the training dataset, and the selected learning mode.

Valid values: string, any of the following: ("Auto", "Ordered",


"Plain").

Default value: "Auto".

1314
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

scale_pos_weight The weight for positive class in binary classification. The value is
used as a multiplier for the weights of objects from positive class.

Valid values: float, range: Positive float.

Default value: 1.0.

max_bin The number of splits for numerical features. "Auto" means that
max_bin is selected based on the processing unit type and other
parameters. For details, see the CatBoost documentation.

Valid values: string, either: ("Auto" or string of integer from "1" to


"65535" inclusively).

Default value: "Auto".

grow_policy The tree growing policy. Defines how to perform greedy tree
construction.

Valid values: string, any of the following: ("SymmetricTree",


"Depthwise", or "Lossguide").

Default value: "SymmetricTree".

random_seed The random seed used for training.

Valid values: integer, range: Non-negative integer.

Default value: 1.0.

thread_count The number of threads to use during the training. If


thread_count is -1, then the number of threads is equal to the
number of processor cores. thread_count cannot be 0.

Valid values: integer, either: (-1 or positive integer).

Default value: -1.

verbose The verbosity of print messages, with higher levels corresponding


to more detailed print statements.

Valid values: integer, range: Positive integer.

Default value: 1.

Tune a CatBoost model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning loss function is automatically assigned based on the type of classification
task, which is determined by the number of unique integers in the label column. For more
information, see CatBoost hyperparameters (p. 1312).

• A learning loss function to optimize during model training

1315
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• An evaluation metric that is used to evaluate model performance during validation


• A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for CatBoost is only available from the Amazon SageMaker SDKs, not
from the SageMaker console.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Evaluation metrics computed by the CatBoost algorithm

The SageMaker CatBoost algorithm computes the following metrics to use for model validation. The
evaluation metric is automatically assigned based on the type of classification task, which is determined
by the number of unique integers in the label column. For a

Metric Name Description Optimization Regex Pattern


Direction

RMSE root mean square error minimize "bestTest =


([0-9\\.]+)"

MAE mean absolute error minimize "bestTest =


([0-9\\.]+)"

median absolute error


MedianAbsoluteError minimize "bestTest =
([0-9\\.]+)"

R2 r2 score maximize "bestTest =


([0-9\\.]+)"

Logloss binary cross entropy maximize "bestTest =


([0-9\\.]+)"

Precision precision maximize "bestTest =


([0-9\\.]+)"

Recall recall maximize "bestTest =


([0-9\\.]+)"

F1 f1 score maximize "bestTest =


([0-9\\.]+)"

AUC auc score maximize "bestTest =


([0-9\\.]+)"

MultiClass multiclass cross entropy maximize "bestTest =


([0-9\\.]+)"

Accuracy accuracy maximize "bestTest =


([0-9\\.]+)"

BalancedAccuracybalanced accuracy maximize "bestTest =


([0-9\\.]+)"

1316
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Tunable CatBoost hyperparameters

Tune the CatBoost model with the following hyperparameters. The hyperparameters that have
the greatest effect on optimizing the CatBoost evaluation metrics are: learning_rate, depth,
l2_leaf_reg, and random_strength. For a list of all the CatBoost hyperparameters, see CatBoost
hyperparameters (p. 1312).

Parameter Name Parameter Type Recommended Ranges

learning_rate ContinuousParameterRanges MinValue: 0.001,


MaxValue: 0.01

depth IntegerParameterRanges MinValue: 4, MaxValue:


10

l2_leaf_reg IntegerParameterRanges MinValue: 2, MaxValue:


10

random_strength ContinuousParameterRanges MinValue: 0, MaxValue:


10

Factorization Machines Algorithm


The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can
use for both classification and regression tasks. It is an extension of a linear model that is designed
to capture interactions between features within high dimensional sparse datasets economically. For
example, in a click prediction system, the Factorization Machines model can capture click rate patterns
observed when ads from a certain ad-category are placed on pages from a certain page-category.
Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such
as click prediction and item recommendation.
Note
The Amazon SageMaker implementation of the Factorization Machines algorithm considers only
pair-wise (2nd order) interactions between features.

Topics
• Input/Output Interface for the Factorization Machines Algorithm (p. 1317)
• EC2 Instance Recommendation for the Factorization Machines Algorithm (p. 1318)
• Factorization Machines Sample Notebooks (p. 1318)
• How Factorization Machines Work (p. 1318)
• Factorization Machines Hyperparameters (p. 1319)
• Tune a Factorization Machines Model (p. 1324)
• Factorization Machines Response Formats (p. 1326)

Input/Output Interface for the Factorization Machines Algorithm

The Factorization Machines algorithm can be run in either in binary classification mode or regression
mode. In each mode, a dataset can be provided to the test channel along with the train channel dataset.
The scoring depends on the mode used. In regression mode, the testing dataset is scored using Root
Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross
Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5).

For training, the Factorization Machines algorithm currently supports only the recordIO-protobuf
format with Float32 tensors. Because their use case is predominantly on sparse data, CSV is not a good
candidate. Both File and Pipe mode training are supported for recordIO-wrapped protobuf.

1317
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For inference, the Factorization Machines algorithm supports the application/json and x-
recordio-protobuf formats.

• For the binary classification problem, the algorithm predicts a score and a label. The label is a number
and can be either 0 or 1. The score is a number that indicates how strongly the algorithm believes that
the label should be 1. The algorithm computes score first and then derives the label from the score
value. If the score is greater than or equal to 0.5, the label is 1.
• For the regression problem, just a score is returned and it is the predicted value. For example, if
Factorization Machines is used to predict a movie rating, score is the predicted rating value.

Please see Factorization Machines Sample Notebooks (p. 1318) for more details on training and
inference file formats.

EC2 Instance Recommendation for the Factorization Machines Algorithm

The Amazon SageMaker Factorization Machines algorithm is highly scalable and can train across
distributed instances. We recommend training and inference with CPU instances for both sparse and
dense datasets. In some circumstances, training with one or more GPUs on dense data might provide
some benefit. Training with GPUs is available only on dense data. Use CPU instances for sparse data. The
Factorization Machines algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

Factorization Machines Sample Notebooks

For a sample notebook that uses the SageMaker Factorization Machines algorithm to analyze the images
of handwritten digits from zero to nine in the MNIST dataset, see An Introduction to Factorization
Machines with MNIST. For instructions how to create and access Jupyter notebook instances that you can
use to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker samples. Example notebooks that use Factorization Machines algorithm are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.

How Factorization Machines Work

The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature set xi to
a target domain. This domain is real-valued for regression and binary for classification. The Factorization
Machines model is supervised and so has a training dataset (xi,yj) available. The advantages this model
presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It
can be represented mathematically as follows:

The three terms in this equation correspond respectively to the three components of the model:

• The w0 term represents the global bias.


th
• The wi linear terms model the strength of the i variable.
th th
• The <vi,vj> factorization terms model the pairwise interaction between the i and j variable.

The global bias and linear terms are the same as in a linear model. The pairwise feature interactions
are modeled in the third term as the inner product of the corresponding factors learned for each
feature. Learned factors can also be considered as embedding vectors for each feature. For example, in
a classification task, if a pair of features tends to co-occur more often in positive labeled samples, then
the inner product of their factors would be large. In other words, their embedding vectors would be close
to each other in cosine similarity. For more information about the Factorization Machines model, see
Factorization Machines.

1318
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For regression tasks, the model is trained by minimizing the squared error between the model prediction
ŷn and the target value yn. This is known as the square loss:

For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log
loss:

where:

For more information about loss functions for classification, see Loss functions for classification.

Factorization Machines Hyperparameters

The following table contains the hyperparameters for the Factorization Machines algorithm. These
are parameters that are set by users to facilitate the estimation of model parameters from data.
The required hyperparameters that must be set are listed first, in alphabetical order. The optional
hyperparameters that can be set are listed next, also in alphabetical order.

Parameter Name Description

feature_dim The dimension of the input feature space. This could be very high
with sparse input.

Required

Valid values: Positive integer. Suggested value range:


[10000,10000000]

num_factors The dimensionality of factorization.

Required

Valid values: Positive integer. Suggested value range: [2,1000], 64


typically generates good outcomes and is a good starting point.

predictor_type The type of predictor.

• binary_classifier: For binary classification tasks.


• regressor: For regression tasks.

Required

Valid values: String: binary_classifier or regressor

bias_init_method The initialization method for the bias term:

• normal: Initializes weights with random values sampled from a


normal distribution with a mean of zero and standard deviation
specified by bias_init_sigma.

1319
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


• uniform: Initializes weights with random values uniformly
sampled from a range specified by [-bias_init_scale,
+bias_init_scale].
• constant: Initializes the weights to a scalar value specified by
bias_init_value.

Optional

Valid values: uniform, normal, or constant

Default value: normal

bias_init_scale Range for initialization of the bias term. Takes effect if


bias_init_method is set to uniform.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: None

bias_init_sigma The standard deviation for initialization of the bias term. Takes
effect if bias_init_method is set to normal.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.01

bias_init_value The initial value of the bias term. Takes effect if


bias_init_method is set to constant.

Optional

Valid values: Float. Suggested value range: [1e-8, 512].

Default value: None

bias_lr The learning rate for the bias term.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.1

bias_wd The weight decay for the bias term.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.01

1320
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

clip_gradient Gradient clipping optimizer parameter. Clips the gradient by


projecting onto the interval [-clip_gradient, +clip_gradient].

Optional

Valid values: Float

Default value: None

epochs The number of training epochs to run.

Optional

Valid values: Positive integer

Default value: 1

eps Epsilon parameter to avoid division by 0.

Optional

Valid values: Float. Suggested value: small.

Default value: None

factors_init_method The initialization method for factorization terms:

• normal Initializes weights with random values sampled from a


normal distribution with a mean of zero and standard deviation
specified by factors_init_sigma.
• uniform: Initializes weights with random values uniformly
sampled from a range specified by [-factors_init_scale,
+factors_init_scale].
• constant: Initializes the weights to a scalar value specified by
factors_init_value.

Optional

Valid values: uniform, normal, or constant.

Default value: normal

factors_init_scale The range for initialization of factorization terms. Takes effect if


factors_init_method is set to uniform.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: None

1321
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

factors_init_sigma The standard deviation for initialization of factorization terms.


Takes effect if factors_init_method is set to normal.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.001

factors_init_value The initial value of factorization terms. Takes effect if


factors_init_method is set to constant.

Optional

Valid values: Float. Suggested value range: [1e-8, 512].

Default value: None

factors_lr The learning rate for factorization terms.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.0001

factors_wd The weight decay for factorization terms.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.00001

linear_lr The learning rate for linear terms.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.001

1322
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

linear_init_method The initialization method for linear terms:

• normal Initializes weights with random values sampled from a


normal distribution with a mean of zero and standard deviation
specified by linear_init_sigma.
• uniform Initializes weights with random values uniformly
sampled from a range specified by [-linear_init_scale,
+linear_init_scale].
• constant Initializes the weights to a scalar value specified by
linear_init_value.

Optional

Valid values: uniform, normal, or constant.

Default value: normal

linear_init_scale Range for initialization of linear terms. Takes effect if


linear_init_method is set to uniform.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: None

linear_init_sigma The standard deviation for initialization of linear terms. Takes effect
if linear_init_method is set to normal.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.01

linear_init_value The initial value of linear terms. Takes effect if


linear_init_method is set to constant.

Optional

Valid values: Float. Suggested value range: [1e-8, 512].

Default value: None

linear_wd The weight decay for linear terms.

Optional

Valid values: Non-negative float. Suggested value range: [1e-8,


512].

Default value: 0.001

1323
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

mini_batch_size The size of mini-batch used for training.

Optional

Valid values: Positive integer

Default value: 1000

rescale_grad Gradient rescaling optimizer parameter. If set, multiplies the


gradient with rescale_grad before updating. Often choose to be
1.0/batch_size.

Optional

Valid values: Float

Default value: None

Tune a Factorization Machines Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the Factorization Machines Algorithm

The Factorization Machines algorithm has both binary classification and regression predictor types. The
predictor type determines which metric you can use for automatic model tuning. The algorithm reports a
test:rmse regressor metric, which is computed during training. When tuning the model for regression
tasks, choose this metric as the objective.

Metric Name Description Optimization Direction

test:rmse Root Mean Square Error Minimize

The Factorization Machines algorithm reports three binary classification metrics, which are computed
during training. When tuning the model for binary classification tasks, choose one of these as the
objective.

Metric Name Description Optimization Direction

Accuracy
test:binary_classification_accuracy Maximize

Cross Entropy
test:binary_classification_cross_entropy Minimize

test:binary_f_beta Beta Maximize

1324
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Tunable Factorization Machines Hyperparameters

You can tune the following hyperparameters for the Factorization Machines algorithm. The initialization
parameters that contain the terms bias, linear, and factorization depend on their initialization method.
There are three initialization methods: uniform, normal, and constant. These initialization methods
are not themselves tunable. The parameters that are tunable are dependent on this choice of the
initialization method. For example, if the initialization method is uniform, then only the scale
parameters are tunable. Specifically, if bias_init_method==uniform, then bias_init_scale,
linear_init_scale, and factors_init_scale are tunable. Similarly, if the initialization method is
normal, then only sigma parameters are tunable. If the initialization method is constant, then only
value parameters are tunable. These dependencies are listed in the following table.

Parameter Name Parameter Type Recommended Dependency


Ranges

bias_init_scale ContinuousParameterRange MinValue: 1e-8, bias_init_method==uniform


MaxValue: 512

bias_init_sigma ContinuousParameterRange MinValue: 1e-8, bias_init_method==normal


MaxValue: 512

bias_init_value ContinuousParameterRange MinValue: 1e-8, bias_init_method==constan


MaxValue: 512

bias_lr ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512

bias_wd ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512

epoch IntegerParameterRange MinValue: 1, None


MaxValue: 1000

ContinuousParameterRange
factors_init_scale MinValue: 1e-8, bias_init_method==uniform
MaxValue: 512

ContinuousParameterRange
factors_init_sigma MinValue: 1e-8, bias_init_method==normal
MaxValue: 512

ContinuousParameterRange
factors_init_value MinValue: 1e-8, bias_init_method==constan
MaxValue: 512

factors_lr ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512

factors_wd ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512]

linear_init_scaleContinuousParameterRange MinValue: 1e-8, bias_init_method==uniform


MaxValue: 512

linear_init_sigmaContinuousParameterRange MinValue: 1e-8, bias_init_method==normal


MaxValue: 512

linear_init_valueContinuousParameterRange MinValue: 1e-8, bias_init_method==constan


MaxValue: 512

linear_lr ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512

1325
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Dependency


Ranges

linear_wd ContinuousParameterRange MinValue: 1e-8, None


MaxValue: 512

mini_batch_size IntegerParameterRange MinValue: 100, None


MaxValue: 10000

Factorization Machines Response Formats

JSON Response Format

Binary classification

let response = {
"predictions": [
{
"score": 0.4,
"predicted_label": 0
}
]
}

Regression

let response = {
"predictions": [
{
"score": 0.4
}
]
}

JSONLINES Response Format

Binary classification

{"score": 0.4, "predicted_label": 0}

Regression

{"score": 0.4}

RECORDIO Response Format

Binary classification

[
Record = {
features = {},
label = {
'score’: {
keys: [],
values: [0.4] # float32
},
'predicted_label': {

1326
Amazon SageMaker Developer Guide
Use Built-in Algorithms

keys: [],
values: [0.0] # float32
}
}
}
]

Regression

[
Record = {
features = {},
label = {
'score’: {
keys: [],
values: [0.4] # float32
}
}
}
]

K-Nearest Neighbors (k-NN) Algorithm


Amazon SageMaker k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-
parametric method for classification or regression. For classification problems, the algorithm queries the
k points that are closest to the sample point and returns the most frequently used label of their class as
the predicted label. For regression problems, the algorithm queries the k closest points to the sample
point and returns the average of their feature values as the predicted value.

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building.
Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction,
the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model
in memory and inference latency. We provide two methods of dimension reduction methods: random
projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for
high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical
analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is
to construct the index. The index enables efficient lookups of distances between points whose values or
class labels have not yet been determined and the k nearest points to use for inference.

Topics
• Input/Output Interface for the k-NN Algorithm (p. 1327)
• k-NN Sample Notebooks (p. 1328)
• How the k-NN Algorithm Works (p. 1328)
• EC2 Instance Recommendation for the k-NN Algorithm (p. 1329)
• k-NN Hyperparameters (p. 1329)
• Tune a k-NN Model (p. 1331)
• Data Formats for k-NN Training Input (p. 1332)
• k-NN Request and Response Formats (p. 1333)

Input/Output Interface for the k-NN Algorithm

SageMaker k-NN supports train and test data channels.

• Use a train channel for data that you want to sample and construct into the k-NN index.
• Use a test channel to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for
classifier, mean-squared error (mse) for regressor for score.

1327
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For training inputs, k-NN supports text/csv and application/x-recordio-protobuf data


formats. For input type text/csv, the first label_size columns are interpreted as the label vector
for that row. You can use either File mode or Pipe mode to train models on data that is formatted as
recordIO-wrapped-protobuf or as CSV.

For inference inputs, k-NN supports the application/json, application/x-recordio-protobuf,


and text/csv data formats. The text/csv format accepts a label_size and encoding parameter. It
assumes a label_size of 0 and a UTF-8 encoding.

For inference outputs, k-NN supports the application/json and application/x-recordio-


protobuf data formats. These two data formats also support a verbose output mode. In verbose output
mode, the API provides the search results with the distances vector sorted from smallest to largest, and
corresponding elements in the labels vector.

For batch transform, k-NN supports the application/jsonlines data format for both input and
output. An example input is as follows:

content-type: application/jsonlines

{"features": [1.5, 16.0, 14.0, 23.0]}


{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}

An example output is as follows:

accept: application/jsonlines

{"predicted_label": 0.0}
{"predicted_label": 2.0}

For more information on input and output file formats, see Data Formats for k-NN Training
Input (p. 1332) for training, k-NN Request and Response Formats (p. 1333) for inference, and the k-NN
Sample Notebooks (p. 1328).

k-NN Sample Notebooks


For a sample notebook that uses the SageMaker k-nearest neighbor algorithm to predict wilderness
cover types from geological and forest service data, see the K-Nearest Neighbor Covertype .

Use a Jupyter notebook instance to run the example in SageMaker. To learn how to create and open a
Jupyter notebook instance in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker example notebooks. Find K-Nearest Neighbor notebooks in the Introduction to Amazon
algorithms section. To open a notebook, click on its Use tab and select Create copy.

How the k-NN Algorithm Works


Step 1: Sample
To specify the total number of data points to be sampled from the training dataset, use the
sample_sizeparameter. For example, if the initial dataset has 1,000 data points and the sample_size
is set to 100, where the total number of instances is 2, each worker would sample 50 points. A total set
of 100 data points would be collected. Sampling runs in linear time with respect to the number of data
points.

Step 2: Perform Dimension Reduction


The current implementation of the k-NN algorithm has two methods of dimension reduction. You
specify the method in the dimension_reduction_type hyperparameter. The sign method specifies
a random projection, which uses a linear projection using a matrix of random signs, and the fjlt
method specifies a fast Johnson-Lindenstrauss transform, a method based on the Fourier transform.

1328
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Both methods preserve the L2 and inner product distances. The fjlt method should be used when the
target dimension is large and has better performance with CPU inference. The methods differ in their
computational complexity. The sign method requires O(ndk) time to reduce the dimension of a batch of
n points of dimension d into a target dimension k. The fjlt method requires O(nd log(d)) time, but the
constants involved are larger. Using dimension reduction introduces noise into the data and this noise
can reduce prediction accuracy.

Step 3: Build an Index


During inference, the algorithm queries the index for the k-nearest-neighbors of a sample point. Based
on the references to the points, the algorithm makes the classification or regression prediction. It makes
the prediction based on the class labels or values provided. k-NN provides three different types of
indexes: a flat index, an inverted index, and an inverted index with product quantization. You specify the
type with the index_type parameter.

Serialize the Model


When the k-NN algorithm finishes training, it serializes three files to prepare for inference.

• model_algo-1: Contains the serialized index for computing the nearest neighbors.
• model_algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted
label based on the query result from the index.
• model_algo-1.json: Contains the JSON-formatted model metadata which stores the k and
predictor_type hyper-parameters from training for inference along with other relevant state.

With the current implementation of k-NN, you can modify the metadata file to change the way
predictions are computed. For example, you can change k to 10 or change predictor_type to
regressor.

{
"k": 5,
"predictor_type": "classifier",
"dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
"normalize": False,
"version": "1.0"
}

EC2 Instance Recommendation for the k-NN Algorithm


We recommend training on a CPU instance (such as ml.m5.2xlarge) or on a GPU instance. The k-NN
algorithm supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

Inference requests from CPUs generally have a lower average latency than requests from GPUs because
there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally
have higher throughput for larger batches.

k-NN Hyperparameters

Parameter Name Description

feature_dim The number of features in the input data.

Required

Valid values: positive integer.

k The number of nearest neighbors.

1329
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Required

Valid values: positive integer

predictor_type The type of inference to use on the data labels.

Required

Valid values: classifier for classification or regressor for regression.

sample_size The number of data points to be sampled from the training data set.

Required

Valid values: positive integer

The target dimension to reduce to.


dimension_reduction_target

Required when you specify the dimension_reduction_type


parameter.

Valid values: positive integer greater than 0 and less than feature_dim.

The type of dimension reduction method.


dimension_reduction_type

Optional

Valid values: sign for random projection or fjlt for the fast Johnson-
Lindenstrauss transform.

Default value: No dimension reduction

faiss_index_ivf_nlistsThe number of centroids to construct in the index when index_type is


faiss.IVFFlat or faiss.IVFPQ.

Optional

Valid values: positive integer

Default value: auto, which resolves to sqrt(sample_size).

faiss_index_pq_m The number of vector sub-components to construct in the index when


index_type is set to faiss.IVFPQ.

The FaceBook AI Similarity Search (FAISS) library requires that the


value of faiss_index_pq_m is a divisor of the data dimension. If
faiss_index_pq_m is not a divisor of the data dimension, we increase
the data dimension to smallest integer divisible by faiss_index_pq_m.
If no dimension reduction is applied, the algorithm adds a padding of
zeros. If dimension reduction is applied, the algorithm increase the value
of the dimension_reduction_target hyper-parameter.

Optional

Valid values: One of the following positive integers: 1, 2, 3, 4, 8, 12, 16,


20, 24, 28, 32, 40, 48, 56, 64, 96

1330
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

index_metric The metric to measure the distance between points when finding nearest
neighbors. When training with index_type set to faiss.IVFPQ, the
INNER_PRODUCT distance and COSINE similarity are not supported.

Optional

Valid values: L2 for Euclidean-distance, INNER_PRODUCT for inner-


product distance, COSINE for cosine similarity.

Default value: L2

index_type The type of index.

Optional

Valid values: faiss.Flat, faiss.IVFFlat, faiss.IVFPQ.

Default values: faiss.Flat

mini_batch_size The number of observations per mini-batch for the data iterator.

Optional

Valid values: positive integer

Default value: 5000

Tune a k-NN Model

The Amazon SageMaker k-nearest neighbors algorithm is a supervised algorithm. The algorithm
consumes a test data set and emits a metric about the accuracy for a classification task or about the
mean squared error for a regression task. These accuracy metrics compare the model predictions for
their respective task to the ground truth provided by the empirical test data. To find the best model that
reports the highest accuracy or lowest error on the test dataset, run a hyperparameter tuning job for k-
NN.

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective
metric appropriate for the prediction task of the algorithm. Automatic model tuning searches the
hyperparameters chosen to find the combination of values that result in the model that optimizes the
objective metric. The hyperparameters are used only to help estimate model parameters and are not
used by the trained model to make predictions.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the k-NN Algorithm

The k-nearest neighbors algorithm computes one of two metrics in the following table during training
depending on the type of task specified by the predictor_type hyper-parameter.

• classifier specifies a classification task and computes test:accuracy


• regressor specifies a regression task and computes test:mse.

1331
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Choose the predictor_type value appropriate for the type of task undertaken to calculate the
relevant objective metric when tuning a model.

Metric Name Description Optimization Direction

test:accuracy When predictor_type is set to classifier, k- Maximize


NN compares the predicted label, based on the
average of the k-nearest neighbors' labels, to the
ground truth label provided in the test channel
data. The accuracy reported ranges from 0.0 (0%)
to 1.0 (100%).

test:mse When predictor_type is set to regressor, k- Minimize


NN compares the predicted label, based on the
average of the k-nearest neighbors' labels, to the
ground truth label provided in the test channel
data. The mean squared error is computed by
comparing the two labels.

Tunable k-NN Hyperparameters

Tune the Amazon SageMaker k-nearest neighbor model with the following hyperparameters.

Parameter Name Parameter Type Recommended Ranges

k IntegerParameterRanges MinValue: 1, MaxValue:


1024

sample_size IntegerParameterRanges MinValue: 256,


MaxValue: 20000000

Data Formats for k-NN Training Input

All Amazon SageMaker built-in algorithms adhere to the common input training formats described
in Common Data Formats - Training. This topic contains a list of the available input formats for the
SageMaker k-nearest-neighbor algorithm.

CSV Data Format

content-type: text/csv; label_size=1

4,1.2,1.3,9.6,20.3

The first label_size columns are interpreted as the label vector for that row.

RECORDIO Data Format

content-type: application/x-recordio-protobuf

[
Record = {
features = {
'values': {
values: [1.2, 1.3, 9.6, 20.3] # float32

1332
Amazon SageMaker Developer Guide
Use Built-in Algorithms

}
},
label = {
'values': {
values: [4] # float32
}
}
}
]

k-NN Request and Response Formats

All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker k-nearest-neighbor algorithm.

INPUT: CSV Request Format

content-type: text/csv

1.2,1.3,9.6,20.3

This accepts a label_size or encoding parameter. It assumes a label_size of 0 and a utf-8 encoding.

INPUT: JSON Request Format

content-type: application/json

{
"instances": [
{"data": {"features": {"values": [-3, -1, -4, 2]}}},
{"features": [3.0, 0.1, 0.04, 0.002]}]
}

INPUT: JSONLINES Request Format

content-type: application/jsonlines

{"features": [1.5, 16.0, 14.0, 23.0]}


{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}

INPUT: RECORDIO Request Format

content-type: application/x-recordio-protobuf

[
Record = {
features = {
'values': {
values: [-3, -1, -4, 2] # float32
}
},
label = {}
},
Record = {
features = {

1333
Amazon SageMaker Developer Guide
Use Built-in Algorithms

'values': {
values: [3.0, 0.1, 0.04, 0.002] # float32
}
},
label = {}
},
]

OUTPUT: JSON Response Format

accept: application/json

{
"predictions": [
{"predicted_label": 0.0},
{"predicted_label": 2.0}
]
}

OUTPUT: JSONLINES Response Format

accept: application/jsonlines

{"predicted_label": 0.0}
{"predicted_label": 2.0}

OUTPUT: VERBOSE JSON Response Format

In verbose mode, the API provides the search results with the distances vector sorted from smallest to
largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/json; verbose=true

{
"predictions": [
{
"predicted_label": 0.0,
"distances": [3.11792408, 3.89746071, 6.32548437],
"labels": [0.0, 1.0, 0.0]
},
{
"predicted_label": 2.0,
"distances": [1.08470316, 3.04917915, 5.25393973],
"labels": [2.0, 2.0, 0.0]
}
]
}

OUTPUT: RECORDIO-PROTOBUF Response Format

content-type: application/x-recordio-protobuf

[
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
}

1334
Amazon SageMaker Developer Guide
Use Built-in Algorithms

}
},
Record = {
features = {},
label = {
'predicted_label': {
values: [2.0] # float32
}
}
}
]

OUTPUT: VERBOSE RECORDIO-PROTOBUF Response Format

In verbose mode, the API provides the search results with the distances vector sorted from smallest to
largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/x-recordio-protobuf; verbose=true

[
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
},
'distances': {
values: [3.11792408, 3.89746071, 6.32548437] # float32
},
'labels': {
values: [0.0, 1.0, 0.0] # float32
}
}
},
Record = {
features = {},
label = {
'predicted_label': {
values: [0.0] # float32
},
'distances': {
values: [1.08470316, 3.04917915, 5.25393973] # float32
},
'labels': {
values: [2.0, 2.0, 0.0] # float32
}
}
}
]

SAMPLE OUTPUT for the k-NN Algorithm

For regressor tasks:

[06/08/2018 20:15:33 INFO 140026520049408] #test_score (algo-1) : ('mse',


0.013333333333333334)

For classifier tasks:

[06/08/2018 20:15:46 INFO 140285487171328] #test_score (algo-1) : ('accuracy',


0.98666666666666669)

1335
Amazon SageMaker Developer Guide
Use Built-in Algorithms

LightGBM
LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree
(GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target
variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM
uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT.

How to use SageMaker LightGBM

You can use LightGBM as an Amazon SageMaker built-in algorithm. The following section describes how
to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the
Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).

• Use LightGBM as a built-in algorithm

Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following
code example. You can automatically spot the LightGBM built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 2).

After specifying the LightGBM image URI, you can use the LightGBM container to construct an
estimator using the SageMaker Estimator API and initiate a training job. The LightGBM built-in
algorithm runs in script mode, but the training script is provided for you and there is no need to
replace it. If you have extensive experience using script mode to create a SageMaker training job, then
you can incorporate your own LightGBM training scripts.

from sagemaker import image_uris, model_uris, script_uris

train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*",


"training"
training_instance_type = "ml.m5.xlarge"

# Retrieve the docker image


train_image_uri = image_uris.retrieve(
region=None,
framework=None,
model_id=train_model_id,
model_version=train_model_version,
image_scope=train_scope,
instance_type=training_instance_type
)

# Retrieve the training script


train_source_uri = script_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)

train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tabular_multiclass/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"

1336
Amazon SageMaker Developer Guide
Use Built-in Algorithms

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

from sagemaker import hyperparameters

# Retrieve the default hyperparameters for training the model


hyperparameters = hyperparameters.retrieve_default(
model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values


hyperparameters[
"num_boost_round"
] = "500"
print(hyperparameters)

from sagemaker.estimator import Estimator


from sagemaker.utils import name_from_base

training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")

# Create SageMaker Estimator instance


tabular_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1, # for distributed training, specify an instance_count greater than
1
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location
)

# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"train": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)

For more information about how to set up the LightGBM as a built-in algorithm, see the following
notebook examples.
• Tabular classification with Amazon SageMaker LightGBM and CatBoost algorithm
• Tabular regression with Amazon SageMaker LightGBM and CatBoost algorithm

Input and Output interface for the LightGBM algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of LightGBM supports CSV for training and inference:

• For Training ContentType, valid inputs must be text/csv.


• For Inference ContentType, valid inputs must be text/csv.

Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.

1337
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For CSV inference, the algorithm assumes that CSV input does not have the label column.

Input format for training data, validation data, and categorical features

Be mindful of how to format your training data for input to the LightGBM model. You must provide the
path to an Amazon S3 bucket that contains your training and validation data. You can also include a
list of categorical features. Use both the train and validation channels to provide your input data.
Alternatively, you can use only the train channel.
Note
Both train and training are valid channel names for LightGBM training.

Use both the train and validation channels

You can provide your input data by way of two S3 paths, one for the train channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files
are provided for the train or validation channels, the LightGBM algorithm concatenates the files.
The validation data is used to compute a validation score at the end of each boosting iteration. Early
stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a
JSON file for categorical features, your train channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.

Use only the train channel:

You can alternatively provide your input data by way of a single S3 path for the train channel. This S3
path should point to a directory with a subdirectory named train/ that contains one or more CSV files.
You can optionally include another subdirectory in the same location called validation/ that also has
one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly
sampled to serve as the validation data. If your predictors include categorical features, you can provide a
JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.

SageMaker LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be
used for saving or loading the model.

To use a model trained with SageMaker LightGBM with the JobLib module

• Use the following Python code:

import joblib
import tarfile

t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()

model = joblib.load(model_file_path)

1338
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# prediction with test data


# dtest should be a pandas DataFrame with column names feature_0, feature_1, ...,
feature_d
pred = model.predict(dtest)

Amazon EC2 instance recommendation for the LightGBM algorithm

SageMaker LightGBM currently supports single-instance and multi-instance CPU training. For multi-
instance CPU training (distributed training), specify an instance_count greater than 1 when you
define your Estimator. For more information on distributed training with LightGBM, see Amazon
SageMaker LightGBM Distributed training using Dask.

LightGBM is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose


compute instance (for example, M5) is a better choice than a compute-optimized instance (for example,
C5). Further, we recommend that you have enough total memory in selected instances to hold the
training data.

LightGBM sample notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker LightGBM algorithm.

Notebook Title Description

Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker LightGBM algorithm to train
and host a tabular classification model.

Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
LightGBM and CatBoost algorithm Amazon SageMaker LightGBM algorithm to train
and host a tabular regression model.

Amazon SageMaker LightGBM Distributed training This notebook demonstrates distributed training
using Dask with the Amazon SageMaker LightGBM algorithm
using the Dask framework.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

How LightGBM works

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the
addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature
Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of
GBDT.

The LightGBM algorithm performs well in machine learning competitions because of its robust handling
of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you
can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking
problems.

For more information on gradient boosting, see How XGBoost Works (p. 1376). For in-depth details
about the additional GOSS and EFB techniques used in the LightGBM method, see LightGBM: A Highly
Efficient Gradient Boosting Decision Tree.

1339
Amazon SageMaker Developer Guide
Use Built-in Algorithms

LightGBM hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker LightGBM algorithm. Users set these parameters to facilitate the estimation
of model parameters from data. The SageMaker LightGBM algorithm is an implementation of the open-
source LightGBM package.
Note
The default hyperparameters are based on example datasets in the LightGBM sample
notebooks (p. 1339).

By default, the SageMaker LightGBM algorithm automatically chooses an evaluation metric and
objective function based on the type of classification problem. The LightGBM algorithm detects the
type of classification problem based on the number of labels in your data. For regression problems,
the evaluation metric is root mean squared error and the objective function is L2 loss. For binary
classification problems, the evaluation metric and objective function are both binary cross entropy. For
multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective
function is softmax. You can use the metric hyperparameter to change the default evaluation metric.
Refer to the following table for more information on LightGBM hyperparameters, including descriptions,
valid values, and default values.

Parameter Name Description

num_boost_round The maximum number of boosting iterations. Note: Internally,


LightGBM constructs num_class * num_boost_round trees for
multi-class classification problems.

Valid values: integer, range: Positive integer.

Default value: 100.

early_stopping_rounds The training will stop if one metric of one validation data point
does not improve in the last early_stopping_rounds round.
If early_stopping_rounds is less than or equal to zero, this
hyperparameter is ignored.

Valid values: integer.

Default value: 10.

metric The evaluation metric for validation data. If metric is set to the
default "auto" value, then the algorithm automatically chooses an
evaluation metric based on the type of classification problem:

• rmse for regression


• binary_logloss for binary classification
• multi_logloss for multi-class classification

Valid values: string, any of the following: ("auto", "rmse", "l1",


"l2", "huber", "fair", "binary_logloss", "binary_error",
"auc", "average_precision", "multi_logloss",
"multi_error", "auc_mu", or "cross_entropy").

Default value: "auto".

learning_rate The rate at which the model weights are updated after working
through each batch of training examples.

1340
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Valid values: float, range: (0.0, 1.0).

Default value: 0.1.

num_leaves The maximum number of leaves in one tree.

Valid values: integer, range: (1, 131072).

Default value: 64.

feature_fraction A subset of features to be selected on each iteration (tree). Must be


less than 1.0.

Valid values: float, range: (0.0, 1.0).

Default value: 0.9.

bagging_fraction A subset of features similar to feature_fraction, but


bagging_fraction randomly selects part of the data without
resampling.

Valid values: float, range: (0.0, 1.0].

Default value: 0.9.

bagging_freq The frequency to perform bagging. At every bagging_freq


iteration, LightGBM randomly selects a percentage of the data
to use for the next bagging_freq iteration. This percentage
is determined by the bagging_fraction hyperparameter. If
bagging_freq is zero, then bagging is deactivated.

Valid values: integer, range: Non-negative integer.

Default value: 1.

max_depth The maximum depth for a tree model. This is used to deal with
overfitting when the amount of data is small. If max_depth is less
than or equal to zero, this means there is no limit for maximum
depth.

Valid values: integer.

Default value: 6.

min_data_in_leaf The minimum amount of data in one leaf. Can be used to deal with
overfitting.

Valid values: integer, range: Non-negative integer.

Default value: 3.

max_delta_step Used to limit the max output of tree leaves. If max_delta_step


is less than or equal to 0, then there is no constraint. The final max
output of leaves is learning_rate * max_delta_step.

Valid values: float.

Default value: 0.0.

1341
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

lambda_l1 L1 regularization.

Valid values: float, range: Non-negative float.

Default value: 0.0.

lambda_l2 L2 regularization.

Valid values: float, range: Non-negative float.

Default value: 0.0.

boosting Boosting type

Valid values: string, any of the following: ("gbdt", "rf", "dart",


or "goss").

Default value: "gbdt".

min_gain_to_split The minimum gain to perform a split. Can be used to speed up


training.

Valid values: integer, float: Non-negative float.

Default value: 0.0.

scale_pos_weight The weight of the labels with positive class. Used only for binary
classification tasks. scale_pos_weight cannot be used if
is_unbalance is set to "True".

Valid values: float, range: Positive float.

Default value: 1.0.

tree_learner Tree learner type.

Valid values: string, any of the following: ("serial", "feature",


"data", or "voting").

Default value: "serial".

feature_fraction_bynode Selects a subset of random features on each tree node. For


example, if feature_fraction_bynode is 0.8, then 80% of
features are selected. Can be used to deal with overfitting.

Valid values: integer, range: (0.0, 1.0].

Default value: 1.0.

is_unbalance Set to "True" if training data is unbalanced. Used only for


binary classification tasks. is_unbalance cannot be used with
scale_pos_weight.

Valid values: string, either: ("True" or "False").

Default value: "False".

1342
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

max_bin The maximum number of bins used to bucket feature values.


A small number of bins may reduce training accuracy, but may
increase general performance. Can be used to deal with overfitting.

Valid values: integer, range: (1, ∞).

Default value: 255.

tweedie_variance_power Controls the variance of the Tweedie distribution. Set this closer to
2.0 to shift toward a gamma distribution. Set this closer to 1.0 to
shift toward a Poisson distribution. Used only for regression tasks.

Valid values: float, range: [1.0, 2.0).

Default value: 1.5.

num_threads Number of parallel threads used to run LightGBM. Value 0 means


default number of threads in OpenMP.

Valid values: integer, range: Non-negative integer.

Default value: 0.

verbosity The verbosity of print messages. If the verbosity is less than 0,


then print messages only show fatal errors. If verbosity is set to
0, then print messages include errors and warnings. If verbosity
is 1, then print messages show more information. A verbosity
greater than 1 shows the most information in print messages and
can be used for debugging.

Valid values: integer.

Default value: 1.

Tune a LightGBM model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning objective function is automatically assigned based on the type of classification
task, which is determined by the number of unique integers in the label column. For more
information, see LightGBM hyperparameters (p. 1340).

• A learning objective function to optimize during model training


• An evaluation metric that is used to evaluate model performance during validation
• A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your specified hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not
from the SageMaker console.

1343
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Evaluation metrics computed by the LightGBM algorithm

The SageMaker LightGBM algorithm computes the following metrics to use for model validation. The
evaluation metric is automatically assigned based on the type of classification task, which is determined
by the number of unique integers in the label column.

Metric Name Description Optimization Regex Pattern


Direction

rmse root mean square error minimize "rmse: ([0-9\


\.]+)"

l1 mean absolute error minimize "l1: ([0-9\


\.]+)"

l2 mean squared error minimize "l2: ([0-9\


\.]+)"

huber huber loss minimize "huber: ([0-9\


\.]+)"

fair fair loss minimize "fair: ([0-9\


\.]+)"

binary_logloss binary cross entropy maximize "binary_logloss:


([0-9\\.]+)"

binary_error binary error minimize "binary_error:


([0-9\\.]+)"

auc AUC maximize "auc: ([0-9\


\.]+)"

average precision score


average_precision maximize "average_precision:
([0-9\\.]+)"

multi_logloss multiclass cross entropy maximize "multi_logloss:


([0-9\\.]+)"

multi_error multiclass error score minimize "multi_error:


([0-9\\.]+)"

auc_mu AUC-mu maximize "auc_mu:


([0-9\\.]+)"

cross_entropy cross entropy minimize "cross_entropy:


([0-9\\.]+)"

Tunable LightGBM hyperparameters

Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the LightGBM evaluation metrics are: learning_rate, num_leaves,
feature_fraction, bagging_fraction, bagging_freq, max_depth and min_data_in_leaf. For
a list of all the LightGBM hyperparameters, see LightGBM hyperparameters (p. 1340).

1344
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges

learning_rate ContinuousParameterRanges MinValue: 0.001,


MaxValue: 0.01

num_leaves IntegerParameterRanges MinValue: 10,


MaxValue: 100

feature_fraction ContinuousParameterRanges MinValue: 0.1,


MaxValue: 1.0

bagging_fraction ContinuousParameterRanges MinValue: 0.1,


MaxValue: 1.0

bagging_freq IntegerParameterRanges MinValue: 0, MaxValue:


10

max_depth IntegerParameterRanges MinValue: 15,


MaxValue: 100

min_data_in_leaf IntegerParameterRanges MinValue: 10,


MaxValue: 200

Linear Learner Algorithm


Linear models are supervised learning algorithms used for solving either classification or regression
problems. For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y
is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass
classification problems, the labels must be from 0 to num_classes - 1. For regression problems, y is
a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold
function, and maps a vector x to an approximation of the label y.

The Amazon SageMaker linear learner algorithm provides a solution for both classification and
regression problems. With the SageMaker algorithm, you can simultaneously explore different training
objectives and choose the best solution from a validation set. You can also explore a large number of
models and choose the best. The best model optimizes either of the following:

• Continuous objectives, such as mean square error, cross entropy loss, absolute error.
• Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy.

Compared with methods that provide a solution for only continuous objectives, the SageMaker linear
learner algorithm provides a significant increase in speed over naive hyperparameter optimization
techniques. It is also more convenient.

The linear learner algorithm requires a data matrix, with rows representing the observations, and
columns representing the dimensions of the features. It also requires an additional column that contains
the labels that match the data points. At a minimum, Amazon SageMaker linear learner requires you to
specify input and output data locations, and objective type (classification or regression) as arguments.
The feature dimension is also required. For more information, see CreateTrainingJob. You can specify
additional parameters in the HyperParameters string map of the request body. These parameters
control the optimization procedure, or specifics of the objective function that you train on. For example,
the number of epochs, regularization, and loss type.

If you're using Managed Spot Training, the linear learner algorithm supports using checkpoints to take a
snapshot of the state of the model.

Topics

1345
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Input/Output interface for the linear learner algorithm (p. 1346)


• EC2 instance recommendation for the linear learner algorithm (p. 1346)
• Linear learner sample notebooks (p. 1347)
• How linear learner works (p. 1347)
• Linear learner hyperparameters (p. 1348)
• Tune a linear learner model (p. 1356)
• Linear learner response formats (p. 1360)

Input/Output interface for the linear learner algorithm

The Amazon SageMaker linear learner algorithm supports three data channels: train, validation
(optional), and test (optional). If you provide validation data, the S3DataDistributionType should
be FullyReplicated. The algorithm logs validation loss at every epoch, and uses a sample of the
validation data to calibrate and select the best model. If you don't provide validation data, the algorithm
uses a sample of the training data to calibrate and select the model. If you provide test data, the
algorithm logs include the test score for the final model.

For training, the linear learner algorithm supports both recordIO-wrapped protobuf and CSV
formats. For the application/x-recordio-protobuf input type, only Float32 tensors are
supported. For the text/csv input type, the first column is assumed to be the label, which is the target
variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data
that is formatted as recordIO-wrapped-protobuf or as CSV.

For inference, the linear learner algorithm supports the application/json,


application/x-recordio-protobuf, and text/csv formats. When you make
predictions on new data, the format of the response depends on the type of model. For
regression (predictor_type='regressor'), the score is the prediction produced
by the model. For classification (predictor_type='binary_classifier' or
predictor_type='multiclass_classifier'), the model returns a score and also a
predicted_label. The predicted_label is the class predicted by the model and the score
measures the strength of that prediction.

• For binary classification, predicted_label is 0 or 1, and score is a single floating point number
that indicates how strongly the algorithm believes that the label should be 1.
• For multiclass classification, the predicted_class will be an integer from 0 to num_classes-1,
and score will be a list of one floating point number per class.

To interpret the score in classification problems, you have to consider the loss function used. If the
loss hyperparameter value is logistic for binary classification or softmax_loss for multiclass
classification, then the score can be interpreted as the probability of the corresponding class. These
are the loss values used by the linear learner when the loss value is auto default value. But if the loss
is set to hinge_loss, then the score cannot be interpreted as a probability. This is because hinge loss
corresponds to a Support Vector Classifier, which does not produce probability estimates.

For more information on input and output file formats, see Linear learner response formats (p. 1360).
For more information on inference formats, and the Linear learner sample notebooks (p. 1347).

EC2 instance recommendation for the linear learner algorithm

The linear learner algorithm supports both CPU and GPU instances for training and inference. For GPU,
the linear learner algorithm supports P2, P3, G4dn, and G5 GPU families.

During testing, we have not found substantial evidence that multi-GPU instances are faster than single-
GPU instances. Results can vary, depending on your specific use case.

1346
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Linear learner sample notebooks


The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker linear learner algorithm.

Notebook Title Description

An Introduction with the MNIST dataset Using the MNIST dataset, we train a binary
classifier to predict a single digit.

Predict Breast Cancer Using UCI's Breast Cancer dataset, we train a


model to predict Breast Cancer.

How to Build a Multiclass Classifier? Using UCI's Covertype dataset, we demonstrate


how to train a multiclass classifier.

How to Build a Machine Learning (ML) Pipeline for Using a Scikit-learn container, we demonstrate
Inference? how to build an end-to-end ML pipeline.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. The topic modeling example notebooks using the linear learning algorithm are
located in the Introduction to Amazon algorithms section. To open a notebook, choose its Use tab and
choose Create copy.

How linear learner works


There are three steps involved in the implementation of the linear learner algorithm: preprocess, train,
and validate.

Step 1: Preprocess
Normalization, or feature scaling, is an important preprocessing step for certain loss functions that
ensures the model being trained on a dataset does not become dominated by the weight of a single
feature. The Amazon SageMaker Linear Learner algorithm has a normalization option to assist with this
preprocessing step. If normalization is turned on, the algorithm first goes over a small sample of the data
to learn the mean value and standard deviation for each feature and for the label. Each of the features in
the full dataset is then shifted to have mean of zero and scaled to have a unit standard deviation.
Note
For best results, ensure your data is shuffled before training. Training with unshuffled data may
cause training to fail.

You can configure whether the linear learner algorithm normalizes the feature data and the labels using
the normalize_data and normalize_label hyperparameters, respectively. Normalization is enabled
by default for both features and labels for regression. Only the features can be normalized for binary
classification and this is the default behavior.

Step 2: Train
With the linear learner algorithm, you train with a distributed implementation of stochastic gradient
descent (SGD). You can control the optimization process by choosing the optimization algorithm. For
example, you can choose to use Adam, AdaGrad, stochastic gradient descent, or other optimization
algorithms. You also specify their hyperparameters, such as momentum, learning rate, and the learning
rate schedule. If you aren't sure which algorithm or hyperparameter value to use, choose a default that
works for the majority of datasets.

During training, you simultaneously optimize multiple models, each with slightly different objectives. For
example, you vary L1 or L2 regularization and try out different optimizer settings.

1347
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Step 3: Validate and set the threshold

When training multiple models in parallel, the models are evaluated against a validation set to select
the most optimal model once training is complete. For regression, the most optimal model is the one
that achieves the best loss on the validation set. For classification, a sample of the validation set is used
to calibrate the classification threshold. The most optimal model selected is the one that achieves the
best binary classification selection criteria on the validation set. Examples of such criteria include the F1
measure, accuracy, and cross-entropy loss.
Note
If the algorithm is not provided a validation set, then evaluating and selecting the most optimal
model is not possible. To take advantage of parallel training and model selection ensure you
provide a validation set to the algorithm.

Linear learner hyperparameters

The following table contains the hyperparameters for the linear learner algorithm. These are parameters
that are set by users to facilitate the estimation of model parameters from data. The required
hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters
that can be set are listed next, also in alphabetical order. When a hyperparameter is set to auto, Amazon
SageMaker will automatically calculate and set the value of that hyperparameter.

Parameter Name Description

num_classes The number of classes for the response variable. The algorithm assumes
that classes are labeled 0, ..., num_classes - 1.

Required when predictor_type is multiclass_classifier.


Otherwise, the algorithm ignores it.

Valid values: Integers from 3 to 1,000,000

predictor_type Specifies the type of target variable as a binary classification, multiclass


classification, or regression.

Required

Valid values: binary_classifier, multiclass_classifier, or


regressor

accuracy_top_k When computing the top-k accuracy metric for multiclass classification,
the value of k. If the model assigns one of the top-k scores to the true
label, an example is scored as correct.

Optional

Valid values: Positive integers

Default value: 3

Specifies whether to use class weights, which give each class equal
balance_multiclass_weights
importance in the loss function. Used only when the predictor_type is
multiclass_classifier.

Optional

Valid values: true, false

Default value: false

1348
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

beta_1 The exponential decay rate for first-moment estimates. Applies only when
the optimizer value is adam.

Optional

Valid values: auto or floating-point value between 0 and 1.0

Default value: auto

beta_2 The exponential decay rate for second-moment estimates. Applies only
when the optimizer value is adam.

Optional

Valid values: auto or floating-point integer between 0 and 1.0

Default value: auto

bias_lr_mult Allows a different learning rate for the bias term. The actual learning rate
for the bias is learning_rate * bias_lr_mult.

Optional

Valid values: auto or positive floating-point integer

Default value: auto

bias_wd_mult Allows different regularization for the bias term. The actual L2
regularization weight for the bias is wd * bias_wd_mult. By default, there
is no regularization on the bias term.

Optional

Valid values: auto or non-negative floating-point integer

Default value: auto

binary_classifier_model_selection_criteria
When predictor_type is set to binary_classifier, the model
evaluation criteria for the validation dataset (or for the training dataset if
you don't provide a validation dataset). Criteria include:

• accuracy—The model with the highest accuracy.


• f_beta—The model with the highest F1 score. The default is F1.
• precision_at_target_recall—The model with the highest
precision at a given recall target.
• recall_at_target_precision—The model with the highest recall
at a given precision target.
• loss_function—The model with the lowest value of the loss function
used in training.

Optional

Valid values: accuracy, f_beta, precision_at_target_recall,


recall_at_target_precision, or loss_function

Default value: accuracy

1349
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

If no improvement is made in the relevant metric, the number of


early_stopping_patience
epochs to wait before ending training. If you have provided a value for
binary_classifier_model_selection_criteria. the metric is that
value. Otherwise, the metric is the same as the value specified for the
loss hyperparameter.

The metric is evaluated on the validation data. If you haven't provided


validation data, the metric is always the same as the value specified
for the loss hyperparameter and is evaluated on the training data. To
disable early stopping, set early_stopping_patience to a value
greater than the value specified for epochs.

Optional

Valid values: Positive integer

Default value: 3

The relative tolerance to measure an improvement in loss. If the ratio of


early_stopping_tolerance
the improvement in loss divided by the previous best loss is smaller than
this value, early stopping considers the improvement to be zero.

Optional

Valid values: Positive floating-point integer

Default value: 0.001

epochs The maximum number of passes over the training data.

Optional

Valid values: Positive integer

Default value: 15

f_beta The value of beta to use when calculating F score metrics for binary
or multiclass classification. Also used if the value specified for
binary_classifier_model_selection_criteria is f_beta.

Optional

Valid values: Positive floating-point integers

Default value: 1.0

feature_dim The number of features in the input data.

Optional

Valid values: auto or positive integer

Default values: auto

1350
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

huber_delta The parameter for Huber loss. During training and metric evaluation,
compute L2 loss for errors smaller than delta and L1 loss for errors larger
than delta.

Optional

Valid values: Positive floating-point integer

Default value: 1.0

init_bias Initial weight for the bias term.

Optional

Valid values: Floating-point integer

Default value: 0

init_method Sets the initial distribution function used for model weights. Functions
include:

• uniform—Uniformly distributed between (-scale, +scale)


• normal—Normal distribution, with mean 0 and sigma

Optional

Valid values: uniform or normal

Default value: uniform

init_scale Scales an initial uniform distribution for model weights. Applies only
when the init_method hyperparameter is set to uniform.

Optional

Valid values: Positive floating-point integer

Default value: 0.07

init_sigma The initial standard deviation for the normal distribution. Applies only
when the init_method hyperparameter is set to normal.

Optional

Valid values: Positive floating-point integer

Default value: 0.01

l1 The L1 regularization parameter. If you don't want to use L1


regularization, set the value to 0.

Optional

Valid values: auto or non-negative float

Default value: auto

1351
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

learning_rate The step size used by the optimizer for parameter updates.

Optional

Valid values: auto or positive floating-point integer

Default value: auto, whose value depends on the optimizer chosen.

loss Specifies the loss function.

The available loss functions and their default values depend on the value
of predictor_type:

• If the predictor_type is set to regressor,


the available options are auto, squared_loss,
absolute_loss, eps_insensitive_squared_loss,
eps_insensitive_absolute_loss, quantile_loss, and
huber_loss. The default value for auto is squared_loss.
• If the predictor_type is set to binary_classifier, the available
options are auto,logistic, and hinge_loss. The default value for
auto is logistic.
• If the predictor_type is set to multiclass_classifier, the
available options are auto and softmax_loss. The default value for
auto is softmax_loss.

Valid values: auto, logistic, squared_loss, absolute_loss,


hinge_loss, eps_insensitive_squared_loss,
eps_insensitive_absolute_loss, quantile_loss, or huber_loss

Optional

Default value: auto

loss_insensitivity The parameter for the epsilon-insensitive loss type. During training and
metric evaluation, any error smaller than this value is considered to be
zero.

Optional

Valid values: Positive floating-point integer

Default value: 0.01

lr_scheduler_factor For every lr_scheduler_step hyperparameter, the learning rate


decreases by this quantity. Applies only when the use_lr_scheduler
hyperparameter is set to true.

Optional

Valid values: auto or positive floating-point integer between 0 and 1

Default value: auto

1352
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

The learning rate never decreases to a value lower than the value
lr_scheduler_minimum_lr
set for lr_scheduler_minimum_lr. Applies only when the
use_lr_scheduler hyperparameter is set to true.

Optional

Valid values: auto or positive floating-point integer

Default values: auto

lr_scheduler_step The number of steps between decreases of the learning rate. Applies only
when the use_lr_scheduler hyperparameter is set to true.

Optional

Valid values: auto or positive integer

Default value: auto

margin The margin for the hinge_loss function.

Optional

Valid values: Positive floating-point integer

Default value: 1.0

mini_batch_size The number of observations per mini-batch for the data iterator.

Optional

Valid values: Positive integer

Default value: 1000

momentum The momentum of the sgd optimizer.

Optional

Valid values: auto or a floating-point integer between 0 and 1.0

Default value: auto

normalize_data Normalizes the feature data before training. Data normalization shifts
the data for each feature to have a mean of zero and scales it to have unit
standard deviation.

Optional

Valid values: auto, true, or false

Default value: true

1353
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

normalize_label Normalizes the label. Label normalization shifts the label to have a mean
of zero and scales it to have unit standard deviation.

The auto default value normalizes the label for regression problems but
does not for classification problems. If you set the normalize_label
hyperparameter to true for classification problems, the algorithm ignores
it.

Optional

Valid values: auto, true, or false

Default value: auto

The number of observations from the validation dataset to use for model
num_calibration_samples
calibration (when finding the best threshold).

Optional

Valid values: auto or positive integer

Default value: auto

num_models The number of models to train in parallel. For the default, auto, the
algorithm decides the number of parallel models to train. One model
is trained according to the given training parameter (regularization,
optimizer, loss), and the rest by close parameters.

Optional

Valid values: auto or positive integer

Default values: auto

num_point_for_scaler The number of data points to use for calculating normalization or


unbiasing of terms.

Optional

Valid values: Positive integer

Default value: 10,000

optimizer The optimization algorithm to use.

Optional

Valid values:

• auto—The default value.


• sgd—Stochastic gradient descent.
• adam—Adaptive momentum estimation.
• rmsprop—A gradient-based optimization technique that uses a moving
average of squared gradients to normalize the gradient.

Default value: auto. The default setting for auto is adam.

1354
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

The weight assigned to positive examples when training a binary classifier.


positive_example_weight_mult
The weight of negative examples is fixed at 1. If you want the algorithm
to choose a weight so that errors in classifying negative vs. positive
examples have equal impact on training loss, specify balanced. If you
want the algorithm to choose the weight that optimizes performance,
specify auto.

Optional

Valid values: balanced, auto, or a positive floating-point integer

Default value: 1.0

quantile The quantile for quantile loss. For quantile q, the model attempts to
produce predictions so that the value of true_label is greater than the
prediction with probability q.

Optional

Valid values: Floating-point integer between 0 and 1

Default value: 0.5

target_precision The target precision. If


binary_classifier_model_selection_criteria is
recall_at_target_precision, then precision is held at this value
while recall is maximized.

Optional

Valid values: Floating-point integer between 0 and 1.0

Default value: 0.8

target_recall The target recall. If


binary_classifier_model_selection_criteria is
precision_at_target_recall, then recall is held at this value while
precision is maximized.

Optional

Valid values: Floating-point integer between 0 and 1.0

Default value: 0.8

unbias_data Unbiases the features before training so that the mean is 0. By default
data is unbiased as the use_bias hyperparameter is set to true.

Optional

Valid values: auto, true, or false

Default value: auto

1355
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

unbias_label Unbiases labels before training so that the mean is 0. Applies to


regression only if the use_bias hyperparameter is set to true.

Optional

Valid values: auto, true, or false

Default value: auto

use_bias Specifies whether the model should include a bias term, which is the
intercept term in the linear equation.

Optional

Valid values: true or false

Default value: true

use_lr_scheduler Whether to use a scheduler for the learning rate. If you want to use a
scheduler, specify true.

Optional

Valid values: true or false

Default value: true

wd The weight decay parameter, also known as the L2 regularization


parameter. If you don't want to use L2 regularization, set the value to 0.

Optional

Valid values:auto or non-negative floating-point integer

Default value: auto

Tune a linear learner model


Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate
from the automatic model tuning feature described here. By default, the linear learner algorithm tunes
hyperparameters by training multiple models in parallel. When you use automatic model tuning, the
linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel
models, num_models, to 1. The algorithm ignores any value that you set for num_models.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics computed by the linear learner algorithm


The linear learner algorithm reports the metrics in the following table, which are computed during
training. Choose one of them as the objective metric. To avoid overfitting, we recommend tuning the
model against a validation metric instead of a training metric.

1356
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

test:absolute_loss The absolute loss of the final model on the test Minimize
dataset. This objective metric is only valid for
regression.

The accuracy of the final model on the test


test:binary_classification_accuracy Maximize
dataset. This objective metric is only valid for
binary classification.

test:binary_f_beta The F-beta score of the final model on the test Maximize
dataset. By default, it is the F1 score, which
is the harmonic mean of precision and recall.
This objective metric is only valid for binary
classification.

test:dcg The discounted cumulative gain of the final model Maximize


on the test dataset. This objective metric is only
valid for multiclass classification.

test:macro_f_beta The F-beta score of the final model on the test Maximize
dataset. This objective metric is only valid for
multiclass classification.

test:macro_precisionThe precision score of the final model on the test Maximize


dataset. This objective metric is only valid for
multiclass classification.

test:macro_recall The recall score of the final model on the test Maximize
dataset. This objective metric is only valid for
multiclass classification.

test:mse The mean square error of the final model on the Minimize
test dataset. This objective metric is only valid for
regression.

The accuracy of the final model on the test


test:multiclass_accuracy Maximize
dataset. This objective metric is only valid for
multiclass classification.

The accuracy among the top k labels predicted


test:multiclass_top_k_accuracy Maximize
on the test dataset. If you choose this metric as
the objective, we recommend setting the value of
k using the accuracy_top_k hyperparameter.
This objective metric is only valid for multiclass
classification.

test:objective_loss The mean value of the objective loss function Minimize


on the test dataset after the model is trained.
By default, the loss is logistic loss for binary
classification and squared loss for regression.
To set the loss to other types, use the loss
hyperparameter.

test:precision The precision of the final model on the test Maximize


dataset. If you choose this metric as the objective,
we recommend setting a target recall by setting
the binary_classifier_model_selection
hyperparameter to

1357
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction


precision_at_target_recall and setting the
value for the target_recall hyperparameter.
This objective metric is only valid for binary
classification.

test:recall The recall of the final model on the test dataset. Maximize
If you choose this metric as the objective, we
recommend setting a target precision by setting
the binary_classifier_model_selection
hyperparameter to
recall_at_target_precision and
setting the value for the target_precision
hyperparameter. This objective metric is only valid
for binary classification.

test:roc_auc_score The area under receiving operating characteristic Maximize


curve (ROC curve) of the final model on the test
dataset. This objective metric is only valid for
binary classification.

The absolute loss of the final model on the


validation:absolute_loss Minimize
validation dataset. This objective metric is only
valid for regression.

The accuracy of the final model on the validation


validation:binary_classification_accuracy Maximize
dataset. This objective metric is only valid for
binary classification.

The F-beta score of the final model on the


validation:binary_f_beta Maximize
validation dataset. By default, the F-beta
score is the F1 score, which is the harmonic
mean of the validation:precision and
validation:recall metrics. This objective
metric is only valid for binary classification.

validation:dcg The discounted cumulative gain of the final model Maximize


on the validation dataset. This objective metric is
only valid for multiclass classification.

The F-beta score of the final model on the


validation:macro_f_beta Maximize
validation dataset. This objective metric is only
valid for multiclass classification.

The precision score of the final model on the


validation:macro_precision Maximize
validation dataset. This objective metric is only
valid for multiclass classification.

The recall score of the final model on the


validation:macro_recall Maximize
validation dataset. This objective metric is only
valid for multiclass classification.

validation:mse The mean square error of the final model on the Minimize
validation dataset. This objective metric is only
valid for regression.

1358
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

The accuracy of the final model on the validation


validation:multiclass_accuracy Maximize
dataset. This objective metric is only valid for
multiclass classification.

The accuracy among the top k labels predicted on


validation:multiclass_top_k_accuracy Maximize
the validation dataset. If you choose this metric
as the objective, we recommend setting the value
of k using the accuracy_top_k hyperparameter.
This objective metric is only valid for multiclass
classification.

The mean value of the objective loss function on


validation:objective_loss Minimize
the validation dataset every epoch. By default,
the loss is logistic loss for binary classification and
squared loss for regression. To set loss to other
types, use the loss hyperparameter.

validation:precisionThe precision of the final model on the validation Maximize


dataset. If you choose this metric as the objective,
we recommend setting a target recall by setting
the binary_classifier_model_selection
hyperparameter to
precision_at_target_recall and setting the
value for the target_recall hyperparameter.
This objective metric is only valid for binary
classification.

validation:recall The recall of the final model on the Maximize


validation dataset. If you choose this
metric as the objective, we recommend
setting a target precision by setting the
binary_classifier_model_selection
hyperparameter to
recall_at_target_precision and
setting the value for the target_precision
hyperparameter. This objective metric is only valid
for binary classification.

validation:rmse The root mean square error of the final model Minimize
on the validation dataset. This objective metric is
only valid for regression.

The area under receiving operating characteristic


validation:roc_auc_score Maximize
curve (ROC curve) of the final model on the
validation dataset. This objective metric is only
valid for binary classification.

Tuning linear learner hyperparameters


You can tune a linear learner model with the following hyperparameters.

Parameter Name Parameter Type Recommended Ranges

wd ContinuousParameterRanges MinValue: 1e-7,


MaxValue: 1

1359
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges

l1 ContinuousParameterRanges MinValue: 1e-7,


MaxValue: 1

learning_rate ContinuousParameterRanges MinValue: 1e-5,


MaxValue: 1

mini_batch_size IntegerParameterRanges MinValue: 100,


MaxValue: 5000

use_bias CategoricalParameterRanges [True, False]

positive_example_weight_mult
ContinuousParameterRanges MinValue: 1e-5,
MaxValue: 1e5

Linear learner response formats

JSON response formats


All Amazon SageMaker built-in algorithms adhere to the common input inference format described in
Common Data Formats - Inference. The following are the available output formats for the SageMaker
linear learner algorithm.

Binary Classification

let response = {
"predictions": [
{
"score": 0.4,
"predicted_label": 0
}
]
}

Multiclass Classification

let response = {
"predictions": [
{
"score": [0.1, 0.2, 0.4, 0.3],
"predicted_label": 2
}
]
}

Regression

let response = {
"predictions": [
{
"score": 0.4
}
]
}

JSONLINES response formats


Binary Classification

1360
Amazon SageMaker Developer Guide
Use Built-in Algorithms

{"score": 0.4, "predicted_label": 0}

Multiclass Classification

{"score": [0.1, 0.2, 0.4, 0.3], "predicted_label": 2}

Regression

{"score": 0.4}

RECORDIO response formats

Binary Classification

[
Record = {
features = {},
label = {
'score': {
keys: [],
values: [0.4] # float32
},
'predicted_label': {
keys: [],
values: [0.0] # float32
}
}
}
]

Multiclass Classification

[
Record = {
"features": [],
"label": {
"score": {
"values": [0.1, 0.2, 0.3, 0.4]
},
"predicted_label": {
"values": [3]
}
},
"uid": "abc123",
"metadata": "{created_at: '2017-06-03'}"
}
]

Regression

[
Record = {
features = {},
label = {
'score': {
keys: [],
values: [0.4] # float32
}

1361
Amazon SageMaker Developer Guide
Use Built-in Algorithms

}
}
]

TabTransformer
TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The
TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers
transform the embeddings of categorical features into robust contextual embeddings to achieve higher
prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly
robust against both missing and noisy data features, and provide better interpretability.

How to use SageMaker TabTransformer

You can use TabTransformer as an Amazon SageMaker built-in algorithm. The following section
describes how to use TabTransformer with the SageMaker Python SDK. For information on how to use
TabTransformer from the Amazon SageMaker Studio UI, see SageMaker JumpStart (p. 47).

• Use TabTransformer as a built-in algorithm

Use the TabTransformer built-in algorithm to build a TabTransformer training container as shown in
the following code example. You can automatically spot the TabTransformer built-in algorithm image
URI using the SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon
SageMaker Python SDK version 2).

After specifying the TabTransformer image URI, you can use the TabTransformer container to construct
an estimator using the SageMaker Estimator API and initiate a training job. The TabTransformer built-
in algorithm runs in script mode, but the training script is provided for you and there is no need to
replace it. If you have extensive experience using script mode to create a SageMaker training job, then
you can incorporate your own TabTransformer training scripts.

from sagemaker import image_uris, model_uris, script_uris

train_model_id, train_model_version, train_scope = "pytorch-tabtransformerclassification-


model", "*", "training"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the docker image


train_image_uri = image_uris.retrieve(
region=None,
framework=None,
model_id=train_model_id,
model_version=train_model_version,
image_scope=train_scope,
instance_type=training_instance_type
)

# Retrieve the training script


train_source_uri = script_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)

train_model_uri = model_uris.retrieve(
model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tabular_binary/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"

1362
Amazon SageMaker Developer Guide
Use Built-in Algorithms

validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/
validation"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

from sagemaker import hyperparameters

# Retrieve the default hyperparameters for training the model


hyperparameters = hyperparameters.retrieve_default(
model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values


hyperparameters[
"n_epochs"
] = "50"
print(hyperparameters)

from sagemaker.estimator import Estimator


from sagemaker.utils import name_from_base

training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")

# Create SageMaker Estimator instance


tabular_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location
)

# Launch a SageMaker Training job by passing the S3 path of the training data
tabular_estimator.fit(
{
"training": training_dataset_s3_path,
"validation": validation_dataset_s3_path,
}, logs=True, job_name=training_job_name
)

For more information about how to set up the TabTransformer as a built-in algorithm, see the
following notebook examples.
• Tabular classification with Amazon SageMaker TabTransformer algorithm
• Tabular regression with Amazon SageMaker TabTransformer algorithm

Input and Output interface for the TabTransformer algorithm

TabTransformer operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of TabTransformer supports CSV for training and inference:

• For Training ContentType, valid inputs must be text/csv.


• For Inference ContentType, valid inputs must be text/csv.

1363
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record.
For CSV inference, the algorithm assumes that CSV input does not have the label column.

Input format for training data, validation data, and categorical features

Be mindful of how to format your training data for input to the TabTransformer model. You must
provide the path to an Amazon S3 bucket that contains your training and validation data. You can also
include a list of categorical features. Use both the training and validation channels to provide your
input data. Alternatively, you can use only the training channel.

Use both the training and validation channels

You can provide your input data by way of two S3 paths, one for the training channel and one for the
validation channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a
full S3 path pointing to one specific CSV file. The target variables should be in the first column of your
CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are
provided for the training or validation channels, the TabTransformer algorithm concatenates the
files. The validation data is used to compute a validation score at the end of each boosting iteration.
Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named
categorical_index.json in the same location as your training data file or files. If you provide a JSON
file for categorical features, your training channel must point to an S3 prefix and not a specific CSV
file. This file should contain a Python dictionary where the key is the string "cat_index_list" and
the value is a list of unique integers. Each integer in the value list should indicate the column index of
the corresponding categorical features in your training data CSV file. Each value should be a positive
integer (greater than zero because zero represents the target value), less than the Int32.MaxValue
(2147483647), and less than the total number of columns. There should only be one categorical index
JSON file.

Use only the training channel:

You can alternatively provide your input data by way of a single S3 path for the training channel. This
S3 path should point to a directory with a subdirectory named training/ that contains one or more
CSV files. You can optionally include another subdirectory in the same location called validation/ that
also has one or more CSV files. If the validation data is not provided, then 20% of your training data is
randomly sampled to serve as the validation data. If your predictors include categorical features, you can
provide a JSON file named categorical_index.json in the same location as your data subdirectories.
Note
For CSV training input mode, the total memory available to the algorithm (instance count
multiplied by the memory available in the InstanceType) must be able to hold the training
dataset.

Amazon EC2 instance recommendation for the TabTransformer algorithm

SageMaker TabTransformer supports single-instance CPU and single-instance GPU training. Despite
higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage
of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker
TabTransformer currently does not support multi-GPU training.

TabTransformer sample notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker TabTransformer algorithm.

1364
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Notebook Title Description

Tabular classification with Amazon SageMaker This notebook demonstrates the use of the
TabTransformer algorithm Amazon SageMaker TabTransformer algorithm to
train and host a tabular classification model.

Tabular regression with Amazon SageMaker This notebook demonstrates the use of the
TabTransformer algorithm Amazon SageMaker TabTransformer algorithm to
train and host a tabular regression model.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

How TabTransformer works

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The
TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the
embeddings of categorical features into robust contextual embeddings to achieve higher prediction
accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust
against both missing and noisy data features, and provide better interpretability.

TabTransformer performs well in machine learning competitions because of its robust handling of a
variety of data types, relationships, distributions, and the diversity of hyperparameters that you can
fine-tune. You can use TabTransformer for regression, classification (binary and multiclass), and ranking
problems.

The following diagram illustrates the TabTransformer architecture.

1365
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information, see TabTransformer: Tabular Data Modeling Using Contextual Embeddings.

1366
Amazon SageMaker Developer Guide
Use Built-in Algorithms

TabTransformer hyperparameters
The following table contains the subset of hyperparameters that are required or most commonly
used for the Amazon SageMaker TabTransformer algorithm. Users set these parameters to facilitate
the estimation of model parameters from data. The SageMaker TabTransformer algorithm is an
implementation of the open-source TabTransformer package.
Note
The default hyperparameters are based on example datasets in the TabTransformer sample
notebooks (p. 1364).

The SageMaker TabTransformer algorithm automatically chooses an evaluation metric and objective
function based on the type of classification problem. The TabTransformer algorithm detects the type
of classification problem based on the number of labels in your data. For regression problems, the
evaluation metric is r square and the objective function is mean square error. For binary classification
problems, the evaluation metric and objective function are both binary cross entropy. For multiclass
classification problems, the evaluation metric and objective function are both multiclass cross entropy.
Note
The TabTransformer evaluation metric and objective functions are not currently available as
hyperparameters. Instead, the SageMaker TabTransformer built-in algorithm automatically
detects the type of classification task (regression, binary, or multiclass) based on the number of
unique integers in the label column and assigns an evaluation metric and objective function.

Parameter Name Description

n_epochs Number of epochs to train the deep neural network.

Valid values: integer, range: Positive integer.

Default value: 5.

patience The training will stop if one metric of one validation data point
does not improve in the last patience round.

Valid values: integer, range: (2, 60).

Default value: 10.

learning_rate The rate at which the model weights are updated after working
through each batch of training examples.

Valid values: float, range: Positive floating point number.

Default value: 0.001.

batch_size The number of examples propagated through the network.

Valid values: integer, range: (1, 2048).

Default value: 256.

input_dim The dimension of embeddings to encode the categorical and/or


continuous columns.

Valid values: string, any of the following: "16", "32", "64", "128",
"256", or "512".

Default value: "32".

n_blocks The number of Transformer encoder blocks.

1367
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Valid values: integer, range: (1, 12).

Default value: 4.

attn_dropout Dropout rate applied to the Multi-Head Attention layers.

Valid values: float, range: (0, 1).

Default value: 0.2.

mlp_dropout Dropout rate applied to the FeedForward network within the


encoder layers and the final MLP layers on top of Transformer
encoders.

Valid values: float, range: (0, 1).

Default value: 0.1.

frac_shared_embed The fraction of embeddings shared by all the different categories


for one particular column.

Valid values: float, range: (0, 1).

Default value: 0.25.

Tune a TabTransformer model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. Model
tuning focuses on the following hyperparameters:
Note
The learning objective function and evaluation metric are both automatically assigned based on
the type of classification task, which is determined by the number of unique integers in the label
column. For more information, see TabTransformer hyperparameters (p. 1367).

• A learning objective function to optimize during model training


• An evaluation metric that is used to evaluate model performance during validation
• A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that
results in a model that optimizes the chosen evaluation metric.
Note
Automatic model tuning for TabTransformer is only available from the Amazon SageMaker
SDKs, not from the SageMaker console.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Evaluation metrics computed by the TabTransformer algorithm

The SageMaker TabTransformer algorithm computes the following metrics to use for model validation.
The evaluation metric is automatically assigned based on the type of classification task, which is
determined by the number of unique integers in the label column.

1368
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Regex Pattern


Direction

r2 r square maximize "metrics={'r2':


(\\S+)}"

f1_score binary cross entropy maximize "metrics={'f1':


(\\S+)}"

accuracy_score multiclass cross entropy maximize "metrics={'accuracy':


(\\S+)}"

Tunable TabTransformer hyperparameters

Tune the TabTransformer model with the following hyperparameters. The hyperparameters that
have the greatest effect on optimizing the TabTransformer evaluation metrics are: learning_rate,
input_dim, n_blocks, attn_dropout, mlp_dropout, and frac_shared_embed. For a list of all the
TabTransformer hyperparameters, see TabTransformer hyperparameters (p. 1367).

Parameter Name Parameter Type Recommended Ranges

learning_rate ContinuousParameterRanges MinValue: 0.001,


MaxValue: 0.01

input_dim CategoricalParameterRanges [16, 32, 64, 128, 256,


512]

n_blocks IntegerParameterRanges MinValue: 1, MaxValue:


12

attn_dropout ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.8

mlp_dropout ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.8

frac_shared_embed ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.5

XGBoost Algorithm
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation
of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that
attempts to accurately predict a target variable by combining an ensemble of estimates from a set of
simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions
because of its robust handling of a variety of data types, relationships, distributions, and the variety of
hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and
multiclass), and ranking problems.

You can use the new release of the XGBoost algorithm either as a Amazon SageMaker built-in algorithm
or as a framework to run training scripts in your local environments. This implementation has a smaller
memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics
than the original versions. It provides an XGBoost estimator that executes a training script in a
managed XGBoost environment. The current release of SageMaker XGBoost is based on the original
XGBoost versions 1.0, 1.2, 1.3, 1.5, and 1.7.

1369
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Supported versions

• Framework (open source) mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1
• Algorithm mode: 1.0-1, 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1

Warning
Due to required compute capacity, version 1.7-1 of SageMaker XGBoost is not compatible with
GPU instances from the P2 instance family for training or inference.
Important
When you retrieve the SageMaker XGBoost image URI, do not use :latest or :1 for the image
URI tag. You must specify one of the Supported versions (p. 1370) to choose the SageMaker-
managed XGBoost container with the native XGBoost package version that you want to use. To
find the package version migrated into the SageMaker XGBoost containers, see Docker Registry
Paths and Example Code, choose your AWS Region, and navigate to the XGBoost (algorithm)
section.
Warning
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for
XGBoost 0.90 is discontinued. It is highly recommended to upgrade the XGBoost version to one
of the newer versions.
Note
XGBoost v1.1 is not supported on SageMaker because XGBoost 1.1 has a broken capability to
run prediction when the test input has fewer features than the training data in LIBSVM inputs.
This capability has been restored in XGBoost v1.2. Consider using SageMaker XGBoost 1.2-2 or
later.

How to Use SageMaker XGBoost

With SageMaker, you can use XGBoost as a built-in algorithm or framework. By using XGBoost as a
framework, you have more flexibility and access to more advanced scenarios, such as k-fold cross-
validation, because you can customize your own training scripts. The following sections describe how to
use XGBoost with the SageMaker Python SDK. For information on how to use XGBoost from the Amazon
SageMaker Studio UI, see SageMaker JumpStart (p. 47).

• Use XGBoost as a framework

Use XGBoost as a framework to run your customized training scripts that can incorporate additional
data processing into your training jobs. In the following code example, you can find how SageMaker
Python SDK provides the XGBoost API as a framework in the same way it provides other framework
APIs, such as TensorFlow, MXNet, and PyTorch.

import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
"max_depth":"5",
"eta":"0.2",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"verbosity":"1",
"objective":"reg:squarederror",
"num_round":"50"}

1370
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# set an output path where the trained model will be saved


bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

# construct a SageMaker XGBoost estimator


# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py",
framework_version='1.7-1',
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m5.2xlarge',
output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)

# execute the XGBoost training job


estimator.fit({'train': train_input, 'validation': validation_input})

For an end-to-end example of using SageMaker XGBoost as a framework, see Regression with Amazon
SageMaker XGBoost
• Use XGBoost as a built-in algorithm

Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following
code example. You can automatically spot the XGBoost built-in algorithm image URI using the
SageMaker image_uris.retrieve API (or the get_image_uri API if using Amazon SageMaker
Python SDK version 1). If you want to ensure if the image_uris.retrieve API finds the correct URI,
see Common parameters for built-in algorithms and look up xgboost from the full list of built-in
algorithm image URIs and available regions.

After specifying the XGBoost image URI, you can use the XGBoost container to construct an estimator
using the SageMaker Estimator API and initiate a training job. This XGBoost built-in algorithm mode
does not incorporate your own XGBoost training script and runs directly on the input datasets.
Important
When you retrieve the SageMaker XGBoost image URI, do not use :latest or :1 for the
image URI tag. You must specify one of the Supported versions (p. 1370) to choose the
SageMaker-managed XGBoost container with the native XGBoost package version that you
want to use. To find the package version migrated into the SageMaker XGBoost containers,
see Docker Registry Paths and Example Code, choose your AWS Region, and navigate to the
XGBoost (algorithm) section.

import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
"max_depth":"5",
"eta":"0.2",
"gamma":"4",
"min_child_weight":"6",
"subsample":"0.7",
"objective":"reg:squarederror",

1371
Amazon SageMaker Developer Guide
Use Built-in Algorithms

"num_round":"50"}

# set an output path where the trained model will be saved


bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost
container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker estimator that calls the xgboost-container


estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m5.2xlarge',
volume_size=5, # 5 GB
output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)

# execute the XGBoost training job


estimator.fit({'train': train_input, 'validation': validation_input})

For more information about how to set up the XGBoost as a built-in algorithm, see the following
notebook examples.
• Managed Spot Training for XGBoost
• Regression with Amazon SageMaker XGBoost (Parquet input)

Input/Output Interface for the XGBoost Algorithm

Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of XGBoost supports the following data formats for training and
inference:

• text/libsvm (default)
• text/csv
• application/x-parquet
• application/x-recordio-protobuf

Note
There are a few considerations to be aware of regarding training and inference input:

• For training with columnar input, the algorithm assumes that the target variable (label) is the
first column. For inference, the algorithm assumes that the input has no label column.
• For CSV data, the input should not have a header record.
• For LIBSVM training, the algorithm assumes that subsequent columns after the label column
contain the zero-based index value pairs for features. So each row has the format: : <label>
<index0>:<value0> <index1>:<value1>.

1372
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• For information on instance types and distributed training, see EC2 Instance Recommendation
for the XGBoost Algorithm (p. 1373).

For CSV training input mode, the total memory available to the algorithm (Instance Count * the memory
available in the InstanceType) must be able to hold the training dataset. For libsvm training input
mode, it's not required, but we recommend it.

For v1.3-1 and later, SageMaker XGBoost saves the model in the XGBoost internal binary format, using
Booster.save_model. Previous versions use the Python pickle module to serialize/deserialize the
model.
Note
Be mindful of versions when using an SageMaker XGBoost model in open source XGBoost.
Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the
Python pickle module.

To use a model trained with SageMaker XGBoost v1.3-1 or later in open source XGBoost

• Use the following Python code:

import xgboost as xgb

xgb_model = xgb.Booster()
xgb_model.load_model(model_file_path)
xgb_model.predict(dtest)

To use a model trained with previous versions of SageMaker XGBoost in open source XGBoost

• Use the following Python code:

import pickle as pkl


import tarfile

t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()

model = pkl.load(open(model_file_path, 'rb'))

# prediction with test data


pred = model.predict(dtest)

To differentiate the importance of labelled data points use Instance Weight Supports

• SageMaker XGBoost allows customers to differentiate the importance of labelled data points
by assigning each instance a weight value. For text/libsvm input, customers can assign weight
values to data instances by attaching them after the labels. For example, label:weight
idx_0:val_0 idx_1:val_1.... For text/csv input, customers need to turn on the csv_weights
flag in the parameters and attach weight values in the column after labels. For example:
label,weight,val_0,val_1,...).

EC2 Instance Recommendation for the XGBoost Algorithm

SageMaker XGBoost supports CPU and GPU training and inference. Instance recommendations depend
on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the
following options for more information:

1373
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• CPU training (p. 1374)


• GPU training (p. 1374)
• Distributed CPU training (p. 1374)
• Distributed GPU training (p. 1374)
• Inference (p. 1375)

Training

The SageMaker XGBoost algorithm supports CPU and GPU training.

CPU training

SageMaker XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to
compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice
than a compute-optimized instance (for example, C4). Further, we recommend that you have enough
total memory in selected instances to hold the training data. Although it supports the use of disk space
to handle data that does not fit into main memory (the out-of-core feature available with the libsvm
input mode), writing cache files onto disk slows the algorithm processing time.

GPU training

SageMaker XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs,
GPUs train more quickly, making them more cost effective.

SageMaker XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

SageMaker XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that
due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.

To take advantage of GPU training, specify the instance type as one of the GPU instances (for example,
P3) and set the tree_method hyperparameter to gpu_hist in your existing XGBoost script.

Distributed training

SageMaker XGBoost supports CPU and GPU instances for distributed training.

Distributed CPU training

To run CPU training on multiple instances, set the instance_count parameter for the estimator to a
value greater than one. The input data must be divided between the total number of instances.

Divide input data across instances

Divide the input data using the following steps:

1. Break the input data down into smaller files. The number of files should be at least equal to the
number of instances used for distributed training. Using multiple smaller files as opposed to one
large file also decreases the data download time for the training job.
2. When creating your TrainingInput, set the distribution parameter to ShardedByS3Key. This
parameter ensures that each instance gets approximately 1/n of the number of files in S3 if there
are n instances specified in the training job.

Distributed GPU training

You can use distributed training with either single-GPU or multi-GPU instances.

1374
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Distributed training with single-GPU instances

SageMaker XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means
that even if you select a multi-GPU instance, only one GPU is used per instance.

If you use XGBoost versions 1.2-2 through 1.3-1, or if you do not need to use multi-GPU instances, then
you must divide your input data between the total number of instances. For more information, see Divide
input data across instances (p. 1374).
Note
Versions 1.2-2 through 1.3-1 of SageMaker XGBoost only use one GPU per instance even if you
choose a multi-GPU instance.

Distributed training with multi-GPU instances

Starting with version 1.5-1, SageMaker XGBoost offers distributed GPU training with Dask. With Dask you
can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-
GPU instances.

Train with Dask using the following steps:

1. Either omit the distribution parameter in your TrainingInput or set it to FullyReplicated.


2. When defining your hyperparameters, set use_dask_gpu_training to "true".

Important
Distributed training with Dask only supports CSV and Parquet input formats. If you use other
data formats such as LIBSVM or PROTOBUF, the training job fails.
For Parquet data, ensure that the column names are saved as strings. Columns that have names
of other data types will fail to load.
Important
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the
training job fails.

There are a few considerations to be aware of when training SageMaker XGBoost with Dask. Be sure to
split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for
every GPU, so the number of files should be greater than the total number of GPUs (instance count *
number of GPUs per instance). Having a very large number of files can also degrade performance. For
more information, see Dask Best Practices.

Variations in output

The specified tree_method hyperparameter determines the algorithm that is used for XGBoost training.
The tree methods approx, hist and gpu_hist are all approximate methods and use sketching for
quantile calculation. For more information, see Tree Methods in the XGBoost documentation. Sketching
is an approximate algorithm. Therefore, you can expect variations in the model depending on factors
such as the number of workers chosen for distributed training. The significance of the variation is data-
dependent.

Inference

SageMaker XGBoost supports CPU and GPU instances for inference. For information about the instance
types for inference, see Amazon SageMaker ML Instance Types.

XGBoost Sample Notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker XGBoost algorithm.

1375
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Notebook Title Description

How to Create a Custom XGBoost container? This notebook shows you how to build a custom
XGBoost Container with Amazon SageMaker Batch
Transform.

Regression with XGBoost using Parquet This notebook shows you how to use the Abalone
dataset in Parquet to train a XGBoost model.

How to Train and Host a Multiclass Classification This notebook shows how to use the MNIST
Model? dataset to train and host a multiclass classification
model.

How to train a Model for Customer Churn This notebook shows you how to train a model to
Prediction? Predict Mobile Customer Departure in an effort to
identify unhappy customers.

An Introduction to Amazon SageMaker Managed This notebook shows you how to use Spot
Spot infrastructure for XGBoost Training Instances for training with a XGBoost Container.

How to use Amazon SageMaker Debugger to This notebook shows you how to use Amazon
debug XGBoost Training Jobs? SageMaker Debugger to monitor training jobs to
detect inconsistencies using built-in debugging
rules.

How to use Amazon SageMaker Debugger to This notebook shows you how to use the MNIST
debug XGBoost Training Jobs in Real-Time? dataset and Amazon SageMaker Debugger to
perform real-time analysis of XGBoost training
jobs while training jobs are running.

For instructions on how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created
a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all of the
SageMaker samples. The topic modeling example notebooks using the linear learning algorithm are
located in the Introduction to Amazon algorithms section. To open a notebook, choose its Use tab and
choose Create copy.

How XGBoost Works

XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target
variable by combining the estimates of a set of simpler, weaker models.

When using gradient boosting for regression, the weak learners are regression trees, and each regression
tree maps an input data point to one of its leafs that contains a continuous score. XGBoost minimizes a
regularized (L1 and L2) objective function that combines a convex loss function (based on the difference
between the predicted and target outputs) and a penalty term for model complexity (in other words, the
regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals
or errors of prior trees that are then combined with previous trees to make the final prediction. It's called
gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new
models.

Below is a brief illustration on how gradient tree boosting works.

1376
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more detail on XGBoost, see:

• XGBoost: A Scalable Tree Boosting System


• Gradient Tree Boosting
• Introduction to Boosted Trees

XGBoost Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used
for the Amazon SageMaker XGBoost algorithm. These are parameters that are set by users to facilitate
the estimation of model parameters from data. The required hyperparameters that must be set are
listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in
alphabetical order. The SageMaker XGBoost algorithm is an implementation of the open-source DMLC
XGBoost package. For details about full set of hyperparameter that can be configured for this version of
XGBoost, see XGBoost Parameters.

Parameter Name Description

num_class The number of classes.

Required if objective is set to multi:softmax or multi:softprob.

Valid values: Integer.

num_round The number of rounds to run the training.

1377
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Required

Valid values: Integer.

alpha L1 regularization term on weights. Increasing this value makes


models more conservative.

Optional

Valid values: Float.

Default value: 0

base_score The initial prediction score of all instances, global bias.

Optional

Valid values: Float.

Default value: 0.5

booster Which booster to use. The gbtree and dart values use a tree-
based model, while gblinear uses a linear function.

Optional

Valid values: String. One of "gbtree", "gblinear", or "dart".

Default value: "gbtree"

colsample_bylevel Subsample ratio of columns for each split, in each level.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

colsample_bynode Subsample ratio of columns from each node.

Optional

Valid values: Float. Range: (0,1].

Default value: 1

colsample_bytree Subsample ratio of columns when constructing each tree.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

1378
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

csv_weights When this flag is enabled, XGBoost differentiates the importance


of instances for csv input by taking the second column (the column
after labels) in training data as the instance weights.

Optional

Valid values: 0 or 1

Default value: 0

deterministic_histogram When this flag is enabled, XGBoost builds histogram on GPU


deterministically. Used only if tree_method is set to gpu_hist.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: String. Range: "true" or "false".

Default value: "true"

early_stopping_rounds The model trains until the validation score stops


improving. Validation error needs to decrease at least every
early_stopping_rounds to continue training. SageMaker
hosting uses the best model for inference.

Optional

Valid values: Integer.

Default value: -

eta Step size shrinkage used in updates to prevent overfitting. After


each boosting step, you can directly get the weights of new
features. The eta parameter actually shrinks the feature weights to
make the boosting process more conservative.

Optional

Valid values: Float. Range: [0,1].

Default value: 0.3

eval_metric Evaluation metrics for validation data. A default metric is assigned


according to the objective:

• rmse: for regression


• error: for classification
• map: for ranking

For a list of valid inputs, see XGBoost Learning Task Parameters.

Optional

Valid values: String.

Default value: Default according to objective.

1379
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

gamma Minimum loss reduction required to make a further partition on


a leaf node of the tree. The larger, the more conservative the
algorithm is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 0

grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.

Optional

Valid values: String. Either "depthwise" or "lossguide".

Default value: "depthwise"

interaction_constraints Specify groups of variables that are allowed to interact.

Optional

Valid values: Nested list of integers. Each integer represents a


feature, and each nested list contains features that are allowed to
interact e.g., [[1,2], [3,4,5]].

Default value: None

lambda L2 regularization term on weights. Increasing this value makes


models more conservative.

Optional

Valid values: Float.

Default value: 1

lambda_bias L2 regularization term on bias.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0

max_bin Maximum number of discrete bins to bucket continuous features.


Used only if tree_method is set to hist.

Optional

Valid values: Integer.

Default value: 256

1380
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

max_delta_step Maximum delta step allowed for each tree's weight estimation.
When a positive integer is used, it helps make the update more
conservative. The preferred option is to use it in logistic regression.
Set it to 1-10 to help control the update.

Optional

Valid values: Integer. Range: [0,∞).

Default value: 0

max_depth Maximum depth of a tree. Increasing this value makes the model
more complex and likely to be overfit. 0 indicates no limit. A limit is
required when grow_policy=depth-wise.

Optional

Valid values: Integer. Range: [0,∞)

Default value: 6

max_leaves Maximum number of nodes to be added. Relevant only if


grow_policy is set to lossguide.

Optional

Valid values: Integer.

Default value: 0

min_child_weight Minimum sum of instance weight (hessian) needed in a child. If the


tree partition step results in a leaf node with the sum of instance
weight less than min_child_weight, the building process gives
up further partitioning. In linear regression models, this simply
corresponds to a minimum number of instances needed in each
node. The larger the algorithm, the more conservative it is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 1

monotone_constraints Specifies monotonicity constraints on any feature.

Optional

Valid values: Tuple of Integers. Valid integers: -1 (decreasing


constraint), 0 (no constraint), 1 (increasing constraint).

E.g., (0, 1): No constraint on first predictor, and an increasing


constraint on the second. (-1, 1): Decreasing constraint on first
predictor, and an increasing constraint on the second.

Default value: (0, 0)

1381
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

normalize_type Type of normalization algorithm.

Optional

Valid values: Either tree or forest.

Default value: tree

nthread Number of parallel threads used to run xgboost.

Optional

Valid values: Integer.

Default value: Maximum number of threads.

objective Specifies the learning task and the corresponding learning


objective. Examples: reg:logistic, multi:softmax,
reg:squarederror. For a full list of valid inputs, refer to XGBoost
Learning Task Parameters.

Optional

Valid values: String

Default value: "reg:squarederror"

one_drop When this flag is enabled, at least one tree is always dropped
during the dropout.

Optional

Valid values: 0 or 1

Default value: 0

process_type The type of boosting process to run.

Optional

Valid values: String. Either "default" or "update".

Default value: "default"

rate_drop The dropout rate that specifies the fraction of previous trees to
drop during the dropout.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

1382
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

refresh_leaf This is a parameter of the 'refresh' updater plug-in. When set to


true (1), tree leaves and tree node stats are updated. When set to
false(0), only tree node stats are updated.

Optional

Valid values: 0/1

Default value: 1

sample_type Type of sampling algorithm.

Optional

Valid values: Either uniform or weighted.

Default value: uniform

scale_pos_weight Controls the balance of positive and negative weights. It's useful
for unbalanced classes. A typical value to consider: sum(negative
cases) / sum(positive cases).

Optional

Valid values: float

Default value: 1

seed Random number seed.

Optional

Valid values: integer

Default value: 0

single_precision_histogram When this flag is enabled, XGBoost uses single precision to build
histograms instead of double precision. Used only if tree_method
is set to hist or gpu_hist.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: String. Range: "true" or "false"

Default value: "false"

sketch_eps Used only for approximate greedy algorithm. This translates into
O(1 / sketch_eps) number of bins. Compared to directly select
number of bins, this comes with theoretical guarantee with sketch
accuracy.

Optional

Valid values: Float, Range: [0, 1].

Default value: 0.03

1383
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

skip_drop Probability of skipping the dropout procedure during a boosting


iteration.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

subsample Subsample ratio of the training instance. Setting it to 0.5 means


that XGBoost randomly collects half of the data instances to grow
trees. This prevents overfitting.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

tree_method The tree construction algorithm used in XGBoost.

Optional

Valid values: One of auto, exact, approx, hist, or gpu_hist.

Default value: auto

tweedie_variance_power Parameter that controls the variance of the Tweedie distribution.

Optional

Valid values: Float. Range: (1, 2).

Default value: 1.5

updater A comma-separated string that defines the sequence of tree


updaters to run. This provides a modular way to construct and to
modify the trees.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: comma-separated string.

Default value: grow_colmaker, prune

use_dask_gpu_training Set use_dask_gpu_training to "true" if you want to run


distributed GPU training with Dask. Dask GPU training is only
supported for versions 1.5-1 and later. Do not set this value to
"true" for versions preceding 1.5-1. For more information, see
Distributed GPU training (p. 1374).

Optional

Valid values: String. Range: "true" or "false"

Default value: "false"

1384
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

verbosity Verbosity of printing messages.

Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug).

Optional

Default value: 1

Tune an XGBoost Model


Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. You
choose three types of hyperparameters:

• a learning objective function to optimize during model training


• an eval_metric to use to evaluate model performance during validation
• a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic
model tuning searches the hyperparameters chosen to find the combination of values that result in the
model that optimizes the evaluation metric.
Note
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs,
not from the SageMaker console.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Evaluation Metrics Computed by the XGBoost Algorithm


The XGBoost algorithm computes the following metrics to use for model validation. When tuning the
model, choose one of these metrics to evaluate the model. For full list of valid eval_metric values,
refer to XGBoost Learning Task Parameters

Metric Name Description Optimization Direction

validation:accuracy Classification rate, calculated as #(right)/#(all Maximize


cases).

validation:auc Area under the curve. Maximize

validation:error Binary classification error rate, calculated as Minimize


#(wrong cases)/#(all cases).

validation:f1 Indicator of classification accuracy, calculated as Maximize


the harmonic mean of precision and recall.

validation:logloss Negative log-likelihood. Minimize

validation:mae Mean absolute error. Minimize

validation:map Mean average precision. Maximize

validation:merror Multiclass classification error rate, calculated as Minimize


#(wrong cases)/#(all cases).

1385
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

validation:mlogloss Negative log-likelihood for multiclass Minimize


classification.

validation:mse Mean squared error. Minimize

validation:ndcg Normalized Discounted Cumulative Gain. Maximize

validation:rmse Root mean square error. Minimize

Tunable XGBoost Hyperparameters

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the XGBoost evaluation metrics are: alpha, min_child_weight,
subsample, eta, and num_round.

Parameter Name Parameter Type Recommended Ranges

alpha ContinuousParameterRanges MinValue: 0, MaxValue:


1000

colsample_bylevel ContinuousParameterRanges MinValue: 0.1,


MaxValue: 1

colsample_bynode ContinuousParameterRanges MinValue: 0.1,


MaxValue: 1

colsample_bytree ContinuousParameterRanges MinValue: 0.5,


MaxValue: 1

eta ContinuousParameterRanges MinValue: 0.1,


MaxValue: 0.5

gamma ContinuousParameterRanges MinValue: 0, MaxValue:


5

lambda ContinuousParameterRanges MinValue: 0, MaxValue:


1000

max_delta_step IntegerParameterRanges [0, 10]

max_depth IntegerParameterRanges [0, 10]

min_child_weight ContinuousParameterRanges MinValue: 0, MaxValue:


120

num_round IntegerParameterRanges [1, 4000]

subsample ContinuousParameterRanges MinValue: 0.5,


MaxValue: 1

Deprecated Versions of XGBoost and their Upgrades

This topic contains documentation for previous versions of Amazon SageMaker XGBoost that are
still available but deprecated. It also provides instructions on how to upgrade deprecated versions of
XGBoost, when possible, to more current versions.

1386
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Topics
• Upgrade XGBoost Version 0.90 to Version 1.5 (p. 1387)
• XGBoost Version 0.72 (p. 1388)

Upgrade XGBoost Version 0.90 to Version 1.5

If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you
must have version 2.x of the SDK installed and change the XGBoost version and framework_version
parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few
hyperparameters and learning objectives.

Topics
• Upgrade SageMaker Python SDK Version 1.x to Version 2.x (p. 1387)
• Change the image tag to 1.5-1 (p. 1387)
• Change Docker Image for Boto3 (p. 1388)
• Update Hyperparameters and Learning Objectives (p. 1388)

Upgrade SageMaker Python SDK Version 1.x to Version 2.x

If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the
SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see Use
Version 2.x of the SageMaker Python SDK. To install the latest version, run:

python -m pip install --upgrade sagemaker

Change the image tag to 1.5-1

If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the
version parameter in image_uris.retrive.

from sagemaker import image_uris


image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m5.2xlarge',
volume_size=5, # 5 GB
output_path=output_path)

If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized
training scripts, change the framework_version parameter in the XGBoost API.

estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py",


framework_version='1.5-1',
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m5.2xlarge',
output_path=output_path)

sagemaker.session.s3_input in SageMaker Python SDK version 1.x has been renamed to


sagemaker.inputs.TrainingInput. You must use sagemaker.inputs.TrainingInput as in the
following example.

1387
Amazon SageMaker Developer Guide
Use Built-in Algorithms

content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'),
content_type=content_type)

For the full list of SageMaker Python SDK version 2.x changes, see Use Version 2.x of the SageMaker
Python SDK.

Change Docker Image for Boto3

If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or
0.90-2) to 1.5-1.

{
"AlgorithmSpecification":: {
"TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-
xgboost:1.5-1"
}
...
}

If you using the SageMaker Python SDK to retrieve registry path, change the version parameter in
image_uris.retrieve.

from sagemaker import image_uris


image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

Update Hyperparameters and Learning Objectives

The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions.
Use verbosity instead. If you were using the reg:linear learning objective, it has been deprecated as
well in favor of reg:squarederror. Use reg:squarederror instead.

hyperparameters = {
"verbosity": "2",
"objective": "reg:squarederror",
"num_round": "50",
...
}

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
hyperparameters=hyperparameters,
...)

XGBoost Version 0.72


Important
The XGBoost 0.72 is deprecated by Amazon SageMaker. You can still use this old version of
XGBoost (as a built-in algorithm) by pulling its image URI as shown in the following code
sample. For XGBoost, the image URI ending with :1 is for the old version.

SageMaker Python SDK v1

import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

1388
Amazon SageMaker Developer Guide
Use Built-in Algorithms

xgb_image_uri = get_image_uri(boto3.Session().region_name, "xgboost",


repo_version="1")

SageMaker Python SDK v2

import boto3
from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")

If you want to use newer versions, you have to explicitly specify the image URI tags (see
Supported versions (p. 1370)).

This previous release of the Amazon SageMaker XGBoost algorithm is based on the 0.72 release.
XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of
the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that
attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker
models. XGBoost has done remarkably well in machine learning competitions because it robustly
handles a variety of data types, relationships, and distributions, and because of the large number of
hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid
choice for problems in regression, classification (binary and multiclass), and ranking.

Customers should consider using the new release of XGBoost Algorithm (p. 1369). They can use it as a
SageMaker built-in algorithm or as a framework to run scripts in their local environments as they would
typically, for example, do with a Tensorflow deep learning framework. The new implementation has a
smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of
metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone
migrating to the new version. But this previous implementation will remain tied to the 0.72 release of
XGBoost.

Input/Output Interface for the XGBoost Release 0.72

Gradient boosting operates on tabular data, with the rows representing observations, one column
representing the target variable or label, and the remaining columns representing features.

The SageMaker implementation of XGBoost supports CSV and libsvm formats for training and inference:

• For Training ContentType, valid inputs are text/libsvm (default) or text/csv.


• For Inference ContentType, valid inputs are text/libsvm or (the default) text/csv.

Note
For CSV training, the algorithm assumes that the target variable is in the first column and that
the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input
does not have the label column.
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent
columns contain the zero-based index value pairs for features. So each row has the format:
<label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not
have labels in the libsvm format.

This differs from other SageMaker algorithms, which use the protobuf training input format to maintain
greater consistency with standard XGBoost data formats.

For CSV training input mode, the total memory available to the algorithm (Instance Count * the memory
available in the InstanceType) must be able to hold the training dataset. For libsvm training input
mode, it's not required, but we recommend it.

1389
Amazon SageMaker Developer Guide
Use Built-in Algorithms

SageMaker XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used
for saving/loading the model.

To use a model trained with SageMaker XGBoost in open source XGBoost

• Use the following Python code:

import pickle as pkl


import tarfile
import xgboost

t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()

model = pkl.load(open(model_file_path, 'rb'))

# prediction with test data


pred = model.predict(dtest)

To differentiate the importance of labelled data points use Instance Weight Supports

• SageMaker XGBoost allows customers to differentiate the importance of labelled data points
by assigning each instance a weight value. For text/libsvm input, customers can assign weight
values to data instances by attaching them after the labels. For example, label:weight
idx_0:val_0 idx_1:val_1.... For text/csv input, customers need to turn on the csv_weights
flag in the parameters and attach weight values in the column after labels. For example:
label,weight,val_0,val_1,...).

EC2 Instance Recommendation for the XGBoost Release 0.72

SageMaker XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-
bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than
a compute-optimized instance (for example, C4). Further, we recommend that you have enough total
memory in selected instances to hold the training data. Although it supports the use of disk space to
handle data that does not fit into main memory (the out-of-core feature available with the libsvm input
mode), writing cache files onto disk slows the algorithm processing time.

XGBoost Release 0.72 Sample Notebooks

For a sample notebook that shows how to use the latest version of SageMaker XGBoost as a built-
in algorithm to train and host a regression model, see Regression with Amazon SageMaker XGBoost
algorithm. To use the 0.72 version of XGBoost, you need to change the version in the sample code to
0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. The topic modeling example notebooks using the XGBoost algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.

XGBoost Release 0.72 Hyperparameters

The following table contains the hyperparameters for the XGBoost algorithm. These are parameters
that are set by users to facilitate the estimation of model parameters from data. The required
hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters
that can be set are listed next, also in alphabetical order. The SageMaker XGBoost algorithm is an
implementation of the open-source XGBoost package. Currently SageMaker supports version 0.72. For
more detail about hyperparameter configuration for this version of XGBoost, see XGBoost Parameters.

1390
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

num_class The number of classes.

Required if objective is set to multi:softmax or multi:softprob.

Valid values: integer

num_round The number of rounds to run the training.

Required

Valid values: integer

alpha L1 regularization term on weights. Increasing this value makes


models more conservative.

Optional

Valid values: float

Default value: 0

base_score The initial prediction score of all instances, global bias.

Optional

Valid values: float

Default value: 0.5

booster Which booster to use. The gbtree and dart values use a tree-
based model, while gblinear uses a linear function.

Optional

Valid values: String. One of gbtree, gblinear, or dart.

Default value: gbtree

colsample_bylevel Subsample ratio of columns for each split, in each level.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

colsample_bytree Subsample ratio of columns when constructing each tree.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

csv_weights When this flag is enabled, XGBoost differentiates the importance


of instances for csv input by taking the second column (the column
after labels) in training data as the instance weights.

Optional

1391
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Valid values: 0 or 1

Default value: 0

early_stopping_rounds The model trains until the validation score stops


improving. Validation error needs to decrease at least every
early_stopping_rounds to continue training. SageMaker
hosting uses the best model for inference.

Optional

Valid values: integer

Default value: -

eta Step size shrinkage used in updates to prevent overfitting. After


each boosting step, you can directly get the weights of new
features. The eta parameter actually shrinks the feature weights to
make the boosting process more conservative.

Optional

Valid values: Float. Range: [0,1].

Default value: 0.3

eval_metric Evaluation metrics for validation data. A default metric is assigned


according to the objective:

• rmse: for regression


• error: for classification
• map: for ranking

For a list of valid inputs, see XGBoost Parameters.

Optional

Valid values: string

Default value: Default according to objective.

gamma Minimum loss reduction required to make a further partition on


a leaf node of the tree. The larger, the more conservative the
algorithm is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 0

1392
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.

Optional

Valid values: String. Either depthwise or lossguide.

Default value: depthwise

lambda L2 regularization term on weights. Increasing this value makes


models more conservative.

Optional

Valid values: float

Default value: 1

lambda_bias L2 regularization term on bias.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0

max_bin Maximum number of discrete bins to bucket continuous features.


Used only if tree_method is set to hist.

Optional

Valid values: integer

Default value: 256

max_delta_step Maximum delta step allowed for each tree's weight estimation.
When a positive integer is used, it helps make the update more
conservative. The preferred option is to use it in logistic regression.
Set it to 1-10 to help control the update.

Optional

Valid values: Integer. Range: [0,∞).

Default value: 0

max_depth Maximum depth of a tree. Increasing this value makes the model
more complex and likely to be overfit. 0 indicates no limit. A limit is
required when grow_policy=depth-wise.

Optional

Valid values: Integer. Range: [0,∞)

Default value: 6

1393
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

max_leaves Maximum number of nodes to be added. Relevant only if


grow_policy is set to lossguide.

Optional

Valid values: integer

Default value: 0

min_child_weight Minimum sum of instance weight (hessian) needed in a child. If the


tree partition step results in a leaf node with the sum of instance
weight less than min_child_weight, the building process gives
up further partitioning. In linear regression models, this simply
corresponds to a minimum number of instances needed in each
node. The larger the algorithm, the more conservative it is.

Optional

Valid values: Float. Range: [0,∞).

Default value: 1

normalize_type Type of normalization algorithm.

Optional

Valid values: Either tree or forest.

Default value: tree

nthread Number of parallel threads used to run xgboost.

Optional

Valid values: integer

Default value: Maximum number of threads.

objective Specifies the learning task and the corresponding learning


objective. Examples: reg:logistic, reg:softmax,
multi:squarederror. For a full list of valid inputs, refer to
XGBoost Parameters.

Optional

Valid values: string

Default value: reg:squarederror

one_drop When this flag is enabled, at least one tree is always dropped
during the dropout.

Optional

Valid values: 0 or 1

Default value: 0

1394
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

process_type The type of boosting process to run.

Optional

Valid values: String. Either default or update.

Default value: default

rate_drop The dropout rate that specifies the fraction of previous trees to
drop during the dropout.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

refresh_leaf This is a parameter of the 'refresh' updater plug-in. When set to


true (1), tree leaves and tree node stats are updated. When set to
false(0), only tree node stats are updated.

Optional

Valid values: 0/1

Default value: 1

sample_type Type of sampling algorithm.

Optional

Valid values: Either uniform or weighted.

Default value: uniform

scale_pos_weight Controls the balance of positive and negative weights. It's useful
for unbalanced classes. A typical value to consider: sum(negative
cases) / sum(positive cases).

Optional

Valid values: float

Default value: 1

seed Random number seed.

Optional

Valid values: integer

Default value: 0

1395
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

silent 0 means print running messages, 1 means silent mode.

Valid values: 0 or 1

Optional

Default value: 0

sketch_eps Used only for approximate greedy algorithm. This translates into
O(1 / sketch_eps) number of bins. Compared to directly select
number of bins, this comes with theoretical guarantee with sketch
accuracy.

Optional

Valid values: Float, Range: [0, 1].

Default value: 0.03

skip_drop Probability of skipping the dropout procedure during a boosting


iteration.

Optional

Valid values: Float. Range: [0.0, 1.0].

Default value: 0.0

subsample Subsample ratio of the training instance. Setting it to 0.5 means


that XGBoost randomly collects half of the data instances to grow
trees. This prevents overfitting.

Optional

Valid values: Float. Range: [0,1].

Default value: 1

tree_method The tree construction algorithm used in XGBoost.

Optional

Valid values: One of auto, exact, approx, or hist.

Default value: auto

tweedie_variance_power Parameter that controls the variance of the Tweedie distribution.

Optional

Valid values: Float. Range: (1, 2).

Default value: 1.5

1396
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

updater A comma-separated string that defines the sequence of tree


updaters to run. This provides a modular way to construct and to
modify the trees.

For a full list of valid inputs, please refer to XGBoost Parameters.

Optional

Valid values: comma-separated string.

Default value: grow_colmaker, prune

Tune an XGBoost Release 0.72 Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your training and validation datasets. You
choose three types of hyperparameters:

• a learning objective function to optimize during model training


• an eval_metric to use to evaluate model performance during validation
• a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic
model tuning searches the hyperparameters chosen to find the combination of values that result in the
model that optimizes the evaluation metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the XGBoost Release 0.72 Algorithm

The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model
validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of
valid eval_metric values, refer to XGBoost Learning Task Parameters

Metric Name Description Optimization Direction

validation:auc Area under the curve. Maximize

validation:error Binary classification error rate, calculated as Minimize


#(wrong cases)/#(all cases).

validation:logloss Negative log-likelihood. Minimize

validation:mae Mean absolute error. Minimize

validation:map Mean average precision. Maximize

validation:merror Multiclass classification error rate, calculated as Minimize


#(wrong cases)/#(all cases).

validation:mlogloss Negative log-likelihood for multiclass Minimize


classification.

validation:ndcg Normalized Discounted Cumulative Gain. Maximize

1397
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

validation:rmse Root mean square error. Minimize

Tunable XGBoost Release 0.72 Hyperparameters

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the
greatest effect on optimizing the XGBoost evaluation metrics are: alpha, min_child_weight,
subsample, eta, and num_round.

Parameter Name Parameter Type Recommended Ranges

alpha ContinuousParameterRanges MinValue: 0, MaxValue:


1000

colsample_bylevel ContinuousParameterRanges MinValue: 0.1,


MaxValue: 1

colsample_bytree ContinuousParameterRanges MinValue: 0.5,


MaxValue: 1

eta ContinuousParameterRanges MinValue: 0.1,


MaxValue: 0.5

gamma ContinuousParameterRanges MinValue: 0, MaxValue:


5

lambda ContinuousParameterRanges MinValue: 0, MaxValue:


1000

max_delta_step IntegerParameterRanges [0, 10]

max_depth IntegerParameterRanges [0, 10]

min_child_weight ContinuousParameterRanges MinValue: 0, MaxValue:


120

num_round IntegerParameterRanges [1, 4000]

subsample ContinuousParameterRanges MinValue: 0.5,


MaxValue: 1

Built-in SageMaker Algorithms for Text Data


SageMaker provides algorithms that are tailored to the analysis of textual documents used in natural
language processing, document classification or summarization, topic modeling or classification, and
language transcription or translation.

• BlazingText algorithm (p. 1399)—a highly optimized implementation of the Word2vec and text
classification algorithms that scale to large datasets easily. It is useful for many downstream natural
language processing (NLP) tasks.
• Latent Dirichlet Allocation (LDA) Algorithm (p. 1409)—an algorithm suitable for determining topics in
a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with
answers during training.
• Neural Topic Model (NTM) Algorithm (p. 1415)—another unsupervised technique for determining
topics in a set of documents, using a neural network approach.

1398
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Object2Vec Algorithm (p. 1421)—a general-purpose neural embedding algorithm that can be used for
recommendation systems, document classification, and sentence embeddings.
• Sequence-to-Sequence Algorithm (p. 1437)—a supervised algorithm commonly used for neural
machine translation.
• Text Classification - TensorFlow (p. 1450)—a supervised algorithm that supports transfer learning with
available pretrained models for text classification.

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

BlazingText train File or Pipe Text file GPU (single No


(one instance
sentence only) or CPU
per line
with space-
separated
tokens)

LDA train and File or Pipe recordIO- CPU (single No


(optionally) protobuf or instance
test CSV only)

Neural Topic train and File or Pipe recordIO- GPU or CPU Yes
Model (optionally) protobuf or
validation, CSV
test, or both

Object2Vec train and File JSON Lines GPU or No


(optionally) CPU (single
validation, instance
test, or both only)

Seq2Seq train, File recordIO- GPU (single No


Modeling validation, protobuf instance
and vocab only)

Text training and File CSV CPU or GPU Yes (only


Classification validation across
- TensorFlow multiple
GPUs on
a single
instance)

BlazingText algorithm
The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the
Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream
natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine
translation, etc. Text classification is an important task for applications that perform web searches,
information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector
representation of a word is called a word embedding. Words that are semantically similar correspond to
vectors that are close together. That way, word embeddings capture the semantic relationships between
words.

1399
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Many natural language processing (NLP) applications learn word embeddings by training on large
collections of documents. These pretrained vector representations provide information about semantics
and word distributions that typically improves the generalizability of other models that are later trained
on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized
for multi-core CPU architectures. This makes it difficult to scale to large datasets.

With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it
provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's
implementation of the supervised multi-class, multi-label text classification algorithm extends the
fastText text classifier to use GPU acceleration with custom CUDA kernels. You can train a model on
more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve
performance on par with the state-of-the-art deep learning text classification algorithms.

The BlazingText algorithm is not parallelizable. For more information on parameters related to training,
see Docker Registry Paths for SageMaker Built-in Algorithms.

The SageMaker BlazingText algorithms provides the following features:

• Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs
using highly optimized CUDA kernels. For more information, see BlazingText: Scaling and Accelerating
Word2Vec using Multiple GPUs.
• Enriched Word Vectors with Subword Information by learning vector representations for character n-
grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV)
words by representing their vectors as the sum of the character n-gram (subword) vectors.
• A batch_skipgram mode for the Word2Vec algorithm that allows faster training and distributed
computation across multiple CPU nodes. The batch_skipgram mode does mini-batching using the
Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations.
This efficiently leverages the multiply-add instructions of modern architectures. For more information,
see Parallelizing Word2Vec in Shared and Distributed Memory.

To summarize, the following modes are supported by BlazingText on different types instances:

Modes Word2Vec Text Classification

(Unsupervised Learning) (Supervised Learning)

Single CPU instance cbow supervised

Skip-gram

Batch Skip-gram

Single GPU instance (with 1 or cbow supervised with one GPU


more GPUs)
Skip-gram

Multiple CPU instances Batch Skip-gram None

For more information about the mathematics behind BlazingText, see BlazingText: Scaling and
Accelerating Word2Vec using Multiple GPUs.

Topics
• Input/Output Interface for the BlazingText Algorithm (p. 1401)
• EC2 Instance Recommendation for the BlazingText Algorithm (p. 1403)
• BlazingText Sample Notebooks (p. 1404)
• BlazingText Hyperparameters (p. 1404)

1400
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Tune a BlazingText Model (p. 1408)

Input/Output Interface for the BlazingText Algorithm


The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line
in the file should contain a single sentence. If you need to train on multiple text files, concatenate them
into one file and upload the file in the respective channel.

Training and Validation Data Format


Training and Validation Data Format for the Word2Vec Algorithm
For Word2Vec training, upload the file under the train channel. No other channels are supported. The file
should contain a training sentence per line.

Training and Validation Data Format for the Text Classification Algorithm
For supervised mode, you can train with file mode or with the augmented manifest text format.

Train with File Mode


For supervised mode, the training/validation file should contain a training sentence per line along
with the labels. Labels are words that are prefixed by the string __label__. Here is an example of a
training/validation file:

__label__4 linux ready for prime time , intel says , despite all the linux hype , the
open-source movement has yet to make a huge splash in the desktop market . that may be
about to change , thanks to chipmaking giant intel corp .

__label__2 bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly as the indian skippers return to international cricket was short lived .

Note
The order of labels within the sentence doesn't matter.

Upload the training file under the train channel, and optionally upload the validation file under the
validation channel.

Train with Augmented Manifest Text Format


Supervised mode for CPU instances also supports the augmented manifest format, which enables you
to do training in pipe mode without needing to create RecordIO files. While using the format, an S3
manifest file needs to be generated that contains the list of sentences and their corresponding labels.
The manifest file format should be in JSON Lines format in which each line represents one sample. The
sentences are specified using the source tag and the label can be specified using the label tag. Both
source and label tags should be provided under the AttributeNames parameter value as specified in
the request.

{"source":"linux ready for prime time , intel says , despite all the linux hype",
"label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly", "label":2}

Multi-label training is also supported by specifying a JSON array of labels.

{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":
[1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with
sourav ganguly", "label": [2, 4, 5]}

1401
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).

Model Artifacts and Inference

Model Artifacts for the Word2Vec Algorithm


For Word2Vec training, the model artifacts consist of vectors.txt, which contains words-to-vectors
mapping, and vectors.bin, a binary used by BlazingText for hosting, inference, or both. vectors.txt stores
the vectors in a format that is compatible with other tools like Gensim and Spacy. For example, a Gensim
user can run the following commands to load the vectors.txt file:

from gensim.models import KeyedVectors


word_vectors = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
word_vectors.doesnt_match("breakfast cereal dinner lunch".split())

If the evaluation parameter is set to True, an additional file, eval.json, is created. This file contains the
similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The
number of words from the WS-353 dataset that aren't there in the training corpus are reported.

For inference requests, the model accepts a JSON file containing a list of strings and returns a list of
vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to
True during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.

Sample JSON Request


Mime-type: application/json

{
"instances": ["word1", "word2", "word3"]
}

Model Artifacts for the Text Classification Algorithm


Training with supervised outputs creates a model.bin file that can be consumed by BlazingText hosting.
For inference, the BlazingText model accepts a JSON file containing a list of sentences and returns a list
of corresponding predicted labels and probability scores. Each sentence is expected to be a string with
space-separated tokens, words, or both.

Sample JSON Request


Mime-type: application/json

{
"instances": ["the movie was excellent", "i did not like the plot ."]
}

By default, the server returns only one prediction, the one with the highest probability. For retrieving the
top k predictions, you can set k in the configuration, as follows:

{
"instances": ["the movie was excellent", "i did not like the plot ."],
"configuration": {"k": 2}
}

For BlazingText, the content-type and accept parameters must be equal. For batch transform, they
both need to be application/jsonlines. If they differ, the Accept field is ignored. The format for
input follows:

1402
Amazon SageMaker Developer Guide
Use Built-in Algorithms

content-type: application/jsonlines

{"source": "source_0"}
{"source": "source_1"}

if you need to pass the value of k for top-k, then you can do it in the following way:

{"source": "source_0", "k": 2}


{"source": "source_1", "k": 3}

The format for output follows:

accept: application/jsonlines

{"prob": [prob_1], "label": ["__label__1"]}


{"prob": [prob_1], "label": ["__label__1"]}

If you have passed the value of k to be more than 1, then response will be in this format:

{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}


{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}

For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*.bin)
produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries
produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText
using BlazingText.

Here is an example of how to use a model generated with BlazingText with fastText:

#Download the model artifact from S3


aws s3 cp s3://<YOUR_S3_BUCKET>/<PREFIX>/model.tar.gz model.tar.gz

#Unzip the model archive


tar -xzf model.tar.gz

#Use the model archive with fastText


fasttext predict ./model.bin test.txt

However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU
will not produce binaries.

For more details on dataset formats and model hosting, see the example notebooks Text Classification
with the BlazingText Algorithm, FastText Models, and Generating Subword Embeddings with the
Word2Vec Algorithm.

EC2 Instance Recommendation for the BlazingText Algorithm


For cbow and skipgram modes, BlazingText supports single CPU and single GPU instances. Both
of these modes support learning of subwords embeddings. To achieve the highest speed without
compromising accuracy, we recommend that you use an ml.p3.2xlarge instance.

For batch_skipgram mode, BlazingText supports single or multiple CPU instances. When training on
multiple instances, set the value of the S3DataDistributionType field of the S3DataSource object
that you pass to CreateTrainingJob to FullyReplicated. BlazingText takes care of distributing
data across machines.

For the supervised text classification mode, a C5 instance is recommended if the training dataset is less
than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and
G5 instances for training and inference.

1403
Amazon SageMaker Developer Guide
Use Built-in Algorithms

BlazingText Sample Notebooks


For a sample notebook that uses the SageMaker BlazingText algorithm to train and deploy supervised
binary and multiclass classification models, see Blazing Text classification on the DBPedia dataset.
For instructions for creating and accessing Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After creating and opening
a notebook instance, choose the SageMaker Examples tab to see a list of all the SageMaker examples.
The topic modeling example notebooks that use the Blazing Text are located in the Introduction to
Amazon algorithms section. To open a notebook, choose its Use tab, then choose Create copy.

BlazingText Hyperparameters
When you start a training job with a CreateTrainingJob request, you specify a training algorithm.
You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters
for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text
Classification (supervised).

Word2Vec Hyperparameters
The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided
by Amazon SageMaker.

Parameter Name Description

mode The Word2vec architecture used for training.

Required

Valid values: batch_skipgram, skipgram, or cbow

batch_size The size of each batch when mode is set to batch_skipgram. Set
to a number between 10 and 20.

Optional

Valid values: Positive integer

Default value: 11

buckets The number of hash buckets to use for subwords.

Optional

Valid values: positive integer

Default value: 2000000

epochs The number of complete passes through the training data.

Optional

Valid values: Positive integer

Default value: 5

evaluation Whether the trained model is evaluated using the


WordSimilarity-353 Test.

Optional

Valid values: (Boolean) True or False

1404
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: True

learning_rate The step size used for parameter updates.

Optional

Valid values: Positive float

Default value: 0.05

min_char The minimum number of characters to use for subwords/character


n-grams.

Optional

Valid values: positive integer

Default value: 3

min_count Words that appear less than min_count times are discarded.

Optional

Valid values: Non-negative integer

Default value: 5

max_char The maximum number of characters to use for subwords/character


n-grams

Optional

Valid values: positive integer

Default value: 6

negative_samples The number of negative samples for the negative sample sharing
strategy.

Optional

Valid values: Positive integer

Default value: 5

sampling_threshold The threshold for the occurrence of words. Words that appear with
higher frequency in the training data are randomly down-sampled.

Optional

Valid values: Positive fraction. The recommended range is (0, 1e-3]

Default value: 0.0001

1405
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

subwords Whether to learn subword embeddings on not.

Optional

Valid values: (Boolean) True or False

Default value: False

vector_dim The dimension of the word vectors that the algorithm learns.

Optional

Valid values: Positive integer

Default value: 100

window_size The size of the context window. The context window is the number
of words surrounding the target word used for training.

Optional

Valid values: Positive integer

Default value: 5

Text Classification Hyperparameters

The following table lists the hyperparameters for the Text Classification training algorithm provided by
Amazon SageMaker.
Note
Although some of the parameters are common between the Text Classification and Word2Vec
modes, they might have different meanings depending on the context.

Parameter Name Description

mode The training mode.

Required

Valid values: supervised

buckets The number of hash buckets to use for word n-grams.

Optional

Valid values: Positive integer

Default value: 2000000

early_stopping Whether to stop training if validation accuracy doesn't improve


after a patience number of epochs. Note that a validation channel
is required if early stopping is used.

Optional

Valid values: (Boolean) True or False

1406
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: False

epochs The maximum number of complete passes through the training


data.

Optional

Valid values: Positive integer

Default value: 5

learning_rate The step size used for parameter updates.

Optional

Valid values: Positive float

Default value: 0.05

min_count Words that appear less than min_count times are discarded.

Optional

Valid values: Non-negative integer

Default value: 5

min_epochs The minimum number of epochs to train before early stopping logic
is invoked.

Optional

Valid values: Positive integer

Default value: 5

patience The number of epochs to wait before applying early stopping


when no progress is made on the validation set. Used only when
early_stopping is True.

Optional

Valid values: Positive integer

Default value: 4

vector_dim The dimension of the embedding layer.

Optional

Valid values: Positive integer

Default value: 100

1407
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

word_ngrams The number of word n-gram features to use.

Optional

Valid values: Positive integer

Default value: 2

Tune a BlazingText Model


Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the BlazingText Algorithm


The BlazingText Word2Vec algorithm (skipgram, cbow, and batch_skipgram modes) reports on a
single metric during training: train:mean_rho. This metric is computed on WS-353 word similarity
datasets. When tuning the hyperparameter values for the Word2Vec algorithm, use this metric as the
objective.

The BlazingText Text Classification algorithm (supervised mode), also reports on a single metric during
training: the validation:accuracy. When tuning the hyperparameter values for the text classification
algorithm, use these metrics as the objective.

Metric Name Description Optimization Direction

train:mean_rho The mean rho (Spearman's rank correlation Maximize


coefficient) on WS-353 word similarity datasets

validation:accuracy The classification accuracy on the user-specified Maximize


validation dataset

Tunable BlazingText Hyperparameters


Tunable Hyperparameters for the Word2Vec Algorithm
Tune an Amazon SageMaker BlazingText Word2Vec model with the following hyperparameters.
The hyperparameters that have the greatest impact on Word2Vec objective metrics are: mode,
learning_rate, window_size, vector_dim, and negative_samples.

Parameter Name Parameter Type Recommended Ranges


or Values

batch_size IntegerParameterRange [8-32]

epochs IntegerParameterRange [5-15]

learning_rate ContinuousParameterRange MinValue: 0.005,


MaxValue: 0.01

1408
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges


or Values

min_count IntegerParameterRange [0-100]

mode CategoricalParameterRange ['batch_skipgram',


'skipgram', 'cbow']

negative_samples IntegerParameterRange [5-25]

sampling_threshold ContinuousParameterRange MinValue: 0.0001,


MaxValue: 0.001

vector_dim IntegerParameterRange [32-300]

window_size IntegerParameterRange [1-10]

Tunable Hyperparameters for the Text Classification Algorithm


Tune an Amazon SageMaker BlazingText text classification model with the following hyperparameters.

Parameter Name Parameter Type Recommended Ranges


or Values

buckets IntegerParameterRange [1000000-10000000]

epochs IntegerParameterRange [5-15]

learning_rate ContinuousParameterRange MinValue: 0.005,


MaxValue: 0.01

min_count IntegerParameterRange [0-100]

vector_dim IntegerParameterRange [32-300]

word_ngrams IntegerParameterRange [1-3]

Latent Dirichlet Allocation (LDA) Algorithm


The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning
algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most
commonly used to discover a user-specified number of topics shared by documents within a text corpus.
Here each observation is a document, the features are the presence (or occurrence count) of each word,
and the categories are the topics. Since the method is unsupervised, the topics are not specified up front,
and are not guaranteed to align with how a human may naturally categorize documents. The topics are
learned as a probability distribution over the words that occur in each document. Each document, in turn,
is described as a mixture of topics.

The exact content of two documents with similar topic mixtures will not be the same. But overall, you
would expect these documents to more frequently use a shared subset of words, than when compared
with a document from a different topic mixture. This allows LDA to discover these word groups and use
them to form topics. As an extremely simple example, given a set of documents where the only words
that occur within them are: eat, sleep, play, meow, and bark, LDA might produce topics like the following:

Topic eat sleep play meow bark

Topic 1 0.1 0.3 0.2 0.4 0.0

1409
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Topic eat sleep play meow bark

Topic 2 0.2 0.1 0.4 0.0 0.3

You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely
to meow and sleep), and documents that fall into Topic 2 are about dogs (who prefer to play and bark).
These topics can be found even though the words dog and cat never appear in any of the texts.

Topics
• Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM) (p. 1410)
• Input/Output Interface for the LDA Algorithm (p. 1410)
• EC2 Instance Recommendation for the LDA Algorithm (p. 1411)
• LDA Sample Notebooks (p. 1411)
• How LDA Works (p. 1411)
• LDA Hyperparameters (p. 1413)
• Tune an LDA Model (p. 1414)

Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)

Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate
semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity
and maximize topic coherence.

Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the
geometric mean per-word likelihood in your test data. A lower perplexity score indicates better
generalization performance. Research has shown that the likelihood computed per word often does
not align to human judgement, and can be entirely non-correlated, thus topic coherence has been
introduced. Each inferred topic from your model consists of words, and topic coherence is computed to
the top N words for that particular topic from your model. It is often defined as the average or median of
the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A
promising model generates coherent topics or topics with high topic coherence scores.

While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence,
there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has
shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs
sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From
a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more
flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized
across multiple GPU instances, whereas LDA only supports single-instance CPU training.

Topics
• Input/Output Interface for the LDA Algorithm (p. 1410)
• EC2 Instance Recommendation for the LDA Algorithm (p. 1411)
• LDA Sample Notebooks (p. 1411)
• How LDA Works (p. 1411)
• LDA Hyperparameters (p. 1413)
• Tune an LDA Model (p. 1414)

Input/Output Interface for the LDA Algorithm

LDA expects data to be provided on the train channel, and optionally supports a test channel, which
is scored by the final model. LDA supports both recordIO-wrapped-protobuf (dense and sparse)

1410
Amazon SageMaker Developer Guide
Use Built-in Algorithms

and CSV file formats. For CSV, the data must be dense and have dimension equal to number of records *
vocabulary size. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only
in File mode for the CSV format.

For inference, text/csv, application/json, and application/x-recordio-protobuf content


types are supported. Sparse data can also be passed for application/json and application/x-
recordio-protobuf. LDA inference returns application/json or application/x-recordio-
protobuf predictions, which include the topic_mixture vector for each observation.

Please see the LDA Sample Notebooks (p. 1411) for more detail on training and inference formats.

EC2 Instance Recommendation for the LDA Algorithm

LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/
inference.

LDA Sample Notebooks

For a sample notebook that shows how to train the SageMaker Latent Dirichlet Allocation algorithm
on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures
in input documents, see the An Introduction to SageMaker LDA. For instructions how to create and
access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon
SageMaker Notebook Instances (p. 204). Once you have created a notebook instance and opened it,
select the SageMaker Examples tab to see a list of all the SageMaker samples. The topic modeling
example notebooks using the NTM algorithms are located in the Introduction to Amazon algorithms
section. To open a notebook, click on its Use tab and select Create copy.

How LDA Works

Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of
observations as a mixture of different categories. These categories are themselves a probability
distribution over the features. LDA is a generative probability model, which means it attempts to
provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to
discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to
automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in
text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A
feature is referred to as a word. And the resulting categories are referred to as topics.
Note
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-
processing any input text data.

An LDA model is defined by two parameters:

• α—A prior estimate on topic probability (in other words, the average frequency that each topic within
a given document occurs).
• β—a collection of k topics where each topic is given a probability distribution over the vocabulary used
in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a
generative model where each document is generated word-by-word by choosing a topic mixture θ ∼
Dirichlet(α).

For each word in the document:

• Choose a topic z ∼ Multinomial(θ)

1411
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Choose the corresponding topic-word distribution β_z.


• Draw a word w ∼ Multinomial(β_z).

When training the model, the goal is to find parameters α and β, which maximize the probability that the
text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation
Maximization (EM) techniques. The Amazon SageMaker LDA uses tensor spectral decomposition. This
provides several advantages:

• Theoretical guarantees on results. The standard EM-method is guaranteed to converge only to local
optima, which are often of poor quality.
• Embarrassingly parallelizable. The work can be trivially divided over input documents in both training
and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily.
• Fast. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs
Sampling is also subject to slow convergence rates and also requires a large number of samples.

At a high-level, the tensor decomposition algorithm follows this process:

1. The goal is to calculate the spectral decomposition of a V x V x V tensor, which summarizes the
moments of the documents in our corpus. V is vocabulary size (in other words, the number of distinct
words in all of the documents). The spectral components of this tensor are the LDA parameters α and
β, which maximize the overall likelihood of the document corpus. However, because vocabulary size
tends to be large, this V x V x V tensor is prohibitively large to store in memory.
2. Instead, it uses a V x V moment matrix, which is the two-dimensional analog of the tensor from step
1, to find a whitening matrix of dimension V x k. This matrix can be used to convert the V x V moment
matrix into a k x k identity matrix. k is the number of topics in the model.
3. This same whitening matrix can then be used to find a smaller k x k x k tensor. When spectrally
decomposed, this tensor has components that have a simple relationship with the components of the
V x V x V tensor.
4. Alternating Least Squares is used to decompose the smaller k x k x k tensor. This provides a substantial
improvement in memory consumption and speed. The parameters α and β can be found by
“unwhitening” these outputs in the spectral decomposition.

After the LDA model’s parameters have been found, you can find the topic mixtures for each document.
You use stochastic gradient descent to maximize the likelihood function of observing a given topic
mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then
filtering out poor quality ones. This is in fact done automatically in SageMaker LDA: 25% more topics
are computed and only the ones with largest associated Dirichlet priors are returned. To perform further
topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as
follows:

> import mxnet as mx


> alpha, beta = mx.ndarray.load(‘model.tar.gz’)
> # modify alpha and beta
> mx.nd.save(‘new_model.tar.gz’, [new_alpha, new_beta])
> # upload to S3 and create new SageMaker model using the console

For more information about algorithms for LDA and the SageMaker implementation, see the following:

• Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor
Decompositions for Learning Latent Variable Models, Journal of Machine Learning Research, 15:2773–
2832, 2014.

1412
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3(Jan):993–1022, 2003.
• Thomas L Griffiths and Mark Steyvers. Finding Scientific Topics. Proceedings of the National Academy
of Sciences, 101(suppl 1):5228–5235, 2004.
• Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–
500, 2009.

LDA Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters
for the LDA training algorithm provided by Amazon SageMaker. For more information, see How LDA
Works (p. 1411).

Parameter Name Description

num_topics The number of topics for LDA to find within the data.

Required

Valid values: positive integer

feature_dim The size of the vocabulary of the input document corpus.

Required

Valid values: positive integer

mini_batch_size The total number of documents in the input document corpus.

Required

Valid values: positive integer

alpha0 Initial guess for the concentration parameter: the sum of the
elements of the Dirichlet prior. Small values are more likely to
generate sparse topic mixtures and large values (greater than 1.0)
produce more uniform mixtures.

Optional

Valid values: Positive float

Default value: 1.0

max_restarts The number of restarts to perform during the Alternating Least


Squares (ALS) spectral decomposition phase of the algorithm.
Can be used to find better quality local minima at the expense of
additional computation, but typically should not be adjusted.

Optional

Valid values: Positive integer

Default value: 10

max_iterations The maximum number of iterations to perform during the ALS


phase of the algorithm. Can be used to find better quality minima

1413
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


at the expense of additional computation, but typically should not
be adjusted.

Optional

Valid values: Positive integer

Default value: 1000

tol Target error tolerance for the ALS phase of the algorithm. Can be
used to find better quality minima at the expense of additional
computation, but typically should not be adjusted.

Optional

Valid values: Positive float

Default value: 1e-8

Tune an LDA Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations
(documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric
measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document
dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA
model.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the LDA Algorithm

The LDA algorithm reports on a single metric during training: test:pwll. When tuning a model, choose
this metric as the objective metric.

Metric Name Description Optimization Direction

test:pwll Per-word log-likelihood on the test dataset. The Maximize


likelihood that the test dataset is accurately
described by the learned LDA model.

Tunable LDA Hyperparameters

You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, alpha0 and
num_topics, can affect the LDA objective metric (test:pwll). If you don't already know the optimal
values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA
model, automatic model tuning can help find them.

1414
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges

alpha0 ContinuousParameterRanges MinValue: 0.1,


MaxValue: 10

num_topics IntegerParameterRanges MinValue: 1, MaxValue:


150

Neural Topic Model (NTM) Algorithm


Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of
documents into topics that contain word groupings based on their statistical distribution. Documents
that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely
to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize
documents based on the topics detected or to retrieve information or recommend content based on
topic similarities. The topics from documents that NTM learns are characterized as a latent representation
because the topics are inferred from the observed word distributions in the corpus. The semantics of
topics are usually inferred by examining the top ranking words they contain. Because the method is
unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the
topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the
learned topics. Documents relevant to each topic might be indexed or searched for based on their soft
topic labels. The latent representations of documents might also be used to find similar documents in
the topic space. You can also use the latent representations of documents that the topic model learns for
input to another supervised algorithm such as a document classifier. Because the latent representations
of documents are expected to capture the semantics of the underlying documents, algorithms based in
part on these representations are expected to perform better than those based on lexical features alone.

Although you can use both the Amazon SageMaker NTM and LDA algorithms for topic modeling, they
are distinct algorithms and can be expected to produce different results on the same input data.

For more information on the mathematics behind NTM, see Neural Variational Inference for Text
Processing.

Topics
• Input/Output Interface for the NTM Algorithm (p. 1415)
• EC2 Instance Recommendation for the NTM Algorithm (p. 1416)
• NTM Sample Notebooks (p. 1416)
• NTM Hyperparameters (p. 1416)
• Tune an NTM Model (p. 1419)
• NTM Response Formats (p. 1420)

Input/Output Interface for the NTM Algorithm


Amazon SageMaker Neural Topic Model supports four data channels: train, validation, test, and auxiliary.
The validation, test, and auxiliary data channels are optional. If you specify any of these optional
channels, set the value of the S3DataDistributionType parameter for them to FullyReplicated.
If you provide validation data, the loss on this data is logged at every epoch, and the model stops
training as soon as it detects that the validation loss is not improving. If you don't provide validation
data, the algorithm stops early based on the training data, but this can be less efficient. If you provide
test data, the algorithm reports the test loss from the final model.

The train, validation, and test data channels for NTM support both recordIO-wrapped-protobuf
(dense and sparse) and CSV file formats. For CSV format, each row must be represented densely with

1415
Amazon SageMaker Developer Guide
Use Built-in Algorithms

zero counts for words not present in the corresponding document, and have dimension equal to:
(number of records) * (vocabulary size). You can use either File mode or Pipe mode to train models on
data that is formatted as recordIO-wrapped-protobuf or as CSV. The auxiliary channel is used to
supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top
words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also
allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in
the log that captures similarity among the top words in each topic effectively. The ContentType for the
auxiliary channel is text/plain, with each line containing a single word, in the order corresponding to
the integer IDs provided in the data. The vocabulary file must be named vocab.txt and currently only
UTF-8 encoding is supported.

For inference, text/csv, application/json, application/jsonlines, and application/x-


recordio-protobuf content types are supported. Sparse data can also be passed for application/
json and application/x-recordio-protobuf. NTM inference returns application/json or
application/x-recordio-protobuf predictions, which include the topic_weights vector for each
observation.

See the blog post and the companion notebook for more details on using the auxiliary channel and the
WETC scores. For more information on how to compute the WETC score, see Coherence-Aware Neural
Topic Modeling. We used the pairwise WETC described in this paper for the Amazon SageMaker Neural
Topic Model.

For more information on input and output file formats, see NTM Response Formats (p. 1420) for
inference and the NTM Sample Notebooks (p. 1416).

EC2 Instance Recommendation for the NTM Algorithm

NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain
workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for
inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

NTM Sample Notebooks

For a sample notebook that uses the SageMaker NTM algorithm to uncover topics in documents from a
synthetic data source where the topic distributions are known, see the Introduction to Basic Functionality
of NTM. For instructions how to create and access Jupyter notebook instances that you can use to
run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have
created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the
SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.

NTM Hyperparameters

Parameter Name Description

feature_dim The vocabulary size of the dataset.

Required

Valid values: Positive integer (min: 1, max: 1,000,000)

num_topics The number of required topics.

Required

Valid values: Positive integer (min: 2, max: 1000)

batch_norm Whether to use batch normalization during training.

1416
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Optional

Valid values: true or false

Default value: false

clip_gradient The maximum magnitude for each gradient component.

Optional

Valid values: Float (min: 1e-3)

Default value: Infinity

encoder_layers The number of layers in the encoder and the output size of each
layer. When set to auto, the algorithm uses two layers of sizes 3 x
num_topics and 2 x num_topics respectively.

Optional

Valid values: Comma-separated list of positive integers or auto

Default value: auto

encoder_layers_activation The activation function to use in the encoder layers.

Optional

Valid values:

• sigmoid: Sigmoid function


• tanh: Hyperbolic tangent
• relu: Rectified linear unit

Default value: sigmoid

epochs The maximum number of passes over the training data.

Optional

Valid values: Positive integer (min: 1)

Default value: 50

learning_rate The learning rate for the optimizer.

Optional

Valid values: Float (min: 1e-6, max: 1.0)

Default value: 0.001

1417
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

mini_batch_size The number of examples in each mini batch.

Optional

Valid values: Positive integer (min: 1, max: 10000)

Default value: 256

num_patience_epochs The number of successive epochs over which early stopping


criterion is evaluated. Early stopping is triggered when the change
in the loss function drops below the specified tolerance within
the last num_patience_epochs number of epochs. To disable
early stopping, set num_patience_epochs to a value larger than
epochs.

Optional

Valid values: Positive integer (min: 1)

Default value: 3

optimizer The optimizer to use for training.

Optional

Valid values:

• sgd: Stochastic gradient descent


• adam: Adaptive momentum estimation
• adagrad: Adaptive gradient algorithm
• adadelta: An adaptive learning rate algorithm
• rmsprop: Root mean square propagation

Default value: adadelta

rescale_gradient The rescale factor for gradient.

Optional

Valid values: float (min: 1e-3, max: 1.0)

Default value: 1.0

sub_sample The fraction of the training data to sample for training per epoch.

Optional

Valid values: Float (min: 0.0, max: 1.0)

Default value: 1.0

1418
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

tolerance The maximum relative change in the loss function. Early stopping is
triggered when change in the loss function drops below this value
within the last num_patience_epochs number of epochs.

Optional

Valid values: Float (min: 1e-6, max: 0.1)

Default value: 0.001

weight_decay The weight decay coefficient. Adds L2 regularization.

Optional

Valid values: Float (min: 0.0, max: 1.0)

Default value: 0.0

Tune an NTM Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

Amazon SageMaker NTM is an unsupervised learning algorithm that learns latent representations of
large collections of discrete data, such as a corpus of documents. Latent representations use inferred
variables that are not directly measured to model the observations in a dataset. Automatic model tuning
on NTM helps you find the model that minimizes loss over the training or validation data. Training loss
measures how well the model fits the training data. Validation loss measures how well the model can
generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the
training data. Low validation loss indicates that a model has not overfit the training data and so should
be able to model documents successfully on which is has not been trained. Usually, it's preferable to have
both losses be small. However, minimizing training loss too much might result in overfitting and increase
validation loss, which would reduce the generality of the model.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the NTM Algorithm

The NTM algorithm reports a single metric that is computed during training: validation:total_loss.
The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning
hyperparameter values, choose this metric as the objective.

Metric Name Description Optimization Direction

Total Loss on validation set


validation:total_loss Minimize

Tunable NTM Hyperparameters

You can tune the following hyperparameters for the NTM algorithm. Usually setting low
mini_batch_size and small learning_rate values results in lower validation losses, although it

1419
Amazon SageMaker Developer Guide
Use Built-in Algorithms

might take longer to train. Low validation losses don't necessarily produce more coherent topics as
interpreted by humans. The effect of other hyperparameters on training and validation loss can vary
from dataset to dataset. To see which values are compatible, see NTM Hyperparameters (p. 1416).

Parameter Name Parameter Type Recommended Ranges

CategoricalParameterRanges
encoder_layers_activation ['sigmoid', 'tanh', 'relu']

learning_rate ContinuousParameterRange MinValue: 1e-4,


MaxValue: 0.1

mini_batch_size IntegerParameterRanges MinValue: 16,


MaxValue:2048

optimizer CategoricalParameterRanges ['sgd', 'adam', 'adadelta']

rescale_gradient ContinuousParameterRange MinValue: 0.1,


MaxValue: 1.0

weight_decay ContinuousParameterRange MinValue: 0.0,


MaxValue: 1.0

NTM Response Formats

All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker NTM algorithm.

JSON Response Format

{
"predictions": [
{"topic_weights": [0.02, 0.1, 0,...]},
{"topic_weights": [0.25, 0.067, 0,...]}
]
}

JSONLINES Response Format

{"topic_weights": [0.02, 0.1, 0,...]}


{"topic_weights": [0.25, 0.067, 0,...]}

RECORDIO Response Format

[
Record = {
features = {},
label = {
'topic_weights': {
keys: [],
values: [0.25, 0.067, 0, ...] # float32
}
}
},
Record = {
features = {},
label = {
'topic_weights': {

1420
Amazon SageMaker Developer Guide
Use Built-in Algorithms

keys: [],
values: [0.25, 0.067, 0, ...] # float32
}
}
}
]

Object2Vec Algorithm
The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm
that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional
objects. The embeddings are learned in a way that preserves the semantics of the relationship between
pairs of objects in the original space in the embedding space. You can use the learned embeddings to
efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in
low-dimensional space, for example. You can also use the embeddings as features of the corresponding
objects in downstream supervised tasks, such as classification or regression.

Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in
the SageMaker BlazingText algorithm (p. 1399). For a blog post that discusses how to apply Object2Vec
to some practical use cases, see Introduction to Amazon SageMaker Object2Vec.

Topics
• I/O Interface for the Object2Vec Algorithm (p. 1421)
• EC2 Instance Recommendation for the Object2Vec Algorithm (p. 1422)
• Object2Vec Sample Notebooks (p. 1422)
• How Object2Vec Works (p. 1422)
• Object2Vec Hyperparameters (p. 1424)
• Tune an Object2Vec Model (p. 1433)
• Data Formats for Object2Vec Training (p. 1435)
• Data Formats for Object2Vec Inference (p. 1435)
• Encoder Embeddings for Object2Vec (p. 1436)

I/O Interface for the Object2Vec Algorithm

You can use Object2Vec on many input data types, including the following examples.

Input Data Type Example

Sentence-sentence pairs "A soccer game with multiple males playing." and "Some men are
playing a sport."

Labels-sequence pairs The genre tags of the movie "Titanic", such as "Romance" and
"Drama", and its short description: "James Cameron's Titanic is
an epic, action-packed romance set against the ill-fated maiden
voyage of the R.M.S. Titanic. She was the most luxurious liner of her
era, a ship of dreams, which ultimately carried over 1,500 people to
their death in the ice cold waters of the North Atlantic in the early
hours of April 15, 1912."

Customer-customer pairs The customer ID of Jane and customer ID of Jackie.

Product-product pairs The product ID of football and product ID of basketball.

Item review user-item pairs A user's ID and the items she has bought, such as apple, pear, and
orange.

1421
Amazon SageMaker Developer Guide
Use Built-in Algorithms

To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec
natively supports two types of input:

• A discrete token, which is represented as a list of a single integer-id. For example, [10].
• A sequences of discrete tokens, which is represented as a list of integer-ids. For example,
[0,12,10,13].

The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token,
token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as
compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:

• Average-pooled embeddings
• Hierarchical convolutional neural networks (CNNs),
• Multi-layered bidirectional long short-term memory (BiLSTMs)

The input label for each pair can be one of the following:

• A categorical label that expresses the relationship between the objects in the pair
• A score that expresses the strength of the similarity between the two objects

For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For
ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss
function. Specify these loss functions with the output_layer hyperparameter when you create the
model training job.

EC2 Instance Recommendation for the Object2Vec Algorithm

The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you
are training or running inference.

When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance.
For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance,
you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine.
However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU
instance families for training and inference.

For inference with a trained Object2Vec model that has a deep neural network, we recommend
using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE
environment variable can be specified to optimize on whether the the section called “GPU
optimization: Classification or Regression” (p. 1435) or the section called “GPU optimization: Encoder
Embeddings” (p. 1436) inference network is loaded into GPU.

Object2Vec Sample Notebooks

• Using Object2Vec to Encode Sentences into Fixed Length Embeddings


• Using Object2Vec to learn document embeddings

Note
To run the notebooks on a notebook instance, see Example Notebooks (p. 220). To run the
notebooks on Studio, see Create or Open an Amazon SageMaker Studio Notebook (p. 148).

How Object2Vec Works

When using the Amazon SageMaker Object2Vec algorithm, you follow the standard workflow: process
the data, train the model, and produce inferences.

1422
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Topics
• Step 1: Process Data (p. 1423)
• Step 2: Train a Model (p. 1423)
• Step 3: Produce Inferences (p. 1424)

Step 1: Process Data

During preprocessing, convert the data to the JSON Lines text file format specified in Data Formats
for Object2Vec Training (p. 1435) . To get the highest accuracy during training, also randomly shuffle
the data before feeding it into the model. How you generate random permutations depends on the
language. For python, you could use np.random.shuffle; for Unix, shuf.

Step 2: Train a Model

The SageMaker Object2Vec algorithm has the following main components:

• Two input channels – The input channels take a pair of objects of the same or different types as
inputs, and pass them to independent and customizable encoders.
• Two encoders – The two encoders, enc0 and enc1, convert each object into a fixed-length embedding
vector. The encoded embeddings of the objects in the pair are then passed into a comparator.
• A comparator – The comparator compares the embeddings in different ways and outputs scores that
indicate the strength of the relationship between the paired objects. In the output score for a sentence
pair. For example, 1 indicates a strong relationship between a sentence pair, and 0 represents a weak
relationship.

During training, the algorithm accepts pairs of objects and their relationship labels or scores as inputs.
The objects in each pair can be of different types, as described earlier. If the inputs to both encoders are
composed of the same token-level units, you can use a shared token embedding layer by setting the
tied_token_embedding_weight hyperparameter to True when you create the training job. This is
possible, for example, when comparing sentences that both have word token-level units. To generate
negative samples at a specified rate, set the negative_sampling_rate hyperparameter to the desired
ratio of negative to positive samples. This hyperparameter expedites learning how to discriminate
between the positive samples observed in the training data and the negative samples that are not likely
to be observed.

Pairs of objects are passed through independent, customizable encoders that are compatible with the
input types of corresponding objects. The encoders convert each object in a pair into a fixed-length
embedding vector of equal length. The pair of vectors are passed to a comparator operator, which
assembles the vectors into a single vector using the value specified in the he comparator_list
hyperparameter. The assembled vector then passes through a multilayer perceptron (MLP) layer, which
produces an output that the loss function compares with the labels that you provided. This comparison
evaluates the strength of the relationship between the objects in the pair as predicted by the model. The
following figure shows this workflow.

1423
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Architecture of the Object2Vec Algorithm from Data Inputs to Scores

Step 3: Produce Inferences

After the model is trained, you can use the trained encoder to preprocess input objects or to perform two
types of inference:

• To convert singleton input objects into fixed-length embeddings using the corresponding encoder
• To predict the relationship label or score between a pair of input objects

The inference server automatically figures out which of the types is requested based on the input data.
To get the embeddings as output, provide only one input. To predict the relationship label or score,
provide both inputs in the pair.

Object2Vec Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the
Object2Vec training algorithm.

Parameter Name Description

enc0_max_seq_len The maximum sequence length for the enc0 encoder.

Required

Valid values: 1 ≤ integer ≤ 5000

enc0_vocab_size The vocabulary size of enc0 tokens.

Required

Valid values: 2 ≤ integer ≤ 3000000

1424
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

bucket_width The allowed difference between data sequence length when


bucketing is enabled. To enable bucketing, specify a non-zero value
for this parameter.

Optional

Valid values: 0 ≤ integer ≤ 100

Default value: 0 (no bucketing)

comparator_list A list used to customize the way in which two embeddings are
compared. The Object2Vec comparator operator layer takes the
encodings from both encoders as inputs and outputs a single
vector. This vector is a concatenation of subvectors. The string
values passed to the comparator_list and the order in which
they are passed determine how these subvectors are assembled. For
example, if comparator_list="hadamard, concat", then the
comparator operator constructs the vector by concatenating the
Hadamard product of two encodings and the concatenation of two
encodings. If, on the other hand, comparator_list="hadamard",
then the comparator operator constructs the vector as the
hadamard product of only two encodings.

Optional

Valid values: A string that contains any combination of the names


of the three binary operators: hadamard, concat, or abs_diff.
The Object2Vec algorithm currently requires that the two vector
encodings have the same dimension. These operators produce the
subvectors as follows:

• hadamard: Constructs a vector as the Hadamard (element-wise)


product of two encodings.
• concat: Constructs a vector as the concatenation of two
encodings.
• abs_diff: Constructs a vector as the absolute difference
between two encodings.

Default value: "hadamard, concat, abs_diff"

dropout The dropout probability for network layers. Dropout is a form of


regularization used in neural networks that reduces overfitting by
trimming codependent neurons.

Optional

Valid values: 0.0 ≤ float ≤ 1.0

Default value: 0.0

1425
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

early_stopping_patience The number of consecutive epochs without improvement allowed


before early stopping is applied. Improvement is defined by with
the early_stopping_tolerance hyperparameter.

Optional

Valid values: 1 ≤ integer ≤ 5

Default value: 3

early_stopping_tolerance The reduction in the loss function that an algorithm must


achieve between consecutive epochs to avoid early stopping
after the number of consecutive epochs specified in the
early_stopping_patience hyperparameter concludes.

Optional

Valid values: 0.000001 ≤ float ≤ 0.1

Default value: 0.01

enc_dim The dimension of the output of the embedding layer.

Optional

Valid values: 4 ≤ integer ≤ 10000

Default value: 4096

enc0_network The network model for the enc0 encoder.

Optional

Valid values: hcnn, bilstm, or pooled_embedding

• hcnn: A hierarchical convolutional neural network.


• bilstm: A bidirectional long short-term memory network
(biLSTM), in which the signal propagates backward and forward
in time. This is an appropriate recurrent neural network (RNN)
architecture for sequential learning tasks.
• pooled_embedding: Averages the embeddings of all of the
tokens in the input.

Default value: hcnn

enc0_cnn_filter_width The filter width of the convolutional neural network (CNN) enc0
encoder.

Conditional

Valid values: 1 ≤ integer ≤ 9

Default value: 3

1426
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

Whether to freeze enc0 pretrained embedding weights.


enc0_freeze_pretrained_embedding

Conditional

Valid values: True or False

Default value: True

enc0_layers The number of layers in the enc0 encoder.

Conditional

Valid values: auto or 1 ≤ integer ≤ 4

• For hcnn, auto means 4.


• For bilstm, auto means 1.
• For pooled_embedding, auto ignores the number of layers.

Default value: auto

The filename of the pretrained enc0 token embedding file in the


enc0_pretrained_embedding_file
auxiliary data channel.

Conditional

Valid values: String with alphanumeric characters, underscore, or


period. [A-Za-z0-9\.\_]

Default value: "" (empty string)

enc0_token_embedding_dim The output dimension of the enc0 token embedding layer.

Conditional

Valid values: 2 ≤ integer ≤ 1000

Default value: 300

enc0_vocab_file The vocabulary file for mapping pretrained enc0 token embedding
vectors to numerical vocabulary IDs.

Conditional

Valid values: String with alphanumeric characters, underscore, or


period. [A-Za-z0-9\.\_]

Default value: "" (empty string)

1427
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

enc1_network The network model for the enc1 encoder. If you want the enc1
encoder to use the same network model as enc0, including the
hyperparameter values, set the value to enc0.
Note
Even when the enc0 and enc1 encoder networks have
symmetric architectures, you can't shared parameter values
for these networks.

Optional

Valid values: enc0, hcnn, bilstm, or pooled_embedding

• enc0: The network model for the enc0 encoder.


• hcnn: A hierarchical convolutional neural network.
• bilstm: A bidirectional LSTM, in which the signal propagates
backward and forward in time. This is an appropriate recurrent
neural network (RNN) architecture for sequential learning tasks.
• pooled_embedding: The averages of the embeddings of all of
the tokens in the input.

Default value: enc0

enc1_cnn_filter_width The filter width of the CNN enc1 encoder.

Conditional

Valid values: 1 ≤ integer ≤ 9

Default value: 3

Whether to freeze enc1 pretrained embedding weights.


enc1_freeze_pretrained_embedding

Conditional

Valid values: True or False

Default value: True

enc1_layers The number of layers in the enc1 encoder.

Conditional

Valid values: auto or 1 ≤ integer ≤ 4

• For hcnn, auto means 4.


• For bilstm, auto means 1.
• For pooled_embedding, auto ignores the number of layers.

Default value: auto

1428
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

enc1_max_seq_len The maximum sequence length for the enc1 encoder.

Conditional

Valid values: 1 ≤ integer ≤ 5000

The name of the enc1 pretrained token embedding file in the


enc1_pretrained_embedding_file
auxiliary data channel.

Conditional

Valid values: String with alphanumeric characters, underscore, or


period. [A-Za-z0-9\.\_]

Default value: "" (empty string)

enc1_token_embedding_dim The output dimension of the enc1 token embedding layer.

Conditional

Valid values: 2 ≤ integer ≤ 1000

Default value: 300

enc1_vocab_file The vocabulary file for mapping pretrained enc1 token embeddings
to vocabulary IDs.

Conditional

Valid values: String with alphanumeric characters, underscore, or


period. [A-Za-z0-9\.\_]

Default value: "" (empty string)

enc1_vocab_size The vocabulary size of enc0 tokens.

Conditional

Valid values: 2 ≤ integer ≤ 3000000

epochs The number of epochs to run for training.

Optional

Valid values: 1 ≤ integer ≤ 100

Default value: 30

learning_rate The learning rate for training.

Optional

Valid values: 1.0E-6 ≤ float ≤ 1.0

Default value: 0.0004

1429
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

mini_batch_size The batch size that the dataset is split into for an optimizer
during training.

Optional

Valid values: 1 ≤ integer ≤ 10000

Default value: 32

mlp_activation The type of activation function for the multilayer perceptron (MLP)
layer.

Optional

Valid values: tanh, relu, or linear

• tanh: Hyperbolic tangent


• relu: Rectified linear unit (ReLU)
• linear: Linear function

Default value: linear

mlp_dim The dimension of the output from MLP layers.

Optional

Valid values: 2 ≤ integer ≤ 10000

Default value: 512

mlp_layers The number of MLP layers in the network.

Optional

Valid values: 0 ≤ integer ≤ 10

Default value: 2

1430
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

negative_sampling_rate The ratio of negative samples, generated to assist in training


the algorithm, to positive samples that are provided by users.
Negative samples represent data that is unlikely to occur in reality
and are labelled negatively for training. They facilitate training
a model to discriminate between the positive samples observed
and the negative samples that are not. To specify the ratio of
negative samples to positive samples used for training, set the
value to a positive integer. For example, if you train the algorithm
on input data in which all of the samples are positive and set
negative_sampling_rate to 2, the Object2Vec algorithm
internally generates two negative samples per positive sample. If
you don't want to generate or use negative samples during training,
set the value to 0.

Optional

Valid values: 0 ≤ integer

Default value: 0 (off)

num_classes The number of classes for classification training. Amazon


SageMaker ignores this hyperparameter for regression problems.

Optional

Valid values: 2 ≤ integer ≤ 30

Default value: 2

optimizer The optimizer type.

Optional

Valid values: adadelta, adagrad, adam, sgd, or rmsprop.

• adadelta: A per-dimension learning rate method for gradient


descent
• adagrad: The adaptive gradient algorithm
• adam: The adaptive moment estimation algorithm
• sgd: Stochastic gradient descent
• rmsprop: Root mean square propagation

Default value: adam

1431
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

output_layer The type of output layer where you specify that the task is
regression or classification.

Optional

Valid values: softmax or mean_squared_error

• softmax: The Softmax function used for classification.


• mean_squared_error: The MSE used for regression.

Default value: softmax

tied_token_embedding_weightWhether to use a shared embedding layer for both encoders. If


the inputs to both encoders use the same token-level units, use
a shared token embedding layer. For example, for a collection of
documents, if one encoder encodes sentences and another encodes
whole documents, you can use a shared token embedding layer.
That's because both sentences and documents are composed of
word tokens from the same vocabulary.

Optional

Valid values: True or False

Default value: False

token_embedding_storage_type
The mode of gradient update used during training: when the dense
mode is used, the optimizer calculates the full gradient matrix for
the token embedding layer even if most rows of the gradient are
zero-valued. When sparse mode is used, the optimizer only stores
rows of the gradient that are actually being used in the mini-batch.
If you want the algorithm to perform lazy gradient updates, which
calculate the gradients only in the non-zero rows and which speed
up training, specify row_sparse. Setting the value to row_sparse
constrains the values available for other hyperparameters, as
follows:

• The optimizer hyperparameter must be set to adam,


adagrad, or sgd. Otherwise, the algorithm throws a
CustomerValueError.
• The algorithm automatically disables bucketing, setting the
bucket_width hyperparameter to 0.

Optional

Valid values: dense or row_sparse

Default value: dense

1432
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

weight_decay The weight decay parameter used for optimization.

Optional

Valid values: 0 ≤ float ≤ 10000

Default value: 0 (no decay)

Tune an Object2Vec Model


Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. For the objective metric, you
use one of the metrics that the algorithm computes. Automatic model tuning searches the chosen
hyperparameters to find the combination of values that result in the model that optimizes the objective
metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the Object2Vec Algorithm


The Object2Vec algorithm has both classification and regression metrics. The output_layer type
determines which metric you can use for automatic model tuning.

Regressor Metrics Computed by the Object2Vec Algorithm


The algorithm reports a mean squared error regressor metric, which is computed during testing and
validation. When tuning the model for regression tasks, choose this metric as the objective.

Metric Name Description Optimization Direction

The Mean Square Error


test:mean_squared_error Minimize

The Mean Square Error


validation:mean_squared_error Minimize

Classification Metrics Computed by the Object2Vec Algorithm


The Object2Vec algorithm reports accuracy and cross-entropy classification metrics, which are computed
during test and validation. When tuning the model for classification tasks, choose one of these as the
objective.

Metric Name Description Optimization Direction

test:accuracy Accuracy Maximize

test:cross_entropy Cross-entropy Minimize

validation:accuracy Accuracy Maximize

Cross-entropy
validation:cross_entropy Minimize

Tunable Object2Vec Hyperparameters


You can tune the following hyperparameters for the Object2Vec algorithm.

1433
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Hyperparameter Hyperparameter Type Recommended


Name Ranges and Values

dropout ContinuousParameterRange MinValue: 0.0,


MaxValue: 1.0

IntegerParameterRange
early_stopping_patience MinValue: 1,
MaxValue: 5

ContinuousParameterRange
early_stopping_tolerance MinValue: 0.001,
MaxValue: 0.1

enc_dim IntegerParameterRange MinValue: 4,


MaxValue: 4096

IntegerParameterRange
enc0_cnn_filter_width MinValue: 1,
MaxValue: 5

enc0_layers IntegerParameterRange MinValue: 1,


MaxValue: 4

IntegerParameterRange
enc0_token_embedding_dim MinValue: 5,
MaxValue: 300

IntegerParameterRange
enc1_cnn_filter_width MinValue: 1,
MaxValue: 5

enc1_layers IntegerParameterRange MinValue: 1,


MaxValue: 4

IntegerParameterRange
enc1_token_embedding_dim MinValue: 5,
MaxValue: 300

epochs IntegerParameterRange MinValue: 4,


MaxValue: 20

learning_rate ContinuousParameterRange MinValue: 1e-6,


MaxValue: 1.0

mini_batch_size IntegerParameterRange MinValue: 1,


MaxValue: 8192

mlp_activation CategoricalParameterRanges [tanh, relu,


linear]

mlp_dim IntegerParameterRange MinValue: 16,


MaxValue: 1024

mlp_layers IntegerParameterRange MinValue: 1,


MaxValue: 4

optimizer CategoricalParameterRanges [adagrad, adam,


rmsprop, sgd,
adadelta]

weight_decay ContinuousParameterRange MinValue: 0.0,


MaxValue: 1.0

1434
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Data Formats for Object2Vec Training


Input: JSON Lines Request Format
Content-type: application/jsonlines

{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15,
69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9,
107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}

The “in0” and “in1” are the inputs for encoder0 and encoder1, respectively. The same format is valid for
both classification and regression problems. For regression, the field "label" can accept real valued
inputs.

Data Formats for Object2Vec Inference


GPU optimization: Classification or Regression
Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE environment variable can be specified
to optimize on whether the classification/regression or the the section called “Output: Encoder
Embeddings” (p. 1437) inference network is loaded into GPU. If the majority of your inference is for
classification or regression, specify INFERENCE_PREFERRED_MODE=classification. The following is
a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for classification/regression
inference:

transformer = o2v.transformer(instance_count=4,
instance_type="ml.p2.xlarge",
max_concurrent_transforms=2,
max_payload=1, # 1MB
strategy='MultiRecord',
env={'INFERENCE_PREFERRED_MODE': 'classification'}, # only
useful with GPU
output_path=output_s3_path)

Input: Classification or Regression Request Format


Content-type: application/json

{
"instances" : [
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821,
4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]},
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4],
"in1": [22, 32, 13, 25, 1016, 573, 3252, 4]},
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
]
}

Content-type: application/jsonlines

{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4],
"in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1":
[22, 32, 13, 25, 1016, 573, 3252, 4]}
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}

For classification problems, the length of the scores vector corresponds to num_classes. For regression
problems, the length is 1.

1435
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Output: Classification or Regression Response Format


Accept: application/json

{
"predictions": [
{
"scores": [
0.6533935070037842,
0.07582679390907288,
0.2707797586917877
]
},
{
"scores": [
0.026291321963071823,
0.6577019095420837,
0.31600672006607056
]
}
]
}

Accept: application/jsonlines

{"scores":[0.195667684078216,0.395351558923721,0.408980727195739]}
{"scores":[0.251988261938095,0.258233487606048,0.489778339862823]}
{"scores":[0.280087798833847,0.368331134319305,0.351581096649169]}

In both the classification and regression formats, the scores apply to individual labels.

Encoder Embeddings for Object2Vec

GPU optimization: Encoder Embeddings


An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Due to GPU memory scarcity, the INFERENCE_PREFERRED_MODE environment variable can be specified
to optimize on whether the the section called “Inference Formats: Scoring” (p. 1435) or the encoder
embedding inference network is loaded into GPU. If the majority of your inference is for encoder
embeddings, specify INFERENCE_PREFERRED_MODE=embedding. The following is a Batch Transform
example of using 4 instances of p3.2xlarge that optimizes for encoder embedding inference:

transformer = o2v.transformer(instance_count=4,
instance_type="ml.p2.xlarge",
max_concurrent_transforms=2,
max_payload=1, # 1MB
strategy='MultiRecord',
env={'INFERENCE_PREFERRED_MODE': 'embedding'}, # only useful
with GPU
output_path=output_s3_path)

Input: Encoder Embeddings


Content-type: application/json; infer_max_seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum
sequence lengths for the forward and backward encoder.

{
"instances" : [

1436
Amazon SageMaker Developer Guide
Use Built-in Algorithms

{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821,
4]},
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]},
{"in0": [774, 14, 21, 206]}
]
}

Content-type: application/jsonlines; infer_max_seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum
sequence lengths for the forward and backward encoder.

{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]}
{"in0": [774, 14, 21, 206]}

In both of these formats, you specify only one input type: “in0” or “in1.” The inference service then
invokes the corresponding encoder and outputs the embeddings for each of the instances.

Output: Encoder Embeddings


Content-type: application/json

{
"predictions": [
{"embeddings":
[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.0036375711
{"embeddings":
[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133
]
}

Content-type: application/jsonlines

{"embeddings":
[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.0036375711
{"embeddings":
[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133

The vector length of the embeddings output by the inference service is equal to the value of one of
the following hyperparameters that you specify at training time: enc0_token_embedding_dim,
enc1_token_embedding_dim, or enc_dim.

Sequence-to-Sequence Algorithm
Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a
sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
Example applications include: machine translation (input a sentence from one language and predict what
that sentence would be in another language), text summarization (input a longer string of words and
predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output
sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural
networks that show a significant performance boost over previous methodologies. Amazon SageMaker
seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with
attention as encoder-decoder architectures.

Topics
• Input/Output Interface for the Sequence-to-Sequence Algorithm (p. 1438)
• EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm (p. 1439)
• Sequence-to-Sequence Sample Notebooks (p. 1439)

1437
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• How Sequence-to-Sequence Works (p. 1439)


• Sequence-to-Sequence Hyperparameters (p. 1440)
• Tune a Sequence-to-Sequence Model (p. 1448)

Input/Output Interface for the Sequence-to-Sequence Algorithm


Training

SageMaker seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as
integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in the seq2seq
example notebook. In general, it packs the data into 32-bit integer tensors and generates the necessary
vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three
channels:

• train: It should contain the training data (for example, the train.rec file generated by the
preprocessing script).
• validation: It should contain the validation data (for example, the val.rec file generated by the
preprocessing script).
• vocab: It should contain two vocabulary files (vocab.src.json and vocab.trg.json)

If the algorithm doesn't find data in any of these three channels, training results in an error.

Inference

For hosted endpoints, inference supports two data formats. To perform inference using space separated
text tokens, use the application/json format. Otherwise, use the recordio-protobuf format to
work with the integer encoded data. Both modes support batching of input data. application/json
format also allows you to visualize the attention matrix.

• application/json: Expects the input in JSON format and returns the output in JSON format. Both
content and accept types should be application/json. Each sequence is expected to be a string
with whitespace separated tokens. This format is recommended when the number of source sequences
in the batch is small. It also supports the following additional configuration options:

configuration: {attention_matrix: true}: Returns the attention matrix for the particular input
sequence.
• application/x-recordio-protobuf: Expects the input in recordio-protobuf format and
returns the output in recordio-protobuf format. Both content and accept types should be
applications/x-recordio-protobuf. For this format, the source sequences must be converted
into a list of integers for subsequent protobuf encoding. This format is recommended for bulk
inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON
Lines format and returns the output in JSON Lines format. Both content and accept types should be
application/jsonlines. The format for input is as follows:

content-type: application/jsonlines

{"source": "source_sequence_0"}
{"source": "source_sequence_1"}

The format for response is as follows:

1438
Amazon SageMaker Developer Guide
Use Built-in Algorithms

accept: application/jsonlines

{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for
inference, see the Sequence-to-Sequence Sample Notebooks (p. 1439) .

EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm


The Amazon SageMaker seq2seq algorithm only supports on GPU instance types and can only train on a
single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2,
P3, G4dn, and G5 GPU instance families.

Sequence-to-Sequence Sample Notebooks


For a sample notebook that shows how to use the SageMaker Sequence to Sequence algorithm to train a
English-German translation model, see Machine Translation English-German Example Using SageMaker
Seq2Seq. For instructions how to create and access Jupyter notebook instances that you can use to
run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have
created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the
SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.

How Sequence-to-Sequence Works


Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including:

• An embedding layer. In this layer, the input matrix, which is input tokens encoded in a sparse way
(for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-
dimensional feature vector is more capable of encoding information regarding a particular token (word
for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this
embedding layer with a pre-trained word vector like FastText or Glove or to initialize it randomly and
learn the parameters during training.
• An encoder layer. After the input tokens are mapped into a high-dimensional feature space,
the sequence is passed through an encoder layer to compress all the information from the input
embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is
made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). (
Colah's blog explains LSTM in a great detail.)
• A decoder layer. The decoder layer takes this encoded feature vector and produces the output
sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU).

The whole model is trained jointly to maximize the probability of the target sequence given the source
sequence. This model was first introduced by Sutskever et al. in 2014.

Attention mechanism. The disadvantage of an encoder-decoder framework is that model performance


decreases as and when the length of the source sequence increases because of the limit of how much
information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015,
Bahdanau et al. proposed the attention mechanism. In an attention mechanism, the decoder tries to find
the location in the encoder sequence where the most important information could be located and uses
that information and previously decoded words to predict the next token in the sequence.

For more in details, see the whitepaper Effective Approaches to Attention-based Neural Machine
Translation by Luong, et al. that explains and simplifies calculations for various attention mechanisms.
Additionally, the whitepaper Google's Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation by Wu, et al. describes Google's architecture for machine translation,
which uses skip connections between encoder and decoder layers.

1439
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Sequence-to-Sequence Hyperparameters

Parameter Name Description

batch_size Mini batch size for gradient descent.

Optional

Valid values: positive integer

Default value: 64

beam_size Length of the beam for beam search. Used during training
for computing bleu and used during inference.

Optional

Valid values: positive integer

Default value: 5

bleu_sample_size Number of instances to pick from validation dataset


to decode and compute bleu score during training.
Set to -1 to use full validation set (if bleu is chosen as
optimized_metric).

Optional

Valid values: integer

Default value: 0

bucket_width Returns (source,target) buckets up to


(max_seq_len_source, max_seq_len_target). The
longer side of the data uses steps of bucket_width
while the shorter side uses steps scaled down by the
average target/source length ratio. If one sided reaches its
maximum length before the other, width of extra buckets
on that side is fixed to that side of max_len.

Optional

Valid values: positive integer

Default value: 10

bucketing_enabled Set to false to disable bucketing, unroll to maximum


length.

Optional

Valid values: true or false

Default value: true

checkpoint_frequency_num_batches Checkpoint and evaluate every x batches. This


checkpointing hyperparameter is passed to the
SageMaker's seq2seq algorithm for early stopping and
retrieving the best model. The algorithm's checkpointing
runs locally in the algorithm's training container and is not

1440
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


compatible with SageMaker checkpointing. The algorithm
temporarily saves checkpoints to a local path and stores
the best model artifact to the model output path in S3
after the training job has stopped.

Optional

Valid values: positive integer

Default value: 1000

checkpoint_threshold Maximum number of checkpoints model is allowed


to not improve in optimized_metric on validation
dataset before training is stopped. This checkpointing
hyperparameter is passed to the SageMaker's seq2seq
algorithm for early stopping and retrieving the best
model. The algorithm's checkpointing runs locally in the
algorithm's training container and is not compatible with
SageMaker checkpointing. The algorithm temporarily saves
checkpoints to a local path and stores the best model
artifact to the model output path in S3 after the training
job has stopped.

Optional

Valid values: positive integer

Default value: 3

clip_gradient Clip absolute gradient values greater than this. Set to


negative to disable.

Optional

Valid values: float

Default value: 1

cnn_activation_type The cnn activation type to be used.

Optional

Valid values: String. One of glu, relu, softrelu,


sigmoid, or tanh.

Default value: glu

cnn_hidden_dropout Dropout probability for dropout between convolutional


layers.

Optional

Valid values: Float. Range in [0,1].

Default value: 0

1441
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

cnn_kernel_width_decoder Kernel width for the cnn decoder.

Optional

Valid values: positive integer

Default value: 5

cnn_kernel_width_encoder Kernel width for the cnn encoder.

Optional

Valid values: positive integer

Default value: 3

cnn_num_hidden Number of cnn hidden units for encoder and decoder.

Optional

Valid values: positive integer

Default value: 512

decoder_type Decoder type.

Optional

Valid values: String. Either rnn or cnn.

Default value: rnn

embed_dropout_source Dropout probability for source side embeddings.

Optional

Valid values: Float. Range in [0,1].

Default value: 0

embed_dropout_target Dropout probability for target side embeddings.

Optional

Valid values: Float. Range in [0,1].

Default value: 0

encoder_type Encoder type. The rnn architecture is based on attention


mechanism by Bahdanau et al. and cnn architecture is
based on Gehring et al.

Optional

Valid values: String. Either rnn or cnn.

Default value: rnn

1442
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

fixed_rate_lr_half_life Half life for learning rate in terms of number of


checkpoints for fixed_rate_* schedulers.

Optional

Valid values: positive integer

Default value: 10

learning_rate Initial learning rate.

Optional

Valid values: float

Default value: 0.0003

loss_type Loss function for training.

Optional

Valid values: String. cross-entropy

Default value: cross-entropy

lr_scheduler_type Learning rate scheduler type. plateau_reduce means


reduce the learning rate whenever optimized_metric on
validation_accuracy plateaus. inv_t is inverse time
decay. learning_rate/(1+decay_rate*t)

Optional

Valid values: String. One of plateau_reduce,


fixed_rate_inv_t, or fixed_rate_inv_sqrt_t.

Default value: plateau_reduce

max_num_batches Maximum number of updates/batches to process. -1 for


infinite.

Optional

Valid values: integer

Default value: -1

max_num_epochs Maximum number of epochs to pass through training


data before fitting is stopped. Training continues until
this number of epochs even if validation accuracy is not
improving if this parameter is passed. Ignored if not passed.

Optional

Valid values: Positive integer and less than or equal to


max_num_epochs.

Default value: none

1443
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

max_seq_len_source Maximum length for the source sequence length.


Sequences longer than this length are truncated to this
length.

Optional

Valid values: positive integer

Default value: 100

max_seq_len_target Maximum length for the target sequence length. Sequences


longer than this length are truncated to this length.

Optional

Valid values: positive integer

Default value: 100

min_num_epochs Minimum number of epochs the training must run before it


is stopped via early_stopping conditions.

Optional

Valid values: positive integer

Default value: 0

momentum Momentum constant used for sgd. Don't pass this


parameter if you are using adam or rmsprop.

Optional

Valid values: float

Default value: none

num_embed_source Embedding size for source tokens.

Optional

Valid values: positive integer

Default value: 512

num_embed_target Embedding size for target tokens.

Optional

Valid values: positive integer

Default value: 512

1444
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

num_layers_decoder Number of layers for Decoder rnn or cnn.

Optional

Valid values: positive integer

Default value: 1

num_layers_encoder Number of layers for Encoder rnn or cnn.

Optional

Valid values: positive integer

Default value: 1

optimized_metric Metrics to optimize with early stopping.

Optional

Valid values: String. One of perplexity, accuracy, or


bleu.

Default value: perplexity

optimizer_type Optimizer to choose from.

Optional

Valid values: String. One of adam, sgd, or rmsprop.

Default value: adam

plateau_reduce_lr_factor Factor to multiply learning rate with (for


plateau_reduce).

Optional

Valid values: float

Default value: 0.5

plateau_reduce_lr_threshold For plateau_reduce scheduler, multiply learning rate


with reduce factor if optimized_metric didn't improve
for this many checkpoints.

Optional

Valid values: positive integer

Default value: 3

1445
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

rnn_attention_in_upper_layers Pass the attention to upper layers of rnn, like Google NMT
paper. Only applicable if more than one layer is used.

Optional

Valid values: boolean (true or false)

Default value: true

rnn_attention_num_hidden Number of hidden units for attention layers. defaults to


rnn_num_hidden.

Optional

Valid values: positive integer

Default value: rnn_num_hidden

rnn_attention_type Attention model for encoders. mlp refers to concat and


bilinear refers to general from the Luong et al. paper.

Optional

Valid values: String. One of dot, fixed, mlp, or bilinear.

Default value: mlp

rnn_cell_type Specific type of rnn architecture.

Optional

Valid values: String. Either lstm or gru.

Default value: lstm

rnn_decoder_state_init How to initialize rnn decoder states from encoders.

Optional

Valid values: String. One of last, avg, or zero.

Default value: last

rnn_first_residual_layer First rnn layer to have a residual connection, only


applicable if number of layers in encoder or decoder is
more than 1.

Optional

Valid values: positive integer

Default value: 2

1446
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

rnn_num_hidden The number of rnn hidden units for encoder and decoder.
This must be a multiple of 2 because the algorithm uses
bi-directional Long Term Short Term Memory (LSTM) by
default.

Optional

Valid values: positive even integer

Default value: 1024

rnn_residual_connections Add residual connection to stacked rnn. Number of layers


should be more than 1.

Optional

Valid values: boolean (true or false)

Default value: false

rnn_decoder_hidden_dropout Dropout probability for hidden state that combines the


context with the rnn hidden state in the decoder.

Optional

Valid values: Float. Range in [0,1].

Default value: 0

training_metric Metrics to track on training on validation data.

Optional

Valid values: String. Either perplexity or accuracy.

Default value: perplexity

weight_decay Weight decay constant.

Optional

Valid values: float

Default value: 0

weight_init_scale Weight initialization scale (for uniform and xavier


initialization).

Optional

Valid values: float

Default value: 2.34

1447
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

weight_init_type Type of weight initialization.

Optional

Valid values: String. Either uniform or xavier.

Default value: xavier

xavier_factor_type Xavier factor type.

Optional

Valid values: String. One of in, out, or avg.

Default value: in

Tune a Sequence-to-Sequence Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the Sequence-to-Sequence Algorithm

The sequence to sequence algorithm reports three metrics that are computed during training. Choose
one of them as an objective to optimize when tuning the hyperparameter values.

Metric Name Description Optimization Direction

validation:accuracy Accuracy computed on the validation dataset. Maximize

validation:bleu Bleu score computed on the validation dataset. Maximize


Because BLEU computation is expensive, you can
choose to compute BLEU on a random subsample
of the validation dataset to speed up the overall
training process. Use the bleu_sample_size
parameter to specify the subsample.

Perplexity, is a loss function computed on the


validation:perplexity Minimize
validation dataset. Perplexity measures the cross-
entropy between an empirical sample and the
distribution predicted by a model and so provides
a measure of how well a model predicts the
sample values, Models that are good at predicting
a sample have a low perplexity.

Tunable Sequence-to-Sequence Hyperparameters

You can tune the following hyperparameters for the SageMaker Sequence to Sequence algorithm.
The hyperparameters that have the greatest impact on sequence to sequence objective

1448
Amazon SageMaker Developer Guide
Use Built-in Algorithms

metrics are: batch_size, optimizer_type, learning_rate, num_layers_encoder, and


num_layers_decoder.

Parameter Name Parameter Type Recommended Ranges

num_layers_encoder IntegerParameterRange [1-10]

num_layers_decoder IntegerParameterRange [1-10]

batch_size CategoricalParameterRange [16,32,64,128,256,512,1024,2048]

optimizer_type CategoricalParameterRange ['adam', 'sgd', 'rmsprop']

weight_init_type CategoricalParameterRange ['xavier', 'uniform']

weight_init_scale ContinuousParameterRange For the xavier type:


MinValue: 2.0,
MaxValue: 3.0 For the
uniform type: MinValue:
-1.0, MaxValue: 1.0

learning_rate ContinuousParameterRange MinValue: 0.00005,


MaxValue: 0.2

weight_decay ContinuousParameterRange MinValue: 0.0,


MaxValue: 0.1

momentum ContinuousParameterRange MinValue: 0.5,


MaxValue: 0.9

clip_gradient ContinuousParameterRange MinValue: 1.0,


MaxValue: 5.0

rnn_num_hidden CategoricalParameterRange Applicable only to


recurrent neural
networks (RNNs).
[128,256,512,1024,2048]

cnn_num_hidden CategoricalParameterRange Applicable only to


convolutional neural
networks (CNNs).
[128,256,512,1024,2048]

num_embed_source IntegerParameterRange [256-512]

num_embed_target IntegerParameterRange [256-512]

embed_dropout_sourceContinuousParameterRange MinValue: 0.0,


MaxValue: 0.5

embed_dropout_targetContinuousParameterRange MinValue: 0.0,


MaxValue: 0.5

ContinuousParameterRange
rnn_decoder_hidden_dropout MinValue: 0.0,
MaxValue: 0.5

cnn_hidden_dropout ContinuousParameterRange MinValue: 0.0,


MaxValue: 0.5

1449
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges

lr_scheduler_type CategoricalParameterRange ['plateau_reduce',


'fixed_rate_inv_t',
'fixed_rate_inv_sqrt_t']

ContinuousParameterRange
plateau_reduce_lr_factor MinValue: 0.1,
MaxValue: 0.5

IntegerParameterRange
plateau_reduce_lr_threshold [1-5]

IntegerParameterRange
fixed_rate_lr_half_life [10-30]

Text Classification - TensorFlow


The Amazon SageMaker Text Classification - TensorFlow algorithm is a supervised learning algorithm
that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer
learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount
of text data is not available. The text classification algorithm takes a text string as input and outputs a
probability for each of the class labels. Training datasets must be in CSV format.

Topics
• How to use the SageMaker Text Classification - TensorFlow algorithm (p. 1450)
• Input and output interface for the Text Classification - TensorFlow algorithm (p. 1451)
• Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm (p. 1452)
• Text Classification - TensorFlow sample notebooks (p. 1453)
• How Text Classification - TensorFlow Works (p. 1453)
• TensorFlow Hub Models (p. 1453)
• Text Classification - TensorFlow Hyperparameters (p. 1456)
• Tune a Text Classification - TensorFlow model (p. 1459)

How to use the SageMaker Text Classification - TensorFlow algorithm

You can use Text Classification - TensorFlow as an Amazon SageMaker built-in algorithm. The following
section describes how to use Text Classification - TensorFlow with the SageMaker Python SDK. For
information on how to use Text Classification - TensorFlow from the Amazon SageMaker Studio UI, see
SageMaker JumpStart (p. 47).

The Text Classification - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow models. For a list of all available pretrained models, see TensorFlow Hub
Models (p. 1453). Every pretrained model has a unique model_id. The following example uses BERT
Base Uncased (model_id: tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2) to fine-tune
on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and
stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated
model training artifacts to construct a SageMaker Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and
their default values with hyperparameters.retrieve_default. For more information, see Text
Classification - TensorFlow Hyperparameters (p. 1456). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For example, for larger
models, the default batch size is smaller.

1450
Amazon SageMaker Developer Guide
Use Built-in Algorithms

This example uses the SST2 dataset, which contains positive and negative movie reviews. We pre-
downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call .fit using
the Amazon S3 location of your training dataset. Any S3 bucket used in a notebook must be in the same
AWS Region as the notebook instance that accesses it.

from sagemaker import image_uris, model_uris, script_uris, hyperparameters


from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"


training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image


train_image_uri =
image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type

# Retrieve the training script


train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version,
script_scope="training")

# Retrieve the pretrained model tarball for transfer learning


train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version,
model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model


hyperparameters = hyperparameters.retrieve_default(model_id=model_id,
model_version=model_version)

# [Optional] Override default hyperparameters with custom values


hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance


tf_tc_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location,
)

# Launch a training job


tf_tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to JumpStart - Text Classification notebook.

Input and output interface for the Text Classification - TensorFlow algorithm
Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset made
up of text sentences with any number of classes. The pretrained model attaches a classification layer to

1451
Amazon SageMaker Developer Guide
Use Built-in Algorithms

the Text Embedding model and initializes the layer parameters to random values. The output dimension
of the classification layer is determined based on the number of classes detected in the input data.

Be mindful of how to format your training data for input to the Text Classification - TensorFlow model.

• Training data input format: A directory containing a data.csv file. Each row of the first column
should have integer class labels between 0 and the number of classes. Each row of the second column
should have the corresponding text data.

The following is an example of an input CSV file. Note that the file should not have any
header. The file should be hosted in an Amazon S3 bucket with a path similar to the following:
s3://bucket_name/input_directory/. Note that the trailing / is required.

| | |
|---|---|
|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human
nature|
|...|...|

Incremental training
You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Text Classification - TensorFlow model with another Text
Classification - TensorFlow model trained in SageMaker.

You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model.

For more information on using incremental training with the SageMaker Text Classification - TensorFlow
algorithm, see the Introduction to JumpStart - Text Classification sample notebook.

Inference with the Text Classification - TensorFlow algorithm


You can host the fine-tuned model that results from your TensorFlow Text Classification training for
inference. Any raw text formats for inference must be content type application/x-text.

Running inference results in probability values, class labels for all classes, and the predicted label
corresponding to the class index with the highest probability encoded in JSON format. The Text
Classification - TensorFlow model processes a single string per request and outputs only one line. The
following is an example of a JSON format response:

accept: application/json;verbose

{"probabilities": [prob_0, prob_1, prob_2, ...],


"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}

If accept is set to application/json, then the model only outputs probabilities.

Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
The Text Classification - TensorFlow algorithm supports all CPU and GPU instances for training,
including:

1452
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge
• ml.g4dn.xlarge
• ml.g4dn.16.xlarge
• ml.g5.xlarge
• ml.g5.48xlarge

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such
as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference. For a comprehensive list of
SageMaker training and inference instances across AWS Regions, see Amazon SageMaker Pricing.

Text Classification - TensorFlow sample notebooks

For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to JumpStart - Text Classification notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.

How Text Classification - TensorFlow Works

The Text Classification - TensorFlow algorithm takes text as classifies it into one of the output class
labels. Deep learning networks such as BERT are highly accurate for text classification. There are also
deep learning networks that are trained on large text datasets, such as TextNet, which has more than 11
million texts with about 11,000 categories. After a network is trained with TextNet data, you can then
fine-tune the network on a dataset with a particular focus to perform more specific text classification
tasks. The Amazon SageMaker Text Classification - TensorFlow algorithm supports transfer learning on
many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a text classification layer is attached to
the pretrained TensorFlow model of your choice. The classification layer consists of a dropout layer,
a dense layer, and a fully connected layer with 2-norm regularization, and is initialized with random
weights. You can change the hyperparameter values for the dropout rate of the dropout layer and the L2
regularization factor for the dense layer.

You can fine-tune either the entire network (including the pretrained model) or only the top
classification layer on new training data. With this method of transfer learning, training with smaller
datasets is possible.

TensorFlow Hub Models

The following pretrained models are available to use for transfer learning with the Text Classification -
TensorFlow algorithm.

The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.

1453
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

BERT Base Uncased tensorflow-tc-bert- TensorFlow Hub link


en-uncased-L-12-H-768-
A-12-2

BERT Base Cased tensorflow-tc-bert-en- TensorFlow Hub link


cased-L-12-H-768-A-12-2

BERT Base Multilingual Cased tensorflow-tc-bert- TensorFlow Hub link


multi-cased-L-12-H-768-
A-12-2

Small BERT L-2_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-2-H-128-A-2

Small BERT L-2_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-2-H-256-A-4

Small BERT L-2_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-2-H-512-A-8

Small BERT L-2_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-2-H-768-A-12

Small BERT L-4_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-4-H-128-A-2

Small BERT L-4_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-4-H-256-A-4

Small BERT L-4_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-4-H-512-A-8

Small BERT L-4_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-4-H-768-A-12

Small BERT L-6_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-6-H-128-A-2

Small BERT L-6_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-6-H-256-A-4

Small BERT L-6_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-6-H-512-A-8

1454
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

Small BERT L-6_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-6-H-768-A-12

Small BERT L-8_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-8-H-128-A-2

Small BERT L-8_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-8-H-256-A-4

Small BERT L-8_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-8-H-512-A-8

Small BERT L-8_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-8-H-768-A-12

Small BERT L-10_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-10-H-128-A-2

Small BERT L-10_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-10-H-256-A-4

Small BERT L-10_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-10-H-512-A-8

Small BERT L-10_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-10-H-768-A-12

Small BERT L-12_H-128_A-2 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-12-H-128-A-2

Small BERT L-12_H-256_A-4 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-12-H-256-A-4

Small BERT L-12_H-512_A-8 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-12-H-512-A-8

Small BERT L-12_H-768_A-12 tensorflow-tc-small- TensorFlow Hub link


bert-bert-en-uncased-
L-12-H-768-A-12

BERT Large Uncased tensorflow-tc-bert-en- TensorFlow Hub link


uncased-L-24-H-1024-
A-16-2

1455
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

BERT Large Cased tensorflow-tc-bert-en- TensorFlow Hub link


cased-L-24-H-1024-A-16-2

BERT Large Uncased Whole tensorflow-tc-bert-en- TensorFlow Hub link


Word Masking wwm-uncased-L-24-H-1024-
A-16-2

BERT Large Cased Whole Word tensorflow-tc-bert-en- TensorFlow Hub link


Masking wwm-cased-L-24-H-1024-
A-16-2

ALBERT Base tensorflow-tc-albert-en- TensorFlow Hub link


base

ELECTRA Small++ tensorflow-tc-electra- TensorFlow Hub link


small-1

ELECTRA Base tensorflow-tc-electra- TensorFlow Hub link


base-1

BERT Base Wikipedia and tensorflow-tc-experts- TensorFlow Hub link


BooksCorpus bert-wiki-books-1

BERT Base MEDLINE/PubMed tensorflow-tc-experts- TensorFlow Hub link


bert-pubmed-1

Talking Heads Base tensorflow-tc-talking- TensorFlow Hub link


heads-base

Talking Heads Large tensorflow-tc-talking- TensorFlow Hub link


heads-large

Text Classification - TensorFlow Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Object Detection -
TensorFlow algorithm. See Tune a Text Classification - TensorFlow model (p. 1459) for information on
hyperparameter tuning.

Parameter Name Description

batch_size The batch size for training. For training on instances with multiple
GPUs, this batch size is used across the GPUs.

Valid values: positive integer.

Default value: 32.

beta_1 The beta1 for the "adam" and "adamw" optimizers. Represents the
exponential decay rate for the first moment estimates. Ignored for
other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

1456
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

beta_2 The beta2 for the "adam" and "adamw" optimizers. Represents the
exponential decay rate for the second moment estimates. Ignored
for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.999.

dropout_rate The dropout rate for the dropout layer in the top classification
layer. Used only when reinitialize_top_layer is set to
"True".

Valid values: float, range: [0.0, 1.0].

Default value: 0.2

early_stopping Set to "True" to use early stopping logic during training. If


"False", early stopping is not used.

Valid values: string, either: ("True" or "False").

Default value: "False".

early_stopping_min_delta The minimum change needed to qualify as an


improvement. An absolute change less than the value of
early_stopping_min_delta does not qualify as improvement.
Used only when early_stopping is set to "True".

Valid values: float, range: [0.0, 1.0].

Default value: 0.0.

early_stopping_patience The number of epochs to continue training with no improvement.


Used only when early_stopping is set to "True".

Valid values: positive integer.

Default value: 5.

epochs The number of training epochs.

Valid values: positive integer.

Default value: 10.

epsilon The epsilon for "adam", "rmsprop", "adadelta", and


"adagrad" optimizers. Usually set to a small value to avoid
division by 0. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 1e-7.

1457
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

initial_accumulator_value The starting value for the accumulators, or the per-parameter


momentum values, for the "adagrad" optimizer. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.0001.

learning_rate The optimizer learning rate.

Valid values: float, range: [0.0, 1.0].

Default value: 0.001.

momentum The momentum for the "sgd" and "nesterov" optimizers.


Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.

Valid values: string, any of the following: ("adamw", "adam",


"sgd", "nesterov", "rmsprop", "adagrad" , "adadelta").

Default value: "adam".

regularizers_l2 The L2 regularization factor for the dense layer in the classification
layer. Used only when reinitialize_top_layer is set to
"True".

Valid values: float, range: [0.0, 1.0].

Default value: 0.0001.

reinitialize_top_layer If set to "Auto", the top classification layer parameters are


re-initialized during fine-tuning. For incremental training, top
classification layer parameters are not re-initialized unless set to
"True".

Valid values: string, any of the following: ("Auto", "True" or


"False").

Default value: "Auto".

rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.95.

1458
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

train_only_on_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.

Valid values: string, either: ("True" or "False").

Default value: "False".

validation_split_ratio The fraction of training data to randomly split to create validation


data. Only used if validation data is not provided through the
validation channel.

Valid values: float, range: [0.0, 1.0].

Default value: 0.2.

warmup_steps_fraction The fraction of the total number of gradient update steps, where
the learning rate increases from 0 to the initial learning rate as a
warm up. Only used with the adamw optimizer.

Valid values: float, range: [0.0, 1.0].

Default value: 0.1.

Tune a Text Classification - TensorFlow model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics computed by the Text Classification - TensorFlow algorithm

Refer to the following chart to find which metrics are computed by the Text Classification - TensorFlow
algorithm.

Metric Name Description Optimization Regex Pattern


Direction

The ratio of the number of correct


validation:accuracy Maximize val_accuracy=([0-9\
predictions to the total number of \.]+)
predictions made.

Tunable Text Classification - TensorFlow hyperparameters

Tune a text classification model with the following hyperparameters. The hyperparameters that have
the greatest impact on text classification objective metrics are: batch_size, learning_rate, and
optimizer. Tune the optimizer-related hyperparameters, such as momentum, regularizers_l2,
beta_1, beta_2, and eps based on the selected optimizer. For example, use beta_1 and beta_2
only when adamw or adam is the optimizer.

1459
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information about which hyperparameters are used for each optimizer, see Text
Classification - TensorFlow Hyperparameters (p. 1456).

Parameter Name Parameter Type Recommended Ranges

batch_size IntegerParameterRanges MinValue: 4, MaxValue:


128

beta_1 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

beta_2 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

eps ContinuousParameterRanges MinValue: 1e-8,


MaxValue: 1.0

learning_rate ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.5

momentum ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

optimizer CategoricalParameterRanges ['adamw', 'adam', 'sgd',


'rmsprop', 'nesterov',
'adagrad', 'adadelta']

regularizers_l2 ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

CategoricalParameterRanges
train_only_on_top_layer ['True', 'False']

Built-in SageMaker Algorithms for Time-Series Data


SageMaker provides algorithms that are tailored to the analysis of time-series data for forecasting
product demand, server loads, webpage requests, and more.

• DeepAR Forecasting Algorithm (p. 1460)—a supervised learning algorithm for forecasting scalar (one-
dimensional) time series using recurrent neural networks (RNN).

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

DeepAR train and File JSON Lines GPU or CPU Yes


Forecasting (optionally) or Parquet
test

DeepAR Forecasting Algorithm


The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting
scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting
methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS),
fit a single model to each individual time series. They then use that model to extrapolate the time series
into the future.

1460
Amazon SageMaker Developer Guide
Use Built-in Algorithms

In many applications, however, you have many similar time series across a set of cross-sectional units.
For example, you might have time series groupings for demand for different products, server loads, and
requests for webpages. For this type of application, you can benefit from training a single model jointly
over all of the time series. DeepAR takes this approach. When your dataset contains hundreds of related
time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained
model to generate forecasts for new time series that are similar to the ones it has been trained on.

The training input for the DeepAR algorithm is one or, preferably, more target time series that have
been generated by the same process or similar processes. Based on this input dataset, the algorithm
trains a model that learns an approximation of this process/processes and uses it to predict how the
target time series evolves. Each target time series can be optionally associated with a vector of static
(time-independent) categorical features provided by the cat field and a vector of dynamic (time-
dependent) time series provided by the dynamic_feat field. SageMaker trains the DeepAR model by
randomly sampling training examples from each target time series in the training dataset. Each training
example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. To
control how far in the past the network can see, use the context_length hyperparameter. To control
how far in the future predictions can be made, use the prediction_length hyperparameter. For more
information, see How the DeepAR Algorithm Works (p. 1465).

Topics
• Input/Output Interface for the DeepAR Algorithm (p. 1461)
• Best Practices for Using the DeepAR Algorithm (p. 1463)
• EC2 Instance Recommendations for the DeepAR Algorithm (p. 1464)
• DeepAR Sample Notebooks (p. 1464)
• How the DeepAR Algorithm Works (p. 1465)
• DeepAR Hyperparameters (p. 1467)
• Tune a DeepAR Model (p. 1471)
• DeepAR Inference Formats (p. 1472)

Input/Output Interface for the DeepAR Algorithm

DeepAR supports two data channels. The required train channel describes the training dataset. The
optional test channel describes a dataset that the algorithm uses to evaluate model accuracy after
training. You can provide training and test datasets in JSON Lines format. Files can be in gzip or Parquet
file format.

When specifying the paths for the training and test data, you can specify a single file or a directory that
contains multiple files, which can be stored in subdirectories. If you specify a directory, DeepAR uses all
files in the directory as inputs for the corresponding channel, except those that start with a period (.) and
those named _SUCCESS. This ensures that you can directly use output folders produced by Spark jobs as
input channels for your DeepAR training jobs.

By default, the DeepAR model determines the input format from the file extension (.json, .json.gz,
or .parquet) in the specified input path. If the path does not end in one of these extensions, you must
explicitly specify the format in the SDK for Python. Use the content_type parameter of the s3_input
class.

The records in your input files should contain the following fields:

• start—A string with the format YYYY-MM-DD HH:MM:SS. The start timestamp can't contain time
zone information.
• target—An array of floating-point values or integers that represent the time series. You can encode
missing values as null literals, or as "NaN" strings in JSON, or as nan floating-point values in Parquet.
• dynamic_feat (optional)—An array of arrays of floating-point values or integers that represents the
vector of custom feature time series (dynamic features). If you set this field, all records must have the

1461
Amazon SageMaker Developer Guide
Use Built-in Algorithms

same number of inner arrays (the same number of feature time series). In addition, each inner array
must be the same length as the associated target value plus prediction_length. Missing values
are not supported in the features. For example, if target time series represents the demand of different
products, an associated dynamic_feat might be a boolean time-series which indicates whether a
promotion was applied (1) to the particular product or not (0):

{"start": ..., "target": [1, 5, 10, 2], "dynamic_feat": [[0, 1, 1, 0]]}

• cat (optional)—An array of categorical features that can be used to encode the groups that
the record belongs to. Categorical features must be encoded as a 0-based sequence of positive
integers. For example, the categorical domain {R, G, B} can be encoded as {0, 1, 2}. All values
from each categorical domain must be represented in the training dataset. That's because the
DeepAR algorithm can forecast only for categories that have been observed during training.
And, each categorical feature is embedded in a low-dimensional space whose dimensionality is
controlled by the embedding_dimension hyperparameter. For more information, see DeepAR
Hyperparameters (p. 1467).

If you use a JSON file, it must be in JSON Lines format. For example:

{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1],
"dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat":
[[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat":
[[1.3, 0.4]]}

In this example, each time series has two associated categorical features and one time series features.

For Parquet, you use the same three fields as columns. In addition, "start" can be the datetime type.
You can compress Parquet files using gzip (gzip) or the Snappy compression library (snappy).

If the algorithm is trained without cat and dynamic_feat fields, it learns a "global" model, that
is a model that is agnostic to the specific identity of the target time series at inference time and is
conditioned only on its shape.

If the model is conditioned on the cat and dynamic_feat feature data provided for each time series,
the prediction will probably be influenced by the character of time series with the corresponding cat
features. For example, if the target time series represents the demand of clothing items, you can
associate a two-dimensional cat vector that encodes the type of item (e.g. 0 = shoes, 1 = dress) in the
first component and the color of an item (e.g. 0 = red, 1 = blue) in the second component. A sample input
would look as follows:

{ "start": ..., "target": ..., "cat": [0, 0], ... } # red shoes
{ "start": ..., "target": ..., "cat": [1, 1], ... } # blue dress

At inference time, you can request predictions for targets with cat values that are combinations of the
cat values observed in the training data, for example:

{ "start": ..., "target": ..., "cat": [0, 1], ... } # blue shoes
{ "start": ..., "target": ..., "cat": [1, 0], ... } # red dress

The following guidelines apply to training data:

• The start time and length of the time series can differ. For example, in marketing, products often enter
a retail catalog at different dates, so their start dates naturally differ. But all series must have the same
frequency, number of categorical features, and number of dynamic features.

1462
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Shuffle the training file with respect to the position of the time series in the file. In other words, the
time series should occur in random order in the file.
• Make sure to set the start field correctly. The algorithm uses the start timestamp to derive the
internal features.
• If you use categorical features (cat), all time series must have the same number of categorical
features. If the dataset contains the cat field, the algorithm uses it and extracts the cardinality of the
groups from the dataset. By default, cardinality is "auto". If the dataset contains the cat field,
but you don't want to use it, you can disable it by setting cardinality to "". If a model was trained
using a cat feature, you must include it for inference.
• If your dataset contains the dynamic_feat field, the algorithm uses it automatically. All time series
have to have the same number of feature time series. The time points in each of the feature time series
correspond one-to-one to the time points in the target. In addition, the entry in the dynamic_feat
field should have the same length as the target. If the dataset contains the dynamic_feat field, but
you don't want to use it, disable it by setting(num_dynamic_feat to ""). If the model was trained
with the dynamic_feat field, you must provide this field for inference. In addition, each of the
features has to have the length of the provided target plus the prediction_length. In other words,
you must provide the feature value in the future.

If you specify optional test channel data, the DeepAR algorithm evaluates the trained model with
different accuracy metrics. The algorithm calculates the root mean square error (RMSE) over the test data
as follows:

yi,t is the true value of time series i at the time t. ŷi,t is the mean prediction. The sum is over all n time
series in the test set and over the last Τ time points for each time series, where Τ corresponds to the
forecast horizon. You specify the length of the forecast horizon by setting the prediction_length
hyperparameter. For more information, see DeepAR Hyperparameters (p. 1467).

In addition, the algorithm evaluates the accuracy of the forecast distribution using weighted quantile
loss. For a quantile in the range [0, 1], the weighted quantile loss is defined as follows:

(τ)
qi,t is the τ-quantile of the distribution that the model predicts. To specify which quantiles to
calculate loss for, set the test_quantiles hyperparameter. In addition to these, the average of
the prescribed quantile losses is reported as part of the training logs. For information, see DeepAR
Hyperparameters (p. 1467).

For inference, DeepAR accepts JSON format and the following fields:

• "instances", which includes one or more time series in JSON Lines format
• A name of "configuration", which includes parameters for generating the forecast

For more information, see DeepAR Inference Formats (p. 1472).

Best Practices for Using the DeepAR Algorithm

When preparing your time series data, follow these best practices to achieve the best results:

1463
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Except for when splitting your dataset for training and testing, always provide the entire time
series for training, testing, and when calling the model for inference. Regardless of how you set
context_length, don't break up the time series or provide only a part of it. The model uses data
points further back than the value set in context_length for the lagged values feature.
• When tuning a DeepAR model, you can split the dataset to create a training dataset and a test dataset.
In a typical evaluation, you would test the model on the same time series used for training, but
on the future prediction_length time points that follow immediately after the last time point
visible during training. You can create training and test datasets that satisfy this criteria by using the
entire dataset (the full length of all time series that are available) as a test set and removing the last
prediction_length points from each time series for training. During training, the model doesn't see
the target values for time points on which it is evaluated during testing. During testing, the algorithm
withholds the last prediction_length points of each time series in the test set and generates a
prediction. Then it compares the forecast with the withheld values. You can create more complex
evaluations by repeating time series multiple times in the test set, but cutting them at different
endpoints. With this approach, accuracy metrics are averaged over multiple forecasts from different
time points. For more information, see Tune a DeepAR Model (p. 1471).
• Avoid using very large values (>400) for the prediction_length because it makes the model slow
and less accurate. If you want to forecast further into the future, consider aggregating your data at a
lower frequency. For example, use 5min instead of 1min.
• Because lags are used, a model can look further back in the time series than the value specified for
context_length. Therefore, you don't need to set this parameter to a large value. We recommend
starting with the value that you used for prediction_length.
• We recommend training a DeepAR model on as many time series as are available. Although a DeepAR
model trained on a single time series might work well, standard forecasting algorithms, such as ARIMA
or ETS, might provide more accurate results. The DeepAR algorithm starts to outperform the standard
methods when your dataset contains hundreds of related time series. Currently, DeepAR requires that
the total number of observations available across all training time series is at least 300.

EC2 Instance Recommendations for the DeepAR Algorithm

You can train DeepAR on both GPU and CPU instances and in both single and multi-machine settings.
We recommend starting with a single CPU instance (for example, ml.c4.2xlarge or ml.c4.4xlarge), and
switching to GPU instances and multiple machines only when necessary. Using GPUs and multiple
machines improves throughput only for larger models (with many cells per layer and many layers) and
for large mini-batch sizes (for example, greater than 512).

For inference, DeepAR supports only CPU instances.

Specifying large values for context_length, prediction_length, num_cells, num_layers, or


mini_batch_size can create models that are too large for small instances. In this case, use a larger
instance type or reduce the values for these parameters. This problem also frequently occurs when
running hyperparameter tuning jobs. In that case, use an instance type large enough for the model
tuning job and consider limiting the upper values of the critical parameters to avoid job failures.

DeepAR Sample Notebooks

For a sample notebook that shows how to prepare a time series dataset for training the SageMaker
DeepAR algorithm and how to deploy the trained model for performing inferences, see Time series
forecasting with DeepAR - Synthetic data as well as DeepAR demo on electricity dataset, which illustrates
the advanced features of DeepAR on a real world dataset. For instructions on creating and accessing
Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon SageMaker
Notebook Instances (p. 204). After creating and opening a notebook instance, choose the SageMaker
Examples tab to see a list of all of the SageMaker examples. To open a notebook, choose its Use tab, and
choose Create copy.

1464
Amazon SageMaker Developer Guide
Use Built-in Algorithms

How the DeepAR Algorithm Works

During training, DeepAR accepts a training dataset and an optional test dataset. It uses the test dataset
to evaluate the trained model. In general, the datasets don't have to contain the same set of time series.
You can use a model trained on a given training set to generate forecasts for the future of the time series
in the training set, and for other time series. Both the training and the test datasets consist of one or,
preferably, more target time series. Each target time series can optionally be associated with a vector
of feature time series and a vector of categorical features. For more information, see Input/Output
Interface for the DeepAR Algorithm (p. 1461).

For example, the following is an element of a training set indexed by i which consists of a target time
series, Zi,t, and two associated feature time series, Xi,1,t and Xi,2,t:

The target time series might contain missing values, which are represented by line breaks in the time
series. DeepAR supports only feature time series that are known in the future. This allows you to run
"what if?" scenarios. What happens, for example, if I change the price of a product in some way?

Each target time series can also be associated with a number of categorical features. You can use these
features to encode which groupings a time series belongs to. Categorical features allow the model to
learn typical behavior for groups, which it can use to increase model accuracy. DeepAR implements this
by learning an embedding vector for each group that captures the common properties of all time series
in the group.

How Feature Time Series Work in the DeepAR Algorithm

To facilitate learning time-dependent patterns, such as spikes during weekends, DeepAR automatically
creates feature time series based on the frequency of the target time series. It uses these derived feature
time series with the custom feature time series that you provide during training and inference. The
following figure shows two of these derived time series features: ui,1,t represents the hour of the day and
ui,2,t the day of the week.

1465
Amazon SageMaker Developer Guide
Use Built-in Algorithms

The DeepAR algorithm automatically generates these feature time series. The following table lists the
derived features for the supported basic time frequencies.

Frequency of the Time Series Derived Features

Minute minute-of-hour, hour-of-day, day-of-week, day-of-month,


day-of-year

Hour hour-of-day, day-of-week, day-of-month, day-of-year

Day day-of-week, day-of-month, day-of-year

Week day-of-month, week-of-year

Month month-of-year

DeepAR trains a model by randomly sampling several training examples from each of the time series
in the training dataset. Each training example consists of a pair of adjacent context and prediction
windows with fixed predefined lengths. The context_length hyperparameter controls how far in the
past the network can see, and the prediction_length hyperparameter controls how far in the future
predictions can be made. During training, the algorithm ignores training set elements containing time
series that are shorter than a specified prediction length. The following figure represents five samples
with context lengths of 12 hours and prediction lengths of 6 hours drawn from element i. For brevity,
we've omitted the feature time series xi,1,t and ui,2,t.

1466
Amazon SageMaker Developer Guide
Use Built-in Algorithms

To capture seasonality patterns, DeepAR also automatically feeds lagged values from the target time
series. In the example with hourly frequency, for each time index, t = T, the model exposes the zi,t values,
which occurred approximately one, two, and three days in the past.

For inference, the trained model takes as input target time series, which might or might not have been
used during training, and forecasts a probability distribution for the next prediction_length values.
Because DeepAR is trained on the entire dataset, the forecast takes into account patterns learned from
similar time series.

For information on the mathematics behind DeepAR, see DeepAR: Probabilistic Forecasting with
Autoregressive Recurrent Networks.

DeepAR Hyperparameters

Parameter Name Description

context_length The number of time-points that the model gets to see before
making the prediction. The value for this parameter should be
about the same as the prediction_length. The model also
receives lagged inputs from the target, so context_length can be
much smaller than typical seasonalities. For example, a daily time
series can have yearly seasonality. The model automatically includes
a lag of one year, so the context length can be shorter than a year.
The lag values that the model picks depend on the frequency of the
time series. For example, lag values for daily frequency are previous
week, 2 weeks, 3 weeks, 4 weeks, and year.

Required

Valid values: Positive integer

epochs The maximum number of passes over the training data. The
optimal value depends on your data size and learning rate. See also
early_stopping_patience. Typical values range from 10 to
1000.

Required

Valid values: Positive integer

prediction_length The number of time-steps that the model is trained to predict, also
called the forecast horizon. The trained model always generates
forecasts with this length. It can't generate longer forecasts. The

1467
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


prediction_length is fixed when a model is trained and it
cannot be changed later.

Required

Valid values: Positive integer

time_freq The granularity of the time series in the dataset. Use time_freq to
select appropriate date features and lags. The model supports the
following basic frequencies. It also supports multiples of these basic
frequencies. For example, 5min specifies a frequency of 5 minutes.

• M: monthly
• W: weekly
• D: daily
• H: hourly
• min: every minute

Required

Valid values: An integer followed by M, W, D, H, or min. For


example, 5min.

cardinality When using the categorical features (cat), cardinality is an


array specifying the number of categories (groups) per categorical
feature. Set this to auto to infer the cardinality from the data. The
auto mode also works when no categorical features are used in the
dataset. This is the recommended setting for the parameter.

Set cardinality to ignore to force DeepAR to not use categorical


features, even it they are present in the data.

To perform additional data validation, it is possible to explicitly set


this parameter to the actual value. For example, if two categorical
features are provided where the first has 2 and the other has 3
possible values, set this to [2, 3].

For more information on how to use categorical feature, see the


data-section on the main documentation page of DeepAR.

Optional

Valid values: auto, ignore, array of positive integers, empty string,


or

Default value: auto

dropout_rate The dropout rate to use during training. The model uses zoneout
regularization. For each iteration, a random subset of hidden
neurons are not updated. Typical values are less than 0.2.

Optional

Valid values: float

Default value: 0.1

1468
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

early_stopping_patience If this parameter is set, training stops when no progress is made


within the specified number of epochs. The model that has the
lowest loss is returned as the final model.

Optional

Valid values: integer

embedding_dimension Size of embedding vector learned per categorical feature (same


value is used for all categorical features).

The DeepAR model can learn group-level time series patterns when
a categorical grouping feature is provided. To do this, the model
learns an embedding vector of size embedding_dimension for
each group, capturing the common properties of all time series in
the group. A larger embedding_dimension allows the model to
capture more complex patterns. However, because increasing the
embedding_dimension increases the number of parameters in
the model, more training data is required to accurately learn these
parameters. Typical values for this parameter are between 10-100.

Optional

Valid values: positive integer

Default value: 10

learning_rate The learning rate used in training. Typical values range from 1e-4 to
1e-1.

Optional

Valid values: float

Default value: 1e-3

1469
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

likelihood The model generates a probabilistic forecast, and can provide


quantiles of the distribution and return samples. Depending on
your data, select an appropriate likelihood (noise model) that is
used for uncertainty estimates. The following likelihoods can be
selected:

• gaussian: Use for real-valued data.


• beta: Use for real-valued targets between 0 and 1 inclusive.
• negative-binomial: Use for count data (non-negative integers).
• student-T: An alternative for real-valued data that works well for
bursty data.
• deterministic-L1: A loss function that does not estimate
uncertainty and only learns a point forecast.

Optional

Valid values: One of gaussian, beta, negative-binomial, student-T, or


deterministic-L1.

Default value: student-T

mini_batch_size The size of mini-batches used during training. Typical values range
from 32 to 512.

Optional

Valid values: positive integer

Default value: 128

num_cells The number of cells to use in each hidden layer of the RNN. Typical
values range from 30 to 100.

Optional

Valid values: positive integer

Default value: 40

num_dynamic_feat The number of dynamic_feat provided in the data. Set this to


auto to infer the number of dynamic features from the data. The
auto mode also works when no dynamic features are used in the
dataset. This is the recommended setting for the parameter.

To force DeepAR to not use dynamic features, even it they are


present in the data, set num_dynamic_feat to ignore.

To perform additional data validation, it is possible to explicitly


set this parameter to the actual integer value. For example, if two
dynamic features are provided, set this to 2.

Optional

Valid values: auto, ignore, positive integer, or empty string

Default value: auto

1470
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

num_eval_samples The number of samples that are used per time-series when
calculating test accuracy metrics. This parameter does not have any
influence on the training or the final model. In particular, the model
can be queried with a different number of samples. This parameter
only affects the reported accuracy scores on the test channel after
training. Smaller values result in faster evaluation, but then the
evaluation scores are typically worse and more uncertain. When
evaluating with higher quantiles, for example 0.95, it may be
important to increase the number of evaluation samples.

Optional

Valid values: integer

Default value: 100

num_layers The number of hidden layers in the RNN. Typical values range from
1 to 4.

Optional

Valid values: positive integer

Default value: 2

test_quantiles Quantiles for which to calculate quantile loss on the test channel.

Optional

Valid values: array of floats

Default value: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

Tune a DeepAR Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the DeepAR Algorithm

The DeepAR algorithm reports three metrics, which are computed during training. When tuning a model,
choose one of these as the objective. For the objective, use either the forecast accuracy on a provided
test channel (recommended) or the training loss. For recommendations for the training/test split for the
DeepAR algorithm, see Best Practices for Using the DeepAR Algorithm (p. 1463).

Metric Name Description Optimization Direction

test:RMSE The root mean square error between the forecast Minimize
and the actual target computed on the test set.

1471
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

The average overall quantile losses computed on


test:mean_wQuantileLoss Minimize
the test set. To control which quantiles are used,
set the test_quantiles hyperparameter.

train:final_loss The training negative log-likelihood loss averaged Minimize


over the last training epoch for the model.

Tunable Hyperparameters for the DeepAR Algorithm

Tune a DeepAR model with the following hyperparameters. The hyperparameters that have the greatest
impact, listed in order from the most to least impactful, on DeepAR objective metrics are: epochs,
context_length, mini_batch_size, learning_rate, and num_cells.

Parameter Name Parameter Type Recommended Ranges

epochs IntegerParameterRanges MinValue: 1, MaxValue:


1000

context_length IntegerParameterRanges MinValue: 1, MaxValue:


200

mini_batch_size IntegerParameterRanges MinValue: 32,


MaxValue: 1028

learning_rate ContinuousParameterRange MinValue: 1e-5,


MaxValue: 1e-1

num_cells IntegerParameterRanges MinValue: 30,


MaxValue: 200

num_layers IntegerParameterRanges MinValue: 1, MaxValue:


8

dropout_rate ContinuousParameterRange MinValue: 0.00,


MaxValue: 0.2

embedding_dimension IntegerParameterRanges MinValue: 1, MaxValue:


50

DeepAR Inference Formats

DeepAR JSON Request Formats

Query a trained model by using the model's endpoint. The endpoint takes the following JSON request
format.

In the request, the instances field corresponds to the time series that should be forecast by the model.

If the model was trained with categories, you must provide a cat for each instance. If the model was
trained without the cat field, it should be omitted.

If the model was trained with a custom feature time series (dynamic_feat), you have to provide the
same number of dynamic_featvalues for each instance. Each of them should have a length given by
length(target) + prediction_length, where the last prediction_length values correspond to

1472
Amazon SageMaker Developer Guide
Use Built-in Algorithms

the time points in the future that will be predicted. If the model was trained without custom feature time
series, the field should not be included in the request.

{
"instances": [
{
"start": "2009-11-01 00:00:00",
"target": [4.0, 10.0, "NaN", 100.0, 113.0],
"cat": [0, 1],
"dynamic_feat": [[1.0, 1.1, 2.1, 0.5, 3.1, 4.1, 1.2, 5.0, ...]]
},
{
"start": "2012-01-30",
"target": [1.0],
"cat": [2, 1],
"dynamic_feat": [[2.0, 3.1, 4.5, 1.5, 1.8, 3.2, 0.1, 3.0, ...]]
},
{
"start": "1999-01-30",
"target": [2.0, 1.0],
"cat": [1, 3],
"dynamic_feat": [[1.0, 0.1, -2.5, 0.3, 2.0, -1.2, -0.1, -3.0, ...]]
}
],
"configuration": {
"num_samples": 50,
"output_types": ["mean", "quantiles", "samples"],
"quantiles": ["0.5", "0.9"]
}
}

The configuration field is optional. configuration.num_samples sets the number of sample


paths that the model generates to estimate the mean and quantiles. configuration.output_types
describes the information that will be returned in the request. Valid values are "mean"
"quantiles" and "samples". If you specify "quantiles", each of the quantile values in
configuration.quantiles is returned as a time series. If you specify "samples", the model also
returns the raw samples used to calculate the other outputs.

DeepAR JSON Response Formats

The following is the format of a response, where [...] are arrays of numbers:

{
"predictions": [
{
"quantiles": {
"0.9": [...],
"0.5": [...]
},
"samples": [...],
"mean": [...]
},
{
"quantiles": {
"0.9": [...],
"0.5": [...]
},
"samples": [...],
"mean": [...]
},
{
"quantiles": {
"0.9": [...],

1473
Amazon SageMaker Developer Guide
Use Built-in Algorithms

"0.5": [...]
},
"samples": [...],
"mean": [...]
}
]
}

DeepAR has a response timeout of 60 seconds. When passing multiple time series in a single request,
the forecasts are generated sequentially. Because the forecast for each time series typically takes about
300 to 1000 milliseconds or longer, depending on the model size, passing too many time series in a
single request can cause time outs. It's better to send fewer time series per request and send more
requests. Because the DeepAR algorithm uses multiple workers per instance, you can achieve much
higher throughput by sending multiple requests in parallel.

By default, DeepAR uses one worker per CPU for inference, if there is sufficient memory per CPU. If the
model is large and there isn't enough memory to run a model on each CPU, the number of workers is
reduced. The number of workers used for inference can be overwritten using the environment variable
MODEL_SERVER_WORKERS For example, by setting MODEL_SERVER_WORKERS=1) when calling the
SageMaker CreateModel API.

Batch Transform with the DeepAR Algorithm


DeepAR forecasting supports getting inferences by using batch transform from data using the JSON
Lines format. In this format, each record is represented on a single line as a JSON object, and lines
are separated by newline characters. The format is identical to the JSON Lines format used for model
training. For information, see Input/Output Interface for the DeepAR Algorithm (p. 1461). For example:

{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1],
"dynamic_feat": [[1.1, 1.2, 0.5, ..]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat":
[[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat":
[[1.3, 0.4]]}

Note
When creating the transformation job with CreateTransformJob, set the BatchStrategy
value to SingleRecord and set the SplitType value in the TransformInput configuration
to Line, as the default values currently cause runtime failures.

Similar to the hosted endpoint inference request format, the cat and the dynamic_feat fields for each
instance are required if both of the following are true:

• The model is trained on a dataset that contained both the cat and the dynamic_feat fields.
• The corresponding cardinality and num_dynamic_feat values used in the training job are not set
to "".

Unlike hosted endpoint inference, the configuration field is set once for the entire batch
inference job using an environment variable named DEEPAR_INFERENCE_CONFIG. The
value of DEEPAR_INFERENCE_CONFIG can be passed when the model is created by calling
CreateTransformJob API. If DEEPAR_INFERENCE_CONFIG is missing in the container environment,
the inference container uses the following default:

{
"num_samples": 100,
"output_types": ["mean", "quantiles"],
"quantiles": ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
}

1474
Amazon SageMaker Developer Guide
Use Built-in Algorithms

The output is also in JSON Lines format, with one line per prediction, in an order identical to the instance
order in the corresponding input file. Predictions are encoded as objects identical to the ones returned by
responses in online inference mode. For example:

{ "quantiles": { "0.1": [...], "0.2": [...] }, "samples": [...], "mean": [...] }

Note that in the TransformInput configuration of the SageMaker CreateTransformJob request


clients must explicitly set the AssembleWith value to Line, as the default value None concatenates all
JSON objects on the same line.

For example, here is a SageMaker CreateTransformJob request for a DeepAR job with a custom
DEEPAR_INFERENCE_CONFIG:

{
"BatchStrategy": "SingleRecord",
"Environment": {
"DEEPAR_INFERENCE_CONFIG" : "{ \"num_samples\": 200, \"output_types\": [\"mean\"] }",
...
},
"TransformInput": {
"SplitType": "Line",
...
},
"TransformOutput": {
"AssembleWith": "Line",
...
},
...
}

Unsupervised Built-in SageMaker Algorithms


Amazon SageMaker provides several built-in algorithms that can be used for a variety of unsupervised
learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.

• IP Insights (p. 1476)—learns the usage patterns for IPv4 addresses. It is designed to capture
associations between IPv4 addresses and various entities, such as user IDs or account numbers.
• K-Means Algorithm (p. 1485)—finds discrete groupings within data, where members of a group are as
similar as possible to one another and as different as possible from members of other groups.
• Principal Component Analysis (PCA) Algorithm (p. 1493)—reduces the dimensionality (number of
features) within a dataset by projecting data points onto the first few principal components. The
objective is to retain as much information or variation as possible. For mathematicians, principal
components are eigenvectors of the data's covariance matrix.
• Random Cut Forest (RCF) Algorithm (p. 1497)—detects anomalous data points within a data set that
diverge from otherwise well-structured or patterned data.

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

IP Insights train and File CSV CPU or GPU Yes


(optionally)
validation

K-Means train and File or Pipe recordIO- CPU or No


(optionally) protobuf or GPUCommon
test CSV (single GPU

1475
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Algorithm Channel Training File type Instance Parallelizable


name name input mode class
device on
one or more
instances)

PCA train and File or Pipe recordIO- GPU or CPU Yes


(optionally) protobuf or
test CSV

Random Cut train and File or Pipe recordIO- CPU Yes


Forest (optionally) protobuf or
test CSV

IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for
IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such
as user IDs or account numbers. You can use it to identify a user attempting to log into a web service
from an anomalous IP address, for example. Or you can use it to identify an account that is attempting
to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an
endpoint for making real-time predictions or used for processing batch transforms.

SageMaker IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage
patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker IP Insights
model returns a score that infers how anomalous the pattern of the event is. For example, when a user
attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might
decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the
IP Insights score into another machine learning model. For example, you can combine the IP Insight
score with other features to rank the findings of another security system, such as those from Amazon
GuardDuty.

The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as
embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks
that use the information observed in the IP addresses. For example, you can use them in tasks such as
measuring similarities between IP addresses in clustering and visualization tasks.

Topics
• Input/Output Interface for the IP Insights Algorithm (p. 1476)
• EC2 Instance Recommendation for the IP Insights Algorithm (p. 1477)
• IP Insights Sample Notebooks (p. 1478)
• How IP Insights Works (p. 1478)
• IP Insights Hyperparameters (p. 1479)
• Tune an IP Insights Model (p. 1481)
• IP Insights Data Formats (p. 1483)

Input/Output Interface for the IP Insights Algorithm


Training and Validation

The SageMaker IP Insights algorithm supports training and validation data channels. It uses the optional
validation channel to compute an area-under-curve (AUC) score on a predefined negative sampling
strategy. The AUC metric validates how well the model discriminates between positive and negative
samples. Training and validation data content types need to be in text/csv format. The first column
of the CSV data is an opaque string that provides a unique identifier for the entity. The second column

1476
Amazon SageMaker Developer Guide
Use Built-in Algorithms

is an IPv4 address in decimal-dot notation. IP Insights currently supports only File mode. For more
information and some examples, see IP Insights Training Data Formats (p. 1483).

Inference

For inference, IP Insights supports text/csv, application/json, and application/jsonlines


data content types. For more information about the common data formats for inference provided by
SageMaker, see Common Data Formats for Inference (p. 1293). IP Insights inference returns output
formatted as either application/json or application/jsonlines. Each record in the output data
contains the corresponding dot_product (or compatibility score) for each input data point. For more
information and some examples, see IP Insights Inference Data Formats (p. 1483).

EC2 Instance Recommendation for the IP Insights Algorithm


The SageMaker IP Insights algorithm can run on both GPU and CPU instances. For training jobs, we
recommend using GPU instances. However, for certain workloads with large training datasets, distributed
CPU instances might reduce training costs. For inference, we recommend using CPU instances. IP Insights
supports P2, P3, G4dn, and G5 GPU families.

GPU Instances for the IP Insights Algorithm


IP Insights supports all available GPUs. If you need to speed up training, we recommend starting with
a single GPU instance, such as ml.p3.2xlarge, and then moving to a multi-GPU environment, such as
ml.p3.8xlarge and ml.p3.16xlarge. Multi-GPUs automatically divide the mini batches of training data
across themselves. If you switch from a single GPU to multiple GPUs, the mini_batch_size is divided
equally into the number of GPUs used. You may want to increase the value of the mini_batch_size to
compensate for this.

CPU Instances for the IP Insights Algorithm


The type of CPU instance that we recommend depends largely on the instance's available memory
and the model size. The model size is determined by two hyperparameters: vector_dim and
num_entity_vectors. The maximum supported model size is 8 GB. The following table lists typical
EC2 instance types that you would deploy based on these input parameters for various model sizes.
In Table 1, the value for vector_dim in the first column range from 32 to 2048 and the values for
num_entity_vectors in the first row range from 10,000 to 50,000,000.

10,000
vector_dim 50,000 100,000 500,000 1,000,000 5,000,000 10,000,00050,000,000
\
num_entity_vectors.

32 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.2xlarge
ml.m5.4xlarge

64 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge
ml.m5.2xlarge

128 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge
ml.m5.4xlarge

256 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.4xlarge

512 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.large
ml.m5.2xlarge

1024 ml.m5.largeml.m5.large
ml.m5.large
ml.m5.xlarge
ml.m5.4xlarge

2048 ml.m5.largeml.m5.large
ml.m5.xlarge
ml.m5.xlarge

The values for the mini_batch_size, num_ip_encoder_layers,


random_negative_sampling_rate, and shuffled_negative_sampling_rate hyperparameters
also affect the amount of memory required. If these values are large, you might need to use a larger
instance type than normal.

1477
Amazon SageMaker Developer Guide
Use Built-in Algorithms

IP Insights Sample Notebooks


For a sample notebook that shows how to train the SageMaker IP Insights algorithm and perform
inferences with it, see An Introduction to the SageMakerIP Insights Algorithm . For instructions how
to create and access Jupyter notebook instances that you can use to run the example in SageMaker,
see Amazon SageMaker Notebook Instances (p. 204). After creating a notebook instance, choose the
SageMaker Examples tab to see a list of all the SageMaker examples. To open a notebook, choose its
Use tab and choose Create copy.

How IP Insights Works


Amazon SageMaker IP Insights is an unsupervised algorithm that consumes observed data in the form
of (entity, IPv4 address) pairs that associates entities with IP addresses. IP Insights determines how likely
it is that an entity would use a particular IP address by learning latent vector representations for both
entities and IP addresses. The distance between these two representations can then serve as the proxy
for how likely this association is.

The IP Insights algorithm uses a neural network to learn the latent vector representations for entities
and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a
simple embedding layer. Character strings such as user names or account IDs can be fed directly into
IP Insights as they appear in log files. You don't need to preprocess the data for entity identifiers. You
can provide entities as an arbitrary string value during both training and inference. The hash size should
be configured with a value that is high enough to ensure that the number of collisions, which occur
when distinct entities are mapped to the same latent vector, remain insignificant. For more information
about how to select appropriate hash sizes, see Feature Hashing for Large Scale Multitask Learning. For
representing IP addresses, on the other hand, IP Insights uses a specially designed encoder network to
uniquely represent each possible IPv4 address by exploiting the prefix structure of IP addresses.

During training, IP Insights automatically generates negative samples by randomly pairing entities and
IP addresses. These negative samples represent data that is less likely to occur in reality. The model
is trained to discriminate between positive samples that are observed in the training data and these
generated negative samples. More specifically, the model is trained to minimize the cross entropy, also
known as the log loss, defined as follows:

yn is the label that indicates whether the sample is from the real distribution governing observed data
(yn=1) or from the distribution generating negative samples (yn=0). pn is the probability that the sample
is from the real distribution, as predicted by the model.

Generating negative samples is an important process that is used to achieve an accurate model of
the observed data. If negative samples are extremely unlikely, for example, if all of the IP addresses
in negative samples are 10.0.0.0, then the model trivially learns to distinguish negative samples and
fails to accurately characterize the actual observed dataset. To keep negative samples more realistic,
IP Insights generates negative samples both by randomly generating IP addresses and randomly
picking IP addresses from training data. You can configure the type of negative sampling and the
rates at which negative samples are generated with the random_negative_sampling_rate and
shuffled_negative_sampling_rate hyperparameters.

Given an nth (entity, IP address pair), the IP Insights model outputs a score, Sn , that indicates how
compatible the entity is with the IP address. This score corresponds to the log odds ratio for a given
(entity, IP address) of the pair coming from a real distribution as compared to coming from a negative
distribution. It is defined as follows:

The score is essentially a measure of the similarity between the vector representations of the nth entity
and IP address. It can be interpreted as how much more likely it would be to observe this event in reality

1478
Amazon SageMaker Developer Guide
Use Built-in Algorithms

than in a randomly generated dataset. During training, the algorithm uses this score to calculate an
estimate of the probability of a sample coming from the real distribution, pn, to use in the cross entropy
minimization, where:

IP Insights Hyperparameters

In the CreateTransformJob request, you specify the training algorithm. You can also specify
algorithm-specific hyperparameters as string-to-string maps. The following table lists the
hyperparameters for the Amazon SageMaker IP Insights algorithm.

Parameter Name Description

num_entity_vectors The number of entity vector representations (entity


embedding vectors) to train. Each entity in the training
set is randomly assigned to one of these vectors using
a hash function. Because of hash collisions, it might be
possible to have multiple entities assigned to the same
vector. This would cause the same vector to represent
multiple entities. This generally has a negligible effect on
model performance, as long as the collision rate is not
too severe. To keep the collision rate low, set this value as
high as possible. However, the model size, and, therefore,
the memory requirement, for both training and inference,
scales linearly with this hyperparameter. We recommend
that you set this value to twice the number of unique entity
identifiers.

Required

Valid values: 1 ≤ positive integer ≤ 250,000,000

vector_dim The size of embedding vectors to represent entities and IP


addresses. The larger the value, the more information that
can be encoded using these representations. In practice,
model size scales linearly with this parameter and limits
how large the dimension can be. In addition, using vector
representations that are too large can cause the model to
overfit, especially for small training datasets. Overfitting
occurs when a model doesn't learn any pattern in the
data but effectively memorizes the training data and,
therefore, cannot generalize well and performs poorly
during inference. The recommended value is 128.

Required

Valid values: 4 ≤ positive integer ≤ 4096

batch_metrics_publish_interval The interval (every X batches) at which the Apache MXNet


Speedometer function prints the training speed of the
network (samples/second).

Optional

Valid values: positive integer ≥ 1

Default value: 1,000

1479
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

epochs The number of passes over the training data. The optimal
value depends on your data size and learning rate. Typical
values range from 5 to 100.

Optional

Valid values: positive integer ≥ 1

Default value: 10

learning_rate The learning rate for the optimizer. IP Insights use a


gradient-descent-based Adam optimizer. The learning
rate effectively controls the step size to update model
parameters at each iteration. Too large a learning rate can
cause the model to diverge because the training is likely
to overshoot a minima. On the other hand, too small a
learning rate slows down convergence. Typical values range
from 1e-4 to 1e-1.

Optional

Valid values: 1e-6 ≤ float ≤ 10.0

Default value: 0.001

mini_batch_size The number of examples in each mini batch. The


training procedure processes data in mini batches.
The optimal value depends on the number of unique
account identifiers in the dataset. In general, the larger
the mini_batch_size, the faster the training and the
greater the number of possible shuffled-negative-sample
combinations. However, with a large mini_batch_size,
the training is more likely to converge to a poor local
minimum and perform relatively worse for inference.

Optional

Valid values: 1 ≤ positive integer ≤ 500000

Default value: 10,000

num_ip_encoder_layers The number of fully connected layers used to encode the


IP address embedding. The larger the number of layers, the
greater the model's capacity to capture patterns among
IP addresses. However, using a large number of layers
increases the chance of overfitting.

Optional

Valid values: 0 ≤ positive integer ≤ 100

Default value: 1

1480
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

random_negative_sampling_rate The number of random negative samples, R, to generate


per input example. The training procedure relies on
negative samples to prevent the vector representations
of the model collapsing to a single point. Random
negative sampling generates R random IP addresses
for each input account in the mini batch. The sum of
the random_negative_sampling_rate (R) and
shuffled_negative_sampling_rate (S) must be in the
interval: 1 ≤ R + S ≤ 500.

Optional

Valid values: 0 ≤ positive integer ≤ 500

Default value: 1

shuffled_negative_sampling_rate The number of shuffled negative samples, S, to generate


per input example. In some cases, it helps to use more
realistic negative samples that are randomly picked
from the training data itself. This kind of negative
sampling is achieved by shuffling the data within a
mini batch. Shuffled negative sampling generates
S negative IP addresses by shuffling the IP address
and account pairings within a mini batch. The sum
of the random_negative_sampling_rate (R) and
shuffled_negative_sampling_rate (S) must be in the
interval: 1 ≤ R + S ≤ 500.

Optional

Valid values: 0 ≤ positive integer ≤ 500

Default value: 1

weight_decay The weight decay coefficient. This parameter adds an L2


regularization factor that is required to prevent the model
from overfitting the training data.

Optional

Valid values: 0.0 ≤ float ≤ 10.0

Default value: 0.00001

Tune an IP Insights Model

Automatic model tuning, also called hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

1481
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metrics Computed by the IP Insights Algorithm

The Amazon SageMaker IP Insights algorithm is an unsupervised learning algorithm that learns
associations between IP addresses and entities. The algorithm trains a discriminator model , which
learns to separate observed data points (positive samples) from randomly generated data points
(negative samples). Automatic model tuning on IP Insights helps you find the model that can most
accurately distinguish between unlabeled validation data and automatically generated negative samples.
The model accuracy on the validation dataset is measured by the area under the receiver operating
characteristic curve. This validation:discriminator_auc metric can take values between 0.0 and
1.0, where 1.0 indicates perfect accuracy.

The IP Insights algorithm computes a validation:discriminator_auc metric during validation, the


value of which is used as the objective function to optimize for hyperparameter tuning.

Metric Name Description Optimization Direction

Area under the receiver operating characteristic


validation:discriminator_auc Maximize
curve on the validation dataset. The validation
dataset is not labeled. Area Under the Curve (AUC)
is a metric that describes the model's ability to
discriminate validation data points from randomly
generated data points.

Tunable IP Insights Hyperparameters

You can tune the following hyperparameters for the SageMaker IP Insights algorithm.

Parameter Name Parameter Type Recommended Ranges

epochs IntegerParameterRange MinValue: 1, MaxValue:


100

learning_rate ContinuousParameterRange MinValue: 1e-4,


MaxValue: 0.1

mini_batch_size IntegerParameterRanges MinValue: 100,


MaxValue: 50000

num_entity_vectors IntegerParameterRanges MinValue: 10000,


MaxValue: 1000000

IntegerParameterRanges
num_ip_encoder_layers MinValue: 1, MaxValue:
10

IntegerParameterRanges
random_negative_sampling_rate MinValue: 0, MaxValue:
10

IntegerParameterRanges
shuffled_negative_sampling_rate MinValue: 0, MaxValue:
10

vector_dim IntegerParameterRanges MinValue: 8, MaxValue:


256

weight_decay ContinuousParameterRange MinValue: 0.0,


MaxValue: 1.0

1482
Amazon SageMaker Developer Guide
Use Built-in Algorithms

IP Insights Data Formats

This section provides examples of the available input and output data formats used by the IP Insights
algorithm during training and inference.

Topics
• IP Insights Training Data Formats (p. 1483)
• IP Insights Inference Data Formats (p. 1483)

IP Insights Training Data Formats

The following are the available data input formats for the IP Insights algorithm. Amazon SageMaker
built-in algorithms adhere to the common input training format described in Common Data Formats for
Training (p. 1290). However, the SageMaker IP Insights algorithm currently supports only the CSV data
input format.

IP Insights Training Data Input Formats

INPUT: CSV

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's
unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot
notation.

content-type: text/csv

entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2

IP Insights Inference Data Formats

The following are the available input and output formats for the IP Insights algorithm. Amazon
SageMaker built-in algorithms adhere to the common input inference format described in Common
Data Formats for Inference (p. 1293). However, the SageMaker IP Insights algorithm does not currently
support RecordIO format.

IP Insights Input Request Formats

INPUT: CSV Format

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's
unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot
notation.

content-type: text/csv

entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2

INPUT: JSON Format

JSON data can be provided in different formats. IP Insights follows the common SageMaker formats. For
more information about inference formats, see Common Data Formats for Inference (p. 1293).

content-type: application/json

1483
Amazon SageMaker Developer Guide
Use Built-in Algorithms

{
"instances": [
{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
{"features": ["entity_id_2", "10.10.1.2"]}
]
}

INPUT: JSONLINES Format

The JSON Lines content type is useful for running batch transform jobs. For more information on
SageMaker inference formats, see Common Data Formats for Inference (p. 1293). For more information
on running batch transform jobs, see Use Batch Transform (p. 2421).

content-type: application/jsonlines

{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},


{"features": ["entity_id_2", "10.10.1.2"]}]

IP Insights Output Response Formats

OUTPUT: JSON Response Format

The default output of the SageMaker IP Insights algorithm is the dot_product between the input
entity and IP address. The dot_product signifies how compatible the model considers the entity and IP
address. The dot_product is unbounded. To make predictions about whether an event is anomalous,
you need to set a threshold based on your defined distribution. For information about how to use the
dot_product for anomaly detection, see the An Introduction to the SageMakerIP Insights Algorithm.

accept: application/json

{
"predictions": [
{"dot_product": 0.0},
{"dot_product": 2.0}
]
}

Advanced users can access the model's learned entity and IP embeddings by providing the additional
content-type parameter verbose=True to the Accept heading. You can use the entity_embedding
and ip_embedding for debugging, visualizing, and understanding the model. Additionally, you can use
these embeddings in other machine learning techniques, such as classification or clustering.

accept: application/json;verbose=True

{
"predictions": [
{
"dot_product": 0.0,
"entity_embedding": [1.0, 0.0, 0.0],
"ip_embedding": [0.0, 1.0, 0.0]
},
{
"dot_product": 2.0,
"entity_embedding": [1.0, 0.0, 1.0],
"ip_embedding": [1.0, 0.0, 1.0]
}
]
}

1484
Amazon SageMaker Developer Guide
Use Built-in Algorithms

OUTPUT: JSONLINES Response Format

accept: application/jsonlines

{"dot_product": 0.0}
{"dot_product": 2.0}

accept: application/jsonlines; verbose=True

{"dot_product": 0.0, "entity_embedding": [1.0, 0.0, 0.0], "ip_embedding": [0.0, 1.0, 0.0]}
{"dot_product": 2.0, "entity_embedding": [1.0, 0.0, 1.0], "ip_embedding": [1.0, 0.0, 1.0]}

K-Means Algorithm
K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where
members of a group are as similar as possible to one another and as different as possible from members
of other groups. You define the attributes that you want the algorithm to use to determine similarity.

Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared
with the original version of the algorithm, the version used by Amazon SageMaker is more accurate.
Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To
do this, the version used by Amazon SageMaker streams mini-batches (small, random subsets) of the
training data. For more information about mini-batch k-means, see Web-scale k-means Clustering.

The k-means algorithm expects tabular data, where rows represent the observations that you want to
cluster, and the columns represent attributes of the observations. The n attributes in each row represent
a point in n-dimensional space. The Euclidean distance between these points represents the similarity
of the corresponding observations. The algorithm groups observations with similar attribute values (the
points corresponding to these observations are closer together). For more information about how k-
means works in Amazon SageMaker, see How K-Means Clustering Works (p. 1486).

Topics
• Input/Output Interface for the K-Means Algorithm (p. 1485)
• EC2 Instance Recommendation for the K-Means Algorithm (p. 1486)
• K-Means Sample Notebooks (p. 1486)
• How K-Means Clustering Works (p. 1486)
• K-Means Hyperparameters (p. 1489)
• Tune a K-Means Model (p. 1491)
• K-Means Response Formats (p. 1492)

Input/Output Interface for the K-Means Algorithm

For training, the k-means algorithm expects data to be provided in the train channel (recommended
S3DataDistributionType=ShardedByS3Key), with an optional test channel (recommended
S3DataDistributionType=FullyReplicated) to score the data on. Both recordIO-wrapped-
protobuf and CSV formats are supported for training. You can use either File mode or Pipe mode to
train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.

For inference, text/csv, application/json, and application/x-recordio-protobuf are


supported. k-means returns a closest_cluster label and the distance_to_cluster for each
observation.

For more information on input and output file formats, see K-Means Response Formats (p. 1492) for
inference and the K-Means Sample Notebooks (p. 1486). The k-means algorithm does not support

1485
Amazon SageMaker Developer Guide
Use Built-in Algorithms

multiple instance learning, in which the training set consists of labeled “bags”, each of which is a
collection of unlabeled instances.

EC2 Instance Recommendation for the K-Means Algorithm

We recommend training k-means on CPU instances. You can train on GPU instances, but should limit
GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per
instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

K-Means Sample Notebooks

For a sample notebook that uses the SageMaker K-means algorithm to segment the population of
counties in the United States by attributes identified using principle component analysis, see Analyze
US census data for population segmentation using Amazon SageMaker. For instructions how to create
and access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon
SageMaker Notebook Instances (p. 204). Once you have created a notebook instance and opened it,
select the SageMaker Examples tab to see a list of all the SageMaker samples. To open a notebook, click
on its Use tab and select Create copy.

How K-Means Clustering Works

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm
accomplishes this by mapping each observation in the input dataset to a point in the n-dimensional
space (where n is the number of attributes of the observation). For example, your dataset might contain
observations of temperature and humidity in a particular location, which are mapped to points (t, h) in 2-
dimensional space.

Note
Clustering algorithms are unsupervised. In unsupervised learning, labels that might be
associated with the objects in the training dataset aren't used.

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the
distance of the point that corresponds to each observation in the dataset to the cluster centers as the
basis for clustering. You choose the number of clusters (k) to create.

For example, suppose that you want to create a model to recognize handwritten digits and you choose
the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0
through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As
part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image
corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To
find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of
the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image
belongs.
Note
Amazon SageMaker uses a customized version of the algorithm where, instead of specifying that
the algorithm create k clusters, you might choose to improve model accuracy by specifying extra
cluster centers (K = k*x). However, the algorithm ultimately reduces these to k clusters.

In SageMaker, you specify the number of clusters when creating a training job. For more information, see
CreateTrainingJob. In the request body, you add the HyperParameters string map to specify the k
and extra_center_factor strings.

The following is a summary of how k-means works for model training in SageMaker:

1. It determines the initial K cluster centers.

1486
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Note
In the following topics, K clusters refer to k * x, where you specify k and x when creating a
model training job.
2. It iterates over input training data and recalculates cluster centers.
3. It reduces resulting clusters to k (if the data scientist specified the creation of k*x clusters in the
request).

The following sections also explain some of the parameters that a data scientist might specify to
configure a model training job as part of the HyperParameters string map.

Topics
• Step 1: Determine the Initial Cluster Centers (p. 1487)
• Step 2: Iterate over the Training Dataset and Calculate Cluster Centers (p. 1488)
• Step 3: Reduce the Clusters from K to k (p. 1488)

Step 1: Determine the Initial Cluster Centers

When using k-means in SageMaker, the initial cluster centers are chosen from the observations in a
small, randomly sampled batch. Choose one of the following strategies to determine how these initial
cluster centers are selected:

• The random approach—Randomly choose K observations in your input dataset as cluster centers. For
example, you might choose a cluster center that points to the 784-dimensional space that corresponds
to any 10 images in the MNIST training dataset.
• The k-means++ approach, which works as follows:
1. Start with one cluster and determine its center. You randomly select an observation from your
training dataset and use the point corresponding to the observation as the cluster center. For
example, in the MNIST dataset, randomly choose a handwritten digit image. Then choose the point
in the 784-dimensional space that corresponds to the image as your cluster center. This is cluster
center 1.
2. Determine the center for cluster 2. From the remaining observations in the training dataset, pick
an observation at random. Choose one that is different than the one you previously selected. This
observation corresponds to a point that is far away from cluster center 1. Using the MNIST dataset
as an example, you do the following:
• For each of the remaining images, find the distance of the corresponding point from cluster
center 1. Square the distance and assign a probability that is proportional to the square of the
distance. That way, an image that is different from the one that you previously selected has a
higher probability of getting selected as cluster center 2.
• Choose one of the images randomly, based on probabilities assigned in the previous step. The
point that corresponds to the image is cluster center 2.
3. Repeat Step 2 to find cluster center 3. This time, find the distances of the remaining images from
cluster center 2.
4. Repeat the process until you have the K cluster centers.

To train a model in SageMaker, you create a training job. In the request, you provide configuration
information by specifying the following HyperParameters string maps:

• To specify the number of clusters to create, add the k string.


• For greater accuracy, add the optional extra_center_factor string.
• To specify the strategy that you want to use to determine the initial cluster centers, add the
init_method string and set its value to random or k-means++.

1487
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information about the SageMaker k-means estimator, see K-means in the Amazon SageMaker
Python SDK documentation.

You now have an initial set of cluster centers.

Step 2: Iterate over the Training Dataset and Calculate Cluster Centers

The cluster centers that you created in the preceding step are mostly random, with some consideration
for the training dataset. In this step, you use the training dataset to move these centers toward the true
cluster centers. The algorithm iterates over the training dataset, and recalculates the K cluster centers.

1. Read a mini-batch of observations (a small, randomly chosen subset of all records) from the training
dataset and do the following.
Note
When creating a model training job, you specify the batch size in the mini_batch_size
string in the HyperParameters string map.

a. Assign all of the observations in the mini-batch to one of the clusters with the closest cluster
center.
b. Calculate the number of observations assigned to each cluster. Then, calculate the proportion of
new points assigned per cluster.

For example, consider the following clusters:

Cluster c1 = 100 previously assigned points. You added 25 points from the mini-batch in this
step.

Cluster c2 = 150 previously assigned points. You added 40 points from the mini-batch in this
step.

Cluster c3 = 450 previously assigned points. You added 5 points from the mini-batch in this
step.

Calculate the proportion of new points assigned to each of clusters as follows:

p1 = proportion of points assigned to c1 = 25/(100+25)


p2 = proportion of points assigned to c2 = 40/(150+40)
p3 = proportion of points assigned to c3 = 5/(450+5)

c. Compute the center of the new points added to each cluster:

d1 = center of the new points added to cluster 1


d2 = center of the new points added to cluster 2
d3 = center of the new points added to cluster 3

d. Compute the weighted average to find the updated cluster centers as follows:

Center of cluster 1 = ((1 - p1) * center of cluster 1) + (p1 * d1)


Center of cluster 2 = ((1 - p2) * center of cluster 2) + (p2 * d2)
Center of cluster 3 = ((1 - p3) * center of cluster 3) + (p3 * d3)

2. Read the next mini-batch, and repeat Step 1 to recalculate the cluster centers.
3. For more information about mini-batch k-means, see Web-Scale k-means Clustering ).

Step 3: Reduce the Clusters from K to k

If the algorithm created K clusters—(K = k*x) where x is greater than 1—then it reduces the K clusters to
k clusters. (For more information, see extra_center_factor in the preceding discussion.) It does this

1488
Amazon SageMaker Developer Guide
Use Built-in Algorithms

by applying Lloyd's method with kmeans++ initialization to the K cluster centers. For more information
about Lloyd's method, see k-means clustering.

K-Means Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm that you want to use. You
can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists
the hyperparameters for the k-means training algorithm provided by Amazon SageMaker. For more
information about how k-means clustering works, see How K-Means Clustering Works (p. 1486).

Parameter Name Description

feature_dim The number of features in the input data.

Required

Valid values: Positive integer

k The number of required clusters.

Required

Valid values: Positive integer

epochs The number of passes done over the training data.

Optional

Valid values: Positive integer

Default value: 1

eval_metrics A JSON list of metric types used to report a score for the model.
Allowed values are msd for Means Square Deviation and ssd
for Sum of Square Distance. If test data is provided, the score is
reported for each of the metrics requested.

Optional

Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",


\"ssd\"] .

Default value: [\"msd\"]

extra_center_factor The algorithm creates K centers = num_clusters *


extra_center_factor as it runs and reduces the number of
centers from K to k when finalizing the model.

Optional

Valid values: Either a positive integer or auto.

Default value: auto

half_life_time_size Used to determine the weight given to an observation when


computing a cluster mean. This weight decays exponentially as
more points are observed. When a point is first observed, it is
assigned a weight of 1 when computing the cluster mean. The
decay constant for the exponential decay function is chosen so that

1489
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


after observing half_life_time_size points, its weight is 1/2. If
set to 0, there is no decay.

Optional

Valid values: Non-negative integer

Default value: 0

init_method Method by which the algorithm chooses the initial cluster centers.
The standard k-means approach chooses them at random. An
alternative k-means++ method chooses the first cluster center at
random. Then it spreads out the position of the remaining initial
clusters by weighting the selection of centers with a probability
distribution that is proportional to the square of the distance of the
remaining data points from existing centers.

Optional

Valid values: Either random or kmeans++.

Default value: random

local_lloyd_init_method The initialization method for Lloyd's expectation-maximization (EM)


procedure used to build the final model containing k centers.

Optional

Valid values: Either random or kmeans++.

Default value: kmeans++

local_lloyd_max_iter The maximum number of iterations for Lloyd's expectation-


maximization (EM) procedure used to build the final model
containing k centers.

Optional

Valid values: Positive integer

Default value: 300

local_lloyd_num_trials The number of times the Lloyd's expectation-maximization (EM)


procedure with the least loss is run when building the final model
containing k centers.

Optional

Valid values: Either a positive integer or auto.

Default value: auto

1490
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

local_lloyd_tol The tolerance for change in loss for early stopping of Lloyd's
expectation-maximization (EM) procedure used to build the final
model containing k centers.

Optional

Valid values: Float. Range in [0, 1].

Default value: 0.0001

mini_batch_size The number of observations per mini-batch for the data iterator.

Optional

Valid values: Positive integer

Default value: 5000

Tune a K-Means Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker k-means algorithm is an unsupervised algorithm that groups data into clusters
whose members are as similar as possible. Because it is unsupervised, it doesn't use a validation dataset
that hyperparameters can optimize against. But it does take a test dataset and emits metrics that depend
on the squared distance between the data points and the final cluster centroids at the end of each
training run. To find the model that reports the tightest clusters on the test dataset, you can use a
hyperparameter tuning job. The clusters optimize the similarity of their members.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the K-Means Algorithm

The k-means algorithm computes the following metrics during training. When tuning a model, choose
one of these metrics as the objective metric.

Metric Name Description Optimization Direction

test:msd Mean squared distances between each record in Minimize


the test set and the closest center of the model.

test:ssd Sum of the squared distances between each Minimize


record in the test set and the closest center of the
model.

Tunable K-Means Hyperparameters

Tune the Amazon SageMaker k-means model with the following hyperparameters. The
hyperparameters that have the greatest impact on k-means objective metrics are: mini_batch_size,

1491
Amazon SageMaker Developer Guide
Use Built-in Algorithms

extra_center_factor, and init_method. Tuning the hyperparameter epochs generally results in


minor improvements.

Parameter Name Parameter Type Recommended Ranges

epochs IntegerParameterRanges MinValue: 1,


MaxValue:10

extra_center_factor IntegerParameterRanges MinValue: 4,


MaxValue:10

init_method CategoricalParameterRanges ['kmeans++', 'random']

mini_batch_size IntegerParameterRanges MinValue: 3000,


MaxValue:15000

K-Means Response Formats

All SageMaker built-in algorithms adhere to the common input inference format described in Common
Data Formats - Inference. This topic contains a list of the available output formats for the SageMaker k-
means algorithm.

JSON Response Format

{
"predictions": [
{
"closest_cluster": 1.0,
"distance_to_cluster": 3.0,
},
{
"closest_cluster": 2.0,
"distance_to_cluster": 5.0,
},

....
]
}

JSONLINES Response Format

{"closest_cluster": 1.0, "distance_to_cluster": 3.0}


{"closest_cluster": 2.0, "distance_to_cluster": 5.0}

RECORDIO Response Format

[
Record = {
features = {},
label = {
'closest_cluster': {
keys: [],
values: [1.0, 2.0] # float32
},
'distance_to_cluster': {
keys: [],
values: [3.0, 5.0] # float32

1492
Amazon SageMaker Developer Guide
Use Built-in Algorithms

},
}
}
]

CSV Response Format

The first value in each line corresponds to closest_cluster.

The second value in each line corresponds to distance_to_cluster.

1.0,3.0
2.0,5.0

Principal Component Analysis (PCA) Algorithm


PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number
of features) within a dataset while still retaining as much information as possible. This is done by
finding a new set of features called components, which are composites of the original features that are
uncorrelated with one another. They are also constrained so that the first component accounts for the
largest possible variability in the data, the second component the second most variability, and so on.

In Amazon SageMaker, PCA operates in two modes, depending on the scenario:

• regular: For datasets with sparse data and a moderate number of observations and features.
• randomized: For datasets with both a large number of observations and features. This mode uses an
approximation algorithm.

PCA uses tabular data.

The rows represent observations you want to embed in a lower dimensional space. The columns
represent features that you want to find a reduced approximation for. The algorithm calculates the
covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular
value decomposition on this summary to produce the principal components.

Topics
• Input/Output Interface for the PCA Algorithm (p. 1493)
• EC2 Instance Recommendation for the PCA Algorithm (p. 1494)
• PCA Sample Notebooks (p. 1494)
• How PCA Works (p. 1494)
• PCA Hyperparameters (p. 1495)
• PCA Response Formats (p. 1496)

Input/Output Interface for the PCA Algorithm

For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to
the test dataset, which is scored by the final algorithm. Both recordIO-wrapped-protobuf and CSV
formats are supported for training. You can use either File mode or Pipe mode to train models on data
that is formatted as recordIO-wrapped-protobuf or as CSV.

For inference, PCA supports text/csv, application/json, and application/x-recordio-


protobuf. Results are returned in either application/json or application/x-recordio-
protobuf format with a vector of "projections."

1493
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more information on input and output file formats, see PCA Response Formats (p. 1496) for
inference and the PCA Sample Notebooks (p. 1494).

EC2 Instance Recommendation for the PCA Algorithm

PCA supports CPU and GPU instances for training and inference. Which instance type is most performant
depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and
G5.

PCA Sample Notebooks

For a sample notebook that shows how to use the SageMaker Principal Component Analysis algorithm to
analyze the images of handwritten digits from zero to nine in the MNIST dataset, see An Introduction to
PCA with MNIST. For instructions how to create and access Jupyter notebook instances that you can use
to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you have
created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the
SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the
Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create
copy.

How PCA Works

Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of
features) within a dataset while still retaining as much information as possible.

PCA reduces dimensionality by finding a new set of features called components, which are composites of
the original features, but are uncorrelated with one another. The first component accounts for the largest
possible variability in the data, the second component the second most variability, and so on.

It is an unsupervised dimensionality reduction algorithm. In unsupervised learning, labels that might be


associated with the objects in the training dataset aren't used.

Given the input of a matrix with rows each of dimension 1 * d, the data is partitioned into
mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a
summary of its data. The summaries of the different workers are then unified into a single solution at the
end of the computation.

Modes

The Amazon SageMaker PCA algorithm uses either of two modes to calculate these summaries,
depending on the situation:

• regular: for datasets with sparse data and a moderate number of observations and features.
• randomized: for datasets with both a large number of observations and features. This mode uses an
approximation algorithm.

As the algorithm's last step, it performs the singular value decomposition on the unified solution, from
which the principal components are then derived.

Mode 1: Regular

The workers jointly compute both and .


Note
Because are 1 * d row vectors, is a matrix (not a scalar). Using row vectors within the
code allows us to obtain efficient caching.

1494
Amazon SageMaker Developer Guide
Use Built-in Algorithms

The covariance matrix is computed as , and its top num_components singular


vectors form the model.
Note
If subtract_mean is False, we avoid computing and subtracting .

Use this algorithm when the dimension d of the vectors is small enough so that can fit in memory.

Mode 2: Randomized

When the number of features in the input dataset is large, we use a method to approximate
the covariance metric. For every mini-batch of dimension b * d, we randomly initialize a
(num_components + extra_components) * b matrix that we multiply by each mini-batch,
to create a (num_components + extra_components) * d matrix. The sum of these matrices
is computed by the workers, and the servers perform SVD on the final (num_components +
extra_components) * d matrix. The top right num_components singular vectors of it are the
approximation of the top singular vectors of the input matrix.

Let = num_components + extra_components. Given a mini-batch of dimension b * d, the


worker draws a random matrix of dimension . Depending on whether the environment uses a
GPU or CPU and the dimension size, the matrix is either a random sign matrix where each entry is +-1
or a FJLT (fast Johnson Lindenstrauss transform; for information, see FJLT Transforms and the follow-
up papers). The worker then computes and maintains . The worker also maintains ,
the sum of columns of (T being the total number of mini-batches), and s, the sum of all input
rows. After processing the entire shard of data, the worker sends the server B, h, s, and n (the number of
input rows).

Denote the different inputs to the server as The server computes B, h, s, n the sums of the
respective inputs. It then computes , and finds its singular value decomposition. The top-
right singular vectors and singular values of C are used as the approximate solution to the problem.

PCA Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the
PCA training algorithm provided by Amazon SageMaker. For more information about how PCA works, see
How PCA Works (p. 1494).

Parameter Name Description

feature_dim Input dimension.

Required

Valid values: positive integer

mini_batch_size Number of rows in a mini-batch.

Required

Valid values: positive integer

num_components The number of principal components to compute.

Required

Valid values: positive integer

1495
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

algorithm_mode Mode for computing the principal components.

Optional

Valid values: regular or randomized

Default value: regular

extra_components As the value increases, the solution becomes more accurate but the
runtime and memory consumption increase linearly. The default,
-1, means the maximum of 10 and num_components. Valid for
randomized mode only.

Optional

Valid values: Non-negative integer or -1

Default value: -1

subtract_mean Indicates whether the data should be unbiased both during training
and at inference.

Optional

Valid values: One of true or false

Default value: true

PCA Response Formats

All Amazon SageMaker built-in algorithms adhere to the common input inference format described
in Common Data Formats - Inference. This topic contains a list of the available output formats for the
SageMaker PCA algorithm.

JSON Response Format

Accept—application/json

{
"projections": [
{
"projection": [1.0, 2.0, 3.0, 4.0, 5.0]
},
{
"projection": [6.0, 7.0, 8.0, 9.0, 0.0]
},
....
]
}

JSONLINES Response Format

Accept—application/jsonlines

{ "projection": [1.0, 2.0, 3.0, 4.0, 5.0] }


{ "projection": [6.0, 7.0, 8.0, 9.0, 0.0] }

1496
Amazon SageMaker Developer Guide
Use Built-in Algorithms

RECORDIO Response Format


Accept—application/x-recordio-protobuf

[
Record = {
features = {},
label = {
'projection': {
keys: [],
values: [1.0, 2.0, 3.0, 4.0, 5.0]
}
}
},
Record = {
features = {},
label = {
'projection': {
keys: [],
values: [1.0, 2.0, 3.0, 4.0, 5.0]
}
}
}
]

Random Cut Forest (RCF) Algorithm


Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous
data points within a data set. These are observations which diverge from otherwise well-structured or
patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or
unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily
distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase
the complexity of a machine learning task since the "regular" data can often be described with a simple
model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point
is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of
"low" and "high" depend on the application but common practice suggests that scores beyond three
standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data
such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-
dimensional input. Amazon SageMaker RCF scales well with respect to number of features, data set size,
and number of instances.

Topics
• Input/Output Interface for the RCF Algorithm (p. 1497)
• Instance Recommendations for the RCF Algorithm (p. 1498)
• RCF Sample Notebooks (p. 1498)
• How RCF Works (p. 1499)
• RCF Hyperparameters (p. 1501)
• Tune an RCF Model (p. 1502)
• RCF Response Formats (p. 1503)

Input/Output Interface for the RCF Algorithm


Amazon SageMaker Random Cut Forest supports the train and test data channels. The optional test
channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and

1497
Amazon SageMaker Developer Guide
Use Built-in Algorithms

test data content types can be either application/x-recordio-protobuf or text/csv formats.


For the test data, when using text/csv format, the content must be specified as text/csv;label_size=1
where the first column of each row represents the anomaly label: "1" for an anomalous data point and
"0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that
is formatted as recordIO-wrapped-protobuf or as CSV

The train channel only supports S3DataDistributionType=ShardedByS3Key and the test channel
only supports S3DataDistributionType=FullyReplicated. The following example specifies the S3
distribution type for the train channel using the Amazon SageMaker Python SDK.
Note
The sagemaker.inputs.s3_input method was renamed to
sagemaker.inputs.TrainingInput in SageMaker Python SDK v2.

import sagemaker

# specify Random Cut Forest training job information and hyperparameters


rcf = sagemaker.estimator.Estimator(...)

# explicitly specify "ShardedByS3Key" distribution type


train_data = sagemaker.inputs.TrainingInput(
s3_data=s3_training_data_location,
content_type='text/csv;label_size=0',
distribution='ShardedByS3Key')

# run the training job on input data stored in S3


rcf.fit({'train': train_data})

To avoid common errors around execution roles, ensure that you have the execution roles required,
AmazonSageMakerFullAccess and AmazonEC2ContainerRegistryFullAccess. To avoid common
errors around your image not existing or its permissions being incorrect, ensure that your ECR image is
not larger then the allocated disk space on the training instance. To avoid this, run your training job on
an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's
Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this
will result in an error. See the ECR repository permissions for more information on setting a repository
policy statement.

See the S3DataSource for more information on customizing the S3 data source attributes. Finally, in
order to take advantage of multi-instance training the training data must be partitioned into at least as
many files as instances.

For inference, RCF supports application/x-recordio-protobuf, text/csv and application/


json input data content types. See the Common Data Formats for Built-in Algorithms (p. 1289)
documentation for more information. RCF inference returns application/x-recordio-protobuf
or application/json formatted output. Each record in these output data contains the corresponding
anomaly scores for each input data point. See Common Data Formats--Inference for more information.

For more information on input and output file formats, see RCF Response Formats (p. 1503) for
inference and the RCF Sample Notebooks (p. 1498).

Instance Recommendations for the RCF Algorithm


For training, we recommend the ml.m4, ml.c4, and ml.c5 instance families. For inference we
recommend using a ml.c5.xl instance type in particular, for maximum performance as well as
minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it
does not take advantage of GPU hardware.

RCF Sample Notebooks


For an example of how to train an RCF model and perform inferences with it, see the An Introduction to
SageMaker Random Cut Forests notebook. For instructions how to create and access Jupyter notebook

1498
Amazon SageMaker Developer Guide
Use Built-in Algorithms

instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook
Instances (p. 204). Once you have created a notebook instance and opened it, select the SageMaker
Examples tab to see a list of all the SageMaker samples. To open a notebook, click on its Use tab and
select Create copy.

How RCF Works


Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous
data points within a dataset. These are observations which diverge from otherwise well-structured or
patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or
unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily
distinguishable from the "regular" data. Including these anomalies in a dataset can drastically increase
the complexity of a machine learning task since the "regular" data can often be described with a simple
model.

The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using
a partition of a sample of the training data. For example, a random sample of the input data is first
determined. The random sample is then partitioned according to the number of trees in the forest. Each
tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score
assigned to a data point by the tree is defined as the expected change in complexity of the tree as a
result adding that point to the tree; which, in approximation, is inversely proportional to the resulting
depth of the point in the tree. The random cut forest assigns an anomaly score by computing the
average score from each constituent tree and scaling the result with respect to the sample size. The RCF
algorithm is based on the one described in reference [1].

Sample Data Randomly


The first step in the RCF algorithm is to obtain a random sample of the training data. In particular,
suppose we want a sample of size from total data points. If the training data is small enough,
the entire dataset can be used, and we could randomly draw elements from this set. However,
frequently the training data is too large to fit all at once, and this approach isn't feasible. Instead, we use
a technique called reservoir sampling.

Reservoir sampling is an algorithm for efficiently drawing random samples from a dataset
where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir
sampling works even when is not known a priori. If only one sample is requested, such as when ,
the algorithm is like this:

Algorithm: Reservoir Sampling

• Input: dataset or data stream


• Initialize the random sample
• For each observed sample :
• Pick a uniform random number
• If
• Set
• Return

This algorithm selects a random sample such that for all . When the
algorithm is more complicated. Additionally, a distinction must be made between random sampling that
is with and without replacement. RCF performs an augmented reservoir sampling without replacement
on the training data based on the algorithms described in [2].

Train a RCF Model and Produce Inferences


The next step in RCF is to construct a random cut forest using the random sample of data. First, the
sample is partitioned into a number of equal-sized partitions equal to the number of trees in the forest.

1499
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary
tree by partitioning the data domain into bounding boxes.

This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional
dataset. The corresponding tree is initialized to the root node:

A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous
data point (orange). The tree is initialized with a root node.

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data,
selecting a random dimension (giving more weight to dimensions with higher "variance"), and then
randomly determining the position of a hyperplane "cut" through that dimension. The two resulting
subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the
remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will
consist of the subtree of points to the left of the initial cut and the other representing the single point on
the right.

1500
Amazon SageMaker Developer Guide
Use Built-in Algorithms

A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie
isolated in a bounding box at a smaller tree depth than other points.

Bounding boxes are then computed for the left and right halves of the data and the process is repeated
until every leaf of the tree represents a single data point from the sample. Note that if the lone point
is sufficiently far away then it is more likely that a random cut would result in point isolation. This
observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the
anomaly score.

When performing inference using a trained RCF model the final anomaly score is reported as the average
across scores reported by each tree. Note that it is often the case that the new data point does not
already reside in the tree. To determine the score associated with the new point the data point is inserted
into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent
to the training process described above. That is, the resulting tree is as if the input data point were
a member of the sample used to construct the tree in the first place. The reported score is inversely
proportional to the depth of the input point within the tree.

Choose Hyperparameters

The primary hyperparameters used to tune the RCF model are num_trees and
num_samples_per_tree. Increasing num_trees has the effect of reducing the noise observed in
anomaly scores since the final score is the average of the scores reported by each tree. While the optimal
value is application-dependent we recommend using 100 trees to begin with as a balance between score
noise and model complexity. Note that inference time is proportional to the number of trees. Although
training time is also affected it is dominated by the reservoir sampling algorithm describe above.

The parameter num_samples_per_tree is related to the expected density of anomalies in the dataset.
In particular, num_samples_per_tree should be chosen such that 1/num_samples_per_tree
approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each
tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an
optimal value for this hyperparameter is dependent on the application.

References

1. Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based
anomaly detection on streams." In International Conference on Machine Learning, pp. 2712-2721.
2016.
2. Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random
sampling with replacement from data stream." In Proceedings of the 2004 SIAM International
Conference on Data Mining, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.

RCF Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm. You can also specify algorithm-
specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the
Amazon SageMaker RCF algorithm. For more information, including recommendations on how to choose
hyperparameters, see How RCF Works (p. 1499).

Parameter Name Description

feature_dim The number of features in the data set. (If you use the Random Cut Forest
estimator, this value is calculated for you and need not be specified.)

Required

Valid values: Positive integer (min: 1, max: 10000)

1501
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

eval_metrics A list of metrics used to score a labeled test data set. The following
metrics can be selected for output:

• accuracy - returns fraction of correct predictions.


• precision_recall_fscore - returns the positive and negative
precision, recall, and F1-scores.

Optional

Valid values: a list with possible values taken from accuracy or


precision_recall_fscore.

Default value: Both accuracy, precision_recall_fscore are


calculated.

num_samples_per_tree Number of random samples given to each tree from the training data set.

Optional

Valid values: Positive integer (min: 1, max: 2048)

Default value: 256

num_trees Number of trees in the forest.

Optional

Valid values: Positive integer (min: 50, max: 1000)

Default value: 100

Tune an RCF Model

Automatic model tuning, also known as hyperparameter tuning or hyperparameter optimization, finds
the best version of a model by running many jobs that test a range of hyperparameters on your dataset.
You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose
the objective metric from the metrics that the algorithm computes. Automatic model tuning searches
the hyperparameters chosen to find the combination of values that result in the model that optimizes
the objective metric.

The Amazon SageMaker RCF algorithm is an unsupervised anomaly-detection algorithm that requires
a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data
points and then labels the data points as anomalous if their scores are beyond three standard deviations
from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the
difference between calculated labels and actual labels. The hyperparameter tuning job finds the model
that maximizes that score. The success of hyperparameter optimization depends on the applicability of
the three-sigma limit heuristic to the test dataset.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the RCF Algorithm

The RCF algorithm computes the following metric during training. When tuning the model, choose this
metric as the objective metric.

1502
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

test:f1 F1-score on the test dataset, based on the Maximize


difference between calculated labels and actual
labels.

Tunable RCF Hyperparameters

You can tune a RCF model with the following hyperparameters.

Parameter Name Parameter Type Recommended Ranges

num_samples_per_treeIntegerParameterRanges MinValue: 1,
MaxValue:2048

num_trees IntegerParameterRanges MinValue: 50,


MaxValue:1000

RCF Response Formats

All Amazon SageMaker built-in algorithms adhere to the common input inference format described in
Common Data Formats - Inference. Note that SageMaker Random Cut Forest supports both dense and
sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the
SageMaker RCF algorithm.

JSON Response Format

ACCEPT: application/json.

"scores": [

{"score": 0.02},

{"score": 0.25}

JSONLINES Response Format

ACCEPT: application/jsonlines.

{"score": 0.02},

1503
Amazon SageMaker Developer Guide
Use Built-in Algorithms

{"score": 0.25}

RECORDIO Response Format

ACCEPT: application/x-recordio-protobuf.

Record = {

features = {},

label = {

'score': {

keys: [],

values: [0.25] # float32

},

Record = {

features = {},

label = {

'score': {

keys: [],

1504
Amazon SageMaker Developer Guide
Use Built-in Algorithms

values: [0.23] # float32

Built-in SageMaker Algorithms for Computer Vision


SageMaker provides image processing algorithms that are used for image classification, object detection,
and computer vision.

• Image Classification - MXNet (p. 1506)—uses example data with answers (referred to as a supervised
algorithm). Use this algorithm to classify images.
• Image Classification - TensorFlow (p. 1517)—uses pretrained TensorFlow Hub models to fine-tune for
specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.
• Object Detection - MXNet (p. 1530)—detects and classifies objects in images using a single deep
neural network. It is a supervised learning algorithm that takes images as input and identifies all
instances of objects within the image scene.
• Object Detection - TensorFlow (p. 1541)—detects bounding boxes and object labels in an image. It is
a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow
models.
• Semantic Segmentation Algorithm (p. 1549)—provides a fine-grained, pixel-level approach to
developing computer vision applications.

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

Image train and File or Pipe recordIO or GPU Yes


Classification validation, image files
- MXNet (optionally) (.jpg or .png)
train_lst,
validation_lst,
and model

Image training and File image files CPU or GPU Yes (only
Classification validation (.jpg, .jpeg, across
- TensorFlow or .png) multiple
GPUs on
a single
instance)

1505
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Algorithm Channel Training File type Instance Parallelizable


name name input mode class

Object train and File or Pipe recordIO or GPU Yes


Detection validation, image files
(optionally) (.jpg or .png)
train_annotation,
validation_annotation,
and model

Object training and File image files GPU Yes (only


Detection - validation (.jpg, .jpeg, across
TensorFlow or .png) multiple
GPUs on
a single
instance)

Semantic train and File or Pipe Image files GPU (single No


Segmentation validation, instance
train_annotation, only)
validation_annotation,
and
(optionally)
label_map
and model

Image Classification - MXNet


The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports
multi-label classification. It takes an image as input and outputs one or more labels assigned to that
image. It uses a convolutional neural network that can be trained from scratch or trained using transfer
learning when a large number of training images are not available

The recommended input format for the Amazon SageMaker image classification algorithms is Apache
MXNet RecordIO. However, you can also use raw images in .jpg or .png format. Refer to this discussion
for a broad overview of efficient data preparation and loading for machine learning systems.
Note
To maintain better interoperability with existing deep learning frameworks, this differs from the
protobuf data formats commonly used by other Amazon SageMaker algorithms.

For more information on convolutional networks, see:

• Deep residual learning for image recognition Kaiming He, et al., 2016 IEEE Conference on Computer
Vision and Pattern Recognition
• ImageNet image database
• Image classification with Gluon-CV and MXNet

Topics
• Input/Output Interface for the Image Classification Algorithm (p. 1507)
• EC2 Instance Recommendation for the Image Classification Algorithm (p. 1509)
• Image Classification Sample Notebooks (p. 1509)
• How Image Classification Works (p. 1509)
• Image Classification Hyperparameters (p. 1510)
• Tune an Image Classification Model (p. 1516)

1506
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Input/Output Interface for the Image Classification Algorithm

The SageMaker Image Classification algorithm supports both RecordIO (application/x-recordio)


and image (image/png, image/jpeg, and application/x-image) content types for training in
file mode, and supports the RecordIO (application/x-recordio) content type for training in pipe
mode. However, you can also train in pipe mode using the image files (image/png, image/jpeg, and
application/x-image), without creating RecordIO files, by using the augmented manifest format.

Distributed training is supported for file mode and pipe mode. When using the RecordIO content type in
pipe mode, you must set the S3DataDistributionType of the S3DataSource to FullyReplicated.
The algorithm supports a fully replicated model where your data is copied onto each machine.

The algorithm supports image/png, image/jpeg, and application/x-image for inference.

Train with RecordIO Format

If you use the RecordIO format for training, specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. Specify one RecordIO (.rec)
file in the train channel and one RecordIO file in the validation channel. Set the content type for
both channels to application/x-recordio.

Train with Image Format

If you use the Image format for training, specify train, validation, train_lst,
and validation_lst channels as values for the InputDataConfig parameter of the
CreateTrainingJob request. Specify the individual image data (.jpg or .png files) for the train
and validation channels. Specify one .lst file in each of the train_lst and validation_lst
channels. Set the content type for all four channels to application/x-image.
Note
SageMaker reads the training and validation data separately from different channels, so you
must store the training and validation data in different folders.

A .lst file is a tab-separated file with three columns that contains a list of image files. The first column
specifies the image index, the second column specifies the class label index for the image, and the third
column specifies the relative path of the image file. The image index in the first column must be unique
across all of the images. The set of class label indices are numbered successively and the numbering
should start with 0. For example, 0 for the cat class, 1 for the dog class, and so on for additional classes.

The following is an example of a .lst file:

5 1 your_image_directory/train_img_dog1.jpg
1000 0 your_image_directory/train_img_cat1.jpg
22 1 your_image_directory/train_img_dog2.jpg

For example, if your training images are stored in s3://<your_bucket>/train/class_dog, s3://


<your_bucket>/train/class_cat, and so on, specify the path for your train channel as s3://
<your_bucket>/train, which is the top-level directory for your data. In the .lst file, specify the
relative path for an individual file named train_image_dog1.jpg in the class_dog class directory as
class_dog/train_image_dog1.jpg. You can also store all your image files under one subdirectory
inside the train directory. In that case, use that subdirectory for the relative path. For example, s3://
<your_bucket>/train/your_image_directory.

Train with Augmented Manifest Image Format

The augmented manifest format enables you to do training in Pipe mode using image files without
needing to create RecordIO files. You need to specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. While using the format,
an S3 manifest file needs to be generated that contains the list of images and their corresponding

1507
Amazon SageMaker Developer Guide
Use Built-in Algorithms

annotations. The manifest file format should be in JSON Lines format in which each line represents one
sample. The images are specified using the 'source-ref' tag that points to the S3 location of the
image. The annotations are provided under the "AttributeNames" parameter value as specified in the
CreateTrainingJob request. It can also contain additional metadata under the metadata tag, but
these are ignored by the algorithm. In the following example, the "AttributeNames" are contained
in the list of image and annotation references ["source-ref", "class"]. The corresponding label
value is "0" for the first image and “1” for the second image:

{"source-ref":"s3://image/filename1.jpg", "class":"0"}
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name":
"cat", "type" : "groundtruth/image-classification"}}

The order of "AttributeNames" in the input files matters when training the ImageClassification
algorithm. It accepts piped data in a specific order, with image first, followed by label. So the
"AttributeNames" in this example are provided with "source-ref" first, followed by "class".
When using the ImageClassification algorithm with Augmented Manifest, the value of the
RecordWrapperType parameter must be "RecordIO".

Multi-label training is also supported by specifying a JSON array of values. The num_classes
hyperparameter must be set to match the total number of classes. There are two valid label formats:
multi-hot and class-id.

In the multi-hot format, each label is a multi-hot encoded vector of all classes, where each class takes
the value of 0 or 1. In the following example, there are three classes. The first image is labeled with
classes 0 and 2, while the second image is labeled with class 2 only:

{"image-ref": "s3://mybucket/sample01/image1.jpg", "class": "[1, 0, 1]"}


{"image-ref": "s3://mybucket/sample02/image2.jpg", "class": "[0, 0, 1]"}

In the class-id format, each label is a list of the class ids, from [0, num_classes), which apply to the data
point. The previous example would instead look like this:

{"image-ref": "s3://mybucket/sample01/image1.jpg", "class": "[0, 2]"}


{"image-ref": "s3://mybucket/sample02/image2.jpg", "class": "[2]"}

The multi-hot format is the default, but can be explicitly set in the content type with the label-format
parameter: "application/x-recordio; label-format=multi-hot". The class-id format, which
is the format outputted by GroundTruth, must be set explicitly: "application/x-recordio; label-
format=class-id".

For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).

Incremental Training

You can also seed the training of a new model with the artifacts from a model that you trained
previously with SageMaker. Incremental training saves training time when you want to train a new model
with the same or similar data. SageMaker image classification models can be seeded only with another
built-in image classification model trained in SageMaker.

To use a pretrained model, in the CreateTrainingJob request, specify the ChannelName as "model" in
the InputDataConfig parameter. Set the ContentType for the model channel to application/x-
sagemaker-model. The input hyperparameters of both the new model and the pretrained model that
you upload to the model channel must have the same settings for the num_layers, image_shape and
num_classes input parameters. These parameters define the network architecture. For the pretrained
model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker. You can use
either RecordIO or image formats for input data.

1508
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For a sample notebook that shows how to use incremental training with the SageMaker image
classification algorithm, see the End-to-End Incremental Training Image Classification Example. For more
information on incremental training and for instructions on how to use it, see Incremental Training in
Amazon SageMaker (p. 2113).

Inference with the Image Classification Algorithm

The generated models can be hosted for inference and support encoded .jpg and .png image formats
as image/png, image/jpeg, and application/x-image content-type. The input image is resized
automatically. The output is the probability values for all classes encoded in JSON format, or in JSON
Lines text format for batch transform. The image classification model processes a single image per
request and so outputs only one line in the JSON or JSON Lines format. The following is an example of a
response in JSON Lines format:

accept: application/jsonlines

{"prediction": [prob_0, prob_1, prob_2, prob_3, ...]}

For more details on training and inference, see the image classification sample notebook instances
referenced in the introduction.

EC2 Instance Recommendation for the Image Classification Algorithm

For image classification, we support P2, P3, G4dn, and G5 instances. We recommend using GPU instances
with more memory for training with large batch sizes. You can also run the algorithm on multi-GPU and
multi-machine settings for distributed training. Both CPU (such as C4) and GPU (P2, P3, G4dn, or G5)
instances can be used for inference.

Image Classification Sample Notebooks

For a sample notebook that uses the SageMaker image classification algorithm to train a model on the
caltech-256 dataset and then to deploy it to perform inferences, see the End-to-End Multiclass Image
Classification Example. For instructions how to create and access Jupyter notebook instances that you
can use to run the example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker samples. The example image classification notebooks are located in the Introduction to
Amazon algorithms section. To open a notebook, click on its Use tab and select Create copy.

How Image Classification Works

The image classification algorithm takes an image as input and classifies it into one of the output
categories. Deep learning has revolutionized the image classification domain and has achieved great
performance. Various deep learning networks such as ResNet, DenseNet, Inception, and so on, have
been developed to be highly accurate for image classification. At the same time, there have been efforts
to collect labeled image data that are essential for training these networks. ImageNet is one such
large dataset that has more than 11 million images with about 11,000 categories. Once a network is
trained with ImageNet data, it can then be used to generalize with other datasets as well, by simple re-
adjustment or fine-tuning. In this transfer learning approach, a network is initialized with weights (in
this example, trained on ImageNet), which can be later fine-tuned for an image classification task in a
different dataset.

Image classification in Amazon SageMaker can be run in two modes: full training and transfer learning.
In full training mode, the network is initialized with random weights and trained on user data from
scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top
fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new
data. In this mode, training can be achieved even with a smaller dataset. This is because the network is
already trained and therefore can be used in cases without sufficient training data.

1509
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Image Classification Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Image Classification
algorithm. See Tune an Image Classification Model (p. 1516) for information on image classification
hyperparameter tuning.

Parameter Name Description

num_classes Number of output classes. This parameter defines the dimensions


of the network output and is typically set to the number of classes
in the dataset.

Besides multi-class classification, multi-label classification is


supported too. Please refer to Input/Output Interface for the Image
Classification Algorithm (p. 1507) for details on how to work with
multi-label classification with augmented manifest files.

Required

Valid values: positive integer

num_training_samples Number of training examples in the input dataset.

If there is a mismatch between this value and the number


of samples in the training set, then the behavior of the
lr_scheduler_step parameter is undefined and distributed
training accuracy might be affected.

Required

Valid values: positive integer

augmentation_type Data augmentation type. The input images can be augmented in


multiple ways as specified below.

• crop: Randomly crop the image and flip the image horizontally
• crop_color: In addition to ‘crop’, three random values in
the range [-36, 36], [-50, 50], and [-50, 50] are added to the
corresponding Hue-Saturation-Lightness channels respectively
• crop_color_transform: In addition to crop_color, random
transformations, including rotation, shear, and aspect ratio
variations are applied to the image. The maximum angle of
rotation is 10 degrees, the maximum shear ratio is 0.1, and the
maximum aspect changing ratio is 0.25.

Optional

Valid values: crop, crop_color, or crop_color_transform.

Default value: no default value

beta_1 The beta1 for adam, that is the exponential decay rate for the first
moment estimates.

Optional

Valid values: float. Range in [0, 1].

1510
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: 0.9

beta_2 The beta2 for adam, that is the exponential decay rate for the
second moment estimates.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.999

checkpoint_frequency Period to store model parameters (in number of epochs).

Note that all checkpoint files are saved as part of the final model
file "model.tar.gz" and uploaded to S3 to the specified model
location. This increases the size of the model file proportionally to
the number of checkpoints saved during training.

Optional

Valid values: positive integer no greater than epochs.

Default value: no default value (Save checkpoint at the epoch that


has the best validation accuracy)

early_stopping True to use early stopping logic during training. False not to use
it.

Optional

Valid values: True or False

Default value: False

early_stopping_min_epochs The minimum number of epochs that must be run before


the early stopping logic can be invoked. It is used only when
early_stopping = True.

Optional

Valid values: positive integer

Default value: 10

early_stopping_patience The number of epochs to wait before ending training if no


improvement is made in the relevant metric. It is used only when
early_stopping = True.

Optional

Valid values: positive integer

Default value: 5

1511
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

early_stopping_tolerance Relative tolerance to measure an improvement in accuracy


validation metric. If the ratio of the improvement in accuracy
divided by the previous best accuracy is smaller than the
early_stopping_tolerance value set, early stopping considers
there is no improvement. It is used only when early_stopping =
True.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.0

epochs Number of training epochs.

Optional

Valid values: positive integer

Default value: 30

eps The epsilon for adam and rmsprop. It is usually set to a small value
to avoid division by 0.

Optional

Valid values: float. Range in [0, 1].

Default value: 1e-8

gamma The gamma for rmsprop, the decay factor for the moving average
of the squared gradient.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.9

1512
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

image_shape The input image dimensions, which is the same size as the input
layer of the network. The format is defined as 'num_channels,
height, width'. The image dimension can take on any value as the
network can handle varied dimensions of the input. However,
there may be memory constraints if a larger image dimension is
used. Pretrained models can use only a fixed 224 x 224 image size.
Typical image dimensions for image classification are '3,224,224'.
This is similar to the ImageNet dataset.

For training, if any input image is smaller than this parameter


in any dimension, training fails. If an image is larger, a portion
of the image is cropped, with the cropped area specified by this
parameter. If hyperparameter augmentation_type is set, random
crop is taken; otherwise, central crop is taken.

At inference, input images are resized to the image_shape that


was used during training. Aspect ratio is not preserved, and images
are not cropped.

Optional

Valid values: string

Default value: ‘3,224,224’

kv_store Weight update synchronization mode during distributed training.


The weight updates can be updated either synchronously or
asynchronously across machines. Synchronous updates typically
provide better accuracy than asynchronous updates but can be
slower. See distributed training in MXNet for more details.

This parameter is not applicable to single machine training.

• dist_sync: The gradients are synchronized after every batch


with all the workers. With dist_sync, batch-size now means
the batch size used on each machine. So if there are n machines
and we use batch size b, then dist_sync behaves like local with
batch size n*b
• dist_async: Performs asynchronous updates. The weights are
updated whenever gradients are received from any machine
and the weight updates are atomic. However, the order is not
guaranteed.

Optional

Valid values: dist_sync or dist_async

Default value: no default value

learning_rate Initial learning rate.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.1

1513
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

lr_scheduler_factor The ratio to reduce learning rate used in conjunction with the
lr_scheduler_step parameter, defined as lr_new = lr_old *
lr_scheduler_factor.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.1

lr_scheduler_step The epochs at which to reduce the learning rate. As explained


in the lr_scheduler_factor parameter, the learning rate
is reduced by lr_scheduler_factor at these epochs. For
example, if the value is set to "10, 20", then the learning rate is
reduced by lr_scheduler_factor after 10th epoch and again
by lr_scheduler_factor after 20th epoch. The epochs are
delimited by ",".

Optional

Valid values: string

Default value: no default value

mini_batch_size The batch size for training. In a single-machine multi-GPU setting,


each GPU handles mini_batch_size/num_gpu training samples.
For the multi-machine training in dist_sync mode, the actual batch
size is mini_batch_size*number of machines. See MXNet docs
for more details.

Optional

Valid values: positive integer

Default value: 32

momentum The momentum for sgd and nag, ignored for other optimizers.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.9

multi_label Flag to use for multi-label classification where each sample can
be assigned multiple labels. Average accuracy across all classes is
logged.

Optional

Valid values: 0 or 1

Default value: 0

1514
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

num_layers Number of layers for the network. For data with large image size
(for example, 224x224 - like ImageNet), we suggest selecting the
number of layers from the set [18, 34, 50, 101, 152, 200]. For data
with small image size (for example, 28x28 - like CIFAR), we suggest
selecting the number of layers from the set [20, 32, 44, 56, 110].
The number of layers in each set is based on the ResNet paper. For
transfer learning, the number of layers defines the architecture of
base network and hence can only be selected from the set [18, 34,
50, 101, 152, 200].

Optional

Valid values: positive integer in [18, 34, 50, 101, 152, 200] or [20,
32, 44, 56, 110]

Default value: 152

optimizer The optimizer type. For more details of the parameters for the
optimizers, please refer to MXNet's API.

Optional

Valid values: One of sgd, adam, rmsprop, or nag.

• sgd: Stochastic gradient descent


• adam: Adaptive momentum estimation
• rmsprop: Root mean square propagation
• nag: Nesterov accelerated gradient

Default value: sgd

precision_dtype The precision of the weights used for training. The algorithm can
use either single precision (float32) or half precision (float16)
for the weights. Using half-precision for weights results in reduced
memory consumption.

Optional

Valid values: float32 or float16

Default value: float32

resize The number of pixels in the shortest side of an image after resizing
it for training. If the parameter is not set, then the training data is
used without resizing. The parameter should be larger than both
the width and height components of image_shape to prevent
training failure.

Required when using image content types

Optional when using the RecordIO content type

Valid values: positive integer

Default value: no default value

1515
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

top_k Reports the top-k accuracy during training. This parameter has to
be greater than 1, since the top-1 training accuracy is the same as
the regular training accuracy that has already been reported.

Optional

Valid values: positive integer larger than 1.

Default value: no default value

use_pretrained_model Flag to use pre-trained model for training. If set to 1, then the
pretrained model with the corresponding number of layers is
loaded and used for training. Only the top FC layer are reinitialized
with random weights. Otherwise, the network is trained from
scratch.

Optional

Valid values: 0 or 1

Default value: 0

use_weighted_loss Flag to use weighted cross-entropy loss for multi-label classification


(used only when multi_label = 1), where the weights are
calculated based on the distribution of classes.

Optional

Valid values: 0 or 1

Default value: 0

weight_decay The coefficient weight decay for sgd and nag, ignored for other
optimizers.

Optional

Valid values: float. Range in [0, 1].

Default value: 0.0001

Tune an Image Classification Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the Image Classification Algorithm

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is
computed during training. When tuning the model, choose this metric as the objective metric.

1516
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

validation:accuracy The ratio of the number of correct predictions to Maximize


the total number of predictions made.

Tunable Image Classification Hyperparameters

Tune an image classification model with the following hyperparameters. The hyperparameters that have
the greatest impact on image classification objective metrics are: mini_batch_size, learning_rate,
and optimizer. Tune the optimizer-related hyperparameters, such as momentum, weight_decay,
beta_1, beta_2, eps, and gamma, based on the selected optimizer. For example, use beta_1 and
beta_2 only when adam is the optimizer.

For more information about which hyperparameters are used in each optimizer, see Image Classification
Hyperparameters (p. 1510).

Parameter Name Parameter Type Recommended Ranges

beta_1 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

beta_2 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

eps ContinuousParameterRanges MinValue: 1e-8,


MaxValue: 1.0

gamma ContinuousParameterRanges MinValue: 1e-8,


MaxValue: 0.999

learning_rate ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.5

mini_batch_size IntegerParameterRanges MinValue: 8, MaxValue:


512

momentum ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

optimizer CategoricalParameterRanges ['sgd', ‘adam’, ‘rmsprop’,


'nag']

weight_decay ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

Image Classification - TensorFlow


The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm
that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer
learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount
of image data is not available. The image classification algorithm takes an image as input and outputs a
probability for each provided class label. Training datasets must consist of images in .jpg, .jpeg, or .png
format.

Topics
• How to use the SageMaker Image Classification - TensorFlow algorithm (p. 1518)

1517
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Input and output interface for the Image Classification - TensorFlow algorithm (p. 1519)
• Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm (p. 1520)
• Image Classification - TensorFlow sample notebooks (p. 1521)
• How Image Classification - TensorFlow Works (p. 1521)
• TensorFlow Hub Models (p. 1521)
• Image Classification - TensorFlow Hyperparameters (p. 1526)
• Tune an Image Classification - TensorFlow model (p. 1529)

How to use the SageMaker Image Classification - TensorFlow algorithm

You can use Image Classification - TensorFlow as an Amazon SageMaker built-in algorithm. The following
section describes how to use Image Classification - TensorFlow with the SageMaker Python SDK. For
information on how to use Image Classification - TensorFlow from the Amazon SageMaker Studio UI, see
SageMaker JumpStart (p. 47).

The Image Classification - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow Hub models. For a list of all available pretrained models, see TensorFlow
Hub Models (p. 1521). Every pretrained model has a unique model_id. The following example
uses MobileNet V2 1.00 224 (model_id: tensorflow-ic-imagenet-mobilenet-v2-100-224-
classification-4) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded
from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network
isolation. Use these pre-generated model training artifacts to construct a SageMaker Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and
their default values with hyperparameters.retrieve_default. For more information, see Image
Classification - TensorFlow Hyperparameters (p. 1526). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For larger models, the default
batch size is smaller and the train_only_top_layer hyperparameter is set to "True".

This example uses the tf_flowers dataset, which contains five classes of flower images. We pre-
downloaded the dataset from TensorFlow under the Apache 2.0 license and made it available with
Amazon S3. To fine-tune your model, call .fit using the Amazon S3 location of your training dataset.

from sagemaker import image_uris, model_uris, script_uris, hyperparameters


from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4",


"*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image


train_image_uri =
image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type

# Retrieve the training script


train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version,
script_scope="training")

# Retrieve the pretrained model tarball for transfer learning


train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version,
model_scope="training")

# Retrieve the default hyper-parameters for fine-tuning the model


hyperparameters = hyperparameters.retrieve_default(model_id=model_id,
model_version=model_version)

1518
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# [Optional] Override default hyperparameters with custom values


hyperparameters["epochs"] = "5"

# The sample training data is available in the following S3 bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tf_flowers/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ic-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create SageMaker Estimator instance


tf_ic_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location,
)

# Use S3 path of the training data to launch SageMaker TrainingJob


tf_ic_estimator.fit({"training": training_dataset_s3_path}, logs=True)

Input and output interface for the Image Classification - TensorFlow algorithm

Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset with
any number of image classes. Be mindful of how to format your training data for input to the Image
Classification - TensorFlow model.

• Training data input format: Your training data should be a directory with as many subdirectories as
the number of classes. Each subdirectory should contain images belonging to that class in .jpg, .jpeg,
or .png format.

The following is an example of an input directory structure. This example dataset has two
classes: roses and dandelion. The image files in each class folder can have any name. The
input directory should be hosted in an Amazon S3 bucket with a path similar to the following:
s3://bucket_name/input_directory/. Note that the trailing / is required.

input_directory
|--roses
|--abc.jpg
|--def.jpg
|--dandelion
|--ghi.jpg
|--jkl.jpg

Trained models output label mapping files that map class folder names to the indices in the list of
output class probabilities. This mapping is in alphabetical order. For example, in the preceding example,
the dandelion class is index 0 and the roses class is index 1.

After training, you have a fine-tuned model that you can further train using incremental training
or deploy for inference. The Image Classification - TensorFlow algorithm automatically adds a pre-
processing and post-processing signature to the fine-tuned model so that it can take in images as input

1519
Amazon SageMaker Developer Guide
Use Built-in Algorithms

and return class probabilities. The file mapping class indices to class labels is saved along with the
models.

Incremental training
You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Image Classification - TensorFlow model with another Image
Classification - TensorFlow model trained in SageMaker.

You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model. For an example of incremental training with the
SageMaker Image Classification - TensorFlow algorithm, see the Introduction to SageMaker TensorFlow -
Image Classification sample notebook.

Inference with the Image Classification - TensorFlow algorithm


You can host the fine-tuned model that results from your TensorFlow Image Classification training
for inference. Any input image for inference must be in .jpg, .jpeg, or .png format and be content
type application/x-image. The Image Classification - TensorFlow algorithm resizes input images
automatically.

Running inference results in probability values, class labels for all classes, and the predicted label
corresponding to the class index with the highest probability encoded in JSON format. The Image
Classification - TensorFlow model processes a single image per request and outputs only one line. The
following is an example of a JSON format response:

accept: application/json;verbose

{"probabilities": [prob_0, prob_1, prob_2, ...],


"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}

If accept is set to application/json, then the model only outputs probabilities. For more
information on training and inference with the Image Classification - TensorFlow algorithm, see the
Introduction to SageMaker TensorFlow - Image Classification sample notebook.

Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm
The Image Classification - TensorFlow algorithm supports all CPU and GPU instances for training,
including:

• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge
• ml.g4dn.xlarge
• ml.g4dn.16.xlarge
• ml.g5.xlarge
• ml.g5.48xlarge

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as
M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.

1520
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Image Classification - TensorFlow sample notebooks

For more information about how to use the SageMaker Image Classification - TensorFlow algorithm
for transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Image
Classification notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.

How Image Classification - TensorFlow Works

The Image Classification - TensorFlow algorithm takes an image as input and classifies it into one of
the output class labels. Various deep learning networks such as MobileNet, ResNet, Inception, and
EfficientNet are highly accurate for image classification. There are also deep learning networks that
are trained on large image datasets, such as ImageNet, which has over 11 million images and almost
11,000 classes. After a network is trained with ImageNet data, you can then fine-tune the network on
a dataset with a particular focus to perform more specific classification tasks. The Amazon SageMaker
Image Classification - TensorFlow algorithm supports transfer learning on many pretrained models that
are available in the TensorFlow Hub.

According to the number of class labels in your training data, a classification layer is attached to the
pretrained TensorFlow Hub model of your choice. The classification layer consists of a dropout layer, a
dense layer, and a fully-connected layer with 2-norm regularizer that is initialized with random weights.
The model has hyperparameters for the dropout rate of the dropout layer and the L2 regularization
factor for the dense layer. You can then fine-tune either the entire network (including the pretrained
model) or only the top classification layer on new training data. With this method of transfer learning,
training with smaller datasets is possible.

TensorFlow Hub Models

The following pretrained models are available to use for transfer learning with the Image Classification -
TensorFlow algorithm.

The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.

Model Name model_id Source

MobileNet V2 1.00 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-100-224-
classification-4

MobileNet V2 0.75 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-075-224-
classification-4

MobileNet V2 0.50 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-050-224-
classification-4

MobileNet V2 0.35 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-035-224-
classification-4

1521
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

MobileNet V2 1.40 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-140-224-
classification-4

MobileNet V2 1.30 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v2-130-224-
classification-4

MobileNet V2 tensorflow-ic-tf2- TensorFlow Hub link


preview-mobilenet-v2-
classification-4

Inception V3 tensorflow-ic- TensorFlow Hub link


imagenet-inception-v3-
classification-4

Inception V2 tensorflow-ic- TensorFlow Hub link


imagenet-inception-v2-
classification-4

Inception V1 tensorflow-ic- TensorFlow Hub link


imagenet-inception-v1-
classification-4

Inception V3 Preview tensorflow-ic-tf2- TensorFlow Hub link


preview-inception-v3-
classification-4

Inception ResNet V2 tensorflow-ic-imagenet- TensorFlow Hub link


inception-resnet-v2-
classification-4

ResNet V2 50 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v2-50-
classification-4

ResNet V2 101 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v2-101-
classification-4

ResNet V2 152 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v2-152-
classification-4

ResNet V1 50 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v1-50-
classification-4

ResNet V1 101 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v1-101-
classification-4

ResNet V1 152 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-v1-152-
classification-4

1522
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

ResNet 50 tensorflow-ic- TensorFlow Hub link


imagenet-resnet-50-
classification-4

EfficientNet B0 tensorflow-ic- TensorFlow Hub link


efficientnet-b0-
classification-1

EfficientNet B1 tensorflow-ic- TensorFlow Hub link


efficientnet-b1-
classification-1

EfficientNet B2 tensorflow-ic- TensorFlow Hub link


efficientnet-b2-
classification-1

EfficientNet B3 tensorflow-ic- TensorFlow Hub link


efficientnet-b3-
classification-1

EfficientNet B4 tensorflow-ic- TensorFlow Hub link


efficientnet-b4-
classification-1

EfficientNet B5 tensorflow-ic- TensorFlow Hub link


efficientnet-b5-
classification-1

EfficientNet B6 tensorflow-ic- TensorFlow Hub link


efficientnet-b6-
classification-1

EfficientNet B7 tensorflow-ic- TensorFlow Hub link


efficientnet-b7-
classification-1

EfficientNet B0 Lite tensorflow-ic- TensorFlow Hub link


efficientnet-lite0-
classification-2

EfficientNet B1 Lite tensorflow-ic- TensorFlow Hub link


efficientnet-lite1-
classification-2

EfficientNet B2 Lite tensorflow-ic- TensorFlow Hub link


efficientnet-lite2-
classification-2

EfficientNet B3 Lite tensorflow-ic- TensorFlow Hub link


efficientnet-lite3-
classification-2

EfficientNet B4 Lite tensorflow-ic- TensorFlow Hub link


efficientnet-lite4-
classification-2

1523
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

MobileNet V1 1.00 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-100-224-
classification-4

MobileNet V1 1.00 192 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-100-192-
classification-4

MobileNet V1 1.00 160 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-100-160-
classification-4

MobileNet V1 1.00 128 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-100-128-
classification-4

MobileNet V1 0.75 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-075-224-
classification-4

MobileNet V1 0.75 192 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-075-192-
classification-4

MobileNet V1 0.75 160 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-075-160-
classification-4

MobileNet V1 0.75 128 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-075-128-
classification-4

MobileNet V1 0.50 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-050-224-
classification-4

MobileNet V1 0.50 192 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-050-192-
classification-4

MobileNet V1 1.00 160 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-050-160-
classification-4

MobileNet V1 0.50 128 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-050-128-
classification-4

MobileNet V1 0.25 224 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-025-224-
classification-4

MobileNet V1 0.25 192 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-025-192-
classification-4

1524
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source

MobileNet V1 0.25 160 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-025-160-
classification-4

MobileNet V1 0.25 128 tensorflow-ic-imagenet- TensorFlow Hub link


mobilenet-v1-025-128-
classification-4

BiT-S R50x1 tensorflow-ic-bit- TensorFlow Hub link


s-r50x1-ilsvrc2012-
classification-1

BiT-S R50x3 tensorflow-ic-bit- TensorFlow Hub link


s-r50x3-ilsvrc2012-
classification-1

BiT-S R101x1 tensorflow-ic-bit- TensorFlow Hub link


s-r101x1-ilsvrc2012-
classification-1

BiT-S R101x3 tensorflow-ic-bit- TensorFlow Hub link


s-r101x3-ilsvrc2012-
classification-1

BiT-M R50x1 tensorflow-ic-bit- TensorFlow Hub link


m-r50x1-ilsvrc2012-
classification-1

BiT-M R50x3 tensorflow-ic-bit- TensorFlow Hub link


m-r50x3-ilsvrc2012-
classification-1

BiT-M R101x1 tensorflow-ic-bit- TensorFlow Hub link


m-r101x1-ilsvrc2012-
classification-1

BiT-M R101x3 tensorflow-ic-bit- TensorFlow Hub link


m-r101x3-ilsvrc2012-
classification-1

BiT-M R50x1 ImageNet-21k tensorflow-ic-bit- TensorFlow Hub link


m-r50x1-imagenet21k-
classification-1

BiT-M R50x3 ImageNet-21k tensorflow-ic-bit- TensorFlow Hub link


m-r50x3-imagenet21k-
classification-1

BiT-M R101x1 ImageNet-21k tensorflow-ic-bit-m- TensorFlow Hub link


r101x1-imagenet21k-
classification-1

BiT-M R101x3 ImageNet-21k tensorflow-ic-bit-m- TensorFlow Hub link


r101x3-imagenet21k-
classification-1

1525
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Image Classification - TensorFlow Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Image Classification -
TensorFlow algorithm. See Tune an Image Classification - TensorFlow model (p. 1529) for information on
hyperparameter tuning.

Parameter Name Description

augmentation Set to "True" to apply augmentation_random_flip,


augmentation_random_rotation, and
augmentation_random_zoom to the training data.

Valid values: string, either: ("True" or "False").

Default value: "False".

augmentation_random_flip Indicates which flip mode to use for data augmentation when
augmentation is set to "True". For more information, see
RandomFlip in the TensorFlow documentation.

Valid values: string, any of the following:


("horizontal_and_vertical", "vertical", or "None").

Default value: "horizontal_and_vertical".

Indicates how much rotation to use for data augmentation when


augmentation_random_rotation
augmentation is set to "True". Values represent a fraction of
2π. Positive values rotate counterclockwise while negative values
rotate clockwise. 0 means no rotation. For more information, see
RandomRotation in the TensorFlow documentation.

Valid values: float, range: [-1.0, 1.0].

Default value: 0.2.

augmentation_random_zoom Indicates how much vertical zoom to use for data augmentation
when augmentation is set to "True". Positive values zoom
out while negative values zoom in. 0 means no zoom. For more
information, see RandomZoom in the TensorFlow documentation.

Valid values: float, range: [-1.0, 1.0].

Default value: 0.1.

batch_size The batch size for training. For training on instances with multiple
GPUs, this batch size is used across the GPUs.

Valid values: positive integer.

Default value: 32.

beta_1 The beta1 for the "adam" optimizer. Represents the exponential
decay rate for the first moment estimates. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

1526
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

beta_2 The beta2 for the "adam" optimizer. Represents the exponential
decay rate for the second moment estimates. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.999.

binary_mode When binary_mode is set to "True", the model returns a single


probability number for the positive class and can use additional
eval_metric options. Use only for binary classification problems.

Valid values: string, either: ("True" or "False").

Default value: "False".

dropout_rate The dropout rate for the dropout layer in the top classification
layer.

Valid values: float, range: [0.0, 1.0].

Default value: 0.2

early_stopping Set to "True" to use early stopping logic during training. If


"False", early stopping is not used.

Valid values: string, either: ("True" or "False").

Default value: "False".

early_stopping_min_delta The minimum change needed to qualify as an


improvement. An absolute change less than the value of
early_stopping_min_delta does not qualify as improvement.
Used only when early_stopping is set to "True".

Valid values: float, range: [0.0, 1.0].

Default value: 0.0.

early_stopping_patience The number of epochs to continue training with no improvement.


Used only when early_stopping is set to "True".

Valid values: positive integer.

Default value: 5.

epochs The number of training epochs.

Valid values: positive integer.

Default value: 3.

1527
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

epsilon The epsilon for "adam", "rmsprop", "adadelta", and


"adagrad" optimizers. Usually set to a small value to avoid
division by 0. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 1e-7.

eval_metric If binary_mode is set to "False", eval_metric can only be


"accuracy". If binary_mode is "True", select any of the valid
values. For more information, see Metrics in the TensorFlow
documentation.

Valid values: string, any of the following: ("accuracy",


"precision", "recall", "auc", or "prc").

Default value: "accuracy".

image_resize_interpolation Indicates interpolation method used when resizing images.


For more information, see image.resize in the TensorFlow
documentation.

Valid values: string, any of the following: ("bilinear",


"nearest", "bicubic", "area", "lanczos3" , "lanczos5",
"gaussian", or "mitchellcubic").

Default value: "bilinear".

initial_accumulator_value The starting value for the accumulators, or the per-parameter


momentum values, for the "adagrad" optimizer. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.0001.

label_smoothing Indicates how much to relax the confidence on label values.


For example, if label_smoothing is 0.1, then non-target
labels are 0.1/num_classes and target labels are 0.9+0.1/
num_classes.

Valid values: float, range: [0.0, 1.0].

Default value: 0.1.

learning_rate The optimizer learning rate.

Valid values: float, range: [0.0, 1.0].

Default value: 0.001.

momentum The momentum for "sgd", "nesterov", and "rmsprop"


optimizers. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

1528
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.

Valid values: string, any of the following: ("adam", "sgd",


"nesterov", "rmsprop", "adagrad" , "adadelta").

Default value: "adam".

regularizers_l2 The L2 regularization factor for the dense layer in the classification
layer.

Valid values: float, range: [0.0, 1.0].

Default value: .0001.

reinitialize_top_layer If set to "Auto", the top classification layer parameters are


re-initialized during fine-tuning. For incremental training, top
classification layer parameters are not re-initialized unless set to
"True".

Valid values: string, any of the following: ("Auto", "True" or


"False").

Default value: "Auto".

rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.95.

train_only_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.

Valid values: string, either: ("True" or "False").

Default value: "False".

Tune an Image Classification - TensorFlow model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics computed by the Image Classification - TensorFlow algorithm

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is
computed during training. When tuning the model, choose this metric as the objective metric.

1529
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Metric Name Description Optimization Direction

validation:accuracy The ratio of the number of correct predictions to Maximize


the total number of predictions made.

Tunable Image Classification - TensorFlow hyperparameters

Tune an image classification model with the following hyperparameters. The hyperparameters that
have the greatest impact on image classification objective metrics are: batch_size, learning_rate,
and optimizer. Tune the optimizer-related hyperparameters, such as momentum, regularizers_l2,
beta_1, beta_2, and eps based on the selected optimizer. For example, use beta_1 and beta_2
only when adam is the optimizer.

For more information about which hyperparameters are used for each optimizer, see Image
Classification - TensorFlow Hyperparameters (p. 1526).

Parameter Name Parameter Type Recommended Ranges

batch_size IntegerParameterRanges MinValue: 8, MaxValue:


512

beta_1 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

beta_2 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

eps ContinuousParameterRanges MinValue: 1e-8,


MaxValue: 1.0

learning_rate ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.5

momentum ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

optimizer CategoricalParameterRanges ['sgd', ‘adam’, ‘rmsprop’,


'nesterov', 'adagrad',
'adadelta']

regularizers_l2 ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

train_only_top_layerContinuousParameterRanges ['True', 'False']

Object Detection - MXNet


The Amazon SageMaker Object Detection - MXNet algorithm detects and classifies objects in images
using a single deep neural network. It is a supervised learning algorithm that takes images as input and
identifies all instances of objects within the image scene. The object is categorized into one of the classes
in a specified collection with a confidence score that it belongs to the class. Its location and scale in the
image are indicated by a rectangular bounding box. It uses the Single Shot multibox Detector (SSD)
framework and supports two base networks: VGG and ResNet. The network can be trained from scratch,
or trained with models that have been pre-trained on the ImageNet dataset.

Topics

1530
Amazon SageMaker Developer Guide
Use Built-in Algorithms

• Input/Output Interface for the Object Detection Algorithm (p. 1531)


• EC2 Instance Recommendation for the Object Detection Algorithm (p. 1533)
• Object Detection Sample Notebooks (p. 1533)
• How Object Detection Works (p. 1534)
• Object Detection Hyperparameters (p. 1534)
• Tune an Object Detection Model (p. 1539)
• Object Detection Request and Response Formats (p. 1540)

Input/Output Interface for the Object Detection Algorithm

The SageMaker Object Detection algorithm supports both RecordIO (application/x-recordio) and
image (image/png, image/jpeg, and application/x-image) content types for training in file mode
and supports RecordIO (application/x-recordio) for training in pipe mode. However you can also
train in pipe mode using the image files (image/png, image/jpeg, and application/x-image),
without creating RecordIO files, by using the augmented manifest format. The recommended input
format for the Amazon SageMaker object detection algorithms is Apache MXNet RecordIO. However, you
can also use raw images in .jpg or .png format. The algorithm supports only application/x-image for
inference.
Note
To maintain better interoperability with existing deep learning frameworks, this differs from the
protobuf data formats commonly used by other Amazon SageMaker algorithms.

See the Object Detection Sample Notebooks (p. 1533) for more details on data formats.

Train with the RecordIO Format

If you use the RecordIO format for training, specify both train and validation channels as values for the
InputDataConfig parameter of the CreateTrainingJob request. Specify one RecordIO (.rec) file in
the train channel and one RecordIO file in the validation channel. Set the content type for both channels
to application/x-recordio. An example of how to generate RecordIO file can be found in the object
detection sample notebook. You can also use tools from the MXNet's GluonCV to generate RecordIO files
for popular datasets like the PASCAL Visual Object Classes and Common Objects in Context (COCO).

Train with the Image Format

If you use the image format for training, specify train, validation, train_annotation,
and validation_annotation channels as values for the InputDataConfig parameter of
CreateTrainingJob request. Specify the individual image data (.jpg or .png) files for the train and
validation channels. For annotation data, you can use the JSON format. Specify the corresponding .json
files in the train_annotation and validation_annotation channels. Set the content type for all
four channels to image/png or image/jpeg based on the image type. You can also use the content
type application/x-image when your dataset contains both .jpg and .png images. The following is an
example of a .json file.

{
"file": "your_image_directory/sample_image1.jpg",
"image_size": [
{
"width": 500,
"height": 400,
"depth": 3
}
],
"annotations": [
{
"class_id": 0,
"left": 111,

1531
Amazon SageMaker Developer Guide
Use Built-in Algorithms

"top": 134,
"width": 61,
"height": 128
},
{
"class_id": 0,
"left": 161,
"top": 250,
"width": 79,
"height": 143
},
{
"class_id": 1,
"left": 101,
"top": 185,
"width": 42,
"height": 130
}
],
"categories": [
{
"class_id": 0,
"name": "dog"
},
{
"class_id": 1,
"name": "cat"
}
]
}

Each image needs a .json file for annotation, and the .json file should have the same name as the
corresponding image. The name of above .json file should be "sample_image1.json". There are four
properties in the annotation .json file. The property "file" specifies the relative path of the image file.
For example, if your training images and corresponding .json files are stored in s3://your_bucket/
train/sample_image and s3://your_bucket/train_annotation, specify the path for your train and
train_annotation channels as s3://your_bucket/train and s3://your_bucket/train_annotation,
respectively.

In the .json file, the relative path for an image named sample_image1.jpg should be sample_image/
sample_image1.jpg. The "image_size" property specifies the overall image dimensions. The
SageMaker object detection algorithm currently only supports 3-channel images. The "annotations"
property specifies the categories and bounding boxes for objects within the image. Each object is
annotated by a "class_id" index and by four bounding box coordinates ("left", "top", "width",
"height"). The "left" (x-coordinate) and "top" (y-coordinate) values represent the upper-left corner
of the bounding box. The "width" (x-coordinate) and "height" (y-coordinate) values represent the
dimensions of the bounding box. The origin (0, 0) is the upper-left corner of the entire image. If you
have multiple objects within one image, all the annotations should be included in a single .json file. The
"categories" property stores the mapping between the class index and class name. The class indices
should be numbered successively and the numbering should start with 0. The "categories" property is
optional for the annotation .json file

Train with Augmented Manifest Image Format

The augmented manifest format enables you to do training in pipe mode using image files without
needing to create RecordIO files. You need to specify both train and validation channels as values for
the InputDataConfig parameter of the CreateTrainingJob request. While using the format,
an S3 manifest file needs to be generated that contains the list of images and their corresponding
annotations. The manifest file format should be in JSON Lines format in which each line represents one
sample. The images are specified using the 'source-ref' tag that points to the S3 location of the
image. The annotations are provided under the "AttributeNames" parameter value as specified in the

1532
Amazon SageMaker Developer Guide
Use Built-in Algorithms

CreateTrainingJob request. It can also contain additional metadata under the metadata tag, but
these are ignored by the algorithm. In the following example, the "AttributeNames are contained in
the list ["source-ref", "bounding-box"]:

{"source-ref": "s3://your_bucket/image1.jpg", "bounding-box":{"image_size":[{ "width":


500, "height": 400, "depth":3}], "annotations":[{"class_id": 0, "left": 111, "top":
134, "width": 61, "height": 128}, {"class_id": 5, "left": 161, "top": 250, "width": 80,
"height": 50}]}, "bounding-box-metadata":{"class-map":{"0": "dog", "5": "horse"}, "type":
"groundtruth/object-detection"}}
{"source-ref": "s3://your_bucket/image2.jpg", "bounding-box":{"image_size":[{ "width":
400, "height": 300, "depth":3}], "annotations":[{"class_id": 1, "left": 100, "top": 120,
"width": 43, "height": 78}]}, "bounding-box-metadata":{"class-map":{"1": "cat"}, "type":
"groundtruth/object-detection"}}

The order of "AttributeNames" in the input files matters when training the Object Detection
algorithm. It accepts piped data in a specific order, with image first, followed by annotations. So the
"AttributeNames" in this example are provided with "source-ref" first, followed by "bounding-box".
When using Object Detection with Augmented Manifest, the value of parameter RecordWrapperType
must be set as "RecordIO".

For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).

Incremental Training

You can also seed the training of a new model with the artifacts from a model that you trained
previously with SageMaker. Incremental training saves training time when you want to train a new model
with the same or similar data. SageMaker object detection models can be seeded only with another built-
in object detection model trained in SageMaker.

To use a pretrained model, in the CreateTrainingJob request, specify the ChannelName as "model"
in the InputDataConfig parameter. Set the ContentType for the model channel to application/
x-sagemaker-model. The input hyperparameters of both the new model and the pretrained model
that you upload to the model channel must have the same settings for the base_network and
num_classes input parameters. These parameters define the network architecture. For the pretrained
model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker. You can use
either RecordIO or image formats for input data.

For more information on incremental training and for instructions on how to use it, see Incremental
Training in Amazon SageMaker (p. 2113).

EC2 Instance Recommendation for the Object Detection Algorithm

The object detection algorithm supports P2, P3, G4dn, and G5 GPU instance families. We recommend
using GPU instances with more memory for training with large batch sizes. You can run the object
detection algorithm on multi-GPU and mult-machine settings for distributed training.

You can use both CPU (such as C5 and M5) and GPU (such as P3 and G4dn) instances for inference.

Object Detection Sample Notebooks

For a sample notebook that shows how to use the SageMaker Object Detection algorithm to train and
host a model on the

Caltech Birds (CUB 200 2011) dataset using the Single Shot multibox Detector algorithm, see Amazon
SageMaker Object Detection for Bird Species. For instructions how to create and access Jupyter notebook
instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook
Instances (p. 204). Once you have created a notebook instance and opened it, select the SageMaker
Examples tab to see a list of all the SageMaker samples. The object detection example notebook using

1533
Amazon SageMaker Developer Guide
Use Built-in Algorithms

the Object Detection algorithm is located in the Introduction to Amazon Algorithms section. To open a
notebook, click on its Use tab and select Create copy.

How Object Detection Works

The object detection algorithm identifies and locates all instances of objects in an image from a known
collection of object categories. The algorithm takes an image as input and outputs the category that
the object belongs to, along with a confidence score that it belongs to the category. The algorithm
also predicts the object's location and scale with a rectangular bounding box. Amazon SageMaker
Object Detection uses the Single Shot multibox Detector (SSD) algorithm that takes a convolutional
neural network (CNN) pretrained for classification task as the base network. SSD uses the output of
intermediate layers as features for detection.

Various CNNs such as VGG and ResNet have achieved great performance on the image classification task.
Object detection in Amazon SageMaker supports both VGG-16 and ResNet-50 as a base network for SSD.
The algorithm can be trained in full training mode or in transfer learning mode. In full training mode, the
base network is initialized with random weights and then trained on user data. In transfer learning mode,
the base network weights are loaded from pretrained models.

The object detection algorithm uses standard data augmentation operations, such as flip, rescale, and
jitter, on the fly internally to help avoid overfitting.

Object Detection Hyperparameters

In the CreateTrainingJob request, you specify the training algorithm that you want to use. You
can also specify algorithm-specific hyperparameters that are used to help estimate the parameters of
the model from a training dataset. The following table lists the hyperparameters provided by Amazon
SageMaker for training the object detection algorithm. For more information about how object training
works, see How Object Detection Works (p. 1534).

Parameter Name Description

num_classes The number of output classes. This parameter defines the


dimensions of the network output and is typically set to the
number of classes in the dataset.

Required

Valid values: positive integer

num_training_samples The number of training examples in the input dataset.


Note
If there is a mismatch between this value and the number
of samples in the training set, then the behavior of the
lr_scheduler_step parameter will be undefined and
distributed training accuracy may be affected.

Required

Valid values: positive integer

base_network The base network architecture to use.

Optional

Valid values: 'vgg-16' or 'resnet-50'

Default value: 'vgg-16'

1534
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

early_stopping True to use early stopping logic during training. False not to use
it.

Optional

Valid values: True or False

Default value: False

early_stopping_min_epochs The minimum number of epochs that must be run before


the early stopping logic can be invoked. It is used only when
early_stopping = True.

Optional

Valid values: positive integer

Default value: 10

early_stopping_patience The number of epochs to wait before ending training if no


improvement, as defined by the early_stopping_tolerance
hyperparameter, is made in the relevant metric. It is used only when
early_stopping = True.

Optional

Valid values: positive integer

Default value: 5

early_stopping_tolerance The tolerance value that the relative improvement in


validation:mAP, the mean average precision (mAP), is required
to exceed to avoid early stopping. If the ratio of the change
in the mAP divided by the previous best mAP is smaller than
the early_stopping_tolerance value set, early stopping
considers that there is no improvement. It is used only when
early_stopping = True.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.0

image_shape The image size for input images. We rescale the input image to a
square image with this size. We recommend using 300 and 512 for
better performance.

Optional

Valid values: positive integer ≥300

Default: 300

1535
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

epochs The number of training epochs.

Optional

Valid values: positive integer

Default: 30

freeze_layer_pattern The regular expression (regex) for freezing layers in the base
network. For example, if we set freeze_layer_pattern =
"^(conv1_|conv2_).*", then any layers with a name that
contains "conv1_" or "conv2_" are frozen, which means that
the weights for these layers are not updated during training. The
layer names can be found in the network symbol files vgg16-
symbol.json and resnet-50-symbol.json. Freezing a layer means that
its weights can not be modified further. This can reduce training
time significantly in exchange for modest losses in accuracy. This
technique is commonly used in transfer learning where the lower
layers in the base network do not need to be retrained.

Optional

Valid values: string

Default: No layers frozen.

kv_store The weight update synchronization mode used for distributed


training. The weights can be updated either synchronously or
asynchronously across machines. Synchronous updates typically
provide better accuracy than asynchronous updates but can be
slower. See the Distributed Training MXNet tutorial for details.
Note
This parameter is not applicable to single machine training.

Optional

Valid values: 'dist_sync' or 'dist_async'

• 'dist_sync': The gradients are synchronized after every batch


with all the workers. With 'dist_sync', batch-size now means
the batch size used on each machine. So if there are n machines
and we use batch size b, then dist_sync behaves like a single
machine with batch size n*b.
• 'dist_async': Performs asynchronous updates. The weights
are updated whenever gradients are received from any machine
and the weight updates are atomic. However, the order is not
guaranteed.

Default: -

1536
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

label_width The force padding label width used to sync across training and
validation data. For example, if one image in the data contains at
most 10 objects, and each object's annotation is specified with 5
numbers, [class_id, left, top, width, height], then the label_width
should be no smaller than (10*5 + header information length). The
header information length is usually 2. We recommend using a
slightly larger label_width for the training, such as 60 for this
example.

Optional

Valid values: Positive integer large enough to accommodate the


largest annotation information length in the data.

Default: 350

learning_rate The initial learning rate.

Optional

Valid values: float in (0, 1]

Default: 0.001

lr_scheduler_factor The ratio to reduce learning rate. Used in conjunction with the
lr_scheduler_step parameter defined as lr_new = lr_old *
lr_scheduler_factor.

Optional

Valid values: float in (0, 1)

Default: 0.1

lr_scheduler_step The epochs at which to reduce the learning rate. The learning rate is
reduced by lr_scheduler_factor at epochs listed in a comma-
delimited string: "epoch1, epoch2, ...". For example, if the value is
set to "10, 20" and the lr_scheduler_factor is set to 1/2, then
the learning rate is halved after 10th epoch and then halved again
after 20th epoch.

Optional

Valid values: string

Default: empty string

1537
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

mini_batch_size The batch size for training. In a single-machine multi-gpu setting,


each GPU handles mini_batch_size/num_gpu training samples.
For the multi-machine training in dist_sync mode, the actual
batch size is mini_batch_size*number of machines. A large
mini_batch_size usually leads to faster training, but it may
cause out of memory problem. The memory usage is related
to mini_batch_size, image_shape, and base_network
architecture. For example, on a single p3.2xlarge instance,
the largest mini_batch_size without an out of memory
error is 32 with the base_network set to "resnet-50" and an
image_shape of 300. With the same instance, you can use 64 as
the mini_batch_size with the base network vgg-16 and an
image_shape of 300.

Optional

Valid values: positive integer

Default: 32

momentum The momentum for sgd. Ignored for other optimizers.

Optional

Valid values: float in (0, 1]

Default: 0.9

nms_threshold The non-maximum suppression threshold.

Optional

Valid values: float in (0, 1]

Default: 0.45

optimizer The optimizer types. For details on optimizer values, see MXNet's
API.

Optional

Valid values: ['sgd', 'adam', 'rmsprop', 'adadelta']

Default: 'sgd'

overlap_threshold The evaluation overlap threshold.

Optional

Valid values: float in (0, 1]

Default: 0.5

1538
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

use_pretrained_model Indicates whether to use a pre-trained model for training. If set


to 1, then the pre-trained model with corresponding architecture
is loaded and used for training. Otherwise, the network is trained
from scratch.

Optional

Valid values: 0 or 1

Default: 1

weight_decay The weight decay coefficient for sgd and rmsprop. Ignored for
other optimizers.

Optional

Valid values: float in (0, 1)

Default: 0.0005

Tune an Object Detection Model


Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics Computed by the Object Detection Algorithm


The object detection algorithm reports on a single metric during training: validation:mAP. When
tuning a model, choose this metric as the objective metric.

Metric Name Description Optimization Direction

validation:mAP Mean Average Precision (mAP) computed on the Maximize


validation set.

Tunable Object Detection Hyperparameters


Tune the Amazon SageMaker object detection model with the following hyperparameters. The
hyperparameters that have the greatest impact on the object detection objective metric are:
mini_batch_size, learning_rate, and optimizer.

Parameter Name Parameter Type Recommended Ranges

learning_rate ContinuousParameterRange MinValue: 1e-6,


MaxValue: 0.5

mini_batch_size IntegerParameterRanges MinValue: 8, MaxValue:


64

1539
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Parameter Type Recommended Ranges

momentum ContinuousParameterRange MinValue: 0.0,


MaxValue: 0.999

optimizer CategoricalParameterRanges ['sgd', 'adam', 'rmsprop',


'adadelta']

weight_decay ContinuousParameterRange MinValue: 0.0,


MaxValue: 0.999

Object Detection Request and Response Formats

Request Format
Query a trained model by using the model's endpoint. The endpoint takes .jpg and .png image formats
with image/jpeg and image/png content-types.

Response Formats
The response is the class index with a confidence score and bounding box coordinates for all objects
within the image encoded in JSON format. The following is an example of response .json file:

{"prediction":[
[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507,
0.9345266819000244],
[0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571,
0.9712159633636475],
[4.0, 0.32643985450267792, 0.3677481412887573, 0.034883320331573486, 0.6318609714508057,
0.5967587828636169],
[8.0, 0.22552496790885925, 0.6152569651603699, 0.5722782611846924, 0.882301390171051,
0.8985623121261597],
[3.0, 0.42260299175977707, 0.019305512309074402, 0.08386176824569702,
0.39093565940856934, 0.9574796557426453]
]}

Each row in this .json file contains an array that represents a detected object. Each of these object
arrays consists of a list of six numbers. The first number is the predicted class label. The second
number is the associated confidence score for the detection. The last four numbers represent the
bounding box coordinates [xmin, ymin, xmax, ymax]. These output bounding box corner indices
are normalized by the overall image size. Note that this encoding is different than that use by the
input .json format. For example, in the first entry of the detection result, 0.3088374733924866 is the
left coordinate (x-coordinate of upper-left corner) of the bounding box as a ratio of the overall image
width, 0.07030484080314636 is the top coordinate (y-coordinate of upper-left corner) of the bounding
box as a ratio of the overall image height, 0.7110607028007507 is the right coordinate (x-coordinate of
lower-right corner) of the bounding box as a ratio of the overall image width, and 0.9345266819000244
is the bottom coordinate (y-coordinate of lower-right corner) of the bounding box as a ratio of the
overall image height.

To avoid unreliable detection results, you might want to filter out the detection results with low
confidence scores. In the object detection sample notebook, we provide examples of scripts that use a
threshold to remove low confidence detections and to plot bounding boxes on the original images.

For batch transform, the response is in JSON format, where the format is identical to the JSON format
described above. The detection results of each image is represented as a JSON file. For example:

{"prediction": [[label_id, confidence_score, xmin, ymin, xmax, ymax], [label_id,


confidence_score, xmin, ymin, xmax, ymax]]}

1540
Amazon SageMaker Developer Guide
Use Built-in Algorithms

For more details on training and inference, see the Object Detection Sample Notebooks (p. 1533).

OUTPUT: JSON Response Format


accept: application/json;annotation=1

{
"image_size": [
{
"width": 500,
"height": 400,
"depth": 3
}
],
"annotations": [
{
"class_id": 0,
"score": 0.943,
"left": 111,
"top": 134,
"width": 61,
"height": 128
},
{
"class_id": 0,
"score": 0.0013,
"left": 161,
"top": 250,
"width": 79,
"height": 143
},
{
"class_id": 1,
"score": 0.0133,
"left": 101,
"top": 185,
"width": 42,
"height": 130
}
]
}

Object Detection - TensorFlow


The Amazon SageMaker Object Detection - TensorFlow algorithm is a supervised learning algorithm
that supports transfer learning with many pretrained models from the TensorFlow Model Garden. Use
transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a
large amount of image data is not available. The object detection algorithm takes an image as input
and outputs a list of bounding boxes. Training datasets must consist of images in .jpg, .jpeg, or .png
format.

Topics
• How to use the SageMaker Object Detection - TensorFlow algorithm (p. 1542)
• Input and output interface for the Object Detection - TensorFlow algorithm (p. 1543)
• Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm (p. 1544)
• Object Detection - TensorFlow sample notebooks (p. 1544)
• How Object Detection - TensorFlow Works (p. 1544)
• TensorFlow Models (p. 1545)
• Object Detection - TensorFlow Hyperparameters (p. 1546)
• Tune an Object Detection - TensorFlow model (p. 1548)

1541
Amazon SageMaker Developer Guide
Use Built-in Algorithms

How to use the SageMaker Object Detection - TensorFlow algorithm


You can use Object Detection - TensorFlow as an Amazon SageMaker built-in algorithm. The following
section describes how to use Object Detection - TensorFlow with the SageMaker Python SDK. For
information on how to use Object Detection - TensorFlow from the Amazon SageMaker Studio UI, see
SageMaker JumpStart (p. 47).

The Object Detection - TensorFlow algorithm supports transfer learning using any of the compatible
pretrained TensorFlow models. For a list of all available pretrained models, see TensorFlow
Models (p. 1545). Every pretrained model has a unique model_id. The following example uses
ResNet50 (model_id: tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8) to fine-
tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and
stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated
model training artifacts to construct a SageMaker Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the
hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters
and their default values with hyperparameters.retrieve_default. For more information, see
Object Detection - TensorFlow Hyperparameters (p. 1546). Use these values to construct a SageMaker
Estimator.
Note
Default hyperparameter values are different for different models. For example, for larger
models, the default number of epochs is smaller.

This example uses the PennFudanPed dataset, which contains images of pedestriants in the street. We
pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call .fit
using the Amazon S3 location of your training dataset.

from sagemaker import image_uris, model_uris, script_uris, hyperparameters


from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8", "*"


training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image


train_image_uri =
image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type

# Retrieve the training script


train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version,
script_scope="training")

# Retrieve the pretrained model tarball for transfer learning


train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version,
model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model


hyperparameters = hyperparameters.retrieve_default(model_id=model_id,
model_version=model_version)

# [Optional] Override default hyperparameters with custom values


hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket


training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/PennFudanPed_COCO_format/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-od-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

1542
Amazon SageMaker Developer Guide
Use Built-in Algorithms

# Create an Estimator instance


tf_od_estimator = Estimator(
role=aws_role,
image_uri=train_image_uri,
source_dir=train_source_uri,
model_uri=train_model_uri,
entry_point="transfer_learning.py",
instance_count=1,
instance_type=training_instance_type,
max_run=360000,
hyperparameters=hyperparameters,
output_path=s3_output_location,
)

# Launch a training job


tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the SageMaker Object Detection - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Object Detection
notebook.

Input and output interface for the Object Detection - TensorFlow algorithm

Each of the pretrained models listed in TensorFlow Models can be fine-tuned to any dataset with
any number of image classes. Be mindful of how to format your training data for input to the Object
Detection - TensorFlow model.

• Training data input format: Your training data should be a directory with an images subdirectory and
an annotations.json file.

The following is an example of an input directory structure. The input directory should be hosted in an
Amazon S3 bucket with a path similar to the following: s3://bucket_name/input_directory/.
Note that the trailing / is required.

input_directory
|--images
|--abc.png
|--def.png
|--annotations.json

The annotations.json file should contain information for bounding boxes and their class labels in
the form of a dictionary "images" and "annotations" keys. The value for the "images" key should
be a list of dictionaries. There should be one dictionary for each image with the following information:
{"file_name": image_name, "height": height, "width": width, "id": image_id}. The
value for the "annotations" key should also be a list of dictionaries. There should be one dictionary
for each bounding box with the following information: {"image_id": image_id, "bbox": [xmin,
ymin, xmax, ymax], "category_id": bbox_label}.

After training, a label mapping file and trained model are saved to your Amazon S3 bucket.

Incremental training

You can seed the training of a new model with artifacts from a model that you trained previously with
SageMaker. Incremental training saves training time when you want to train a new model with the same
or similar data.
Note
You can only seed a SageMaker Object Detection - TensorFlow model with another Object
Detection - TensorFlow model trained in SageMaker.

1543
Amazon SageMaker Developer Guide
Use Built-in Algorithms

You can use any dataset for incremental training, as long as the set of classes remains the same. The
incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained
model, you start with an existing fine-tuned model. For more information about how to use incremental
training with the SageMaker Object Detection - TensorFlow, see the Introduction to SageMaker
TensorFlow - Object Detection notebook.

Inference with the Object Detection - TensorFlow algorithm

You can host the fine-tuned model that results from your TensorFlow Object Detection training for
inference. Any input image for inference must be in .jpg, .jpeg, or .png format and be content
type application/x-image. The Object Detection - TensorFlow algorithm resizes input images
automatically.

Running inference results in bounding boxes, predicted classes, and the scores of each prediction
encoded in JSON format. The Object Detection - TensorFlow model processes a single image per request
and outputs only one line. The following is an example of a JSON format response:

accept: application/json;verbose

{"normalized_boxes":[[xmin1, xmax1, ymin1, ymax1],....],


"classes":[classidx1, class_idx2,...],
"scores":[score_1, score_2,...],
"labels": [label1, label2, ...],
"tensorflow_model_output":<original output of the model>}

If accept is set to application/json, then the model only outputs normalized boxes, classes, and
scores.

Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm

The Object Detection - TensorFlow algorithm supports all GPU instances for training, including:

• ml.p2.xlarge
• ml.p2.16xlarge
• ml.p3.2xlarge
• ml.p3.16xlarge

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such
as M5) and GPU (P2 or P3) instances can be used for inference. For a comprehensive list of SageMaker
training and inference instances across AWS Regions, see Amazon SageMaker Pricing.

Object Detection - TensorFlow sample notebooks

For more information about how to use the SageMaker Object Detection - TensorFlow algorithm for
transfer learning on a custom dataset, see the Introduction to SageMaker TensorFlow - Object Detection
notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the
example in SageMaker, see Amazon SageMaker Notebook Instances (p. 204). After you have created a
notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker
samples. To open a notebook, choose its Use tab and choose Create copy.

How Object Detection - TensorFlow Works

The Object Detection - TensorFlow algorithm takes an image as input and predicts bounding boxes and
object labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet
are highly accurate for object detection. There are also deep learning networks that are trained on large

1544
Amazon SageMaker Developer Guide
Use Built-in Algorithms

image datasets, such as Common Objects in Context (COCO), which has 328,000 images. After a network
is trained with COCO data, you can then fine-tune the network on a dataset with a particular focus to
perform more specific object detection tasks. The Amazon SageMaker Object Detection - TensorFlow
algorithm supports transfer learning on many pretrained models that are available in the TensorFlow
Model Garden.

According to the number of class labels in your training data, an object detection layer is attached to the
pretrained TensorFlow model of your choice. You can then fine-tune either the entire network (including
the pretrained model) or only the top classification layer on new training data. With this method of
transfer learning, training with smaller datasets is possible.

TensorFlow Models

The following pretrained models are available to use for transfer learning with the Object Detection -
TensorFlow algorithm.

The following models vary significantly in size, number of model parameters, training time, and
inference latency for any given dataset. The best model for your use case depends on the complexity
of your fine-tuning dataset and any requirements that you have on training time, inference latency, or
model accuracy.

Model Name model_id Source

ResNet50 V1 FPN 640 tensorflow-od1-ssd- TensorFlow Model Garden link


resnet50-v1-fpn-640x640-
coco17-tpu-8

EfficientDet D0 512 tensorflow-od1-ssd- TensorFlow Model Garden link


efficientdet-d0-512x512-
coco17-tpu-8

EfficientDet D1 640 tensorflow-od1-ssd- TensorFlow Model Garden link


efficientdet-d1-640x640-
coco17-tpu-8

EfficientDet D2 768 tensorflow-od1-ssd- TensorFlow Model Garden link


efficientdet-d2-768x768-
coco17-tpu-8

EfficientDet D3 896 tensorflow-od1-ssd- TensorFlow Model Garden link


efficientdet-d3-896x896-
coco17-tpu-32

MobileNet V1 FPN 640 tensorflow-od1- TensorFlow Model Garden link


ssd-mobilenet-v1-
fpn-640x640-coco17-tpu-8

MobileNet V2 FPNLite 320 tensorflow-od1- TensorFlow Model Garden link


ssd-mobilenet-v2-
fpnlite-320x320-coco17-
tpu-8

MobileNet V2 FPNLite 640 tensorflow-od1- TensorFlow Model Garden link


ssd-mobilenet-v2-
fpnlite-640x640-coco17-
tpu-8

ResNet50 V1 FPN 1024 tensorflow-od1- TensorFlow Model Garden link


ssd-resnet50-v1-

1545
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Model Name model_id Source


fpn-1024x1024-coco17-
tpu-8

ResNet101 V1 FPN 640 tensorflow-od1- TensorFlow Model Garden link


ssd-resnet101-v1-
fpn-640x640-coco17-tpu-8

ResNet101 V1 FPN 1024 tensorflow-od1- TensorFlow Model Garden link


ssd-resnet101-v1-
fpn-1024x1024-coco17-
tpu-8

ResNet152 V1 FPN 640 tensorflow-od1- TensorFlow Model Garden link


ssd-resnet152-v1-
fpn-640x640-coco17-tpu-8

ResNet152 V1 FPN 1024 tensorflow-od1- TensorFlow Model Garden link


ssd-resnet152-v1-
fpn-1024x1024-coco17-
tpu-8

Object Detection - TensorFlow Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The
following hyperparameters are supported by the Amazon SageMaker built-in Object Detection -
TensorFlow algorithm. See Tune an Object Detection - TensorFlow model (p. 1548) for information on
hyperparameter tuning.

Parameter Name Description

batch_size The batch size for training.

Valid values: positive integer.

Default value: 3.

beta_1 The beta1 for the "adam" optimizer. Represents the exponential
decay rate for the first moment estimates. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

beta_2 The beta2 for the "adam" optimizer. Represents the exponential
decay rate for the second moment estimates. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.999.

early_stopping Set to "True" to use early stopping logic during training. If


"False", early stopping is not used.

Valid values: string, either: ("True" or "False").

1546
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: "False".

early_stopping_min_delta The minimum change needed to qualify as an


improvement. An absolute change less than the value of
early_stopping_min_delta does not qualify as improvement.
Used only when early_stopping is set to "True".

Valid values: float, range: [0.0, 1.0].

Default value: 0.0.

early_stopping_patience The number of epochs to continue training with no improvement.


Used only when early_stopping is set to "True".

Valid values: positive integer.

Default value: 5.

epochs The number of training epochs.

Valid values: positive integer.

Default value: 5 for smaller models, 1 for larger models.

epsilon The epsilon for "adam", "rmsprop", "adadelta", and


"adagrad" optimizers. Usually set to a small value to avoid
division by 0. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 1e-7.

initial_accumulator_value The starting value for the accumulators, or the per-parameter


momentum values, for the "adagrad" optimizer. Ignored for other
optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.1.

learning_rate The optimizer learning rate.

Valid values: float, range: [0.0, 1.0].

Default value: 0.001.

momentum The momentum for the "sgd" and "nesterov" optimizers.


Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.9.

1547
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

optimizer The optimizer type. For more information, see Optimizers in the
TensorFlow documentation.

Valid values: string, any of the following: ("adam", "sgd",


"nesterov", "rmsprop", "adagrad" , "adadelta").

Default value: "adam".

reinitialize_top_layer If set to "Auto", the top classification layer parameters are


re-initialized during fine-tuning. For incremental training, top
classification layer parameters are not re-initialized unless set to
"True".

Valid values: string, any of the following: ("Auto", "True" or


"False").

Default value: "Auto".

rho The discounting factor for the gradient of the "adadelta" and
"rmsprop" optimizers. Ignored for other optimizers.

Valid values: float, range: [0.0, 1.0].

Default value: 0.95.

train_only_on_top_layer If "True", only the top classification layer parameters are fine-
tuned. If "False", all model parameters are fine-tuned.

Valid values: string, either: ("True" or "False").

Default value: "False".

Tune an Object Detection - TensorFlow model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see Perform Automatic Model Tuning with
SageMaker (p. 1612).

Metrics computed by the Object Detection - TensorFlow algorithm

Refer to the following chart to find which metrics are computed by the Object Detection - TensorFlow
algorithm.

Metric Name Description Optimization Regex Pattern


Direction

The localization loss for box prediction.


validation:localization_loss Minimize Val_localization=([0-9\
\.]+)

1548
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Tunable Object Detection - TensorFlow hyperparameters


Tune an object detection model with the following hyperparameters. The hyperparameters that have
the greatest impact on object detection objective metrics are: batch_size, learning_rate, and
optimizer. Tune the optimizer-related hyperparameters, such as momentum, regularizers_l2,
beta_1, beta_2, and eps based on the selected optimizer. For example, use beta_1 and beta_2
only when adam is the optimizer.

For more information about which hyperparameters are used for each optimizer, see Object Detection
- TensorFlow Hyperparameters (p. 1546).

Parameter Name Parameter Type Recommended Ranges

batch_size IntegerParameterRanges MinValue: 8, MaxValue:


512

beta_1 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

beta_2 ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.999

eps ContinuousParameterRanges MinValue: 1e-8,


MaxValue: 1.0

learning_rate ContinuousParameterRanges MinValue: 1e-6,


MaxValue: 0.5

momentum ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

optimizer CategoricalParameterRanges ['sgd', ‘adam’, ‘rmsprop’,


'nesterov', 'adagrad',
'adadelta']

regularizers_l2 ContinuousParameterRanges MinValue: 0.0,


MaxValue: 0.999

CategoricalParameterRanges
train_only_on_top_layer ['True', 'False']

CategoricalParameterRanges
initial_accumulator_value MinValue: 0.0,
MaxValue: 0.999

Semantic Segmentation Algorithm


The SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to
developing computer vision applications. It tags every pixel in an image with a class label from a
predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an
increasing number of computer vision applications, such as self-driving vehicles, medical imaging
diagnostics, and robot sensing.

For comparison, the SageMaker Image Classification - MXNet (p. 1506) is a supervised learning algorithm
that analyzes only whole images, classifying them into one of multiple output categories. The Object
Detection - MXNet (p. 1530) is a supervised learning algorithm that detects and classifies all instances of
an object in an image. It indicates the location and scale of each object in the image with a rectangular
bounding box.

Because the semantic segmentation algorithm classifies every pixel in an image, it also provides
information about the shapes of the objects contained in the image. The segmentation output is

1549
Amazon SageMaker Developer Guide
Use Built-in Algorithms

represented as a grayscale image, called a segmentation mask. A segmentation mask is a grayscale image
with the same shape as the input image.

The SageMaker semantic segmentation algorithm is built using the MXNet Gluon framework and
the Gluon CV toolkit. It provides you with a choice of three built-in algorithms to train a deep neural
network. You can use the Fully-Convolutional Network (FCN) algorithm , Pyramid Scene Parsing (PSP)
algorithm, or DeepLabV3.

Each of the three algorithms has two distinct components:

• The backbone (or encoder)—A network that produces reliable activation maps of features.
• The decoder—A network that constructs the segmentation mask from the encoded activation maps.

You also have a choice of backbones for the FCN, PSP, and DeepLabV3 algorithms: ResNet50 or
ResNet101. These backbones include pretrained artifacts that were originally trained on the ImageNet
classification task. You can fine-tune these backbones for segmentation using your own data. Or, you
can initialize and train these networks from scratch using only your own data. The decoders are never
pretrained.

To deploy the trained model for inference, use the SageMaker hosting service. During inference, you can
request the segmentation mask either as a PNG image or as a set of probabilities for each class for each
pixel. You can use these masks as part of a larger pipeline that includes additional downstream image
processing or other applications.

Topics
• Semantic Segmentation Sample Notebooks (p. 1550)
• Input/Output Interface for the Semantic Segmentation Algorithm (p. 1550)
• EC2 Instance Recommendation for the Semantic Segmentation Algorithm (p. 1553)
• Semantic Segmentation Hyperparameters (p. 1553)
• Tuning a Semantic Segmentation Model (p. 1558)

Semantic Segmentation Sample Notebooks

For a sample Jupyter notebook that uses the SageMaker semantic segmentation algorithm to train a
model and deploy it to perform inferences, see the Semantic Segmentation Example. For instructions on
how to create and access Jupyter notebook instances that you can use to run the example in SageMaker,
see Amazon SageMaker Notebook Instances (p. 204).

To see a list of all of the SageMaker samples, create and open a notebook instance, and choose
the SageMaker Examples tab. The example semantic segmentation notebooks are located under
Introduction to Amazon algorithms. To open a notebook, choose its Use tab, and choose Create copy.

Input/Output Interface for the Semantic Segmentation Algorithm

SageMaker semantic segmentation expects the customer's training dataset to be on Amazon Simple
Storage Service (Amazon S3). Once trained, it produces the resulting model artifacts on Amazon S3. The
input interface format for the SageMaker semantic segmentation is similar to that of most standardized
semantic segmentation benchmarking datasets. The dataset in Amazon S3 is expected to be presented
in two channels, one for train and one for validation using four directories, two for images and two
for annotations. Annotations are expected to be uncompressed PNG images. The dataset might also have
a label map that describes how the annotation mappings are established. If not, the algorithm uses a
default. It also supports the augmented manifest image format (application/x-image) for training
in Pipe input mode straight from Amazon S3. For inference, an endpoint accepts images with an image/
jpeg content type.

1550
Amazon SageMaker Developer Guide
Use Built-in Algorithms

How Training Works


The training data is split into four directories: train, train_annotation, validation, and
validation_annotation. There is a channel for each of these directories. The dataset also expected
to have one label_map.json file per channel for train_annotation and validation_annotation
respectively. If you don't provide these JSON files, SageMaker provides the default set label map.

The dataset specifying these files should look similar to the following example:

s3://bucket_name
|
|- train
|
| - 0000.jpg
| - coffee.jpg
|- validation
|
| - 00a0.jpg
| - bananna.jpg
|- train_annotation
|
| - 0000.png
| - coffee.png
|- validation_annotation
|
| - 00a0.png
| - bananna.png
|- label_map
| - train_label_map.json
| - validation_label_map.json

Every JPG image in the train and validation directories has a corresponding PNG label image with
the same name in the train_annotation and validation_annotation directories. This naming
convention helps the algorithm to associate a label with its corresponding image during training. The
train, train_annotation, validation, and validation_annotation channels are mandatory.
The annotations are single-channel PNG images. The format works as long as the metadata (modes) in
the image helps the algorithm read the annotation images into a single-channel 8-bit unsigned integer.
For more information on our support for modes, see the Python Image Library documentation. We
recommend using the 8-bit pixel, true color P mode.

The image that is encoded is a simple 8-bit integer when using modes. To get from this mapping to a
map of a label, the algorithm uses one mapping file per channel, called the label map. The label map is
used to map the values in the image with actual label indices. In the default label map, which is provided
by default if you don’t provide one, the pixel value in an annotation matrix (image) directly index the
label. These images can be grayscale PNG files or 8-bit indexed PNG files. The label map file for the
unscaled default case is the following:

{
"scale": "1"
}

To provide some contrast for viewing, some annotation software scales the label images by a constant
amount. To support this, the SageMaker semantic segmentation algorithm provides a rescaling option
to scale down the values to actual label values. When scaling down doesn’t convert the value to an
appropriate integer, the algorithm defaults to the greatest integer less than or equal to the scale value.
The following code shows how to set the scale value to rescale the label values:

{
"scale": "3"
}

1551
Amazon SageMaker Developer Guide
Use Built-in Algorithms

The following example shows how this "scale" value is used to rescale the encoded_label values of
the input annotation image when they are mapped to the mapped_label values to be used in training.
The label values in the input annotation image are 0, 3, 6, with scale 3, so they are mapped to 0, 1, 2 for
training:

encoded_label = [0, 3, 6]
mapped_label = [0, 1, 2]

In some cases, you might need to specify a particular color mapping for each class. Use the map option
in the label mapping as shown in the following example of a label_map file:

{
"map": {
"0": 5,
"1": 0,
"2": 2
}
}

This label mapping for this example is:

encoded_label = [0, 5, 2]
mapped_label = [1, 0, 2]

With label mappings, you can use different annotation systems and annotation software to obtain data
without a lot of preprocessing. You can provide one label map per channel. The files for a label map in
the label_map channel must follow the naming conventions for the four directory structure. If you
don't provide a label map, the algorithm assumes a scale of 1 (the default).

Training with the Augmented Manifest Format


The augmented manifest format enables you to do training in Pipe mode using image files without
needing to create RecordIO files. The augmented manifest file contains data objects and should be in
JSON Lines format, as described in the CreateTrainingJob request. Each line in the manifest is an
entry containing the Amazon S3 URI for the image and the URI for the annotation image.

Each JSON object in the manifest file must contain a source-ref key. The source-ref key
should contain the value of the Amazon S3 URI to the image. The labels are provided under the
AttributeNames parameter value as specified in the CreateTrainingJob request. It can also contain
additional metadata under the metadata tag, but these are ignored by the algorithm. In the example
below, the AttributeNames are contained in the list of image and annotation references ["source-
ref", "city-streets-ref"]. These names must have -ref appended to them. When using the
Semantic Segmentation algorithm with Augmented Manifest, the value of the RecordWrapperType
parameter must be "RecordIO" and value of the ContentType parameter must be application/x-
recordio.

{"source-ref": "S3 bucket location", "city-streets-ref": "S3 bucket location", "city-


streets-metadata": {"job-name": "label-city-streets", }}

For more information on augmented manifest files, see Provide Dataset Metadata to Training Jobs with
an Augmented Manifest File (p. 2138).

Incremental Training
You can also seed the training of a new model with a model that you trained previously using SageMaker.
This incremental training saves training time when you want to train a new model with the same or
similar data. Currently, incremental training is supported only for models trained with the built-in
SageMaker Semantic Segmentation.

1552
Amazon SageMaker Developer Guide
Use Built-in Algorithms

To use your own pre-trained model, specify the ChannelName as "model" in the InputDataConfig for
the CreateTrainingJob request. Set the ContentType for the model channel to application/x-
sagemaker-model. The backbone, algorithm, crop_size, and num_classes input parameters that
define the network architecture must be consistently specified in the input hyperparameters of the new
model and the pre-trained model that you upload to the model channel. For the pretrained model file,
you can use the compressed (.tar.gz) artifacts from SageMaker outputs. You can only use Image formats
for input data. For more information on incremental training and for instructions on how to use it, see
Incremental Training in Amazon SageMaker (p. 2113).

Produce Inferences
To query a trained model that is deployed to an endpoint, you need to provide an image and an
AcceptType that denotes the type of output required. The endpoint takes JPEG images with an
image/jpeg content type. If you request an AcceptType of image/png, the algorithm outputs a PNG
file with a segmentation mask in the same format as the labels themselves. If you request an accept
type ofapplication/x-recordio-protobuf, the algorithm returns class probabilities encoded in
recordio-protobuf format. The latter format outputs a 3D tensor where the third dimension is the same
size as the number of classes. This component denotes the probability of each class label for each pixel.

EC2 Instance Recommendation for the Semantic Segmentation Algorithm


The SageMaker semantic segmentation algorithm only supports GPU instances for training, and we
recommend using GPU instances with more memory for training with large batch sizes. The algorithm
can be trained using P2, P3, G4dn, or G5 instances in single machine configurations.

For inference, you can use either CPU instances (such as C5 and M5) and GPU instances (such as P3 and
G4dn) or both. For information about the instance types that provide varying combinations of CPU, GPU,
memory, and networking capacity for inference, see Amazon SageMaker ML Instance Types.

Semantic Segmentation Hyperparameters


The following tables list the hyperparameters supported by the Amazon SageMaker semantic
segmentation algorithm for network architecture, data inputs, and training. You specify Semantic
Segmentation for training in the AlgorithmName of the CreateTrainingJob request.

Network Architecture Hyperparameters

Parameter Name Description

backbone The backbone to use for the algorithm's encoder component.

Optional

Valid values: resnet-50, resnet-101

Default value: resnet-50

use_pretrained_model Whether a pretrained model is to be used for the backbone.

Optional

Valid values: True, False

Default value: True

algorithm The algorithm to use for semantic segmentation.

Optional

Valid values:

1553
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


• fcn: Fully-Convolutional Network (FCN) algorithm
• psp: Pyramid Scene Parsing (PSP) algorithm
• deeplab: DeepLab V3 algorithm

Default value: fcn

Data Hyperparameters

Parameter Name Description

num_classes The number of classes to segment.

Required

Valid values: 2 ≤ positive integer ≤ 254

num_training_samples The number of samples in the training data. The algorithm uses this value
to set up the learning rate scheduler.

Required

Valid values: positive integer

base_size Defines how images are rescaled before cropping. Images are rescaled
such that the long size length is set to base_size multiplied by a
random number from 0.5 to 2.0, and the short size is computed to
preserve the aspect ratio.

Optional

Valid values: positive integer > 16

Default value: 520

crop_size The image size for input during training. We randomly rescale the input
image based on base_size, and then take a random square crop with
side length equal to crop_size. The crop_size will be automatically
rounded up to multiples of 8.

Optional

Valid values: positive integer > 16

Default value: 240

Training Hyperparameters

Parameter Name Description

early_stopping Whether to use early stopping logic during training.

Optional

Valid values: True, False

1554
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description


Default value: False

The minimum number of epochs that must be run.


early_stopping_min_epochs

Optional

Valid values: integer

Default value: 5

The number of epochs that meet the tolerance for lower performance
early_stopping_patience
before the algorithm enforces an early stop.

Optional

Valid values: integer

Default value: 4

If the relative improvement of the score of the training job, the mIOU,
early_stopping_tolerance
is smaller than this value, early stopping considers the epoch as not
improved. This is used only when early_stopping = True.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.0

epochs The number of epochs with which to train.

Optional

Valid values: positive integer

Default value: 10

gamma1 The decay factor for the moving average of the squared gradient for
rmsprop. Used only for rmsprop.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.9

gamma2 The momentum factor for rmsprop.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.9

1555
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

learning_rate The initial learning rate.

Optional

Valid values: 0 < float ≤ 1

Default value: 0.001

lr_scheduler The shape of the learning rate schedule that controls its decrease over
time.

Optional

Valid values:

• step: A stepwise decay, where the learning rate is reduced


(multiplied) by the lr_scheduler_factor after epochs specified by
lr_scheduler_step.
• poly: A smooth decay using a polynomial function.
• cosine: A smooth decay using a cosine function.

Default value: poly

lr_scheduler_factor If lr_scheduler is set to step, the ratio by which to reduce (multipy)


the learning_rate after each of the epochs specified by the
lr_scheduler_step. Otherwise, ignored.

Optional

Valid values: 0 ≤ float ≤ 1

Default value: 0.1

lr_scheduler_step A comma delimited list of the epochs after which the learning_rate
is reduced (multiplied) by an lr_scheduler_factor. For example, if
the value is set to "10, 20", then the learning-rate is reduced by
lr_scheduler_factor after the 10th epoch and again by this factor
after 20th epoch.

Conditionally Required if lr_scheduler is set to step. Otherwise,


ignored.

Valid values: string

Default value: (No default, as the value is required when used.)

mini_batch_size The batch size for training. Using a large mini_batch_size usually
results in faster training, but it might cause you to run out of memory.
Memory usage is affected by the values of the mini_batch_size and
image_shape parameters, and the backbone architecture.

Optional

Valid values: positive integer

Default value: 16

1556
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

momentum The momentum for the sgd optimizer. When you use other optimizers,
the semantic segmentation algorithm ignores this parameter.

Optional

Valid values: 0 < float ≤ 1

Default value: 0.9

optimizer The type of optimizer. For more information about an optimizer, choose
the appropriate link:

• adam: Adaptive momentum estimation


• adagrad: Adaptive gradient descent
• nag: Nesterov accelerated gradient
• rmsprop: Root mean square propagation
• sgd: Stochastic gradient descent

Optional

Valid values: adam, adagrad, nag, rmsprop, sgd

Default value: sgd

syncbn If set to True, the batch normalization mean and variance are computed
over all the samples processed across the GPUs.

Optional

Valid values: True, False

Default value: False

validation_mini_batch_size
The batch size for validation. A large mini_batch_size usually
results in faster training, but it might cause you to run out of memory.
Memory usage is affected by the values of the mini_batch_size and
image_shape parameters, and the backbone architecture.

• To score the validation on the entire image without cropping the


images, set this parameter to 1. Use this option if you want to measure
performance on the entire image as a whole.
Note
Setting the validation_mini_batch_size parameter to 1
causes the algorithm to create a new network model for every
image. This might slow validation and training.
• To crop images to the size specified in the crop_size parameter, even
during evaluation, set this parameter to a value greater than 1.

Optional

Valid values: positive integer

Default value: 16

1557
Amazon SageMaker Developer Guide
Use Built-in Algorithms

Parameter Name Description

weight_decay The weight decay coefficient for the sgd optimizer. When you use other
optimizers, the algorithm ignores this parameter.

Optional

Valid values: 0 < float < 1

Default value: 0.0001

Tuning a Semantic Segmentation Model

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by
running many jobs that test a range of hyperparameters on your dataset. You choose the tunable
hyperparameters, a range of values for each, and an objective metric. You choose the objective metric
from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters
chosen to find the combination of values that result in the model that optimizes the objective metric.

Metrics Computed by the Semantic Segmentation Algorithm

The semantic segmentation algorithm reports two validation metrics. When tuning hyperparameter
values, choose one of these metrics as the objective.

Metric Name Description Optimization Direction

validation:mIOU The area of the intersection of the predicted Maximize


segmentation and the ground truth divided by
the area of union between them for images in the
validation set. Also known as the Jaccard Index.

The percentage of pixels that are correctly


validation:pixel_accuracy Maximize
classified in images from the validation set.

Tunable Semantic Segmentation Hyperparameters

You can tune the following hyperparameters for the semantic segmentation algorithm.

Parameter Name Parameter Type Recommended Ranges

learning_rate ContinuousParameterRange MinValue: 1e-4,


MaxValue: 1e-1

mini_batch_size IntegerParameterRanges MinValue: 1, MaxValue:


128

momentum ContinuousParameterRange MinValue: 0.9,


MaxValue: 0.999

optimzer CategoricalParameterRanges ['sgd', 'adam', 'adadelta']

weight_decay ContinuousParameterRange MinValue: 1e-5,


MaxValue: 1e-3

1558
Amazon SageMaker Developer Guide
Use Reinforcement Learning

Use Reinforcement Learning with Amazon SageMaker


Reinforcement learning (RL) combines fields such as computer science, neuroscience, and psychology
to determine how to map situations to actions to maximize a numerical reward signal. This notion of a
reward signal in RL stems from neuroscience research into how the human brain makes decisions about
which actions maximize reward and minimize punishment. In most situations, humans are not given
explicit instructions on which actions to take, but instead must learn both which actions yield the most
immediate rewards, and how those actions influence future situations and consequences.

The problem of RL is formalized using Markov decision processes (MDPs) that originate from dynamical
systems theory. MDPs aim to capture high-level details of a real problem that a learning agent
encounters over some period of time in attempting to achieve some ultimate goal. The learning agent
should be able to determine the current state of its environment and identify possible actions that
affect the learning agent’s current state. Furthermore, the learning agent’s goals should correlate
strongly to the state of the environment. A solution to a problem formulated in this way is known as a
reinforcement learning method.

What are the differences between reinforcement, supervised,


and unsupervised learning paradigms?
Machine learning can be divided into three distinct learning paradigms: supervised, unsupervised, and
reinforcement.

In supervised learning, an external supervisor provides a training set of labeled examples. Each example
contains information about a situation, belongs to a category, and has a label identifying the category
to which it belongs. The goal of supervised learning is to generalize in order to predict correctly in
situations that are not present in the training data.

In contrast, RL deals with interactive problems, making it infeasible to gather all possible examples of
situations with correct labels that an agent might encounter. This type of learning is most promising
when an agent is able to accurately learn from its own experience and adjust accordingly.

In unsupervised learning, an agent learns by uncovering structure within unlabeled data. While a RL
agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to
maximize a reward signal.

Topics
• Why is Reinforcement Learning Important? (p. 1559)
• Markov Decision Process (MDP) (p. 1560)
• Key Features of Amazon SageMaker RL (p. 1560)
• Reinforcement Learning Sample Notebooks (p. 1562)
• Sample RL Workflow Using Amazon SageMaker RL (p. 1562)
• RL Environments in Amazon SageMaker (p. 1563)
• Distributed Training with Amazon SageMaker RL (p. 1565)
• Hyperparameter Tuning with Amazon SageMaker RL (p. 1565)

Why is Reinforcement Learning Important?


RL is well-suited for solving large, complex problems, such as supply chain management, HVAC systems,
industrial robotics, game artificial intelligence, dialog systems, and autonomous vehicles. Because RL
models learn by a continuous process of receiving rewards and punishments for every action taken
by the agent, it is possible to train systems to make decisions under uncertainty and in dynamic
environments.

1559
Amazon SageMaker Developer Guide
Use Reinforcement Learning

Markov Decision Process (MDP)


RL is based on models called Markov Decision Processes (MDPs). An MDP consists of a series of time
steps. Each time step consists of the following:

Environment

Defines the space in which the RL model operates. This can be either a real-world environment or
a simulator. For example, if you train a physical autonomous vehicle on a physical road, that would
be a real-world environment. If you train a computer program that models an autonomous vehicle
driving on a road, that would be a simulator.
State

Specifies all information about the environment and past steps that is relevant to the future. For
example, in an RL model in which a robot can move in any direction at any time step, the position
of the robot at the current time step is the state, because if we know where the robot is, it isn't
necessary to know the steps it took to get there.
Action

What the agent does. For example, the robot takes a step forward.
Reward

A number that represents the value of the state that resulted from the last action that the agent
took. For example, if the goal is for a robot to find treasure, the reward for finding treasure might be
5, and the reward for not finding treasure might be 0. The RL model attempts to find a strategy that
optimizes the cumulative reward over the long term. This strategy is called a policy.
Observation

Information about the state of the environment that is available to the agent at each step. This
might be the entire state, or it might be just a part of the state. For example, the agent in a chess-
playing model would be able to observe the entire state of the board at any step, but a robot in a
maze might only be able to observe a small portion of the maze that it currently occupies.

Typically, training in RL consists of many episodes. An episode consists of all of the time steps in an MDP
from the initial state until the environment reaches the terminal state.

Key Features of Amazon SageMaker RL


To train RL models in SageMaker RL, use the following components:

• A deep learning (DL) framework. Currently, SageMaker supports RL in TensorFlow and Apache MXNet.
• An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and
provides a wide selection of state of the art RL algorithms. SageMaker supports the Intel Coach and
Ray RLlib toolkits. For information about Intel Coach, see https://nervanasystems.github.io/coach/.
For information about Ray RLlib, see https://ray.readthedocs.io/en/latest/rllib.html.
• An RL environment. You can use custom environments, open-source environments, or commercial
environments. For information, see RL Environments in Amazon SageMaker (p. 1563).

The following diagram shows the RL components that are supported in SageMaker RL.

1560
Amazon SageMaker Developer Guide
Use Reinforcement Learning

1561
Amazon SageMaker Developer Guide
Use Reinforcement Learning

Reinforcement Learning Sample Notebooks


The following table outlines a variety of sample notebooks that address different use cases of Amazon
SageMaker reinforcement learning.

Notebook Title Description

How to Train Batch RL Policies? This notebook shows how to use batch RL to train
a new policy from an offline dataset.

How to Solve the Cart-pole Balancing Problem? This notebook shows how to solve the cart-pole
balancing problem with RL.

How to Solve the Knapsack Problem? This notebook shows how to use RL to solve the
knapsack problem, and how SageMaker Managed
Spot Training can be used to run training at a
lower cost.

How to Solve the Mountain Car Problem? This notebook shows how to solve the mountain
car control problem with RL.

Sample RL Workflow Using Amazon SageMaker RL


The following example describes the steps for developing RL models using Amazon SageMaker RL.

For complete code examples, see the sample notebooks at https://github.com/awslabs/amazon-


sagemaker-examples/tree/master/reinforcement-learning.

1. Formulate the RL problem—First, formulate the business problem into an RL problem. For example,
auto scaling enables services to dynamically increase or decrease capacity depending on conditions
that you define. Currently, this requires setting up alarms, scaling policies, thresholds, and other
manual steps. To solve this with RL, we define the components of the Markov Decision Process:

a. Objective—Scale instance capacity so that it matches the desired load profile.


b. Environment—A custom environment that includes the load profile. It generates a simulated
load with daily and weekly variations and occasional spikes. The simulated system has a delay
between when new resources are requested and when they become available for serving
requests.
c. State—The current load, number of failed jobs, and number of active machines.
d. Action—Remove, add, or keep the same number of instances.
e. Reward—A positive reward for successful transactions and a high penalty for failing
transactions beyond a specified threshold.
2. Define the RL environment—The RL environment can be the real world where the RL agent
interacts or a simulation of the real world. You can connect open source and custom environments
developed using Gym interfaces and commercial simulation environments such as MATLAB and
Simulink.
3. Define the presets—The presets configure the RL training jobs and define the hyperparameters for
the RL algorithms.
4. Write the training code—Write training code as a Python script and pass the script to a SageMaker
training job. In your training code, import the environment files and the preset files, and then define
the main() function.
5. Train the RL Model—Use the SageMaker RLEstimator in the Amazon SageMaker Python SDK to
start an RL training job. If you are using local mode, the training job runs on the notebook instance.

1562
Amazon SageMaker Developer Guide
Use Reinforcement Learning

When you use SageMaker for training, you can select GPU or CPU instances. Store the output from
the training job in a local directory if you train in local mode, or on Amazon S3 if you use SageMaker
training.

The RLEstimator requires the following information as parameters.

a. The source directory where the environment, presets, and training code are uploaded.
b. The path to the training script.
c. The RL toolkit and deep learning framework you want to use. This automatically resolves to the
Amazon ECR path for the RL container.
d. The training parameters, such as the instance count, job name, and S3 path for output.
e. Metric definitions that you want to capture in your logs. These can also be visualized in
CloudWatch and in SageMaker notebooks.
6. Visualize training metrics and output—After a training job that uses an RL model completes, you
can view the metrics you defined in the training jobs in CloudWatch,. You can also plot the metrics in
a notebook by using the Amazon SageMaker Python SDK analytics library. Visualizing metrics helps
you understand how the performance of the model as measured by the reward improves over time.
Note
If you train in local mode, you can't visualize metrics in CloudWatch.
7. Evaluate the model—Checkpointed data from the previously trained models can be passed on
for evaluation and inference in the checkpoint channel. In local mode, use the local directory. In
SageMaker training mode, you need to upload the data to S3 first.
8. Deploy RL models—Finally, deploy the trained model on an endpoint hosted on SageMaker
containers or on an edge device by using AWS IoT Greengrass.

For more information on RL with SageMaker, see Using RL with the SageMaker Python SDK.

RL Environments in Amazon SageMaker


Amazon SageMaker RL uses environments to mimic real-world scenarios. Given the current state of the
environment and an action taken by the agent or agents, the simulator processes the impact of the
action, and returns the next state and a reward. Simulators are useful in cases where it is not safe to
train an agent in the real world (for example, flying a drone) or if the RL algorithm takes a long time to
converge (for example, when playing chess).

The following diagram shows an example of the interactions with a simulator for a car racing game.

1563
Amazon SageMaker Developer Guide
Use Reinforcement Learning

The simulation environment consists of an agent and a simulator. Here, a convolutional neural network
(CNN) consumes images from the simulator and generates actions to control the game controller. With
multiple simulations, this environment generates training data of the form state_t, action, state_t
+1, and reward_t+1. Defining the reward is not trivial and impacts the RL model quality. We want to
provide a few examples of reward functions, but would like to make it user-configurable.

Topics
• Use OpenAI Gym Interface for Environments in SageMaker RL (p. 1564)
• Use Open-Source Environments (p. 1565)
• Use Commercial Environments (p. 1565)

Use OpenAI Gym Interface for Environments in SageMaker RL


To use OpenAI Gym environments in SageMaker RL, use the following API elements. For more
information about OpenAI Gym, see https://gym.openai.com/docs/.

• env.action_space—Defines the actions the agent can take, specifies whether each action is
continuous or discrete, and specifies the minimum and maximum if the action is continuous.
• env.observation_space—Defines the observations the agent receives from the environment, as
well as minimum and maximum for continuous observations.
• env.reset()—Initializes a training episode. The reset() function returns the initial state of the
environment, and the agent uses the initial state to take its first action. The action is then sent to
step() repeatedly until the episode reaches a terminal state. When step() returns done = True,
the episode ends. The RL toolkit re-initializes the environment by calling reset().
• step()—Takes the agent action as input and outputs the next state of the environment, the reward,
whether the episode has terminated, and an info dictionary to communicate debugging information.
It is the responsibility of the environment to validate the inputs.
• env.render()—Used for environments that have visualization. The RL toolkit calls this function to
capture visualizations of the environment after each call to the step() function.

1564
Amazon SageMaker Developer Guide
Run local code as a remote job

Use Open-Source Environments


You can use open-source environments, such as EnergyPlus and RoboSchool, in SageMaker RL by
building your own container. For more information about EnergyPlus, see https://energyplus.net/.
For more information about RoboSchool, see https://github.com/openai/roboschool. The HVAC and
RoboSchool examples in the SageMaker examples repository show how to build a custom container to
use with SageMaker RL:

Use Commercial Environments


You can use commercial environments, such as MATLAB and Simulink, in SageMaker RL by building your
own container. You need to manage your own licenses.

Distributed Training with Amazon SageMaker RL


Amazon SageMaker RL supports multi-core and multi-instance distributed training. Depending on your
use case, training and/or environment rollout can be distributed. For example, SageMaker RL works for
the following distributed scenarios:

• Single training instance and multiple rollout instances of the same instance type. For an example, see
the Neural Network Compression example in the SageMaker examples repository.
• Single trainer instance and multiple rollout instances, where different instance types for training
and rollouts. For an example, see the AWS DeepRacer / AWS RoboMaker example in the SageMaker
examples repository.
• Single trainer instance that uses multiple cores for rollout. For an example, see the Roboschool
example in the SageMaker examples repository. This is useful if the simulation environment is light-
weight and can run on a single thread.
• Multiple instances for training and rollouts. For an example, see the Roboschool example in the
SageMaker examples repository.

Hyperparameter Tuning with Amazon SageMaker RL


You can run a hyperparameter tuning job to optimize hyperparameters for Amazon SageMaker RL. The
Roboschool example in the sample notebooks in the SageMaker examples repository shows how you
can do this with RL Coach. The launcher script shows how you can abstract parameters from the Coach
preset file and optimize them.

Run your local code as a SageMaker training job


You can run your local machine learning (ML) Python code as a large single-node Amazon SageMaker
training job or as multiple parallel jobs. You can do this by annotating your code with an @remote
decorator, as shown in the following code example. Distributed training (across multiple instances) are
not supported with remote functions.

@remote(**settings)
def divide(x, y):
return x / y

The SageMaker Python SDK will automatically translate your existing workspace environment and any
associated data processing code and datasets into a SageMaker training job that runs on the SageMaker
training platform. You can also activate a persistent cache feature, which will further reduce job start
latency by caching previously downloaded dependency packages. This reduction in job latency is greater

1565
Amazon SageMaker Developer Guide
Set up your environment

than the reduction in latency from using SageMaker managed warm pools alone. For more information,
see Using persistent cache (p. 2121).
Note
Distributed training jobs are not supported by remote functions.

The following sections show how to annotate your local ML code with an @remote decorator and tailor
your experience for your use case. This includes customizing your environment and integrating with
SageMaker Experiments.

Topics
• Set up your environment (p. 1566)
• Invoking a function (p. 1572)
• Configuration file (p. 1576)
• Customize your runtime environment (p. 1577)
• Container image compatibility (p. 1578)
• Logging parameters and metrics with Amazon SageMaker Experiments (p. 1581)
• Using modular code with the @remote decorator (p. 1584)
• Private repository for runtime dependencies (p. 1585)
• Example notebooks (p. 1586)

Set up your environment


Choose one of the following three options to set up your environment.

Run your code from Amazon SageMaker Studio


You can annotate and run your local ML code from SageMaker Studio by creating a SageMaker Notebook
and attaching any image available on SageMaker Studio image. The following instructions help you
create a SageMaker Notebook, install the SageMaker Python SDK, and annotate your code with the
decorator.

1. Create a SageMaker Notebook and attach an image in SageMaker Studio as follows:


a. Follow the instructions in Launch Amazon SageMaker Studio in the Amazon SageMaker Developer
Guide.
b. Select Studio from the left navigation pane. This opens a new window.
c. In the Get Started dialog box, select a user profile from the down arrow. This opens a new window.
d. Select Open Studio.
e. Select Open Launcher from the main working area. This opens a new page.
f. Select Create notebook from the main working area.
g. Select Base Python 3.0 from the down arrow next to Image in the Change environment dialog
box.

The @remote decorator automatically detects the image attached to the SageMaker Studio
notebook and uses it to run the SageMaker training job. If image_uri is specified either as an
argument in the decorator or in the configuration file, then the value specified in image_uri will
be used instead of the detected image.

For more information about how to create a notebook in SageMaker Studio, see the Create a
Notebook from the File Menu section in Create or Open an Amazon SageMaker Studio Notebook.

For a list of available images, see Supported Docker images.


2. Install the SageMaker Python SDK.

1566
Amazon SageMaker Developer Guide
Set up your environment

To annotate your code with the @remote function inside a SageMaker Studio Notebook, you must
have the SageMaker Python SDK installed. Install the SageMaker Python SDK, as shown in the
following code example.

!pip install sagemaker

3. Use @remote decorator to run functions in a SageMaker training job.

To run your local ML code, first create a dependencies file to instruct SageMaker where to locate your
local code. To do so, follow these steps:
a. From the SageMaker Studio Launcher main working area, in Utilities and files, choose Text file.
This opens a new tab with a text file called untitled.txt.

For more information about the SageMaker Studio user interface (UI), see Amazon SageMaker
Studio UI Overview.
b. Rename untitled.txt to requirements.txt.
c. Add all the dependencies required for the code along with the SageMaker library to
requirements.txt.

A minimal code example for requirements.txt for the example divide function is provided in
the following section, as follows.

sagemaker

d. Run your code with the remote decorator by passing the dependencies file, as follows.

from sagemaker.remote_function import remote

@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
return x / y

divide(2, 3.0)

For additional code examples, see the sample notebook quick_start.ipynb.

If you’re already running a SageMaker Studio notebook, and you install the Python SDK as
instructed in 2. Install the SageMaker Python SDK, you must restart your kernel. For more
information, see Use the SageMaker Studio Notebook Toolbar in the Amazon SageMaker Developer
Guide.

Run your code from an Amazon SageMaker notebook


You can annotate your local ML code from a SageMaker notebook instance. The following instructions
show how to create a notebook instance with a custom kernel, install the SageMaker Python SDK, and
annotate your code with the decorator.

1. Create a notebook instance with a custom conda kernel.

You can annotate your local ML code with an @remote decorator to use inside of a SageMaker training
job. First you must create and customize a SageMaker notebook instance to use a kernel with Python
version 3.7 or higher, up to 3.10.x. To do so, follow these steps:
a. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.
b. In the left navigation panel, choose Notebook to expand its options.
c. Choose Notebook Instances from the expanded options.
1567
Amazon SageMaker Developer Guide
Set up your environment

d. Choose the Create Notebook Instance button. This opens a new page.
e. For Notebook instance name, enter a name with a maximum of 63 characters and no spaces. Valid
characters: A-Z, a-z, 0-9, and .:+=@ _%- (hyphen).
f. In the Notebook instance settings dialog box, expand the right arrow next to Additional
Configuration.
g. Under Lifecycle configuration - optional, expand the down arrow and select Create a new
lifecycle configuration. This opens a new dialog box.
h. Under Name, enter a name for your configuration setting.
i. In the Scripts dialog box, in the Start notebook tab, replace the existing contents of the text box
with the following script.

#!/bin/bash

set -e

sudo -u ec2-user -i <<'EOF'


unset SUDO_UID
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda/
source "$WORKING_DIR/miniconda/bin/activate"
for env in $WORKING_DIR/miniconda/envs/*; do
BASENAME=$(basename "$env")
source activate "$BASENAME"
python -m ipykernel install --user --name "$BASENAME" --display-name "Custom
($BASENAME)"
done
EOF

echo "Restarting the Jupyter server.."


# restart command is dependent on current running Amazon Linux and JupyterLab
CURR_VERSION_AL=$(cat /etc/system-release)
CURR_VERSION_JS=$(jupyter --version)

if [[ $CURR_VERSION_JS == *$"jupyter_core : 4.9.1"* ]] && [[ $CURR_VERSION_AL == *


$" release 2018"* ]]; then
sudo initctl restart jupyter-server --no-wait
else
sudo systemctl --no-block restart jupyter-server.service
fi

j. In the Scripts dialog box, in the Create notebook tab, replace the existing contents of the text box
with the following script.

#!/bin/bash

set -e

sudo -u ec2-user -i <<'EOF'


unset SUDO_UID
# Install a separate conda installation via Miniconda
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
mkdir -p "$WORKING_DIR"
wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh -O
"$WORKING_DIR/miniconda.sh"
bash "$WORKING_DIR/miniconda.sh" -b -u -p "$WORKING_DIR/miniconda"
rm -rf "$WORKING_DIR/miniconda.sh"
# Create a custom conda environment
source "$WORKING_DIR/miniconda/bin/activate"
KERNEL_NAME="custom_python310"
PYTHON="3.10"
conda create --yes --name "$KERNEL_NAME" python="$PYTHON" pip
conda activate "$KERNEL_NAME"

1568
Amazon SageMaker Developer Guide
Set up your environment

pip install --quiet ipykernel


# Customize these lines as necessary to install the required packages
EOF

k. Choose the Create configuration button on the bottom right of the window.
l. Choose the Create notebook instance button on the bottom right of the window.
m.Wait for the notebook instance Status to change from Pending to InService.
2. Create a Jupyter notebook in the notebook instance.

The following instructions show how to create a Jupyter notebook using Python 3.10 in your newly
created SageMaker instance.
a. After the notebook instance Status from the previous step is InService, do the following:
i. Select Open Jupyter under Actions in the row containing your newly created notebook instance
Name. This opens a new Jupyter server.
b. In the Jupyter server, select New from the top right menu.
c. From the down arrow, select conda_custom_python310. This creates a new Jupyter notebook that
uses a Python 3.10 kernel. This new Jupyter notebook can now be used similarly to a local Jupyter
notebook.
3. Install the SageMaker Python SDK.

After your virtual environment is running, install the SageMaker Python SDK by using the following
code example.

!pip install sagemaker

4. Use an @remote decorator to run functions in a SageMaker training job.

When you annotate your local ML code with an @remote decorator inside the SageMaker notebook,
SageMaker training will automatically interpret the function of your code and run it as a SageMaker
training job. Set up your notebook by doing the following:
a. Select the kernel name in the notebook menu from the SageMaker notebook instance that you
created in step 1, Create a SageMaker Notebook instance with a custom kernel.

For more information, see Change an Image or a Kernel.


b. From the down arrow, choose the a custom conda kernel that uses a version of Python that is 3.7
or higher.

As an example, selecting conda_custom_python310 chooses the kernel for Python 3.10.


c. Choose Select.
d. Wait for the kernel’s status to show as idle, which indicates that the kernel has started.
e. In the Jupyter Server Home, select New from the top right menu.
f. Next to the down arrow, select Text file. This creates a new text file called untitled.txt.
g. Rename untitled.txt to requirements.txt and add any dependencies required for the code
along with sagemaker.
h. Run your code with the remote decorator by passing the dependencies file as shown below.

from sagemaker.remote_function import remote

@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
return x / y

divide(2, 3.0)
1569
Amazon SageMaker Developer Guide
Set up your environment

See the sample notebook quick_start.ipnyb for additional code examples.

Run your code from within your local IDE


You can annotate your local ML code with an @remote decorator inside your preferred local IDE. The
following steps show the necessary prerequisites, how to install the Python SDK, and how to annotate
your code with the @remote decorator.

1. Install prerequisites by setting up the AWS Command Line Interface (AWS CLI) and creating a role, as
follows:
• Onboard to a SageMaker domain following the instructions in the AWS CLI Prerequisites section of
Set Up Amazon SageMaker Prerequisites.
• Create an IAM role following the Create execution role section of SageMaker Roles.
2. Create a virtual environment by using either PyCharm or conda and using Python version 3.7 or
higher, up to 3.10.x.
• Set up a virtual environment using PyCharm as follows:
a. Select File from the main menu.
b. Choose New Project.
c. Choose Conda from the down arrow under New environment using.
d. In the field for Python version use the down arrow to select a version of Python that is 3.7 or
above. You can go up to 3.10.x from the list.

• If you have Anaconda installed, you can set up a virtual environment using conda, as follows:
• Open an Anaconda prompt terminal interface.
• Create and activate a new conda environment using a Python version of 3.7 or higher, up to
3.10x. The following code example shows how to create a conda environment using Python
version 3.10.

conda create -n sagemaker_jobs_quick_start python=3.10 pip

1570
Amazon SageMaker Developer Guide
Set up your environment

conda activate sagemaker_jobs_quick_start

3. Install the SageMaker Python SDK.

To package your code from your preferred IDE, you must have a virtual environment set up using
Python 3.7 or higher, up to 3.10x. You also need a compatible container image. Install the SageMaker
Python SDK using the following code example.

pip install sagemaker

4. Wrap your code inside the @remote decorator. The SageMaker Python SDK will automatically
interpret the function of your code and run it as a SageMaker training job. The following code
examples show how to import the necessary libraries, set up a SageMaker session, and annotate a
function with the @remote decorator.

You can run your code by either providing the dependencies needed directly, or by using dependencies
from the active conda environment.
• To provide the dependencies directly, do the following:
• Create a requirements.txt file in the working directory that the code resides in.
• Add all of the dependencies required for the code along with the SageMaker library. The
following section provides a minimal code example for requirements.txt for the example
divide function.

sagemaker

• Run your code with the @remote decorator by passing the dependencies file. In the following
code example, replace The IAM role name with an AWS Identity and Access Management (IAM)
role ARN that you would like SageMaker to use to run your job.

import boto3
import sagemaker
from sagemaker.remote_function import remote

sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-
west-2"))
settings = dict(
sagemaker_session=sm_session,
role=<The IAM role name>,
instance_type="ml.m5.xlarge",
dependencies='./requirements.txt'
)

@remote(**settings)
def divide(x, y):
return x / y

if __name__ == "__main__":
print(divide(2, 3.0))

• To use dependencies from the active conda environment, use the value auto_capture for the
dependencies parameter, as shown in the following.

import boto3
import sagemaker
from sagemaker.remote_function import remote

sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-
west-2"))
settings = dict(

1571
Amazon SageMaker Developer Guide
Invoking a function

sagemaker_session=sm_session,
role=<The IAM role name>,
instance_type="ml.m5.xlarge",
dependencies="auto_capture"
)

@remote(**settings)
def divide(x, y):
return x / y

if __name__ == "__main__":
print(divide(2, 3.0))

Note
You can also implement the previous code inside a Jupyter notebook. PyCharm
Professional Edition supports Jupyter natively. For more guidance, see Jupyter notebook
support in PyCharm's documentation.

Invoking a function
To invoke a function inside the @remote decorator, use either of the following methods:

• Use an @remote decorator to invoke a function (p. 1572).


• Use the RemoteExecutor API to invoke a function (p. 1575).

If you use the @remote decorator method to invoke a function, the training job will wait for the function
to complete before starting a new task. However, if you use the RemoteExecutor API, you can run more
than one job in parallel. The following sections show both ways of invoking a function.

Use an @remote decorator to invoke a function


You can use the @remote decorator to annotate a function. SageMaker will transform the code inside
the decorator into a SageMaker training job. The training job will then invoke the function inside the
decorator and wait for the job to complete. The following code example shows how to import the
required libraries, start a SageMaker instance, and annotate a matrix multiplication with the @remote
decorator.

from sagemaker.remote_function import remote


import numpy as np

@remote(instance_type="ml.m5.large")
def matrix_multiply(a, b):
return np.matmul(a, b)

a = np.array([[1, 0],
[0, 1]])
b = np.array([1, 2])

assert (matrix_multiply(a, b) == np.array([1,2])).all()

The decorator is defined as follows.

def remote(
*,
**kwarg):

1572
Amazon SageMaker Developer Guide
Invoking a function

...

When you invoke a decorated function, SageMaker Python SDK loads any exceptions raised by an
error into local memory. In the following code example, the first call to the divide function completes
successfully and the result is loaded into local memory. In the second call to the divide function, the code
returns an error and this error is loaded into local memory.

from sagemaker.remote_function import remote


import pytest

@remote()
def divide(a, b):
return a/b

# the underlying job is completed successfully


# and the function return is loaded
assert divide(10, 5) == 2

# the underlying job fails with "AlgorithmError"


# and the function exception is loaded into local memory
with pytest.raises(ZeroDivisionError):
divide(10, 0)

Note
The decorated function is run as a remote job. If the thread is interrupted, the underlying job
will not be stopped.

How to change the value of a local variable


The decorator function is run on a remote machine. Changing a non-local variable or input arguments
inside a decorated function will not change the local value.

In the following code example, a list and a dict are appended inside the decorator function. This does not
change when the decorator function is invoked.

a = []

@remote
def func():
a.append(1)

# when func is invoked, a in the local memory is not modified


func()
func()

# a stays as []

a = {}
@remote
def func(a):
# append new values to the input dictionary
a["key-2"] = "value-2"

a = {"key": "value"}
func(a)

# a stays as {"key": "value"}

To change the value of a local variable declared inside of a decorator function, return the variable from
the function. The following code example shows that the value of a local variable is changed when it is
returned from the function.

1573
Amazon SageMaker Developer Guide
Invoking a function

a = {"key-1": "value-1"}

@remote
def func(a):
a["key-2"] = "value-2"
return a

a = func(a)

-> {"key-1": "value-1", "key-2": "value-2"}

Input and output for the remote function


You can also convert function arguments, returns, and exceptions into a byte stream by following
Python's pickle protocol. There is no explicit limit on the types of function argument or return types. You
can pickle any of the following items:

• Built-in Python objects, dicts, lists, tuples, or primitive data types


• Numpy arrays
• Pandas Dataframes
• Scikit-learn datasets and estimators
• PyTorch models
• Tensorflow models
• XGBoost Boosters

The following Python objects cannot be serialized with Pickle:

• Python file objects


• Data loader objects, including tensorflow.data.DataSet and xgboost.DMatrix objects
• Ctype objects containing pointers

The following code example uses xgboost.DMatrix and is expected to fail.

@remote()
def train(dtrain, params):
return xgb.train(data, params)

df = pandas.read_csv("./data.csv")
train_data, test_data = train_test_split(df, test_size=0.3)
dtrain = DMatrix(train_data)

booster = train(dtrain, {}) # raises SerializationError

As a recommended practice, DMatrix objects should be loaded during training time instead of using
them as an input data object to the remote function.

To rectify the previous code example, pass the pandas dataframe or numpy arrays directly to the train
function by using this code example.

@remote
def train(df, params):
dtrain = DMatrix(df)
return xgb.train(dtrain, params)

1574
Amazon SageMaker Developer Guide
Invoking a function

For data sets that are too large to fit into memory, use the specialized data loader provided by your
framework in the function. The following code shows an example of the tensorflow data loader.

@remote()
def train(data_path: str, params):
import tensorflow as tf
import tensorflow_io as tfio

dataset = tf.data.TextLineDataset(tf.data.Dataset.list_files(f"{data_path}/*.txt"))
...

train("s3://my_bucket/data", {})

Use the RemoteExecutor API to invoke a function


You can use the RemoteExecutor API to invoke a function. SageMaker Python SDK will transform the
code inside the RemoteExecutor call into a SageMaker training job. The training job will then invoke
the function as an asynchronous operation and return a future. If you use the RemoteExecutor API,
you can run more than one training job in parallel. For more information about futures in Python, see
Futures.

The following code example shows how to import the required libraries, define a function, start a
SageMaker instance, and use the API to submit a request to run 2 jobs in parallel.

from sagemaker.remote_function import RemoteExecutor

def matrix_multiply(a, b):


return np.matmul(a, b)

a = np.array([[1, 0],
[0, 1]])
b = np.array([1, 2])

with RemoteExecutor(max_parallel_job=2, instance_type="ml.m5.large") as e:


future = e.submit(matrix_multiply, a, b)

assert (future.result() == np.array([1,2])).all()

The RemoteExecutor class is an implementation of the concurrent.futures.Executor library.

The following code example shows how to define a function and call it using the RemoteExecutorAPI.
In this example, the RemoteExecutor will submit 4 jobs in total, but only 2 in parallel. The last two jobs
will reuse the clusters with minimal overhead.

from sagemaker.remote_function.client import RemoteExecutor

def divide(a, b):


return a/b

with RemoteExecutor(max_parallel_job=2, keep_alive_period_in_seconds=60) as e:


futures = [e.submit(divide, a, 2) for a in [3, 5, 7, 9]]

for future in futures:


print(future.result())

The max_parallel_job parameter only serves as a rate limiting mechanism without optimizing
compute resource allocation. In the previous code example, RemoteExecutor doesn’t reserve

1575
Amazon SageMaker Developer Guide
Configuration file

compute resources for the two parallel jobs before any jobs are submitted. For more information about
max_parallel_job or other parameters for the @remote decorator, see Remote function classes and
methods specification.

Future class for the RemoteExecutor API


A future class is a public class that represents the return function from the training job when it is invoked
asynchronously. The future class implements the concurrent.futures.Future class. This class can be used
to do operations on the underlying job and load data into memory.

Configuration file
The Amazon SageMaker Python SDK supports setting of default values for AWS infrastructure primitive
types. After administrators configure these defaults, they are automatically passed when SageMaker
Python SDK calls supported APIs. The arguments for the decorator function can be put inside of
configuration files. This is so that you can separate settings that are related to the infrastructure from
the code base. For more information about parameters and arguments for the remote function and
methods, see Remote function classes and methods specification.

You can set infrastructure settings for the network configuration, IAM roles, Amazon S3 folder for input,
output data, and tags inside the configuration file. The configuration file can be used when invoking a
function using either the @remote decorator or the RemoteExecutor API.

An example configuration file that defines the dependencies, resources, and other arguments follows.
This example configuration file is used to invoke a function that is initiated either using the @remote
decorator or the RemoteExecutor API.

SchemaVersion: '1.0'
SageMaker:
PythonSDK:
Modules:
RemoteFunction:
Dependencies: 'path/to/requirements.txt'
EnableInterContainerTrafficEncryption: true
EnvironmentVariables: {'EnvVarKey': 'EnvVarValue'}
ImageUri: '366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest'
IncludeLocalWorkDir: true
InstanceType: 'ml.m5.large'
JobCondaEnvironment: 'your_conda_env'
PreExecutionCommands:
- 'command_1'
- 'command_2'
PreExecutionScript: 'path/to/script.sh'
RoleArn: 'arn:aws:iam::366666666666:role/MyRole'
S3KmsKeyId: 'yourkmskeyid'
S3RootUri: 's3://my-bucket/my-project'
VpcConfig:
SecurityGroupIds:
- 'sg123'
Subnets:
- 'subnet-1234'
Tags: [{'Key': 'yourTagKey', 'Value':'yourTagValue'}]
VolumeKmsKeyId: 'yourkmskeyid'

The @remote decorator and RemoteExecutor will look for Dependencies in the following
configuration files:

• An admin-defined configuration file.


• A user-defined configuration file.

1576
Amazon SageMaker Developer Guide
Customize your runtime environment

The default locations for these configuration files depend on, and are relative to, your environment. The
following code example returns the default location of your admin and user configuration files. These
commands must be run in the same environment where you're using the SageMaker Python SDK.

import os
from platformdirs import site_config_dir, user_config_dir

#Prints the location of the admin config file


print(os.path.join(site_config_dir("sagemaker"), "config.yaml"))

#Prints the location of the user config file


print(os.path.join(user_config_dir("sagemaker"), "config.yaml"))

You can override the default locations of these files by setting the
SAGEMAKER_ADMIN_CONFIG_OVERRIDE and SAGEMAKER_USER_CONFIG_OVERRIDE environment
variables for the admin-defined and user-defined configuration file paths, respectively.

If a key exists in both the admin-defined and user-defined configuration files, the value in the user-
defined file will be used.

Customize your runtime environment


You can customize your runtime environment to use your preferred local integrated development
environments (IDEs), SageMaker notebooks, or SageMaker Studio notebooks to write your ML code.
SageMaker will help package and submit your functions and its dependencies as a SageMaker training
job. This allows you to access the capacity of the SageMaker training server to run your training jobs.

Both the remote decorator and the RemoteExecutor methods to invoke a function allow users to
define and customize their runtime environment. You can use either a requirements.txt file or a
conda environment YAML file.

To customize a runtime environment using both a conda environment YAML file and a
requirements.txt file, refer to the following code example.

# specify a conda environment inside a yaml file


@remote(instance_type="ml.m5.large",
image_uri = "my_base_python:latest",
dependencies = "./environment.yml")
def matrix_multiply(a, b):
return np.matmul(a, b)

# use a requirements.txt file to import dependencies


@remote(instance_type="ml.m5.large",
image_uri = "my_base_python:latest",
dependencies = './requirements.txt')
def matrix_multiply(a, b):
return np.matmul(a, b)

Alternatively, you can set dependencies to auto_capture to let the SageMaker Python SDK
capture the installed dependencies in the active conda environment. The following are required for
auto_capture to work reliably:

• You must have an active conda environment. We recommend not using the base conda environment
for remote jobs so that you can reduce potential dependency conflicts. Not using the base conda
environment also allows for faster environment setup in the remote job.
• You must not have any dependencies installed using pip with a value for the parameter --extra-
index-url.
• You must not have any dependency conflicts between packages installed with conda and packages
installed with pip in the local development environment.

1577
Amazon SageMaker Developer Guide
Container image compatibility

• Your local development environment must not contain operating system-specific dependencies that
are not compatible with Linux.

In case auto_capture does not work, we recommend that you pass in your dependencies as a
requirement.txt or conda environment.yaml file, as described in the first coding example in this section.

Container image compatibility


The following table shows a list of SageMaker training images that are compatible with the @remote
decorator.

Name Python Version Image URI - CPU Image URI - GPU

Data Science 3.7(py37) For SageMaker Studio For SageMaker Studio


Notebooks only. Python Notebooks only. Python
SDK automatically SDK automatically
selects the image selects the image
URI when used as URI when used as
SageMaker Studio SageMaker Studio
Notebook kernel image. Notebook kernel image.

Data Science 2.0 3.8(py38) For SageMaker Studio For SageMaker Studio
Notebooks only. Python Notebooks only. Python
SDK automatically SDK automatically
selects the image selects the image
URI when used as URI when used as
SageMaker Studio SageMaker Studio
Notebook kernel image. Notebook kernel image.

Data Science 3.0 3.10(py310) For SageMaker Studio For SageMaker Studio
Notebooks only. Python Notebooks only. Python
SDK automatically SDK automatically
selects the image selects the image
URI when used as URI when used as
SageMaker Studio SageMaker Studio
Notebook kernel image. Notebook kernel image.

Base Python 2.0 3.8(py38) Python SDK selects this For SageMaker Studio
image when it detects Notebooks only. Python
that development SDK automatically
environment is using selects the image
Python 3.8 runtime. URI when used as
Otherwise Python SDK SageMaker Studio
automatically selects Notebook kernel image.
this image when used
as SageMaker Studio
Notebook kernel image

Base Python 3.0 3.10(py310) Python SDK selects this For SageMaker Studio
image when it detects Notebooks only. Python
that development SDK automatically
environment is using selects the image URI
Python 3.8 runtime. when used as Studio
Otherwise Python SDK Notebook kernel image.
automatically selects
this image when used

1578
Amazon SageMaker Developer Guide
Container image compatibility

Name Python Version Image URI - CPU Image URI - GPU


as SageMaker Studio
Notebook kernel image

DLC-TensorFlow 2.12.0 3.10(py310) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
for SageMaker Training tensorflow- tensorflow-
training:2.12.0-cpu- training:2.12.0-
py310-ubuntu20.04- gpu-py310-cu118-
sagemaker ubuntu20.04-
sagemaker

DLC-Tensorflow 2.11.0 3.9(py39) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
for SageMaker training tensorflow- tensorflow-
training:2.11.0-cpu- training:2.11.0-
py39-ubuntu20.04- gpu-py39-cu112-
sagemaker ubuntu20.04-
sagemaker

DLC-TensorFlow 2.10.1 3.9(py39) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
for SageMaker training tensorflow- tensorflow-
training:2.10.1-cpu- training:2.10.1-
py39-ubuntu20.04- gpu-py39-cu112-
sagemaker ubuntu20.04-
sagemaker

DLC-TensorFlow 2.9.2 3.9(py39) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
for SageMaker training tensorflow- tensorflow-
training:2.9.2-cpu- training:2.9.2-
py39-ubuntu20.04- gpu-py39-cu112-
sagemaker ubuntu20.04-
sagemaker

DLC-TensorFlow 2.8.3 3.9(py39) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
for SageMaker training tensorflow- tensorflow-
training:2.8.3-cpu- training:2.8.3-
py39-ubuntu20.04- gpu-py39-cu112-
sagemaker ubuntu20.04-
sagemaker

DLC-PyTorch 2.0.0 for 3.10(py310) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
SageMaker training pytorch-training:2.0.0- pytorch-training:2.0.0-
cpu-py310- gpu-py310-cu118-
ubuntu20.04- ubuntu20.04-
sagemaker sagemaker

DLC-PyTorch 1.13.1 for 3.9(py39) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
SageMaker training pytorch-training:1.13.1- pytorch-training:1.13.1-
cpu-py39-ubuntu20.04- gpu-py39-cu117-
sagemaker ubuntu20.04-
sagemaker

DLC-PyTorch 1.12.1 for 3.8(py38) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
SageMaker training pytorch-training:1.12.1- pytorch-training:1.12.1-
cpu-py38-ubuntu20.04- gpu-py38-cu113-
sagemaker ubuntu20.04-
sagemaker

1579
Amazon SageMaker Developer Guide
Container image compatibility

Name Python Version Image URI - CPU Image URI - GPU

DLC-PyTorch 1.11.0 for 3.8(py38) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
SageMaker training pytorch-training:1.11.0- pytorch-training:1.11.0-
cpu-py38-ubuntu20.04- gpu-py38-cu113-
sagemaker ubuntu20.04-
sagemaker

DLC-MXNet 1.9.0 for 3.8(py38) 763104351884.dkr.ecr.<region>.amazonaws.com/


763104351884.dkr.ecr.<region>.amazo
SageMaker training mxnet-training:1.9.0- mxnet-training:1.9.0-
cpu-py38-ubuntu20.04- gpu-py38-cu112-
sagemaker ubuntu20.04-
sagemaker

Note
To run jobs locally using AWS Deep Learning Containers (DLC) images, use the image URIs
found in the DLC documentation. The DLC images do not support the auto_capture value for
dependencies.

You can also run remote functions with your custom images. For compatibility with remote functions,
custom images should be built with Python version 3.7.x-3.10.x. The following is a minimal Dockerfile
example showing you how to use a Docker image with Python 3.10.

FROM python:3.10

#... Rest of the Dockerfile

To create conda environments in your image and use it to run jobs, set the environment
variable SAGEMAKER_JOB_CONDA_ENV to the conda environment name. If your image has the
SAGEMAKER_JOB_CONDA_ENV value set, the remote function cannot create a new conda environment
during the training job runtime. Refer to the following Dockerfile example that uses a conda
environment with Python version 3.10.

FROM continuumio/miniconda3:4.12.0

ENV SHELL=/bin/bash \
CONDA_DIR=/opt/conda \
SAGEMAKER_JOB_CONDA_ENV=sagemaker-job-env

RUN conda create -n $SAGEMAKER_JOB_CONDA_ENV \


&& conda install -n $SAGEMAKER_JOB_CONDA_ENV python=3.10 -y \
&& conda clean --all -f -y \

For SageMaker to use mamba to manage your Python virtual environment in the container image, install
the mamba toolkit from miniforge. To use mamba, add the following code example to your Dockerfile.
Then, SageMaker will detect the mamba availability at runtime and use it instead of conda.

#Mamba Installation
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/
Mambaforge-Linux-x86_64.sh" \
&& bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \
&& /opt/conda/bin/conda init bash

Using a custom conda channel on an Amazon S3 bucket is not compatible with mamba when using a
remote function. If you choose to use mamba, make sure you are not using a custom conda channel on
Amazon S3. For more information, see the Prerequisites section under Custom conda repository using
Amazon S3.

1580
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments

The following is a complete Dockerfile example showing how to create a compatible Docker image.

FROM python:3.10

RUN apt-get update -y \


# Needed for awscli to work
# See: https://github.com/aws/aws-cli/issues/1957#issuecomment-687455928
&& apt-get install -y groff unzip curl \
&& pip install --upgrade \
'boto3>1.0<2' \
'awscli>1.0<2' \
'ipykernel>6.0.0<7.0.0' \
#Use ipykernel with --sys-prefix flag, so that the absolute path to
#/usr/local/share/jupyter/kernels/python3/kernel.json python is used
# in kernelspec.json file
&& python -m ipykernel install --sys-prefix

#Install Mamba
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/
Mambaforge-Linux-x86_64.sh" \
&& bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \
&& /opt/conda/bin/conda init bash

#cleanup
RUN apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf ${HOME}/.cache/pip \
&& rm Mambaforge-Linux-x86_64.sh

ENV SHELL=/bin/bash \
PATH=$PATH:/opt/conda/bin

The resulting image from running the previous Dockerfile example can also be used as a SageMaker
Studio kernel image.

Logging parameters and metrics with Amazon


SageMaker Experiments
This guide show how to log parameters and metrics with Amazon SageMaker Experiments. A SageMaker
experiment consists of runs, and each run consists of all the inputs, parameters, configurations and
results for a single model training interaction.

You can log parameters and metrics from a remote function using either the @remote decorator or the
RemoteExecutor API.

To log parameters and metrics from a remote function, choose one of the following methods:

• Instantiate a SageMaker experiment run inside a remote function using Run from the SageMaker
Experiments library. For more information, see Create an Amazon SageMaker Experiment.
• Use the load_run function inside a remote function from the SageMaker Experiments library. This will
load a Run instance that is declared outside of the remote function.

The following sections show how to create and track lineage with SageMaker experiment runs by using
the previous listed methods. The sections also describe cases that are not supported by SageMaker
training.

1581
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments

Use the @remote decorator to integrate with SageMaker


Experiments
You can either instantiate an experiment in SageMaker, or load a current SageMaker experiment from
inside a remote function. The following sections show you show to use either method.

Create an experiment with SageMaker Experiments


You can create an experiment run in SageMaker experiment. To do this you pass your experiment name,
run name, and other parameters into your remote function.

The following code example imports the name of your experiment, the name of the run, and the
parameters to log during each run. The parameters param_1 and param_2 are logged over time inside
a training loop. Common parameters may include batch size or epochs. In this example, the metrics
metric_a and metric_b are logged for a run over time inside a training loop. Other common metrics
may include accuracy or loss.

from sagemaker.remote_function import remote


from sagemaker.experiments.run import Run

# Define your remote function


@remote
def train(value_1, value_2, exp_name, run_name):
...
...
#Creates the experiment
with Run(
experiment_name=exp_name,
run_name=run_name,
) as run:
...
#Define values for the parameters to log
run.log_parameter("param_1", value_1)
run.log_parameter("param_2", value_2)
...
#Define metrics to log
run.log_metric("metric_a", 0.5)
run.log_metric("metric_b", 0.1)

# Invoke your remote function


train(1.0, 2.0, "my-exp-name", "my-run-name")

Load current SageMaker Experiments with a job initiated by the @remote


decorator
Use the load_run() function from the SageMaker Experiments library to load the current run object
from the run context. You can also use the load_run() function within your remote function. Load the
run object initialized locally by the with statement on the run object as shown in the following code
example.

from sagemaker.experiments.run import Run, load_run

# Define your remote function


@remote
def train(value_1, value_2):
...
...
with load_run() as run:
run.log_metric("metric_a", value_1)

1582
Amazon SageMaker Developer Guide
Logging parameters and metrics with
Amazon SageMaker Experiments

run.log_metric("metric_b", value_2)

# Invoke your remote function


with Run(
experiment_name="my-exp-name",
run_name="my-run-name",
) as run:
train(0.5, 1.0)

Load a current experiment run within a job initiated with the


RemoteExecutor API
You can also load a current SageMaker experiment run if your jobs were initiated with the
RemoteExecutor API. The following code example shows how to use RemoteExecutor API with the
SageMaker Experiments load_run function. You do this to load a current SageMaker experiment run
and capture metrics in the job submitted by RemoteExecutor.

from sagemaker.experiments.run import Run, load_run

def square(x):
with load_run() as run:
result = x * x
run.log_metric("result", result)
return result

with RemoteExecutor(
max_parallel_job=2,
instance_type="ml.m5.large"
) as e:
with Run(
experiment_name="my-exp-name",
run_name="my-run-name",
):
future_1 = e.submit(square, 2)

Unsupported uses for SageMaker Experiments while annotating


your code with an @remote decorator
SageMaker does not support passing a Run type object to an @remote function or using global Run
objects. The following examples show code that will throw a SerializationError.

The following code example attempts to pass a Run type object to an @remote decorator, and it
generates an error.

@remote
def func(run: Run):
run.log_metrics("metric_a", 1.0)

with Run(...) as run:


func(run) ---> SerializationError caused by NotImplementedError

The following code example attempts to use a global run object instantiated outside of the remote
function. In the code example, the train() function is defined inside the with Run context,
referencing a global run object from within. When train() is called, it generates an error.

with Run(...) as run:

1583
Amazon SageMaker Developer Guide
Using modular code with the @remote decorator

@remote
def train(metric_1, value_1, metric_2, value_2):
run.log_parameter(metric_1, value_1)
run.log_parameter(metric_2, value_2)

train("p1", 1.0, "p2", 0.5) ---> SerializationError caused by NotImplementedError

Using modular code with the @remote decorator


You can organize your code into modules for ease of workspace management during development
and still use the @remote function to invoke a function. You can also replicate the local modules
from your development environment to the remote job environment. To do so, set the parameter
include_local_workdir to True, as shown in the following code example.

@remote(
include_local_workdir=True,
)

Note
The @remote decorator and parameter must appear in the main file, rather than in any of the
dependent files.

When include_local_workdir is set to True, SageMaker will package all of the Python scripts while
maintaining the directory structure in the process's current directory. It will also make the dependencies
available in the job's working directory.

As an example, consider the following scenario where a Python script to process the MNIST dataset is
divided into a main.py script and a dependent pytorch_mnist.py script. Then, the dependent scripts
is called by main.py. In this scenario, the main.py script contains code to import the dependency, as
follows.

from mnist_impl.pytorch_mnist import...

The main.py file must also contain the @remote decorator, and it must set the
include_local_workdir parameter to True.

Best practices in structuring your working directory.


We recommend the following best practices while using the @remote decorator in your modular code.

• Put the @remote decorator in a file that resides at the root level directory of the workspace.
• Structure the local modules at the root level.

The following example image shows a directory structure that will result in inconsistent behavior when it
is used to annotate your code with an @remote decorator.

In this example structure, the main.py script that contains the @remote decorator is not located at the
root level directory. While the remote job may be successful if you run python entrypoint/main.py,
the remote job will not run successfully. Therefore, the following structure is NOT recommended.

.
### config.yaml
### entrypoint
# ### data
# ### main.py <----------------- @remote used here
### mnist_impl

1584
Amazon SageMaker Developer Guide
Private repository for runtime dependencies

# ### __pycache__
# # ### pytorch_mnist.cpython-310.pyc
# ### pytorch_mnist.py <-------- dependency of main.py
### requirements.txt

Private repository for runtime dependencies


You can use pre-execution commands or script to configure a dependency manager like pip or conda
in your job environment. To achieve network isolation, use either of these options to redirect your
dependency managers to access your private repositories and run remote functions within a VPC. The
pre-execution commands or script will run before your remote function runs. You can define them with
the @remote decorator, the RemoteExecutor API, or within a configuration file.

The following sections show you how to access a private Python Package Index (PyPI) repository
managed with AWS CodeArtifact. The sections also show how to access a custom conda channel hosted
on Amazon Simple Storage Service (Amazon S3).

How to use a custom PyPI repository managed with AWS


CodeArtifact
To use CodeArtifact to manage a custom PyPI repository, the following prerequisites are required:

• Your private PyPI repository should already have been created. You can utilize AWS CodeArtifact
to create and manage your private package repositories. To learn more about CodeArtifact, see the
CodeArtifact User Guide.
• Your VPC should have access to your CodeArtifact repository. To allow a connection from your VPC to
your CodeArtifact repository, you must do the following:
• Create VPC endpoints for CodeArtifact.
• Create an Amazon S3 gateway endpoint for your VPC, which allows CodeArtifact to store package
assets.

The following pre-execution command example shows how to configure pip in the SageMaker training
job to point to your CodeArtifact repository. For more information, see Configure and use pip with
CodeArtifact.

# use a requirements.txt file to import dependencies


@remote(
instance_type="ml.m5.large"
image_uri = "my_base_python:latest",
dependencies = './requirements.txt',
pre_execution_commands=[
"aws codeartifact login --tool pip --domain my-org --domain-owner
<000000000000> --repository my-codeartifact-python-repo --endpoint-url https://vpce-
xxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com"
]
)
def matrix_multiply(a, b):
return np.matmul(a, b)

How to use a custom conda channel hosted on Amazon S3


To use Amazon S3 to manage a custom conda repository, the following prerequisites are required:

• Your private conda channel must already be set up in your Amazon S3 bucket, and all dependent
packages must be indexed and uploaded to your Amazon S3 bucket. For instructions on how to index
your conda packages, see Creating custom channels.

1585
Amazon SageMaker Developer Guide
Example notebooks

• Your VPC should have access to the Amazon S3 bucket. For more information, see Endpoints for
Amazon S3.
• The base conda environment in your job image should have boto3 installed. To check your
environment, enter the following in your Anaconda prompt to check that boto3 appears in the
resulting generated list.

conda list -n base

• You job image should be installed with conda, not mamba. To check your environment, ensure that the
previous code prompt does not return mamba.

The following pre-execution commands example shows how to configure conda in the SageMaker
training job to point to your private channel on Amazon S3 The pre-execution commands removes the
defaults channel and adds custom channels to a .condarc conda configuration file.

# specify your dependencies inside a conda yaml file


@remote(
instance_type="ml.m5.large"
image_uri = "my_base_python:latest",
dependencies = "./environment.yml",
pre_execution_commands=[
"conda config --remove channels 'defaults'"
"conda config --add channels 's3://my_bucket/my-conda-repository/conda-forge/'",
"conda config --add channels 's3://my_bucket/my-conda-repository/main/'"
]
)
def matrix_multiply(a, b):
return np.matmul(a, b)

Example notebooks
You can transform a training code in an existing workspace environment and any associated data
processing code and datasets into a SageMaker training job. The following notebooks show you how
to customize your environment, job settings, and more for an image classification problem, using the
XGBoost algorithm and Hugging Face.

The quick_start notebook contains the following code examples:

• How to customize your job settings with a configuration file.


• How to invoke Python functions as jobs, asynchronously.
• How to customize the job runtime environment by bringing in additional dependencies.
• How to use local dependencies with the @remote function method.

The following notebooks provide additional code examples for different ML problems types and
implementations.

• To see code examples to use the @remote decorator for an image classification problem, open the
pytorch_mnist.ipynb notebook. This classification problem recognizes handwritten digits using the
Modified National Institute of Standards and Technology (MNIST) sample dataset.
• To see code examples for using the @remote decorator for the previous image classification problem
with a script, see the Pytorch MNIST sample script, train.py.
• To see how the XGBoost algorithm implemented with an @remote decorator: Open the
xgboost_abalone.ipynb notebook.
• To see how Hugging Face is integrated with an @remote decorator: Open the huggingface.ipynb
notebook.

1586
Amazon SageMaker Developer Guide
Experiments

Manage Machine Learning with Amazon


SageMaker Experiments
Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you create, manage,
analyze, and compare your machine learning experiments.

Experimentation in machine learning

Machine learning is an iterative process. You need to experiment with multiple combinations of data,
algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.
Over time, this iterative experimentation can result in thousands of model training runs and model
versions. This makes it hard to track the best performing models and their input configurations. It’s
also difficult to compare active experiments with past experiments to identify opportunities for further
incremental improvements. Use SageMaker Experiments to organize, view, analyze, and compare
iterative ML experimentation to gain comparative insights and track your best performing models.

Manage ML experimentation with SageMaker Experiments

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of
your iterations as runs. You can assign, group, and organize these runs into experiments. SageMaker
Experiments is integrated with Amazon SageMaker Studio, providing a visual interface to browse
your active and past experiments, compare runs on key performance metrics, and identify the best
performing models. SageMaker Experiments tracks all of the steps and artifacts that went into creating
a model, and you can quickly revisit the origins of a model when you are troubleshooting issues in
production, or auditing your models for compliance verifications.

Use SageMaker Experiments to view, manage, analyze, and compare both custom experiments that you
programmatically create and experiments automatically created from SageMaker jobs.

Supported AWS Regions


SageMaker Experiments is generally available in all AWS commercial Regions where Amazon SageMaker
Studio is available, except the China Regions.

Topics
• Create an Amazon SageMaker Experiment (p. 1587)
• View, search, and compare experiment runs (p. 1592)
• SageMaker integrations (p. 1596)
• Example notebooks for Amazon SageMaker Experiments (p. 1598)
• Monitor experiment training metrics with AWS CloudTrail (p. 1599)
• Clean Up Amazon SageMaker Experiment Resources (p. 1600)
• Additional supported SDK (p. 1602)
• Experiments FAQs (p. 1605)
• Search Using the Amazon SageMaker Console and API (p. 1607)

Create an Amazon SageMaker Experiment


Create an Amazon SageMaker experiment to track your machine learning (ML) workflows with a few
lines of code from your preferred development environment. You can then browse your experiments,

1587
Amazon SageMaker Developer Guide
Create an experiment

create visualizations for analysis, and find the best performing model. You can also integrate SageMaker
Experiments into your SageMaker training script using the SageMaker Python SDK.

Overview
The following components make up the building blocks of an experiment in Amazon SageMaker.

• experiment: An experiment is a collection of runs. When you initialize a run in your training loop, you
include the name of the experiment that the run belongs to. Experiment names must be unique within
your AWS account.
• Run: A run consists of all the inputs, parameters, configurations, and results for one interaction of
model training. Initialize an experiment run for tracking a training job with Run.init().
Note
We recommend that you initialize a Run object in a Jupyter Notebook, and create the
SageMaker job for your experiment within the context of this Run object initialization. To refer
to this Run object in script mode, use the load_run() method. For examples, see Example
notebooks for Amazon SageMaker Experiments (p. 1598).
Note
The SageMaker Python SDK automatically turns experiment names and run names to
lowercase.
• load_run: To run your experiments in script mode, refer to an initialized Run object with
load_run(). If an experiment for a run exists, load_run returns the experiment context. Generally,
you use load_run with no arguments to track metrics, parameters, and artifacts within a SageMaker
training or processing job script.

# Load run from a local script passing experiment and run names
with load_run(experiment_name=experiment_name, run_name=run_name) as run:
run.log_parameter("param1", "value1")

# Load run within a training or processing Job (automated context sharing)


with load_run() as run:
run.log_parameter("param1", "value1")

• log_parameter: Log parameters for a run, such as batch size or epochs, over time in a training loop
with run.log_parameter(). log_parameter records a single name-value pair in a run. You can
use run.log_parameters() to log multiple parameters. If called multiple times within a run for a
parameter of the same name, log_parameter overwrites any previous value. The name must be a
string and the value must be either a string, integer, or float.

# Log a single parameter


run.log_parameter("param1", "value1")

# Log multiple parameters


run.log_parameters({
"param2": "value2",
"param3": "value3"
})

• log_metric: Log metrics for a run, such as accuracy or loss, over time in a training loop with
run.log_metric(). log_metric records a name-value pair where the name is a string and the
value is an integer or float. To declare the frequency of logging over the course of the run, define a
step value. You can then visualize these metrics in the Studio Experiments UI. For more information,
see View, search, and compare experiment runs (p. 1592).

# Log a metric over the course of a run

1588
Amazon SageMaker Developer Guide
Create an experiment

run.log_metric(name="Final_loss", value=finalloss)

# Log a metric over the course of a run at each epoch


run.log_metric(name="test:loss", value=loss, step=epoch)

• log_artifact: Log any input or output artifacts related to a run with run.log_artifact(). Log
artifacts such as S3 URIs, datasets, models, and more for your experiment to help you keep track of
artifacts across multiple runs. is_output is True by default. To record the artifact as an input artifact
instead of an output artifact, set is_output to False.

# Track a string value as an input or output artifact


run.log_artifact(name="training_data", value="data.csv" is_output=False)

• log_file: Log any input or output files related to a run, such as training or test data, and store them
in Amazon S3 with run.log_file(). is_output is True by default. To record the file as an input
artifact instead of an output artifact, set is_output to False.

# Upload a local file to S3 and track it as an input or output artifact


run.log_file("training_data.csv", name="training_data", is_output=False)

For more information on initializing a Run object, see Experiments in the SageMaker Python SDK
documentation. For information on visualizing logged experiment data and automatic logging, see View,
search, and compare experiment runs (p. 1592).

Create an experiment with the SageMaker Python SDK


The following section demonstrates how to create an Amazon SageMaker Experiment using the
SageMaker Python SDK. This example uses the Run class to track a Keras model in a notebook
environment. The Keras Callback class provides a method on_epoch_end which emits metrics at the
end of each epoch. First, define a Callback class.

class ExperimentCallback(keras.callbacks.Callback):
""" """

def __init__(self, run, model, x_test, y_test):


"""Save params in constructor"""
self.run = run
self.model = model
self.x_test = x_test
self.y_test = y_test

def on_epoch_end(self, epoch, logs=None):


""" """
keys = list(logs.keys())
for key in keys:
run.log_metric(name=key, value=logs[key], step=epoch)
print("Epoch: {}\n{} -> {}".format(epoch, key, logs[key]))

Next, train the Keras model in a notebook environment and track it as an experiment.
Note
This example carries out jobs sequentially. To run SageMaker jobs asynchronously, you may need
to increase your resource limit.

from sagemaker.experiments import Run

# The run name is an optional argument to `run.init()`

1589
Amazon SageMaker Developer Guide
Create an experiment

with Run(experiment_name = 'my-experiment') as run:

# Define values for the parameters to log


run.log_parameter("batch_size", batch_size)
run.log_parameter("epochs", epochs)
run.log_parameter("dropout", 0.5)

# Define input artifacts


run.log_file('datasets/input_train.npy', is_output = False)
run.log_file('datasets/input_test.npy', is_output = False)
run.log_file('datasets/input_train_labels.npy', is_output = False)
run.log_file('datasets/input_test_labels.npy', is_output = False)

# Train locally
model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=epochs,
validation_split=0.1,
callbacks = [ExperimentCallback(run, model, x_test, y_test)]
)

score = model.evaluate(x_test, y_test, verbose=0)


print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Define metrics to log


run.log_metric(name = "Final Test Loss", value = score[0])
run.log_metric(name = "Final Test Accuracy", value = score[1])

For more code samples and example notebooks, see Example notebooks for Amazon SageMaker
Experiments (p. 1598).

Create an experiment using SageMaker script mode


You can use SageMaker script mode to write your own code to train a model and track it as an
experiment. When creating an experiment with script mode, use load_run().

# Make sure that you have the latest version of the SageMaker Python SDK
import os
os.system("pip install -U sagemaker")

# Import additional requirements


import boto3
from sagemaker.session import Session
from sagemaker.experiments.run import load_run

# Define training script


if __name__ == "__main__":
session = Session(boto3.session.Session(region_name=args.region))
with load_run(sagemaker_session=session) as run:
# Define values for the parameters to log
run.log_parameters({
"batch_size": batch_size,
"epochs": epochs,
"dropout": 0.5
})
# Define input artifacts
run.log_file('datasets/input_train.npy', is_output = False)
run.log_file('datasets/input_test.npy', is_output = False)
run.log_file('datasets/input_train_labels.npy', is_output = False)
run.log_file('datasets/input_test_labels.npy', is_output = False)

1590
Amazon SageMaker Developer Guide
Create an experiment

# Train the model


model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=epochs,
validation_split=0.1,
callbacks = [ExperimentCallback(run, model, x_test, y_test)]
)

score = model.evaluate(x_test, y_test, verbose=0)


print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Define metrics to log


run.log_metric(name = "Final Test Loss", value = score[0])
run.log_metric(name = "Final Test Accuracy", value = score[1])

For more code samples and example notebooks on using Amazon SageMaker Experiments in SageMaker
script mode, see Track experiments for SageMaker training jobs using script mode (p. 1599).

For more information on script mode, see Use script mode in a supported framework. You can also
define custom metrics in script mode by specifying a name and regular expression for each metric that a
tuning job monitors. See Use a custom algorithm for training for more information.

View your experiment in Studio


To view the experiment in Studio, in the left sidebar, choose Experiments.

Select the name of the experiment to view all associated runs. It might take a moment for the list to
refresh and display a new experiment or experiment run. You can click Refresh to update the page. Your
experiment list should look similar to the following:

To view the runs that make up your experiment, select the experiment name. For more information, see
View, search, and compare experiment runs (p. 1592).

View unassigned runs


All SageMaker jobs, including training jobs, processing jobs, and transform jobs, correspond to runs
and create Run objects by default. If you launch these jobs without explicitly associating them with an

1591
Amazon SageMaker Developer Guide
View, search, and compare experiment runs

experiment, the resulting runs are unassigned and can be viewed in the Unassigned runs section of the
Studio Experiments UI.

To clean up the resources you created, see Clean Up Amazon SageMaker Experiment Resources (p. 1600).

View, search, and compare experiment runs


An Amazon SageMaker experiment consists of multiple run groups with a related objective. A run group
consists of one or more runs, such as a data preprocessing job and a training job.

You use the experiments browser to display a list of these entities. You can filter the list by entity
name, type, and tags. For an overview of the Studio user interface, see Amazon SageMaker Studio UI
Overview (p. 129).

Topics
• View experiments and runs (p. 1592)
• Compare and analyze runs (p. 1594)

View experiments and runs


Amazon SageMaker Studio provides an experiments browser that you can use to view lists of
experiments and runs. You can choose one of these entities to view detailed information about the entity
or choose multiple entities for comparison.

To view experiments and runs

1. To view the experiment in Studio, in the left sidebar, choose Experiments.

Select the name of the experiment to view all associated runs. You can search experiments by typing
directly into the Search bar or filtering for experiment type. You can also choose which columns to
display in your experiment or run list.

It might take a moment for the list to refresh and display a new experiment or experiment run. You
can click Refresh to update the page. Your experiment list should look similar to the following:

1592
Amazon SageMaker Developer Guide
View, search, and compare experiment runs

2. In the experiments list, double-click an experiment to display a list of the runs in the experiment.
Note
Experiment runs that are automatically created by SageMaker jobs and containers are
visible in the Experiments Studio UI by default. To hide runs created by SageMaker jobs for

a given experiment, choose the settings icon ( ) and toggle Show jobs.

3. Double-click a run to display information about a specific run.

In the Overview pane, choose any of the following headings to see available information about each
run:

• Metrics – Metrics that are logged during a run.


• Charts – Build your own charts to compare runs.
• Output artifacts – Any resulting artifacts of the experiment run and the artifact locations in
Amazon S3.
• Bias reports – Pe-training or post-training bias reports generated using Clarify.

1593
Amazon SageMaker Developer Guide
View, search, and compare experiment runs

• Explainability– Explainability reports generated using Clarify.


• Debugs – A list of debugger rules and any issues found.

Compare and analyze runs


To analyze experiment runs, select the experiment of your choice in the Amazon SageMaker Studio
Experiments UI and then select the runs that you want to compare. You must select between 1 and 20
runs. After you have your runs selected, choose Analyze in the upper right-hand corner.

To compare experiment runs:

1. After navigating to the experiment of your choice, select all the runs that you want to compare. You
must choose more than 1 and less than 20 runs to analyze.
2. Choose Analyze in the upper right-hand corner.
3. Visualize the comparative metrics of multiple experiment runs in a histogram, line chart, scatter
plot, or bar chart. To add a chart, choose Add Chart, select values for your chart axes, and choose
Create.

You can update, download, or delete existing charts.

1594
Amazon SageMaker Developer Guide
View, search, and compare experiment runs

Log charts
Logging charts and visualizations is available for classification models. You can log a confusion matrix,
receiver operating characteristics, or precision and recall graphs.

Log and visualize metrics with the following Python SDK methods:

• log_confusion_matrix: Records a confusion matrix artifact that you can view in the Charts section
of the Run Overview in Studio.
• log_roc_curve: Records a receiver operating characteristic artifact that you can view in the Charts
section of the Run Overview in Studio.
• log_precision_recall: Records a precision recall graph that you can view in the Charts section of
the Run Overview in Studio.

An automatically logged precision recall record creates a chart similar to the following:

1595
Amazon SageMaker Developer Guide
SageMaker integrations

SageMaker integrations
Amazon SageMaker Experiments is integrated with a number of SageMaker features. Certain SageMaker
jobs automatically create experiments. You can view and manage SageMaker Clarify bias reports or
SageMaker Debugger output tensors for specific experiment runs directly in the Studio Experiments UI.

• Automatic experiment creation (p. 1596)


• Bias and explainability reports (p. 1597)
• Debugging (p. 1598)

Automatic experiment creation


Amazon SageMaker automatically creates experiments when running Autopilot jobs, hyperparameter
optimization (HPO) jobs, or Pipeline executions. You can view these experiments in the Studio
Experiments UI.

Autopilot
Amazon SageMaker Experiments is integrated with Amazon SageMaker Autopilot. When you perform an
Autopilot job, SageMaker Experiments creates an experiment for that job as well as runs for each of the
different combinations of the available run components, parameters, and artifacts. You can find these
runs in the SageMaker Experiments UI by filtering for the run type Autopilot. For more information, see
Automate model development with Amazon SageMaker Autopilot.

HPO
Amazon SageMaker Experiments is integrated with HPO jobs. An HPO job automatically creates Amazon
SageMaker experiments, runs, and components for each training job that it completes. You can find
these runs in the SageMaker Experiments UI by filtering for the run type HPO. For more information, see
Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model.

1596
Amazon SageMaker Developer Guide
SageMaker integrations

Pipelines
Amazon SageMaker Model Building Pipelines is closely integrated with Amazon SageMaker Experiments.
By default, when SageMaker Pipelines creates and executes a pipeline, experiments, runs, and
components are created if they do not already exist. You can find these runs in the SageMaker
Experiments UI by filtering for the run type Pipelines. For more information, see Amazon SageMaker
Experiments Integration.

Bias and explainability reports


Manage SageMaker Clarify bias and explainability reports for experiment runs directly through Studio.
To view reports, find and select the name of the experiment run of your choice in Studio. Choose Bias
reports to see any Clarify bias reports associated with the experiment run.

Choose Explanations to see any Clarify explainability reports associated with the experiment run.

You can generate pre-training or post-training bias reports that analyze bias in datasets or model
predictions using labels and bias metrics with SageMaker Clarify. You can also use SageMaker Clarify to
generate explainability reports that document model behavior for global or local data samples. For more
information, see Amazon SageMaker Clarify Bias Detection and Model Explainability.

1597
Amazon SageMaker Developer Guide
Tutorials

Debugging
You can debug model training progress with Amazon SageMaker Debugger and view debug output
tensors in the Studio Experiments UI. Choose the name of the run associated with the Debugger report
and choose Debugger.

Then, choose the training job name to view the associated Amazon SageMaker Debugger dashboard.

For more information, see Debug Training Jobs Using Amazon SageMaker Debugger.

Example notebooks for Amazon SageMaker


Experiments
The following tutorials demonstrate how to track runs for various model training experiments. You can
view the resulting experiments in Studio after running the notebooks. To clean up the resources created
by a notebook, see Clean Up Amazon SageMaker Experiment Resources (p. 1600). For a tutorial that
showcases additional features of Studio, see Amazon SageMaker Studio Tour (p. 147).

1598
Amazon SageMaker Developer Guide
CloudTrail metrics

Track experiments in a notebook environment


To learn more about tracking experiments in a notebook environment, see the following example
notebooks:

• Track an experiment while training a Keras model locally


• Track an experiment while training a Pytorch model locally or in your notebook

Track bias and explainability for your experiments with


SageMaker Clarify
For a step-by-step guide on tracking bias and explainability for your experiments, see the following
example notebook:

• Fairness and Explainability with SageMaker Clarify

Track experiments for SageMaker training jobs using script


mode
For more information about tracking experiments for SageMaker training jobs, see the following
example notebooks:

• Run a SageMaker Experiment with Pytorch Distributed Data Parallel - MNIST Handwritten Digits
Classification
• Track an experiment while training a Pytorch model with a SageMaker Training Job
• Train a TensorFlow model with a SageMaker training job and track it using SageMaker Experiments

Monitor experiment training metrics with AWS


CloudTrail
The training metrics for Amazon SageMaker Experiments are integrated with AWS CloudTrail, a service
that provides a record of actions taken by a user, role, or an AWS service. CloudTrail captures all API calls
for BatchPutMetrics as events. SageMaker automatically calls BatchPutMetrics when you create an
experiment run using the SageMaker SDK for Python. AWS CloudTrail captures data related to calls for
resource type AWS::SageMaker::ExperimentTrialComponent.
Note
In the Studio Experiments UI, trials are referred to as run groups and trial components are
referred to as runs.

When you create an experiment run, you can also configure the continuous delivery of CloudTrail
events to an Amazon S3 bucket. Use CloudTrail to monitor all ingested training metrics for an
experiment run, including information such as the metric name, the training step of the recorded
metric, the timestamp, and the metric value. CloudTrail events also include the experiment
run ARN, the ID of the account that created the run, and the resource type, which should be
AWS::SageMaker::ExperimentTrialComponent.

To monitor BatchPutMetrics API calls as CloudTrail events, you must first set up the logging of data
plane API activity in CloudTrail. See Logging data events for trails for more information. For granular
control over which API calls you want to selectively log and pay for, you can filter CloudTrail events
by resource type. Specify AWS::SageMaker::ExperimentTrialComponent as a resource type

1599
Amazon SageMaker Developer Guide
Clean up experiment resources

to monitor calls to the BatchPutMetrics API. For more information, see DataResource in the AWS
CloudTrail API reference. To learn more about CloudTrail, see the AWS CloudTrail User Guide.

For an in-depth explanation of how Amazon SageMaker works with AWS CloudTrail, see Log Amazon
SageMaker API Calls with AWS CloudTrail (p. 3285).

The following is an example CloudTrail event for a training metric in an experiment run:

{
...
"eventTime": "2022-12-14T21:53:41Z",
"eventSource": "metrics-sagemaker.amazonaws.com",
"eventName": "BatchPutMetrics",
"awsRegion": "us-east-1",
"sourceIPAddress": "192.0.2.0",
"userAgent": "aws-cli/2.7.25 Python/3.9.11 Linux/5.4.214-134.408.amzn2int.x86_64 exe/
x86_64.amzn.2 prompt/off command/sm-metrics.batch-put-metrics",
"requestParameters": {
"trialComponentName": "trial-component-name",
"metricData": [
{
"metricName": "foo",
"timestamp": 1670366870000,
"step": 101,
"value": 0.9
}
]
},
...
"resources": [
{
"accountId": "abcdef01234567890",
"type": "AWS::SageMaker::ExperimentTrialComponent",
"ARN": "arn:aws:sagemaker:us-east-1:1234567890abcdef0:experiment-trial-component/
trial-component-name"
}
],
...
}

Clean Up Amazon SageMaker Experiment Resources


To avoid incurring unnecessary charges, delete the Amazon SageMaker Experiment resources you no
longer need. You can't delete Experiment resources through the SageMaker Management Console or
the Amazon SageMaker Studio UI. This topic shows you how to clean up these resources using the
SageMaker Python SDK, Boto3, and the Experiments SDK.

Topics
• Clean Up Using the SageMaker Python SDK (Recommended) (p. 1600)
• Clean Up Using the Python SDK (Boto3) (p. 1601)
• Clean Up Using the Experiments SDK (p. 1601)

Clean Up Using the SageMaker Python SDK (Recommended)


To clean up using the SageMaker Python SDK

from sagemaker.experiments.experiment import Experiment

1600
Amazon SageMaker Developer Guide
Clean up experiment resources

exp = Experiment.load(experiment_name=experiment_name, sagemaker_session=sm_session)


exp._delete_all(action="--force")

Clean Up Using the Python SDK (Boto3)


To clean up using Boto 3

import boto3
sm = boto3.Session().client('sagemaker')

Define cleanup_boto3

def cleanup_boto3(experiment_name):
trials = sm.list_trials(ExperimentName=experiment_name)['TrialSummaries']
print('TrialNames:')
for trial in trials:
trial_name = trial['TrialName']
print(f"\n{trial_name}")

components_in_trial = sm.list_trial_components(TrialName=trial_name)
print('\tTrialComponentNames:')
for component in components_in_trial['TrialComponentSummaries']:
component_name = component['TrialComponentName']
print(f"\t{component_name}")
sm.disassociate_trial_component(TrialComponentName=component_name,
TrialName=trial_name)
try:
# comment out to keep trial components
sm.delete_trial_component(TrialComponentName=component_name)
except:
# component is associated with another trial
continue
# to prevent throttling
time.sleep(.5)
sm.delete_trial(TrialName=trial_name)
sm.delete_experiment(ExperimentName=experiment_name)
print(f"\nExperiment {experiment_name} deleted")

Call cleanup_boto3

# Use experiment name not display name


experiment_name = "experiment-name"
cleanup_boto3(experiment_name)

Clean Up Using the Experiments SDK


To clean up using the Experiments SDK

import sys
!{sys.executable} -m pip install sagemaker-experiments

import time

from smexperiments.experiment import Experiment


from smexperiments.trial import Trial

1601
Amazon SageMaker Developer Guide
Additional supported SDK

from smexperiments.trial_component import TrialComponent

Define cleanup_sme_sdk

def cleanup_sme_sdk(experiment):
for trial_summary in experiment.list_trials():
trial = Trial.load(trial_name=trial_summary.trial_name)
for trial_component_summary in trial.list_trial_components():
tc = TrialComponent.load(
trial_component_name=trial_component_summary.trial_component_name)
trial.remove_trial_component(tc)
try:
# comment out to keep trial components
tc.delete()
except:
# tc is associated with another trial
continue
# to prevent throttling
time.sleep(.5)
trial.delete()
experiment_name = experiment.experiment_name
experiment.delete()
print(f"\nExperiment {experiment_name} deleted")

Call cleanup_sme_sdk

experiment_to_cleanup = Experiment.load(
# Use experiment name not display name
experiment_name="experiment-name")

cleanup_sme_sdk(experiment_to_cleanup)

Additional supported SDK


Important
As of v2.123.0, SageMaker Experiments is now fully integrated with the SageMaker Python
SDK and you no longer need to use the separate SageMaker Experiments SDK. We recommend
creating an experiment with sagemaker.experiments.run rather than the following
smexperiments module.

The following section describes how to create a SageMaker Experiment with the SageMaker Experiments
SDK.

Create an Amazon SageMaker Experiment with the SageMaker


Experiments SDK
Create an Amazon SageMaker experiment to track your SageMaker training, processing, and transform
jobs.

The following procedure shows you how to create a SageMaker experiment for a SageMaker training,
processing, or transform job. Steps labeled as (Studio) describe how to view the experiment in Amazon
SageMaker Studio. You don't have to run the experiment in Studio to view the experiment in Studio.

1. Import the sys module to install the SDKs.

import sys

1602
Amazon SageMaker Developer Guide
Additional supported SDK

2. (Optional) The Amazon SageMaker Python SDK, comes preinstalled in SageMaker Studio. If you plan
to run your code outside Studio, install the SageMaker Python SDK.

!{sys.executable} -m pip install sagemaker

3. Install the SageMaker Experiments Python SDK.

!{sys.executable} -m pip install sagemaker-experiments

4. Import modules.

import time
from time import strftime

import sagemaker

from smexperiments.experiment import Experiment


from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

5. Get the execution role and create the SageMaker session.

role = sagemaker.get_execution_role()
sm_sess = sagemaker.session.Session()

6. Create a SageMaker experiment. The experiment name must be unique in your account.
Note
The tags parameter is optional. You can search for the tag using Studio, the SageMaker
console, and the SDK. Tags can also be applied to trials and trial components.

create_date = strftime("%Y-%m-%d-%H-%M-%S")
demo_experiment = Experiment.create(experiment_name = "DEMO-{}".format(create_date),
description = "Demo experiment",
tags = [{'Key': 'demo-experiments', 'Value':
'demo1'}])

7. (Studio) To view the experiment in SageMaker Studio, in the left sidebar, choose the Experiments.

After the code runs, the experiment list contains the new experiment. It might take a moment for
the list to refresh and display the experiment. The filter on the experiment tag is also displayed.
Only experiments that have a matching tag are displayed. Your list should look similar to the
following:

1603
Amazon SageMaker Developer Guide
Additional supported SDK

8. Create a trial for the experiment. The trial name must be unique in your account.

demo_trial = Trial.create(trial_name = "DEMO-{}".format(create_date),


experiment_name = demo_experiment.experiment_name,
tags = [{'Key': 'demo-trials', 'Value': 'demo1'}])

9. Create a trial component as part of the trial. The trial component is the SageMaker job.

Add the ExperimentConfig parameter to the appropriate method. The SageMaker jobs listed in the
following table are supported.

Job SageMaker Python SDK method Boto3 method

Training Estimator.fit CreateTrainingJob

Processing Processor.run CreateProcessingJob

Transform Transformer.transform CreateTransformJob

The following examples are for a training job. The Tags parameter adds a tag to the trial
component. ExperimentName isn't specified because the trial was associated with the experiment
when the trial was created in an earlier step.

Using the SageMaker Python SDK

sagemaker.estimator.Estimator(
...,
sagemaker_session = sm_sess,
tags = [{'Key': 'demo-jobs', 'Value': 'demo2'}])

estimator.fit(
...,
experiment_config = {
# "ExperimentName"
"TrialName" : demo_trial.trial_name,

1604
Amazon SageMaker Developer Guide
Experiments FAQs

"TrialComponentDisplayName" : "TrainingJob",
})

Using Boto3

create_training_job(
...,
"ExperimentConfig": {
# "ExperimentName"
"TrialName" : demo_trial.trial_name,
"TrialComponentDisplayName" : "TrainingJob",
},
"Tags": [{'Key': 'demo-jobs', 'Value': 'demo2'}])

10. (Studio) In the experiment list, double-click the experiment to display a list of the trials in the
experiment. In the Studio UI, trials are referred to as run groups and trial components are referred to
as runs. Your list should look similar to the following:

11. (Studio) To view information about the experiment, trial, and job (trial component), see View, search,
and compare experiment runs (p. 1592).

To clean up the resources you created, see Clean Up Amazon SageMaker Experiment Resources (p. 1600).

Experiments FAQs
Refer to the following FAQ items for answers to commonly asked questions about SageMaker
Experiments.

Q. What is the recommended method to create an experiment?


A: Experiments are a collection of runs aimed at finding the best model to solve a problem. To initialize
a run within an experiment, use the SageMaker Python SDK Run class. For more examples, see Create an
Amazon SageMaker Experiment (p. 1587).

1605
Amazon SageMaker Developer Guide
Experiments FAQs

Q. Can I create an experiment using SageMaker script mode?


Yes. You can create experiments using SageMaker script mode. In the Jupyter notebook or Python
file you are using to define your estimator, initialize a run using the Run class. Within the run, launch
an estimator with your custom entry point script. Within that entry point script, use the load_run
method to initialize the run you defined within the entry point script and log your metrics. For in-depth
examples, see Track experiments for SageMaker training jobs using script mode (p. 1599).

Q. What SageMaker jobs automatically create experiments?


SageMaker Hyperparameter Optimzation (HPO) jobs (also known as tuning jobs) automatically create
experiments to track all the training jobs launched during a hyperparameter search. All other SageMaker
jobs create unassigned runs unless launched from within an experiment.

Q. What kind of SageMaker jobs can I create an experiment for?


You can use SageMaker Experiments to track metrics from training jobs, processing jobs, and transform
jobs.

Q. Why do I see experiments and runs in the Experiments Studio UI that I did not
create using the SageMaker Python SDK?
Experiment runs that are automatically created by SageMaker jobs and containers are visible in the
Experiments Studio UI by default. To hide runs created by SageMaker jobs for a given experiment, choose

the settings icon ( ) and toggle Show jobs.

Q. Is the SageMaker Experiments SDK still supported?


Yes, the SageMaker Experiments SDK is still supported. However, as of v2.123.0, SageMaker Experiments
is fully integrated with the SageMaker Python SDK. We recommend using the SageMaker Python
SDK to create experiments and runs. For more information, see Create an Amazon SageMaker
Experiment (p. 1587).

Q. Can I use distributed training with my experiments?


A: Yes. However, metrics for distributed training can be logged only at the epoch level. Be sure that you
only log metrics generated by the leader node, as shown in the following example:

...
if rank == 0:
test_loss, correct, target, pred = test(model, test_loader, device, tracker)
logger.info(
"Test Average loss: {:.4f}, Test Accuracy: {:.0f}%;\n".format(
test_loss, test_accuracy)
)
)
run.log_metric(name = "train_loss", value = loss.item(), step = epoch)
run.log_metric(name = "test_loss", value = test_loss, step = epoch)
run.log_metric(name = "test_accuracy", value = test_accuracy, step = epoch)
...

For more information, see the Run a SageMaker Experiment with Pytorch Distributed Data Parallel -
MNIST Handwritten Digits Classification example notebook.

Q. What are unassigned runs?


A: All jobs in SageMaker (training jobs, processing jobs, transform jobs) correspond to runs. When
launching these jobs, TrialComponents are created by default. TrialComponents map directly to

1606
Amazon SageMaker Developer Guide
Search using the console and API

runs. If these jobs are launched without being explicitly associated with an experiment or run, they are
created as unassigned runs.

Q. Do I need to pass the experiment run context to the training script when
running a SageMaker training job?
A: Yes. You need to load the run context into the training script, along with the SageMaker session
information.

from sagemaker.session import Session


from sagemaker.experiments.run import load_run

session = Session(boto3.session.Session(region_name=args.region))

with load_run(sagemaker_session=session) as run:


run.log_parameters(
{"num_train_samples": len(train_set.data), "num_test_samples": len(test_set.data)}
)

Q. How do I add a new run to an experiment analysis?


A: If you already created a comparison for your experiment and want to add a new run to analyze, select
all the runs from your previous analysis as well as the new run and choose Analyze. If you don’t see your
new run in the resulting analysis page, then refresh the Studio browser. Note that refreshing your Studio
browser may impact your other open tabs.

Search Using the Amazon SageMaker Console and


API
Developing a machine learning model typically requires extensive experimenting with different datasets,
algorithms, and hyperparameter values. To manage up to thousands of machine learning model
experiments, use the search capabilities in SageMaker.

You can use SageMaker search to:

• Organize, find, and evaluate training jobs using properties, hyperparameters, performance metrics, or
any metadata.
• Find the best performing model by reviewing training job and model metrics, such as training loss or
validation accuracy.
• Trace a model's lineage to the training job and its related resources, such as the training datasets.

This topic covers searching from the SageMaker console and the SageMaker API.

Topics
• Organize, Find, and Evaluate Training Jobs (Console) (p. 1607)
• Find and Evaluate Training Jobs (API) (p. 1609)
• Verify the Datasets Used by Your Training Jobs (p. 1611)
• Trace Model Lineage (p. 1611)

Organize, Find, and Evaluate Training Jobs (Console)


To organize training jobs, assign one or more tags to them.

1607
Amazon SageMaker Developer Guide
Search using the console and API

To find a specific training job, model, or resource, use model tracking to search on keywords assigned to
any searchable items. Searchable items include training jobs, models, hyperparameters, metadata, tags,
and URLs. To refine your tracking results, you can search using multiple criteria.

To choose the best model for deployment, evaluate how all models performed against one or more
metrics. You can use model tracking results to list, sort, and evaluate the performance of the models in
your experiments.

Topics
• Use Tags to Track Training Jobs (Console) (p. 1608)
• Find Training Jobs (Console) (p. 1608)
• Evaluate Models (Console) (p. 1609)

Use Tags to Track Training Jobs (Console)


To group training jobs, create tags with descriptive keys and a value. For example, create tag keys for:
project, owner, customer, and industry.

Add tags to training jobs (console)

1. Open the Amazon SageMaker console.


2. In the navigation pane, choose Training jobs and Create training job.
3. Scroll to the bottom of the page and enter a key and value for the tag.

4. To add another tag, choose Add tag, and add another key-value pair.

Find Training Jobs (Console)


You can search for training jobs using a variety of job attributes. Note that some search parameters
appear only if you have created a training job with that attribute. For example, Tags appears only if you
have added a tag for a training job.

To find training jobs (console)

1. Open the Amazon SageMaker console.


2. In the navigation pane, choose Search.
3. Add Parameters.

a. In the search box, enter a parameter and choose a parameter type, for example
TrainingJobName.
b. Choose a conditional operation. For numeric values, use operators such as is equals to, lesser
than, or or greater than. For text-based values, use operators such as equals to or contains.
c. Enter a value for the parameter.

1608
Amazon SageMaker Developer Guide
Search using the console and API

4. (Optional) To refine your search, add additional search criteria. Choose Add row and enter the
parameter values.
5. Choose Search.

Evaluate Models (Console)


To evaluate a model's performance, review its metadata, hyperparameters, and metrics. To highlight
metrics, adjust the view to show only metrics and important hyperparameters.

To evaluate a model (console)

1. Open the Amazon SageMaker console.


2. In the navigation pane, choose Search and search for training jobs by specifying relevant
parameters. The results are displayed in a table.

3. Open the preferences window by choosing the settings icon in the search results table.
4. To show or hide a hyperparameter or metric, turn it on or off by choosing Hyperparameter or Metric
.
5. Make necessary changes, then choose Update view.
6. After viewing metrics and important hyperparameters, you can compare and contrast the result.
Then, you can choose the best model to host or investigate the models that are performing poorly.

Find and Evaluate Training Jobs (API)


To the find and evaluate training jobs or to get suggestions for items used in experiments that are
searchable, you can use the Search API.

Topics
• Find Training Jobs (API) (p. 1610)
• Evaluate Models (API) (p. 1609)
• Get Suggestions for a Search (API) (p. 1610)

1609
Amazon SageMaker Developer Guide
Search using the console and API

Find Training Jobs (API)


To find training jobs, create a search parameter using the search_params parameter. Then use the
search function in the smclient subprocess in the AWS SDK for Python (Boto3).

The following example shows how to use the Search API to find training jobs.

import boto3

search_params={
"MaxResults": 10,
"Resource": "TrainingJob",
"SearchExpression": {
"Filters": [{
"Name": "Tags.Project",
"Operator": "Equals",
"Value": "Project_Binary_Classifier"
}]},
"SortBy": "Metrics.train:binary_classification_accuracy",
"SortOrder": "Descending"
}

smclient = boto3.client(service_name='sagemaker')
results = smclient.search(**search_params)

Evaluate Models (API)


To evaluate models, run a search as described in Find Training Jobs (API) (p. 1610), review model metrics,
then, use the AWS SDK for Python (Boto3) to create a table and plot it.

The following example shows how to evaluate models and to display the results in a table.

import pandas

headers=["Training Job Name", "Training Job Status", "Batch Size", "Binary Classification
Accuracy"]
rows=[]
for result in results['Results']:
trainingJob = result['TrainingJob']
metrics = trainingJob['FinalMetricDataList']
rows.append([trainingJob['TrainingJobName'],
trainingJob['TrainingJobStatus'],
trainingJob['HyperParameters']['mini_batch_size'],
metrics[[x['MetricName'] for x in
metrics].index('train:binary_classification_accuracy')]['Value']
])

df = pandas.DataFrame(data=rows,columns=headers)

from IPython.display import display, HTMLdisplay(HTML(df.to_html()))

Get Suggestions for a Search (API)


To get suggestions for a search, use the GetSearchSuggestions API.

The following example for AWS SDK for Python (Boto3) is a get_search_suggestions request for
items containing linear.

search_suggestion_params={
"Resource": "TrainingJob",

1610
Amazon SageMaker Developer Guide
Search using the console and API

"SuggestionQuery": {
"PropertyNameQuery": {
"PropertyNameHint": "linear"
}
}
}

The following is an example response for a get_search_suggestions request.

{
'PropertyNameSuggestions': [{'PropertyName': 'hyperparameters.linear_init_method'},
{'PropertyName': 'hyperparameters.linear_init_value'},
{'PropertyName': 'hyperparameters.linear_init_sigma'},
{'PropertyName': 'hyperparameters.linear_lr'},
{'PropertyName': 'hyperparameters.linear_wd'}]
}

After getting search suggestions, you can use one of the property names in a search.

Verify the Datasets Used by Your Training Jobs


You can use model tracking capability to verify which datasets were used in training, where holdout
datasets were used, and other details about training jobs. For example, use model tracking capability to
verify that a specific dataset was used in a training job for an audit or to verify compliance.

To check whether a specific dataset was used in a training job, you search for the URL to its location in
Amazon Simple Storage Service (Amazon S3). Model tracking capability returns the training jobs that
used the dataset that you specify. If your search doesn't return the dataset (the result is empty), the
dataset wasn't used in a training job. An empty result confirms, for example, that a holdout dataset
wasn't used.

Trace Model Lineage


You can use model tracking capability to get information about the lineage of training jobs and the
model resources that were used for them, including the dataset, algorithm, hyperparameters, and
metrics. For example, if you find that the performance of a hosted model has declined, you can review its
training job and the resources it used to determine what's causing the problem.

Topics
• Trace Model Lineage (Console) (p. 1611)
• Trace Model Lineage (API) (p. 1611)

Trace Model Lineage (Console)


To trace a model's lineage (console)

1. Open the Amazon SageMaker console.


2. In the navigation pane, choose Endpoints, and choose the relevant endpoint.
3. Scroll to the Endpoint configuration settings section. This section lists all of the model versions
deployed at the endpoint, with a hyperlink to the training job that created each.

Trace Model Lineage (API)


To trace a model's lineage, get the model's name, then use it to search for training jobs.

1611
Amazon SageMaker Developer Guide
Automatic Model Tuning

The following example shows how to trace a model's lineage using the API.

# Get the name of model deployed at endpoint


endpoint_config = smclient.describe_endpoint_config(EndpointConfigName=endpointName)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']

# Get the model's name


model = smclient.describe_model(ModelName=model_name)

# Search the training job by the location of model artifacts in Amazon S3


search_params={
"MaxResults": 1,
"Resource": "TrainingJob",
"SearchExpression": {
"Filters": [
{
"Name": "ModelArtifacts.S3ModelArtifacts",
"Operator": "Equals",
"Value": model['PrimaryContainer']['ModelDataUrl']
}]},
}
results = smclient.search(**search_params)

After finding the training job, you can review the resources used to train the model.

Perform Automatic Model Tuning with SageMaker


Amazon SageMaker automatic model tuning (AMT), also known as hyperparameter tuning, finds the best
version of a model by running many training jobs on your dataset. To do this, AMT uses the algorithm
and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates
a model that performs the best, as measured by a metric that you choose.

For example, suppose that you want to solve a binary classification problem on a marketing dataset.
Your goal is to maximize the area under the curve (AUC) metric of the algorithm by training an XGBoost
Algorithm (p. 1369) model. You want to find which values for the eta, alpha, min_child_weight,
and max_depth hyperparameters that will train the best model. Specify a range of values for these
hyperparameters. Then, SageMaker hyperparameter tuning searches within these ranges to find a
combination of values that creates a training job that creates a model with the highest AUC. To conserve
resources or meet a specific model quality expectation, you can also set up completion criteria to stop
tuning after the criteria have been met.

You can use SageMaker AMT with built-in algorithms, custom algorithms, or SageMaker pre-built
containers for machine learning frameworks.

SageMaker AMT can use an Amazon EC2 Spot instance to optimize costs when running training jobs. For
more information, see Managed Spot Training in Amazon SageMaker (p. 2117).

Before you start using hyperparameter tuning, you should have a well-defined machine learning
problem, including the following:

• A dataset
• An understanding of the type of algorithm that you need to train
• A clear understanding of how you measure success

Prepare your dataset and algorithm so that they work in SageMaker and successfully run a training job
at least once. For information about setting up and running a training job, see Get Started with Amazon
SageMaker (p. 35).

1612
Amazon SageMaker Developer Guide
How Hyperparameter Tuning Works

Topics
• How Hyperparameter Tuning Works (p. 1613)
• Define metrics and environment variables (p. 1615)
• Define Hyperparameter Ranges (p. 1617)
• Track and set completion criteria for your tuning job (p. 1620)
• Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model (p. 1623)
• Example: Hyperparameter Tuning Job (p. 1628)
• Stop Training Jobs Early (p. 1640)
• Run a Warm Start Hyperparameter Tuning Job (p. 1641)
• Resource Limits for Automatic Model Tuning (p. 1645)
• Best Practices for Hyperparameter Tuning (p. 1647)

How Hyperparameter Tuning Works


When you build complex machine learning systems like deep learning neural networks, exploring all of
the possible combinations is impractical. Hyperparameter tuning can accelerate your productivity by
trying many variations of a model. It looks for the best model automatically by focusing on the most
promising combinations of hyperparameter values within the ranges that you specify. To get good
results, you must choose the right ranges to explore.

Use the API reference guide to understand how to interact with hyperparameter tuning. The examples on
this page can be found in the HyperParameterTuningJobConfig and HyperbandStrategyConfig APIs.
Note
Because the algorithm itself is stochastic, it’s possible that the hyperparameter tuning model
will fail to converge on the best answer. This can occur even if the best possible combination of
values is within the ranges that you choose.

Grid Search
When using grid search, hyperparameter tuning chooses combinations of values from the range of
categorical values that you specify when you create the job. Only categorical parameters are supported
when using the grid search strategy. You do not need to specify the MaxNumberOfTrainingJobs. The
number of training jobs created by the tuning job will be automatically calculated to be the total number
of distinct categorical combinations possible. If specified, the value of MaxNumberOfTrainingJobs
should equal the total number of distinct categorical combinations possible.

Random Search
When using random search, hyperparameter tuning chooses a random combination of values from
within the ranges that you specify for hyperparameters for each training job it launches. Because the
choice of hyperparameter values doesn't depend on the results of previous training jobs, you can run the
maximum number of concurrent training jobs without affecting the performance of the tuning.

For an example notebook that uses random search, see the Random search and hyperparameter scaling
with SageMaker XGBoost and Automatic Model Tuning notebook.

Bayesian Optimization
Bayesian optimization treats hyperparameter tuning like a regression problem. Given a set of input
features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that
you choose. To solve a regression problem, hyperparameter tuning makes guesses about which
hyperparameter combinations are likely to get the best results, and runs training jobs to test these

1613
Amazon SageMaker Developer Guide
How Hyperparameter Tuning Works

values. After testing a set of hyperparameter values, hyperparameter tuning uses regression to choose
the next set of hyperparameter values to test.

Hyperparameter tuning uses an Amazon SageMaker implementation of Bayesian optimization.

When choosing the best hyperparameters for the next training job, hyperparameter tuning
considers everything that it knows about this problem so far. Sometimes it chooses a combination
of hyperparameter values close to the combination that resulted in the best previous training job to
incrementally improve performance. This allows hyperparameter tuning to exploit the best known
results. Other times, it chooses a set of hyperparameter values far removed from those it has tried. This
allows it to explore the range of hyperparameter values to try to find new areas that are not yet well
understood. The explore/exploit trade-off is common in many machine learning problems.

For more information about Bayesian optimization, see the following:

Basic Topics on Bayesian Optimization

• A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User
Modeling and Hierarchical Reinforcement Learning
• Practical Bayesian Optimization of Machine Learning Algorithms
• Taking the Human Out of the Loop: A Review of Bayesian Optimization

Speeding up Bayesian Optimization

• Google Vizier: A Service for Black-Box Optimization


• Learning Curve Prediction with Bayesian Neural Networks
• Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of
learning curves

Advanced Modeling and Transfer Learning

• Scalable Hyperparameter Transfer Learning


• Bayesian Optimization with Tree-structured Dependencies
• Bayesian Optimization with Robust Bayesian Neural Networks
• Scalable Bayesian Optimization Using Deep Neural Networks
• Input Warping for Bayesian Optimization of Non-stationary Functions

Hyperband
Hyperband is a multi-fidelity based tuning strategy that dynamically reallocates resources. Hyperband
uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized
hyperparameter configurations and automatically stops those that underperform. It also seamlessly
scales to using many parallel training jobs. These features can significantly speed up hyperparameter
tuning over random search and Bayesian optimization strategies.

Hyperband should only be used to tune iterative algorithms that publish results at different resource
levels. For example, Hyperband can be used to tune a neural network for image classification which
publishes accuracy metrics after every epoch.

For more information about Hyperband, see the following links:

• Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization


• Massively Parallel Hyperparameter Tuning
• BOHB: Robust and Efficient Hyperparameter Optimization at Scale

1614
Amazon SageMaker Developer Guide
Define metrics and environment variables

• Model-based Asynchronous Hyperparameter and Neural Architecture Search

Hyperband with early stopping


Training jobs can be stopped early when they are unlikely to improve the objective metric of the
hyperparameter tuning job. This can help reduce compute time and avoid overfitting your model.
Hyperband uses an advanced internal mechanism to apply early stopping. Thus, the parameter
TrainingJobEarlyStoppingType in the HyperParameterTuningJobConfig API must be set to
OFF when using the Hyperband internal early stopping feature.
Note
Hyperparameter tuning might not improve your model. It is an advanced tool for building
machine solutions. As such, it should be considered part of the scientific development process.

Define metrics and environment variables


A tuning job optimizes hyperparameters for training jobs that it launches by using a metric to evaluate
performance. This guide shows how to define metrics so that you can use a custom algorithm for
training, or use a built-in algorithm from Amazon SageMaker. This guide also shows how to specify
environment variables during an Automatic model tuning (AMT) job.

Define metrics
Amazon SageMaker hyperparameter tuning parses your machine learning algorithm's stdout and
stderr streams to find metrics, such as loss or validation-accuracy. The metrics show how well the
model is performing on the dataset.

The following sections describe how to use two types of algorithms for training: built-in and custom.

Use a built-in algorithm for training


If you use one of the SageMaker built-in algorithms, metrics are already defined for you. In addition,
built-in algorithms automatically send metrics to hyperparameter tuning for optimization. These metrics
are also written to Amazon CloudWatch logs. For more information, see Log Amazon SageMaker Events
with Amazon CloudWatch.

For the objective metric for the tuning job, choose one of the metrics that the built-in algorithm emits.
For a list of available metrics, see the model tuning section for the appropriate algorithm in Use Amazon
SageMaker Built-in Algorithms or Pre-trained Models.

You can choose up to 40 metrics to monitor in your tuning job. Select one of those metrics to be the
objective metric. The hyperparameter tuning job returns the training job that performed the best
against the objective metric.
Note
Hyperparameter tuning automatically sends an additional hyperparameter
_tuning_objective_metric to pass your objective metric to the tuning job for use during
training.

Use a custom algorithm for training


This section shows how to define your own metrics to use your own custom algorithm for training.
When doing so, make sure that your algorithm writes at least one metric to stderr or stdout.
Hyperparameter tuning parses these streams to find algorithm metrics that show how well the model is
performing on the dataset.

You can define custom metrics by specifying a name and regular expression for each metric that your
tuning job monitors. Then, pass these metric definitions to the CreateHyperParameterTuningJob

1615
Amazon SageMaker Developer Guide
Define metrics and environment variables

API in the TrainingJobDefinition parameter in the MetricDefinitions field of


AlgorithmSpecification.

The following shows sample output from a log written to stderr or stdout by a training algorithm.

GAN_loss=0.138318; Scaled_reg=2.654134; disc:[-0.017371,0.102429] real 93.3% gen 0.0%


disc-combined=0.000000; disc_train_loss=1.374587; Loss = 16.020744; Iteration 0 took
0.704s; Elapsed=0s

The following code example shows how to use regular expressions in Python (regex). This is used to
search the sample log output and capture the numeric values of four different metrics.

[
{
"Name": "ganloss",
"Regex": "GAN_loss=(.*?);",
},
{
"Name": "disc-combined",
"Regex": "disc-combined=(.*?);",
},
{
"Name": "discloss",
"Regex": "disc_train_loss=(.*?);",
},
{
"Name": "loss",
"Regex": "Loss = (.*?);",
},
]

In regular expressions, parenthesis () are used to group parts of the regular expression together.

• For the loss metric that is defined in the code example, the expression (.*?); captures any
character between the exact text "Loss=" and the first semicolon (;) character.
• The character . instructs the regular expression to match any character.
• The character * means to match zero or more characters.
• The character ? means capture only until the first instance of the ; character.

The loss metric defined in the code sample will capture Loss = 16.020744 from the sample output.

Choose one of the metrics that you define as the objective metric for the tuning job. If you are using
the SageMaker API, specify the value of the name key in the HyperParameterTuningJobObjective
field of the HyperParameterTuningJobConfig parameter that you send to the
CreateHyperParameterTuningJob operation.

Specify environment variables


SageMaker AMT optimizes hyperparameters within a tuning job to find the best parameters for model
performance. You can use environment variables to configure your tuning job to change its behavior. You
can also use environment variables that you used during training inside your tuning job.

If you want to use an environment variable from your tuning job or specify a new environment variable,
input a string value for Environment within the SageMaker HyperParameterTrainingJobDefinition API.
Pass this training job definition to the CreateHyperParameterTuningJob API.

For example, the environment variable SM_LOG_LEVEL can be set to the following values to tailor the
output from a Python container.

1616
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges

NOTSET=0
DEBUG=10
INFO=20
WARN=30
ERROR=40
CRITICAL=50

As an example, to set the log level to 10 to debug your container logs, set the environment variable
inside the HyperParameterTrainingJobDefinition, as follows.

{
"HyperParameterTuningJobConfig": {
...,
}
"TrainingJobDefinition": {
...,
"Environment" : [
{
"SM_LOG_LEVEL": 10
}
],
...,
},
...,
}

Define Hyperparameter Ranges


This guide shows how to use SageMaker APIs to define hyperparameter ranges. It also provides a list of
hyperparameter scaling types that you can use.

Choosing hyperparameters and ranges significantly affects the performance of your tuning job.
Hyperparameter tuning finds the best hyperparameter values for your model by searching over a
range of values that you specify for each tunable hyperparameter. You can also specify up to 100
static hyperparameters that do not change over the course of the tuning job. You can use up to 100
hyperparameters in total (static + tunable). For guidance on choosing hyperparameters and ranges, see
Best Practices for Hyperparameter Tuning (p. 1647). You can also use autotune to find optimal tuning
job settings. For more information, see the following Autotune section.
Note
SageMaker Automatic Model Tuning (AMT) may add additional hyperparameters(s) that
contribute to the limit of 100 total hyperparameters. Currently, to pass your objective metric
to the tuning job for use during training, SageMaker adds _tuning_objective_metric
automatically.

Static hyperparameters
Use static hyperparameters for the following cases: For example, you can use AMT to tune your model
using param1 (a tunable parameter) and param2 (a static parameter). If you do, then use a search space
for param1 that lies between two values, and pass param2 as a static hyperparameter, as follows.

param1: ["range_min","range_max"]
param2: "static_value"

Static hyperparameters have the following structure:

"StaticHyperParameters": {

1617
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges

"objective" : "reg:squarederror",
"dropout_rate": "0.3"
}

You can use the Amazon SageMaker API to specify key value pairs in the StaticHyperParameters
field of the HyperParameterTrainingJobDefinition parameter that you pass to the
CreateHyperParameterTuningJob operation.

Dynamic hyperparameters
You can use the SageMaker API to define hyperparameter ranges. Specify the names of hyperparameters
and ranges of values in the ParameterRanges field of the HyperParameterTuningJobConfig
parameter that you pass to the CreateHyperParameterTuningJob operation.

The ParameterRanges field has three subfields: categorical, integer, and continuous. You can define up
to 30 total (categorical + integer + continuous) tunable hyperparameters to search over.
Note
Each categorical hyperparameter can have at most 30 different values.

Dynamic hyperparameters have the following structure:

"ParameterRanges": {
"CategoricalParameterRanges": [
{
"Name": "tree_method",
"Values": ["auto", "exact", "approx", "hist"]
}
],
"ContinuousParameterRanges": [
{
"Name": "eta",
"MaxValue" : "0.5",
"MinValue": "0",
"ScalingType": "Auto"
}
],
"IntegerParameterRanges": [
{
"Name": "max_depth",
"MaxValue": "10",
"MinValue": "1",
"ScalingType": "Auto"
}
]
}

If you create a tuning job with a Grid strategy, you can only specify categorical values. You don't
need to provide the MaxNumberofTrainingJobs. This value is inferred from the total number
of configurations that can be produced from your categorical parameters. If specified, the value of
MaxNumberOfTrainingJobs should be equal to the total number of distinct categorical combinations
possible.

Autotune
To save time and resources searching for hyperparameter ranges, resources or objective metrics,
autotune can automatically guess optimal values for some hyperparameter fields. Use autotune to find
optimal values for the following fields:

• ParameterRanges – The names and ranges of hyperparameters that a tuning job can optimize.

1618
Amazon SageMaker Developer Guide
Define Hyperparameter Ranges

• ResourceLimits – The maximum resources to be used in a tuning job. These resources can include the
maximum number of training jobs, maximum runtime of a tuning job, and the maximum number of
training jobs that can be run at the same time.
• TrainingJobEarlyStoppingType – A flag that stops a training job if a job is not significantly improving
against an objective metric. Defaults to enabled. For more information, see Stop Training Jobs
Early (p. 1640).
• RetryStrategy – The number of times to retry a training job. Non-zero values for RetryStrategy can
increase the likelihood that your job will complete successfully.
• Strategy – Specifies how hyperparameter tuning chooses the combinations of hyperparameter values
to use for the training job that it launches.
• ConvergenceDetected – A flag to indicate that Automatic Model Tuning (AMT) has detected model
convergence.

To use autotune, do the following:

1. Specify the hyperparameter and an example value in the AutoParameters field of the
ParameterRanges API.
2. Enable autotune.

AMT will determine if your hyperparameters and example values are eligible for autotune.
Hyperparameters that can be used in autotune are automatically assigned to the appropriate
parameter range type. Then, AMT uses ValueHint to select an optimal range for you. You can use the
DescribeHyperParameterTrainingJob API to view these ranges.

The following example shows you how to configure a tuning job that uses autotune. In the configuration
example, the hyperparameter max_depth has ValueHint containing an example value of 4.

config = {
'Autotune': {'Mode': 'Enabled'},
'HyperParameterTuningJobName':'my-autotune-job',
'HyperParameterTuningJobConfig': {
'HyperParameterTuningJobObjective': {'Type': 'Minimize', 'MetricName':
'validation:rmse'},
'ResourceLimits': {'MaxNumberOfTrainingJobs': 5, 'MaxParallelTrainingJobs': 1},
'ParameterRanges': {
'AutoParameters': [
{'Name': 'max_depth', 'ValueHint': '4'}
]
}
},
'TrainingJobDefinition': {
.... }

Continuing the previous example, a tuning job is created after the previous configuration is included
in a call to the CreateHyperParameterTuningJob API. Then, autotune converts the max_depth
hyperparameter in AutoParameters to the hyperparameter IntegerParameterRanges. The
following response from a DescribeHyperParameterTrainingJob API shows that the optimal
IntegerParameterRanges for max_depth are between 2 and 8.

{
'HyperParameterTuningJobName':'my_job',
'HyperParameterTuningJobConfig': {
'ParameterRanges': {
'IntegerParameterRanges': [
{'Name': 'max_depth', 'MinValue': '2', 'MaxValue': '8'},
],
}

1619
Amazon SageMaker Developer Guide
Track and set completion criteria

},
'TrainingJobDefinition': {
...
},
'Autotune': {'Mode': 'Enabled'}

Hyperparameter scaling types


For integer and continuous hyperparameter ranges, you can choose the scale that you want
hyperparameter tuning to use. For example, to search the range of values, you can specify a value for the
ScalingType field of the hyperparameter range. You can choose from the following hyperparameter
scaling types:

Auto

SageMaker hyperparameter tuning chooses the best scale for the hyperparameter.
Linear

Hyperparameter tuning searches the values in the hyperparameter range by using a linear scale.
Typically, you choose this if the range of all values from the lowest to the highest is relatively small
(within one order of magnitude). Uniformly searching values from the range provides a reasonable
exploration of the entire range.
Logarithmic

Hyperparameter tuning searches the values in the hyperparameter range by using a logarithmic
scale.

Logarithmic scaling works only for ranges that have values greater than 0.

Choose logarithmic scaling when you're searching a range that spans several orders of magnitude.

For example, if you're tuning a Tune a linear learner model (p. 1345) model, and you specify a range
of values between .0001 and 1.0 for the learning_rate hyperparameter, consider the following:
Searching uniformly on a logarithmic scale gives you a better sample of the entire range than
searching on a linear scale would. This is because searching on a linear scale would, on average,
devote 90 percent of your training budget to only the values between .1 and 1.0. As a result, that
leaves only 10 percent of your training budget for the values between .0001 and .1.
ReverseLogarithmic

Hyperparameter tuning searches the values in the hyperparameter range by using a reverse
logarithmic scale. Reverse logarithmic scaling is supported only for continuous hyperparameter
ranges. It is not supported for integer hyperparameter ranges.

Choose reverse logarithmic scaling when you are searching a range that is highly sensitive to small
changes that are very close to 1.

Reverse logarithmic scaling works only for ranges that are entirely within the range 0<=x<1.0.

For an example notebook that uses hyperparameter scaling, see these Amazon SageMaker
hyperparameter examples on GitHub.

Track and set completion criteria for your tuning job


You can use completion criteria to instruct Automatic model tuning (AMT) to stop your tuning job
if certain conditions are met. With these conditions, you can set a minimum model performance or

1620
Amazon SageMaker Developer Guide
Track and set completion criteria

maximum number of training jobs that don’t improve when evaluated against the objective metric. You
can also track the progress of your tuning job and decide to let it continue or to stop it manually. This
guide shows you how to set completion criteria, check the progress of and stop your tuning job manually.

Set completion criteria for your tuning job


During hyperparameter optimization, a tuning job will launch several training jobs inside a loop. The
tuning job will do the following.

• Check your training jobs for completion and update statistics accordingly
• Decide what combination of hyperparameters to evaluate next.

AMT will continuously check the training jobs that were launched from your tuning job to update
statistics. These statistics include tuning job runtime and best training job. Then, AMT determines
whether it should stop the job according to your completion criteria. You can also check these statistics
and stop your job manually. For more information about stopping a job manually, see the Stopping your
tuning job manually (p. 1623) section.

As an example, if your tuning job meets your objective, you can stop tuning early to conserve resources
or ensure model quality. AMT checks your job performance against your completion criteria and stops
the tuning job if any have been met.

You can specify the following kinds of completion criteria:

• MaxNumberOfTrainingJobs – The maximum number of training jobs to be run before tuning is


stopped.
• MaxNumberOfTrainingJobsNotImproving – The maximum number of training jobs that do
not improve performance against the objective metric from the current best training job. As an
example, if the best training job returned an objective metric that had an accuracy of 90%, and
MaxNumberOfTrainingJobsNotImproving is set to 10. In this example, tuning will stop after 10
training jobs fail to return an accuracy higher than 90%.
• MaxRuntimeInSeconds – The upper limit of wall clock time in seconds of how long a tuning job can
run.
• TargetObjectiveMetricValue – The value of the objective metric against which the tuning job is
evaluated. Once this value is met, AMT stops the tuning job.
• CompleteOnConvergence – A flag to stop tuning after an internal algorithm determines that the
tuning job is unlikely to improve more than 1% over the objective metric from the best training job.

Selecting completion criteria


You can choose one or multiple completion criteria to stop your hyperparameter tuning job after a
condition has been meet. The following instructions show you how to select completion criteria and how
to decide which is the most appropriate for your use case.

• Use MaxNumberOfTrainingJobs in the ResourceLimits API to set an upper limit for the number of
training jobs that can be run before your tuning job is stopped. Start with a large number and adjust it
based on model performance against your tuning job objective. Most users input values of around 50
or more training jobs to find an optimal hyperparameter configuration. Users looking for higher levels
of model performance will use 200 or more training jobs.
• Use MaxNumberOfTrainingJobsNotImproving in the BestObjectiveNotImproving API field to stop
training if model performance fails to improve after a specified number of jobs. Model performance
is evaluated against an objective function. After the MaxNumberOfTrainingJobsNotImproving
is met, AMT will stop the tuning job. Tuning jobs tend to make the most progress in the
beginning of the job. Improving model performance against an objective function will

1621
Amazon SageMaker Developer Guide
Track and set completion criteria

require a larger number of training jobs towards the end of tuning. Select a value for
MaxNumberOfTrainingJobsNotImproving by checking the performance of similar training jobs
against your objective metric.
• Use MaxRuntimeInSeconds in the ResourceLimits API to set an upper limit for the amount of wall
clock time that the tuning job may take. Use this field to meet a deadline by which the tuning job must
complete or to limit compute resources.

To get an estimated total compute time in seconds for a tuning job, use the following formula:

Estimated max compute time in seconds= MaxRuntimeInSeconds * MaxParallelTrainingJobs *


MaxInstancesPerTrainingJob
Note
The actual duration of a tuning job may deviate slightly from the value specified in this field.
• Use TargetObjectiveMetricValue in the TuningJobCompletionCriteria API to stop your tuning
job. You stop the tuning job after any training job that is launched by the tuning job reaches this
objective metric value. Use this field if your use case depends on reaching a specific performance level,
rather than spending compute resources to find the best possible model.
• Use CompleteOnConvergence in the TuningJobCompletionCriteria API to stop a tuning job after
AMT has detected that the tuning job has converged, and is unlikely to make further significant
progress. Use this field when it is not clear what values for any of the other completion criteria should
be used. AMT determines convergence based on an algorithm developed and tested on a wide range of
diverse benchmarks. A tuning job is defined to have converged when none of the training jobs return
significant improvement (1% or less). Improvement is measured against the objective metric returned
by the highest performing job, so far.

Combining different completion criteria


You can also combine any of the different completion criteria in the same tuning job. AMT will stop
the tuning job when any one of the completion criteria is met. For example, if you want to tune your
model until it meets an objective metric, but don't want to keep tuning if your job has converged, use the
following guidance.

• Specify TargetObjectiveMetricValue in the TuningJobCompletionCriteria API to set a target


objective metrics value to reach.
• Set CompleteOnConvergence to Enabled to stop a tuning job if AMT has determined that model
performance is unlikely to improve.

Track tuning job progress


You can use the DescribeHyperParameterTuningJob API to track the progress of your tuning job at
any time while it is running. You don't have to specify completion criteria to obtain tracking information
for your tuning job. Use the following fields to obtain statistics about your tuning job.

• BestTrainingJob – An object that describes the best training job obtained so far, evaluated against your
objective metric. Use this field to check your current model performance and the value of the objective
metric of this best training job.
• ObjectiveStatusCounters – An object that specifies the total number of training jobs completed in a
tuning job. To estimate average duration of a tuning job, use ObjectiveStatusCounters and the
total runtime of a tuning job. You can use the average duration to estimate how much longer your
tuning job will run.
• ConsumedResources – The total resources, such as RunTimeInSeconds, consumed by your tuning
job. Compare ConsumedResources, found in the DescribeHyperParameterTuningJob API, against
BestTrainingJob in the same API. You can also compare ConsumedResources against the

1622
Amazon SageMaker Developer Guide
Tune Multiple Algorithms

response from the ListTrainingJobsForHyperParameterTuningJob API to assess if your tuning job is


making satisfactory progress given the resources being consumed.
• TuningJobCompletionDetails – Tuning job completion information that includes the following:
• The timestamp of when convergence is detected if the job has converged.
• The number of training jobs that have not improved model performance. Model performance is
evaluated against the objective metric from the best training job.

Use the tuning job completion criteria to assess how likely your tuning job is to improve your model
performance. Model performance is evaluated against the best objective metric if it ran to completion.

Stopping your tuning job manually


You can determine if you should let the tuning job run until it completes or if you should stop the
tuning job manually. To determine this, use the information returned by the parameters in the
DescribeHyperParameterTuningJob API, as shown in the previous Tracking tuning job progress
section. As an example, if your model performance does not improve after several training jobs
complete, you may choose to stop the tuning job. Model performance is evaluated against the best
objective metric.

To stop the tuning job manually, use the StopHyperParameterTuningJob API and provide the name of
the tuning job to be stopped.

Tune Multiple Algorithms with Hyperparameter


Optimization to Find the Best Model
To create a new hyperparameter optimization (HPO) job with Amazon SageMaker that tunes multiple
algorithms, you must provide job settings that apply to all of the algorithms to be tested and a training
definition for each of these algorithms. You must also specify the resources you want to use for the
tuning job.

• The job settings to configure include warm starting, early stopping, and the tuning strategy. Warm
starting and early stopping are available only when tuning a single algorithm.
• The training job definition to specify the name, algorithm source, objective metric, and the range
of values, when required, to configure the set of hyperparameter values for each training job. It
configures the channels for data inputs, data output locations, and any checkpoint storage locations
for each training job. The definition also configures the resources to deploy for each training job,
including instance types and counts, managed spot training, and stopping conditions.
• The tuning job resources: to deploy, including the maximum number of concurrent training jobs that
a hyperparameter tuning job can run concurrently and the maximum number of training jobs that the
hyperparameter tuning job can run.

Get Started
You can create a new hyperparameter tuning job, clone a job, add or edit tags to a job from the console.
You can also use the search feature to find jobs by their name, creation time, or status. Alternatively, you
can also hyperparameter tuning jobs with the SageMaker API.

• In the console: To create a new job, open the Amazon SageMaker console at https://
console.aws.amazon.com/sagemaker/, choose Hyperparameter tuning jobs from the Training, menu,
and then choose Create hyperparameter tuning job. Then following the configuration steps to create
a training job for each algorithm that you want to use. These steps are documented in the Create a
Hyperparameter Optimization Tuning Job for One or More Algorithms (Console) (p. 1624) topic.

1623
Amazon SageMaker Developer Guide
Tune Multiple Algorithms

Note
When you start the configuration steps, note that the warm start and early stopping features
are not available to use with multi-algorithm HPO. If you want to use these features, you can
only tune a single algorithm at a time.
• With the API: For instructions on using the SageMaker API to create a hyperparameter tuning job,
see Example: Hyperparameter Tuning Job. When you call CreateHyperParameterTuningJob
to tune multiple algorithms, you must provide a list of training definitions
using TrainingJobDefinitions instead of specifying a single TrainingJobDefinition. You must
provide job settings that apply to all of the algorithms to be tested and a training definition for each
of these algorithms. You must also specify the resources you want to use for the tuning job. You must
choose just one of these definition types depending on the number of algorithms being tuned.

Topics
• Create a Hyperparameter Optimization Tuning Job for One or More Algorithms (Console) (p. 1624)
• Manage Hyperparameter Tuning and Training Jobs (p. 1627)

Create a Hyperparameter Optimization Tuning Job for One or


More Algorithms (Console)
To create a new hyperparameter optimization (HPO) tuning job for one or more algorithms, you need to
define the settings for the tuning job, create training job definitions for each algorithm being tuned, and
configure the resources for the tuning job.

Topics
• Define job settings (p. 1624)
• Create Training Job Definitions (p. 1625)
• Configure Tuning Job Resources (p. 1627)
• Review and Create HPO Tuning Job (p. 1627)

Define job settings


Your tuning job settings are applied across all of the algorithms in the HPO tuning job. Warm start and
early stopping are available only when tuning a single algorithm. After you define the job settings you
will create individual training definitions for each algorithm or variation you want to tune.

Warm Start

If you cloned this job, you can choose to use the results from a previous tuning job to improve the
performance of this new tuning job. This is the warm start feature and it is only available when tuning
a single algorithm. When you choose this option, you can choose up to five previous hyperparameter
tuning jobs to use. Alternatively, you can use transfer learning to add additional data to the parent
tuning job. When you select this option, you choose one previous tuning job as the parent.
Note
Warm start is compatible only with tuning jobs created after October 1, 2018. For more
information, see Run a warm start job.

Early Stopping

To reduce compute time and avoid overfitting your model, training jobs can be stopped early when they
are unlikely to improve the current best objective metric of the hyperparameter tuning job.. Like warm
start, this feature is only available when tuning a single algorithm. This is an automatic feature without

1624
Amazon SageMaker Developer Guide
Tune Multiple Algorithms

configuration options, and it’s disabled by default. For more information on how early stopping works,
the algorithms that support it, and how to use it with your own algorithms, see Stop Training Jobs Early.

Tuning Strategy

Tuning strategy can be either random, Bayesian or Hyperband. These selections specify how automatic
tuning algorithms search over specified hyperparameter ranges (selected in a later step). Random
search chooses random combinations of values from the specified ranges and can be run sequentially
or in parallel. Bayesian optimization chooses values based on what is likely to get the best result
given what is known about the history of previous selections. Hyperband uses a multi-fidelity strategy
that dynamically allocates resources towards well-utilized jobs and automatically stops those that
underperform. The new configuration that starts after stopping other configurations is chosen randomly.
Hyperband can only be used with iterative algorithms. For more information search strategies, see How
Hyperparameter Tuning Works.
Note
Hyperband uses an advanced internal mechanism to apply early stopping. Thus, the parameter
TrainingJobEarlyStoppingType in the HyperParameterTuningJobConfig API must be
set to OFF when using Hyperband's internal early stopping feature.

Tags

You enter tags as key-value pairs to assign metadata to tuning jobs to help you manage them. Values
are not required. You can use just the key. To see the keys associated with a job, choose the Tags tab
on the details page for tuning job. For more information about using tags for tuning jobs, see Manage
Hyperparameter Tuning and Training Jobs (p. 1627)

Create Training Job Definitions


To create a training job definition, you need to configure the algorithm and parameters, define the data
input and output, and configure resources. You must provide at least one TrainingJobDefinition
for each HPO tuning job. Each training definition specifies the configuration for an algorithm. To create
several definitions for your training job you can clone a job definition. Cloning a job can save time as it
copies all of the job settings, including data channels, S3 storage locations for output artifacts. You can
then edit the cloned job just for changes needed to configure the algorithm options.

Topics
• Configure algorithm and parameters (p. 1625)
• Define Data Input and Output (p. 1626)
• Configure Training Job Resources (p. 1627)
• Add or Clone a Training Job (p. 1627)

Configure algorithm and parameters

Each training job definition for a tuning job requires a name, permission to access services, and the
specification of algorithm options, an objective metric, and the range of values, when required, to
configure the set of hyperparameter values for each training job.

Name

Provide a unique name for the training definition.

Permissions

Amazon SageMaker requires permissions to call other services on your behalf. Choose an IAM role or let
AWS create a role that has the AmazonSageMakerFullAccess IAM policy attached.

1625
Amazon SageMaker Developer Guide
Tune Multiple Algorithms

Optional Security Settings

The network isolation setting prevents the container from making any outbound network calls. This is
required for AWS Marketplace machine learning offerings.

You can also choose to use a private VPC.


Note
Inter-container encryption is only available when creating job definitions from the API.

Algorithm Options

You can choose one of the built-in algorithms, your own algorithm, your own container with an
algorithm, or you can subscribe to an algorithm from AWS Marketplace.

• If you choose a built-in algorithm, it has the ECR image information pre-populated.
• If you choose your own container, you must specify the ECR image information. You can select the
input mode for the algorithm as file or pipe.
• If you plan to supply your data using a .CSV file from Amazon S3, you should select the file.

Metrics

When you choose a built-in algorithm, metrics are provided for you. If you choose your own algorithm,
you need to define your metrics. You can define up to 20 metrics for your tuning job to monitor, one
of which must be chosen as the objective metric. For more information on how to define a metric for a
tuning job, see Define metrics (p. 1615).

Objective Metric

To find the best training job, set an objective metric and whether to maximize or minimize it. After the
training job is complete, you can view the tuning job detail page for a summary of the best training job
found using this objective metric.

Hyperparameter Configuration

When you choose a built-in algorithm, the default values for its hyperparameters are set for you, using
ranges that are optimized for the algorithm being tuned. You can change these values as you see fit. For
example, instead of a range, you can set a fixed value for a hyperparameter by setting the parameter’s
type to static. Each algorithm has different required and optional parameters. For more information,
see Best Practices for Hyperparameter Tuning and Define Hyperparameter Ranges.

Define Data Input and Output

Each training job definition for a tuning job must configures the channels for data inputs, data output
locations, and optionally any checkpoint storage locations for each training job.

Input Data Configuration

Input data is defined by channels, each with their own source location (Amazon S3 or Amazon Elastic
File System), compression, and format options. You can define up to 20 channels of input sources. If the
algorithm you chose supports multiple input channels, you can specify those too. For example, when
using the XGBoost churn prediction notebook, you could add two channels: train and validation.

Checkpoint Configuration

Checkpoints are periodically generated during training. You must choose an Amazon S3 location for
the checkpoints to be saved. Checkpoints are used in metrics reporting, and are also used to resume
managed spot training jobs. For more information, see Use Checkpoints in Amazon SageMaker (p. 2142).

1626
Amazon SageMaker Developer Guide
Tune Multiple Algorithms

Output Data Configuration

You must define an Amazon S3 location for the artifacts of the training job to be stored. You have the
option of adding encryption to the output using an AWS Key Management Service (AWS KMS) key.

Configure Training Job Resources

Each training job definition for a tuning job must configures the resources to deploy, including instance
types and counts, managed spot training, and stopping conditions.

Resource Configuration

Each training definition can have a different resource configuration. You choose the instance type and
number of nodes.

Managed spot training

You can save computer costs for jobs if you have flexibility in start and end times by allowing SageMaker
to use spare capacity to run jobs. For more information, see Managed Spot Training in Amazon
SageMaker (p. 2117).

Stopping condition

The stopping condition specifies the maximum duration allowed per training job.

Add or Clone a Training Job

Once you have created a training job definition for a tuning job, you are returned to the Training
Job Definition(s) panel where you can create additional training job definitions to train additional
algorithms. You can select the Add training job definition and work through the steps to definite a
training job again or choose Clone from the Action menu to replicate an existing training job definition
and the edit it for the new algorithm. The clone option can save time as it copies all of the job’s settings,
including the data channels, S3 storage locations. For more information on cloning, see Manage
Hyperparameter Tuning and Training Jobs (p. 1627)

Configure Tuning Job Resources


Resource Limits

You can specify the maximum number of concurrent training jobs that a hyperparameter tuning job
can run concurrently (10 at most) and the maximum number of training jobs that the hyperparameter
tuning job can run (500 at most).The number of parallel jobs should not exceed the number of nodes you
have requested across all of your training definitions. The total number of jobs can’t exceed the number
of jobs that your definitions are expected to run.

Review and Create HPO Tuning Job


Review the job settings, the training job definition(s), and resource limits. Then select Create
hyperparameter tuning job.

Manage Hyperparameter Tuning and Training Jobs


A tuning job can contain many training jobs and creating and managing these jobs and their definitions
can become a complex and onerous task. SageMaker provides tools to help facilitate the management
of these jobs. Tuning jobs you have run can be accessed from the Amazon SageMaker console at https://
console.aws.amazon.com/sagemaker/. Select Hyperparameter tuning job from the Training menu
to see the list. This page is also where you start the procedure to create a new tuning job by selecting
Create hyperparameter tuning job.

1627
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

To see the training jobs run a part of a tuning job, select one of the hyperparameter tuning jobs from
the list. The tabs on the tuning job page allow you to inspect the training jobs, their definitions, the tags
and configuration used for the tuning job, and the best training job found during tuning. You can select
the best training job or any of the other training jobs that belong to the tuning job to see all of their
settings. From here you can create a model that uses the hyperparameter values found by a training job
by selecting Create Model or you can clone the training job by selecting Clone.

Cloning

You can save time by cloning a training job that belongs to a hyperparameter tuning job. Cloning copies
all of the job’s settings, including data channels, S3 storage locations for output artifacts. You can do
this for training jobs you have already run from the tuning job page, as just described, or when you are
creating additional training job definitions while creating a hyperparameter tuning job, as described in
Add or Clone a Training Job (p. 1627) step of that procedure.

Tagging

Automatic Model Tuning launches multiple training jobs within a single parent tuning job to discover the
ideal weighting of model hyperparameters. Tags can be added to the parent tuning job as described in
the Define job settings (p. 1624) section and these tags are then propagated to the individual training
jobs underneath. Customers can use these tags for purposes such as cost allocation or access control. To
add tags using the SageMaker SDK, use AddTags API. For more information about using tagging for AWS
resources, see Tagging AWS resources.

Example: Hyperparameter Tuning Job


This example shows how to create a new notebook for configuring and launching a hyperparameter
tuning job. The tuning job uses the XGBoost Algorithm (p. 1369) to train a model to predict whether a
customer will enroll for a term deposit at a bank after being contacted by phone.

You use the low-level SDK for Python (Boto3) to configure and launch the hyperparameter tuning job,
and the AWS Management Console to monitor the status of hyperparameter tuning jobs. You can also
use the Amazon SageMaker high-level Amazon SageMaker Python SDK to configure, run, monitor, and
analyze hyperparameter tuning jobs. For more information, see https://github.com/aws/sagemaker-
python-sdk.

Prerequisites
To run the code in this example, you need

• An AWS account and an administrator user (p. 36)


• An Amazon S3 bucket for storing your training dataset and the model artifacts created during training
• A running SageMaker notebook instance (p. 88)

Topics
• Create a Notebook Instance (p. 1629)
• Get the Amazon SageMaker Boto 3 Client (p. 1629)
• Get the SageMaker Execution Role (p. 1629)
• Specify a S3 Bucket to Upload Training Datasets and Store Output Data (p. 1630)
• Download, Prepare, and Upload Training Data (p. 1630)
• Configure and Launch a Hyperparameter Tuning Job (p. 1631)
• Clean up (p. 1639)

1628
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

Create a Notebook Instance


Create a Jupyter notebook that contains a pre-installed environment with the default Anaconda
installation and Python3.

To create a Jupyter notebook

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Open a running notebook instance, by choosing Open next to its name. The Jupyter notebook server
page appears:

3. To create a notebook, choose Files, New, and conda_python3. .


4. Name the notebook.

Next Step
Get the Amazon SageMaker Boto 3 Client (p. 1629)

Get the Amazon SageMaker Boto 3 Client


Import Amazon SageMaker Python SDK, AWS SDK for Python (Boto3), and other Python libraries. In a
new Jupyter notebook, paste the following code to the first cell:

import sagemaker
import boto3

import numpy as np # For performing matrix operations and


numerical processing
import pandas as pd # For manipulating tabular data
from time import gmtime, strftime
import os

region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')

The preceding code cell defines region and smclient objects that you will use to call the built-in
XGBoost algorithm and set the SageMaker hyperparameter tuning job.

Next Step
Get the SageMaker Execution Role (p. 1629)

Get the SageMaker Execution Role


Get the execution role for the notebook instance. This is the IAM role that you created for your notebook
instance. You pass the role to the tuning job.

1629
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

from sagemaker import get_execution_role

role = get_execution_role()
print(role)

Next Step
Specify a S3 Bucket to Upload Training Datasets and Store Output Data (p. 1630)

Specify a S3 Bucket to Upload Training Datasets and Store


Output Data
Set up a S3 bucket to upload training datasets and save training output data.

To use a default S3 bucket

Use the following code to specify the default S3 bucket allocated for your SageMaker session. prefix is
the path within the bucket where SageMaker stores the data for the current training job.

sess = sagemaker.Session()
bucket = sess.default_bucket() # Set a default S3 bucket
prefix = 'DEMO-automatic-model-tuning-xgboost-dm'

(Optional) To use a specific S3 bucket

If you want to use a specific S3 bucket, use the following code and replace the strings to the exact name
of the S3 bucket. The name of the bucket must contain sagemaker, and be globally unique. The bucket
must be in the same AWS Region as the notebook instance that you use for this example.

bucket = "sagemaker-your-preferred-s3-bucket"

sess = sagemaker.Session(
default_bucket = bucket
)

Note
The name of the bucket doesn't need to contain sagemaker if the IAM role that you use to run
the hyperparameter tuning job has a policy that gives the S3FullAccess permission.

Next Step
Download, Prepare, and Upload Training Data (p. 1630)

Download, Prepare, and Upload Training Data


For this example, you use a training dataset of information about bank customers that includes
the customer's job, marital status, and how they were contacted during the bank's direct marketing
campaign. To use a dataset for a hyperparameter tuning job, you download it, transform the data, and
then upload it to an Amazon S3 bucket.

For more information about the dataset and the data transformation that the example performs, see the
hpo_xgboost_direct_marketing_sagemaker_APIs notebook in the Hyperparameter Tuning section of the
SageMaker Examples tab in your notebook instance.

Download and Explore the Training Dataset


To download and explore the dataset, run the following code in your notebook:

1630
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-
additional.zip
!unzip -o bank-additional.zip
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5) # Keep the output on one page
data

Prepare and Upload Data


Before creating the hyperparameter tuning job, prepare the data and upload it to an S3 bucket where
the hyperparameter tuning job can access it.

Run the following code in your notebook:

data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)


# Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']),
1, 0) # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)
# Convert categorical variables to sets of indicators
model_data
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx',
'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

train_data, validation_data, test_data = np.split(model_data.sample(frac=1,


random_state=1729), [int(0.7 * len(model_data)), int(0.9*len(model_data))])

pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)],


axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)],
axis=1).to_csv('validation.csv', index=False, header=False)
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)],
axis=1).to_csv('test.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/
train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/
validation.csv')).upload_file('validation.csv')

Next Step
Configure and Launch a Hyperparameter Tuning Job (p. 1631)

Configure and Launch a Hyperparameter Tuning Job


A hyperparameter is a high-level parameter that influences the learning process during model
training. To get the best model predictions, you can optimize a hyperparameter configuration or set
hyperparameter values. The process of finding an optimal configuration is called hyperparameter tuning.
To configure and launch a hyperparameter tuning job, complete the steps in these guides.

Topics
• Settings for the hyperparameter tuning job (p. 1632)
• Configure the training jobs (p. 1633)
• Name and launch the hyperparameter tuning job (p. 1635)
• Monitor the Progress of a Hyperparameter Tuning Job (p. 1636)
• View the Status of the Training Jobs (p. 1638)

1631
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

• View the Best Training Job (p. 1639)

Settings for the hyperparameter tuning job


To specify settings for the hyperparameter tuning job, define a JSON object when you create the tuning
job. Pass this JSON object as the value of the HyperParameterTuningJobConfig parameter to the
CreateHyperParameterTuningJob API.

In this JSON object, specify the following:

In this JSON object, you specify:

• HyperParameterTuningJobObjective – The objective metric used to evaluate the performance of


the training job launched by the hyperparameter tuning job.
• ParameterRanges – The range of values that a tunable hyperparameter can use during optimization.
For more information, see Define Hyperparameter Ranges (p. 1617)
• RandomSeed – A value used to initialize a pseudo-random number generator. Setting a random seed
will allow the hyperparameter tuning search strategies to produce more consistent configurations for
the same tuning job (optional).
• ResourceLimits – The maximum number of training and parallel training jobs that the
hyperparameter tuning job can use.

Note
If you use your own algorithm for hyperparameter tuning, rather than a SageMaker built-
in algorithm, you must define metrics for your algorithm. For more information, see Define
metrics (p. 1615).

The following code example shows how to configure a hyperparameter tuning job using the
built-in XGBoost algorithm. The code example shows how to define ranges for the eta, alpha,
min_child_weight, and max_depth hyperparameters. For more information about these and other
hyperparameters see XGBoost Parameters.

In this code example, the objective metric for the hyperparameter tuning job finds the hyperparameter
configuration that maximizes validation:auc. SageMaker built-in algorithms automatically write the
objective metric to CloudWatch Logs. The following code example also shows how to set a RandomSeed.

tuning_job_config = {
"ParameterRanges": {
"CategoricalParameterRanges": [],
"ContinuousParameterRanges": [
{
"MaxValue": "1",
"MinValue": "0",
"Name": "eta"
},
{
"MaxValue": "2",
"MinValue": "0",
"Name": "alpha"
},
{
"MaxValue": "10",
"MinValue": "1",
"Name": "min_child_weight"
}
],
"IntegerParameterRanges": [

1632
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

{
"MaxValue": "10",
"MinValue": "1",
"Name": "max_depth"
}
]
},
"ResourceLimits": {
"MaxNumberOfTrainingJobs": 20,
"MaxParallelTrainingJobs": 3
},
"Strategy": "Bayesian",
"HyperParameterTuningJobObjective": {
"MetricName": "validation:auc",
"Type": "Maximize"
},
"RandomSeed" : 123
}

Configure the training jobs


The hyperparameter tuning job will launch training jobs to find an optimal configuration
of hyperparameters. These training jobs should be configured using the SageMaker
CreateHyperParameterTuningJob API.

To configure the training jobs, define a JSON object and pass it as the value of the
TrainingJobDefinition parameter inside CreateHyperParameterTuningJob.

In this JSON object, you can specify the following:

• AlgorithmSpecification – The registry path of the Docker image containing the training
algorithm and related metadata. To specify an algorithm, you can use your own custom built
algorithm inside a Docker container or a SageMaker built-in algorithm (required).
• InputDataConfig – The input configuration, including the ChannelName, ContentType, and data
source for your training and test data (required).
• InputDataConfig – The input configuration, including the ChannelName, ContentType, and data
source for your training and test data (required).
• The storage location for the algorithm's output. Specify the S3 bucket where you want to store the
output of the training jobs.
• RoleArn – The Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role
that SageMaker uses to perform tasks. Tasks include reading input data, downloading a Docker image,
writing model artifacts to an S3 bucket, writing logs to Amazon CloudWatch Logs, and writing metrics
to Amazon CloudWatch (required).
• StoppingCondition – The maximum runtime in seconds that a training job can run before being
stopped. This value should be greater than the time needed to train your model (required).
• MetricDefinitions – The name and regular expression that defines any metrics that the training
jobs emit. Define metrics only when you use a custom training algorithm. The example in the following
code uses a built-in algorithm, which already has metrics defined. For information about defining
metrics (optional), see Define metrics (p. 1615).
• TrainingImage – The Dockercontainer image that specifies the training algorithm (optional).
• StaticHyperParameters – The name and values of hyperparameters that are not tuned in the
tuning job (optional).

The following code example sets static values for the eval_metric, num_round, objective,
rate_drop, and tweedie_variance_power parameters of the XGBoost Algorithm (p. 1369) built-in
algorithm.

1633
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

SageMaker Python SDK v1

from sagemaker.amazon.amazon_estimator import get_image_uri


training_image = get_image_uri(region, 'xgboost', repo_version='1.0-1')

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)


s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)

training_job_definition = {
"AlgorithmSpecification": {
"TrainingImage": training_image,
"TrainingInputMode": "File"
},
"InputDataConfig": [
{
"ChannelName": "train",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_train
}
}
},
{
"ChannelName": "validation",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_validation
}
}
}
],
"OutputDataConfig": {
"S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.2xlarge",
"VolumeSizeInGB": 10
},
"RoleArn": role,
"StaticHyperParameters": {
"eval_metric": "auc",
"num_round": "100",
"objective": "binary:logistic",
"rate_drop": "0.3",
"tweedie_variance_power": "1.4"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 43200
}
}

SageMaker Python SDK v2

training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

1634
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)


s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)

training_job_definition = {
"AlgorithmSpecification": {
"TrainingImage": training_image,
"TrainingInputMode": "File"
},
"InputDataConfig": [
{
"ChannelName": "train",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_train
}
}
},
{
"ChannelName": "validation",
"CompressionType": "None",
"ContentType": "csv",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": s3_input_validation
}
}
}
],
"OutputDataConfig": {
"S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
},
"ResourceConfig": {
"InstanceCount": 2,
"InstanceType": "ml.c4.2xlarge",
"VolumeSizeInGB": 10
},
"RoleArn": role,
"StaticHyperParameters": {
"eval_metric": "auc",
"num_round": "100",
"objective": "binary:logistic",
"rate_drop": "0.3",
"tweedie_variance_power": "1.4"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 43200
}
}

Name and launch the hyperparameter tuning job


After you configure the hyperparameter tuning job, you can launch it by calling the
CreateHyperParameterTuningJob API. The following code example uses tuning_job_config
and training_job_definition. These were defined in the previous two code examples to create a
hyperparameter tuning job.

tuning_job_name = "MyTuningJob"

1635
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
HyperParameterTuningJobConfig =
tuning_job_config,
TrainingJobDefinition = training_job_definition)

Monitor the Progress of a Hyperparameter Tuning Job


To monitor the progress of a hyperparameter tuning job and the training jobs that it launches, use the
Amazon SageMaker console.

Topics
• View the Status of the Hyperparameter Tuning Job (p. 1636)

View the Status of the Hyperparameter Tuning Job

To view the status of the hyperparameter tuning job

1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.


2. Choose Hyperparameter tuning jobs.

1636
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

1637
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

3. In the list of hyperparameter tuning jobs, check the status of the hyperparameter tuning job you
launched. A tuning job can be:

• Completed—The hyperparameter tuning job successfully completed.


• InProgress—The hyperparameter tuning job is in progress. One or more training jobs are still
running.
• Failed—The hyperparameter tuning job failed.
• Stopped—The hyperparameter tuning job was manually stopped before it completed. All training
jobs that the hyperparameter tuning job launched are stopped.
• Stopping—The hyperparameter tuning job is in the process of stopping.

View the Status of the Training Jobs

To view the status of the training jobs that the hyperparameter tuning job launched

1. In the list of hyperparameter tuning jobs, choose the job that you launched.
2. Choose Training jobs.

3. View the status of each training job. To see more details about a job, choose it in the list of training
jobs. To view a summary of the status of all of the training jobs that the hyperparameter tuning job
launched, see Training job status counter.

A training job can be:

• Completed—The training job successfully completed.


• InProgress—The training job is in progress.
• Stopped—The training job was manually stopped before it completed.
• Failed (Retryable)—The training job failed, but can be retried. A failed training job can be
retried only if it failed because an internal service error occurred.
• Failed (Non-retryable)—The training job failed and can't be retried. A failed training job
can't be retried when a client error occurs.

1638
Amazon SageMaker Developer Guide
Example: Hyperparameter Tuning Job

Note
Hyperparameter tuning jobs can be stopped and the underlying resources deleted, but the
jobs themselves cannot be deleted.

View the Best Training Job


A hyperparameter tuning job uses the objective metric that each training job returns to evaluate training
jobs. While the hyperparameter tuning job is in progress, the best training job is the one that has
returned the best objective metric so far. After the hyperparameter tuning job is complete, the best
training job is the one that returned the best objective metric.

To view the best training job, choose Best training job.

To deploy the best training job as a model that you can host at a SageMaker endpoint, choose Create
model.

Next Step

Clean up (p. 1639)

Clean up
To avoid incurring unnecessary charges, when you are done with the example, use the AWS Management
Console to delete the resources that you created for it.
Note
If you plan to explore other examples, you might want to keep some of these resources, such as
your notebook instance, S3 bucket, and IAM role.

1. Open the SageMaker console at https://console.aws.amazon.com/sagemaker/ and delete the


notebook instance. Stop the instance before deleting it.
2. Open the Amazon S3 console at https://console.aws.amazon.com/s3/ and delete the bucket that you
created to store model artifacts and the training dataset.

1639
Amazon SageMaker Developer Guide
Stop Training Jobs Early

3. Open the IAM console at https://console.aws.amazon.com/iam/ and delete the IAM role. If you
created permission policies, you can delete them, too.
4. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/ and delete
all of the log groups that have names starting with /aws/sagemaker/.

Stop Training Jobs Early


Stop the training jobs that a hyperparameter tuning job launches early when they are not improving
significantly as measured by the objective metric. Stopping training jobs early can help reduce compute
time and helps you avoid overfitting your model. To configure a hyperparameter tuning job to stop
training jobs early, do one of the following:

• If you are using the AWS SDK for Python (Boto3), set the TrainingJobEarlyStoppingType field of
the HyperParameterTuningJobConfig object that you use to configure the tuning job to AUTO.
• If you are using the Amazon SageMaker Python SDK, set the early_stopping_type parameter of
the HyperParameterTuner object to Auto.
• In the Amazon SageMaker console, in the Create hyperparameter tuning job workflow, under Early
stopping, choose Auto.

For a sample notebook that demonstrates how to use early stopping, see https://
github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/
image_classification_early_stopping/hpo_image_classification_early_stopping.ipynb or open the
hpo_image_classification_early_stopping.ipynb notebook in the Hyperparameter Tuning
section of the SageMaker Examples in a notebook instance. For information about using sample
notebooks in a notebook instance, see Example Notebooks (p. 220).

How Early Stopping Works


When you enable early stopping for a hyperparameter tuning job, SageMaker evaluates each training job
the hyperparameter tuning job launches as follows:

• After each epoch of training, get the value of the objective metric.
• Compute the running average of the objective metric for all previous training jobs up to the same
epoch, and then compute the median of all of the running averages.
• If the value of the objective metric for the current training job is worse (higher when minimizing
or lower when maximizing the objective metric) than the median value of running averages of the
objective metric for previous training jobs up to the same epoch, SageMaker stops the current training
job.

Algorithms That Support Early Stopping


To support early stopping, an algorithm must emit objective metrics for each epoch. The following built-
in SageMaker algorithms support early stopping:

• LightGBM (p. 1336)


• CatBoost (p. 1308)
• AutoGluon-Tabular (p. 1301)
• TabTransformer (p. 1362)
• Linear Learner Algorithm (p. 1345)—Supported only if you use objective_loss as the objective
metric.

1640
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job

• XGBoost Algorithm (p. 1369)


• Image Classification - MXNet (p. 1506)
• Object Detection - MXNet (p. 1530)
• Sequence-to-Sequence Algorithm (p. 1437)
• IP Insights (p. 1476)

Note
This list of built-in algorithms that support early stopping is current as of December 13, 2018.
Other built-in algorithms might support early stopping in the future. If an algorithm emits a
metric that can be used as an objective metric for a hyperparameter tuning job (preferably a
validation metric), then it supports early stopping.

To use early stopping with your own algorithm, you must write your algorithms such that it emits the
value of the objective metric after each epoch. The following list shows how you can do that in different
frameworks:

TensorFlow

Use the tf.keras.callbacks.ProgbarLogger class. For information, see the


tf.keras.callbacks.ProgbarLogger API.
MXNet

Use the mxnet.callback.LogValidationMetricsCallback. For information, see the


mxnet.callback APIs.
Chainer

Extend chainer by using the extensions.Evaluator class. For information, see the
chainer.training.extensions.Evaluator API.
PyTorch and Spark

There is no high-level support. You must explicitly write your training code so that it computes
objective metrics and writes them to logs after each epoch.

Run a Warm Start Hyperparameter Tuning Job


Use warm start to start a hyperparameter tuning job using one or more previous tuning jobs as a starting
point. The results of previous tuning jobs are used to inform which combinations of hyperparameters
to search over in the new tuning job. Hyperparameter tuning uses either Bayesian or random search to
choose combinations of hyperparameter values from ranges that you specify. For more information, see
How Hyperparameter Tuning Works (p. 1613). Using information from previous hyperparameter tuning
jobs can help increase the performance of the new hyperparameter tuning job by making the search for
the best combination of hyperparameters more efficient.
Note
Warm start tuning jobs typically take longer to start than standard hyperparameter tuning
jobs, because the results from the parent jobs have to be loaded before the job can start. The
increased time depends on the total number of training jobs launched by the parent jobs.

Reasons to consider warm start include the following:

• To gradually increase the number of training jobs over several tuning jobs based on results after each
iteration.
• To tune a model using new data that you received.

1641
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job

• To change hyperparameter ranges that you used in a previous tuning job, change static
hyperparameters to tunable, or change tunable hyperparameters to static values.
• You stopped a previous hyperparameter job early or it stopped unexpectedly.

Topics
• Types of Warm Start Tuning Jobs (p. 1642)
• Warm Start Tuning Restrictions (p. 1642)
• Warm Start Tuning Sample Notebook (p. 1643)
• Create a Warm Start Tuning Job (p. 1643)

Types of Warm Start Tuning Jobs


There are two different types of warm start tuning jobs:

IDENTICAL_DATA_AND_ALGORITHM

The new hyperparameter tuning job uses the same input data and training image as the parent
tuning jobs. You can change the hyperparameter ranges to search and the maximum number of
training jobs that the hyperparameter tuning job launches. You can also change hyperparameters
from tunable to static, and from static to tunable, but the total number of static plus tunable
hyperparameters must remain the same as it is in all parent jobs. You cannot use a new version of
the training algorithm, unless the changes in the new version do not affect the algorithm itself. For
example, changes that improve logging or adding support for a different data format are allowed.

Use identical data and algorithm when you use the same training data as you used in a previous
hyperparameter tuning job, but you want to increase the total number of training jobs or change
ranges or values of hyperparameters.

When you run an warm start tuning job of type IDENTICAL_DATA_AND_ALGORITHM, there
is an additional field in the response to DescribeHyperParameterTuningJob named
OverallBestTrainingJob. The value of this field is the TrainingJobSummary for the training job
with the best objective metric value of all training jobs launched by this tuning job and all parent
jobs specified for the warm start tuning job.
TRANSFER_LEARNING

The new hyperparameter tuning job can include input data, hyperparameter ranges, maximum
number of concurrent training jobs, and maximum number of training jobs that are different than
those of its parent hyperparameter tuning jobs. You can also change hyperparameters from tunable
to static, and from static to tunable, but the total number of static plus tunable hyperparameters
must remain the same as it is in all parent jobs. The training algorithm image can also be a different
version from the version used in the parent hyperparameter tuning job. When you use transfer
learning, changes in the dataset or the algorithm that significantly affect the value of the objective
metric might reduce the usefulness of using warm start tuning.

Warm Start Tuning Restrictions


The following restrictions apply to all warm start tuning jobs:

• A tuning job can have a maximum of 5 parent jobs, and all parent jobs must be in a terminal state
(Completed, Stopped, or Failed) before you start the new tuning job.
• The objective metric used in the new tuning job must be the same as the objective metric used in the
parent jobs.

1642
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job

• The total number of static plus tunable hyperparameters must remain the same between parent
jobs and the new tuning job. Because of this, if you think you might want to use a hyperparameter
as tunable in a future warm start tuning job, you should add it as a static hyperparameter when you
create a tuning job.
• The type of each hyperparameter (continuous, integer, categorical) must not change between parent
jobs and the new tuning job.
• The number of total changes from tunable hyperparameters in the parent jobs to static
hyperparameters in the new tuning job, plus the number of changes in the values of static
hyperparameters cannot be more than 10. For example, if the parent job has a tunable categorical
hyperparameter with the possible values red and blue, you change that hyperparameter to
static in the new tuning job, that counts as 2 changes against the allowed total of 10. If the same
hyperparameter had a static value of red in the parent job, and you change the static value to blue in
the new tuning job, it also counts as 2 changes.
• Warm start tuning is not recursive. For example, if you create MyTuningJob3 as a warm start tuning
job with MyTuningJob2 as a parent job, and MyTuningJob2 is itself an warm start tuning job with
a parent job MyTuningJob1, the information that was learned when running MyTuningJob1 is not
used for MyTuningJob3. If you want to use the information from MyTuningJob1, you must explicitly
add it as a parent for MyTuningJob3.
• The training jobs launched by every parent job in a warm start tuning job count against the 500
maximum training jobs for a tuning job.
• Hyperparameter tuning jobs created before October 1, 2018 cannot be used as parent jobs for warm
start tuning jobs.

Warm Start Tuning Sample Notebook


For a sample notebook that shows how to use warm start tuning, see https://github.com/awslabs/
amazon-sagemaker-examples/blob/master/hyperparameter_tuning/image_classification_warmstart/
hpo_image_classification_warmstart.ipynb. For instructions how to create and access Jupyter notebook
instances that you can use to run the example in SageMaker, see Example Notebooks (p. 220). Once you
have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all
the SageMaker samples. The warm start tuning example notebook is located in the Hyperparameter
tuning section, and is named hpo_image_classification_warmstart.ipynb. To open a notebook,
click on its Use tab and select Create copy.

Create a Warm Start Tuning Job


You can use either the low-level AWS SDK for Python (Boto 3) or the high-level SageMaker Python SDK
to create a warm start tuning job.

Topics
• Create a Warm Start Tuning Job ( Low-level SageMaker API for Python (Boto 3)) (p. 1643)
• Create a Warm Start Tuning Job (SageMaker Python SDK) (p. 1644)

Create a Warm Start Tuning Job ( Low-level SageMaker API for Python (Boto 3))
To use warm start tuning, you specify the values of a HyperParameterTuningJobWarmStartConfig
object, and pass that as the WarmStartConfig field in a call to CreateHyperParameterTuningJob.

The following code shows how to create a HyperParameterTuningJobWarmStartConfig object and


pass it to CreateHyperParameterTuningJob job by using the low-level SageMaker API for Python
(Boto 3).

Create the HyperParameterTuningJobWarmStartConfig object:

1643
Amazon SageMaker Developer Guide
Run a Warm Start Hyperparameter Tuning Job

warm_start_config = {
"ParentHyperParameterTuningJobs" : [
{"HyperParameterTuningJobName" : 'MyParentTuningJob'}
],
"WarmStartType" : "IdenticalDataAndAlgorithm"
}

Create the warm start tuning job:

smclient = boto3.Session().client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName =
'MyWarmStartTuningJob',
HyperParameterTuningJobConfig = tuning_job_config, # See notebook for tuning
configuration
TrainingJobDefinition = training_job_definition, # See notebook for job definition
WarmStartConfig = warm_start_config)

Create a Warm Start Tuning Job (SageMaker Python SDK)


To use the Amazon SageMaker Python SDK to run a warm start tuning job, you:

• Specify the parent jobs and the warm start type by using a WarmStartConfig object.
• Pass the WarmStartConfig object as the value of the warm_start_config argument of a
HyperparameterTuner object.
• Call the fit method of the HyperparameterTuner object.

For more information about using the Amazon SageMaker Python SDK for hyperparameter tuning, see
https://github.com/aws/sagemaker-python-sdk#sagemaker-automatic-model-tuning.

This example uses an estimator that uses the Image Classification - MXNet (p. 1506) algorithm for
training. The following code sets the hyperparameter ranges that the warm start tuning job searches
within to find the best combination of values. For information about setting hyperparameter ranges, see
Define Hyperparameter Ranges (p. 1617).

hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.0, 0.1),


'momentum': ContinuousParameter(0.0, 0.99)}

The following code configures the warm start tuning job by creating a WarmStartConfig object.

from sagemaker.tuner import WarmStartConfig,WarmStartTypes

parent_tuning_job_name = "MyParentTuningJob"
warm_start_config =
WarmStartConfig(warm_start_type=WarmStartTypes.IDENTICAL_DATA_AND_ALGORITHM,
parents={parent_tuning_job_name})

Now set the values for static hyperparameters, which are hyperparameters that keep the same
value for every training job that the warm start tuning job launches. In the following code,
imageclassification is an estimator that was created previously.

imageclassification.set_hyperparameters(num_layers=18,
image_shape='3,224,224',
num_classes=257,
num_training_samples=15420,
mini_batch_size=128,

1644
Amazon SageMaker Developer Guide
Resource Limits for Automatic Model Tuning

epochs=30,
optimizer='sgd',
top_k='2',
precision_dtype='float32',
augmentation_type='crop')

Now create the HyperparameterTuner object and pass the WarmStartConfig object that you
previously created as the warm_start_config argument.

tuner_warm_start = HyperparameterTuner(imageclassification,
'validation:accuracy',
hyperparameter_ranges,
objective_type='Maximize',
max_jobs=10,
max_parallel_jobs=2,
base_tuning_job_name='warmstart',
warm_start_config=warm_start_config)

Finally, call the fit method of the HyperparameterTuner object to launch the warm start tuning job.

tuner_warm_start.fit(
{'train': s3_input_train, 'validation': s3_input_validation},
include_cls_metadata=False)

Resource Limits for Automatic Model Tuning


SageMaker sets the following default limits for resources used by automatic model tuning:

Resource Regions Default limits Can be increased to

Number of parallel All 100 N/A


(concurrent)
hyperparameter tuning
jobs

Number of All 30 N/A


hyperparameters that
can be searched *

Number of metrics All 20 N/A


defined per
hyperparameter tuning
job

Number of parallel All 10 100


training jobs per
hyperparameter tuning
job

[Bayesian optimization] All 750 N/A


Number of training jobs
per hyperparameter
tuning job

[Random search] All 750 10000


Number of training jobs

1645
Amazon SageMaker Developer Guide
Resource Limits for Automatic Model Tuning

Resource Regions Default limits Can be increased to


per hyperparameter
tuning job

[Hyperband] Number All 750 N/A


of training jobs per
hyperparameter tuning
job

[Grid] Number of All 750 N/A


training jobs per
hyperparameter tuning
job, either specified
explicitly or inferred
from the search space

Maximum run time for a All 30 days N/A


hyperparameter tuning
job

* Each categorical hyperparameter can have at most 30 different values.

Resource limit example


When you plan hyperparameter tuning jobs, you also have to take into account the limits on training
resources. For information about the default resource limits for SageMaker training jobs, see SageMaker
Limits. Every concurrent training instance on which all of your hyperparameter tuning jobs run
counts against the total number of training instances allowed. For example, if you run 10 concurrent
hyperparameter tuning jobs, each of those hyperparameter tuning jobs runs 100 total training jobs
and 20 concurrent training jobs. Each of those training jobs runs on one ml.m4.xlarge instance. The
following limits apply:

• Number of concurrent hyperparameter tuning jobs: You don't need to increase the limit, because 10
tuning jobs is below the limit of 100.
• Number of training jobs per hyperparameter tuning job: You don't need to increase the limit, because
100 training jobs is below the limit of 750.
• Number of concurrent training jobs per hyperparameter tuning job: You need to request a limit
increase to 20, because the default limit is 10.
• SageMaker training ml.m4.xlarge instances: You need to request a limit increase to 200, because you
have 10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default
limit is 20 instances.
• SageMaker training total instance count: You need to request a limit increase to 200, because you have
10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default limit
is 20 instances.

To request a quota increase:

1. Open the AWS Support Center page, sign in if necessary, and then choose Create case.
2. On the Create case page, choose Service limit increase.
3. On the Case details panel, select SageMaker Automatic Model Tuning [Hyperparameter
Optimization] for the Limit type
4. On the Requests panel for Request 1, select the Region, the resource Limit to increase and the
New Limit value you are requesting. Select Add another request if you have additional requests for
quota increases.

1646
Amazon SageMaker Developer Guide
Best Practices for Hyperparameter Tuning

5. In the Case description panel, provide a description of your use case .


6. In the Contact options panel, select your preferred Contact methods (Web, Chat or Phone) and
then choose Submit.

Best Practices for Hyperparameter Tuning


Hyperparameter optimization (HPO) is not a fully-automated process. To improve optimization, follow
these best practices for hyperparameter tuning.

Topics
• Choosing a tuning strategy (p. 1647)
• Choosing the number of hyperparameters (p. 1648)
• Choosing hyperparameter ranges (p. 1648)
• Using the correct scales for hyperparameters (p. 1648)
• Choosing the best number of parallel training jobs (p. 1648)
• Running training jobs on multiple instances (p. 1649)
• Using a random seed to reproduce hyperparameter configurations (p. 1649)

Choosing a tuning strategy


For large jobs, using the Hyperband tuning strategy can reduce computation time. Hyperband has an
early stopping mechanism to stop under-performing jobs. Hyperband can also reallocate resources

1647
Amazon SageMaker Developer Guide
Best Practices for Hyperparameter Tuning

towards well-utilized hyperparameter configurations and run parallel jobs. For smaller training jobs using
less runtime, use either random search or Bayesian optimization.

Use Bayesian optimization to make increasingly informed decisions about improving hyperparameter
configurations in the next run. Bayesian optimization uses information gathered from prior runs to
improve subsequent runs. Because of its sequential nature, Bayesian optimization cannot massively scale.

Use random search to run a large number of parallel jobs. In random search, subsequent jobs do not
depend on the results from prior jobs and can be run independently. Compared to other strategies,
random search is able to run the largest number of parallel jobs.

Use grid search to reproduce results of a tuning job, or if simplicity and transparency of the optimization
algorithm are important. You can also use grid search to explore the entire hyperparameter search space
evenly. Grid search methodically searches through every hyperparameter combination to find optimal
hyperparameter values. Unlike grid search, Bayesian optimization, random search and Hyperband all
draw hyperparameters randomly from the search space. Because grid search analyzes every combination
of hyperparameters, optimal hyperparameter values will be identical between tuning jobs that use the
same hyperparameters.

Choosing the number of hyperparameters


During optimization, the computational complexity of a hyperparameter tuning job depends on the
following:

• The number of hyperparameters


• The range of values that Amazon SageMaker has to search

Although you can simultaneously specify up to 30 hyperparameters, limiting your search to a smaller
number can reduce computation time. Reducing computation time allows SageMaker to converge more
quickly to an optimal hyperparameter configuration.

Choosing hyperparameter ranges


The range of values that you choose to search can adversely affect hyperparameter optimization. For
example, a range that covers every possible hyperparameter value can lead to large compute times and a
model that doesn't generalize well to unseen data. If you know that using a subset of the largest possible
range is appropriate for your use case, consider limiting the range to that subset.

Using the correct scales for hyperparameters


During hyperparameter tuning, SageMaker attempts to infer if your hyperparameters are log-scaled or
linear-scaled. Initially, SageMaker assumes linear scaling for hyperparameters. If hyperparameters are
log-scaled, choosing the correct scale will make your search more efficient. You can also select Auto for
ScalingType in the CreateHyperParameterTuningJob API if you want SageMaker to detect the scale for
you.

Choosing the best number of parallel training jobs


You can use the results of previous trials to improve the performance of subsequent trials. Choose the
largest number of parallel jobs that would provide a meaningful incremental result that is also within
your region and account compute constraints. Use the MaxParallelTrainingJobs field to limit the
number of training jobs that a hyperparameter tuning job can launch in parallel. For more information,
see Running multiple HPO jobs in parallel on Amazon SageMaker.

1648
Amazon SageMaker Developer Guide
Debug and Profile

Running training jobs on multiple instances


When a training job runs on multiple machines in distributed mode, each machine emits an objective
metric. HPO can only use one of these emitted objective metrics to evaluate model performance, In
distributed mode, HPO uses the objective metric that was reported by the last running job across all
instances.

Using a random seed to reproduce hyperparameter


configurations
You can specify an integer as a random seed for hyperparameter tuning and use that seed during
hyperparameter generation. Later, you can use the same seed to reproduce hyperparameter
configurations that are consistent with your previous results. For random search and Hyperband
strategies, using the same random seed can provide up to 100% reproducibility of the previous
hyperparameter configuration for the same tuning job. For Bayesian strategy, using the same random
seed will improve reproducibility for the same tuning job.

Debug and Profile Training Jobs Using Amazon


SageMaker Debugger
Debug, profile, and monitor training jobs in real time to detect non-converging conditions, optimize
resource utilization by eliminating bottlenecks, improve training time, and reduce costs of your machine
learning models using Amazon SageMaker Debugger.

Amazon SageMaker Debugger Features


A machine learning (ML) training job can have problems such as system bottlenecks, overfitting,
saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger profiles and debugs training jobs to help resolve such problems and improve your
ML model's compute resource utilization and performance. Debugger offers tools to send alerts when
training anomalies are found, take actions against the problems, and identify the root cause of them by
visualizing collected metrics and tensors.

SageMaker Debugger supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks. For
more information about available frameworks and versions supported by SageMaker Debugger, see
Supported Frameworks and Algorithms (p. 1650).

1649
Amazon SageMaker Developer Guide
Supported Frameworks and Algorithms

The high-level Debugger workflow is as follows:

1. Modify your training script with the sagemaker-debugger Python SDK if needed.
2. Configure a SageMaker training job with SageMaker Debugger.
• Configure using the SageMaker Estimator API (for Python SDK).
• Configure using the SageMaker CreateTrainingJob request (for Boto3 or CLI).
• Configure custom training containers (p. 1795) with SageMaker Debugger.
3. Start a training job and monitor training issues in real time.
• SageMaker Studio Debugger dashboards in Studio Experiments and trials (p. 1721).
• List of Debugger Built-in Rules (p. 1748).
4. Get alerts and take prompt actions against the training issues.
• Receive texts and emails and stop training jobs when training issues are found using Debugger
Built-in Actions for Rules (p. 1698).
• Set up your own actions using Amazon CloudWatch Events and AWS Lambda (p. 1702).
5. Receive training reports, suggestions to fix the issues, and insights into your training jobs.
• Studio Debugger Insights dashboard for deep learning frameworks
• Deep learning framework profiling report
• SageMaker XGBoost training report
6. Explore deep analysis of the training issues and bottlenecks.
• For profiling training jobs, see Analyze Data Using the SMDebug Client Library (p. 1740).
• For debugging model output tensors, see Visualize Debugger Output Tensors in
TensorBoard (p. ).
7. Fix the issues, considering the suggestions provided by Debugger, and repeat steps 1–5 until you
optimize your model and achieve target accuracy.

The SageMaker Debugger developer guide walks you through the following topics.

Topics
• Supported Frameworks and Algorithms (p. 1650)
• Amazon SageMaker Debugger Architecture (p. 1653)
• Get Started with Debugger Tutorials (p. 1654)
• Debug Training Jobs Using Amazon SageMaker Debugger (p. 1664)
• Profile Training Jobs Using Amazon SageMaker Debugger (p. 1709)
• List of Debugger Built-in Rules (p. 1748)
• Create Debugger Custom Rules for Training Job Analysis (p. 1793)
• Use Debugger with Custom Training Containers (p. 1795)
• Configure Debugger Using Amazon SageMaker API (p. 1799)
• Best Practices for Amazon SageMaker Debugger (p. 1809)
• Amazon SageMaker Debugger Advanced Topics and Reference Documentation (p. 1812)
• Amazon SageMaker Debugger Release Notes (p. 1820)

Supported Frameworks and Algorithms


The following table shows SageMaker machine learning frameworks and algorithms supported by
Debugger.
1650
Amazon SageMaker Developer Guide
Supported Frameworks and Algorithms

SageMaker-supported Monitoring system Profiling deep learning Debugging output


frameworks and bottlenecks framework operations tensors
algorithms

TensorFlow All AWS Deep learning AWS TensorFlow deep AWS TensorFlow deep
containers learning containers >= learning containers
v2.3.1, < v2.11 1.15.4 or later

PyTorch AWS PyTorch deep AWS PyTorch deep


learning containers >= learning containers
v1.6.0, < v2.0 1.5.0 or later

MXNet - AWS MXNet deep


learning containers
1.6.0 or later

XGBoost 1.0-1, 1.2-1, 1.3-1 - 1.0-1, 1.2-1, 1.3-1

SageMaker generic SageMaker built-in - Custom training


estimator algorithms using image containers (p. 1795)
URIs (available for
TensorFlow, PyTorch,
Custom training MXNet, and XGBoost
containers (p. 1795) with manual hook
(with the AWS deep registration)
learning container
images, public Docker
images, or your own
Docker images)

• Monitoring system bottlenecks – Monitor the system utilization rate for resources such as CPU,
GPU, memories, network, and data I/O metrics. This is a framework and model agnostic feature and
available for any training jobs in SageMaker.
• Profiling deep learning framework operations – Profile the deep learning operations of the
TensorFlow and PyTorch frameworks, such as step durations, data loaders, forward and backward
operations, Python profiling metrics, and framework-specific metrics.
Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow
2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks
and SDKs as follows.
• SageMaker Python SDK <= v2.130.0
• PyTorch >= v1.6.0, < v2.0
• TensorFlow >= v2.3.1, < v2.11
See also Amazon SageMaker Debugger Release Notes: March 16, 2023 (p. 1820).
• Debugging output tensors – Track and debug model parameters, such as weights, gradients, biases,
and scalar values of your training job. Available deep learning frameworks are Apache MXNet,
TensorFlow, PyTorch, and XGBoost.
Important
For the TensorFlow framework with Keras, SageMaker Debugger deprecates the zero code
change support for debugging models built using the tf.keras modules of TensorFlow
2.6 and later. This is due to breaking changes announced in the TensorFlow 2.6.0 release
note. For instructions on how to update your training script, see the section called
“TensorFlow” (p. 1667).

1651
Amazon SageMaker Developer Guide
Supported Frameworks and Algorithms

Important
Since PyTorch v1.12.0 and later, SageMaker Debugger deprecates the zero code change
support for debugging models.
This is due to breaking changes that cause SageMaker Debugger to interfere with the
torch.jit functionality. For instructions on how to update your training script, see the
section called “PyTorch” (p. 1665).

If the framework or algorithm that you want to train and debug is not listed in the table, go to the AWS
Discussion Forum and leave feedback on SageMaker Debugger.

AWS Regions
Amazon SageMaker Debugger is available in all regions where Amazon SageMaker is in service except the
following region.

• Asia Pacific (Jakarta): ap-southeast-3

To find if Amazon SageMaker is in service in your AWS Region, see AWS Regional Services.

Use Debugger with Custom Training Containers


Bring your training containers to SageMaker and gain insights into your training jobs using Debugger.
Maximize your work efficiency by optimizing your model on Amazon EC2 instances using the monitoring
and debugging features.

For more information about how to build your training container with the sagemaker-debugger client
library, push it to the Amazon Elastic Container Registry (Amazon ECR), and monitor and debug, see Use
Debugger with Custom Training Containers (p. 1795).

Debugger Open-Source GitHub Repositories


Debugger APIs are provided through the SageMaker Python SDK and designed to construct Debugger
hook and rule configurations for the SageMaker CreateTrainingJob and DescribeTrainingJob API
operations. The sagemaker-debugger client library provides tools to register hooks and access the
training data through its trial feature, all through its flexible and powerful API operations. It supports the
machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and later.

For direct resources about the Debugger and sagemaker-debugger API operations, see the following
links:

• The Amazon SageMaker Python SDK documentation


• The Amazon SageMaker Python SDK - Debugger APIs
• The sagemaker-debugger Python SDK documentation for the Amazon SageMaker Debugger open
source client library
• The sagemaker-debugger PyPI

If you use the SDK for Java to conduct SageMaker training jobs and want to configure Debugger APIs, see
the following references:

• Amazon SageMaker Debugger API Operations (p. 1812)


• Configure Debugger Using Amazon SageMaker API (p. 1799)

1652
Amazon SageMaker Developer Guide
Debugger Architecture

Amazon SageMaker Debugger Architecture


This topic walks you through a high-level overview of the Amazon SageMaker Debugger workflow.

Debugger supports profiling functionality for performance optimization to identify computation issues,
such as system bottlenecks and underutilization, and to help optimize hardware resource utilization at
scale.

Debugger's debugging functionality for model optimization is about analyzing non-converging training
issues that can arise while minimizing the loss functions using optimization algorithms, such as gradient
descent and its variations.

The following diagram shows the architecture of SageMaker Debugger. The blocks with bold boundary
lines are what Debugger manages to analyze your training job.

Debugger stores the following data from your training jobs in your secured Amazon S3 bucket:

1653
Amazon SageMaker Developer Guide
Tutorials

• System metrics – Hardware resource utilization data, such as CPU, GPU, CPU and GPU memory,
network, and data input and output (I/O) metrics.
• Framework metrics – Metrics to track each framework operation per call or sampling, such as
convolutional layer operations in the forward pass, batch normalization operations in the backward
pass, data loader processes between steps, and gradient descent algorithm operations to calculate and
update the loss function.
• Output tensors – Collections of scalars and model parameters that are continuously updated during
the forward and backward passes while training ML models. The output tensors include scalar values
(accuracy and loss) and matrices (weights, gradients, input layers, and output layers).
Note
By default, Debugger monitors and debugs SageMaker training jobs without any Debugger-
specific parameters configured in SageMaker estimators. Debugger collects system metrics
every 500 milliseconds and basic output tensors (scalar outputs such as loss and accuracy)
every 500 steps. It also runs the ProfilerReport rule to analyze the system metrics and
aggregate the Studio Debugger insights dashboard and a profiling report. Debugger saves the
output data in your secured Amazon S3 bucket.

The Debugger built-in rules run on processing containers, which are designed to evaluate machine
learning models by processing the training data collected in your S3 bucket (see Process Data and
Evaluate Models). The built-in rules are fully managed by Debugger. You can also create your own rules
customized to your model to watch for any issues you want to monitor.

Get Started with Debugger Tutorials


The following topics walk you through tutorials from the basics to advanced use cases of monitoring,
profiling, and debugging SageMaker training jobs using Debugger. Explore the Debugger features and
learn how you can debug and improve your machine learning models efficiently by using Debugger.

Topics
• Debugger Tutorial Videos (p. 1654)
• Debugger Example Notebooks (p. 1655)
• Debugger Advanced Demos and Visualization (p. 1657)

Debugger Tutorial Videos


The following videos provide a tour of Amazon SageMaker Debugger capabilities using SageMaker
Studio and SageMaker notebook instances.

Topics
• Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon SageMaker
Debugger (p. 1654)
• Debug Models with Amazon SageMaker Debugger in Studio (p. 1654)
• Deep Dive on Amazon SageMaker Debugger and SageMaker Model Monitor (p. 1655)

Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon
SageMaker Debugger
Emily Webber, AWS Machine Learning Specialist | Length: 13 minutes 54 seconds

This tutorial video gives you a tour of Amazon SageMaker Debugger to capture, debug, and visualize
model output data from a training model with MXNet. Learn how Amazon SageMaker Debugger makes

1654
Amazon SageMaker Developer Guide
Tutorials

the training process transparent by automatically capturing metrics, analyzing training runs, and
detecting problems.

Analyze, Detect, and Get Alerted on Problems with Training Runs Using Amazon SageMaker Debugger

You can find the example notebook in this video at Visualizing Debugging Tensors of MXNet training in
the Amazon SageMaker Examples GitHub repository.

Debug Models with Amazon SageMaker Debugger in Studio


Julien Simon, AWS Technical Evangelist | Length: 14 minutes 17 seconds

This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect
debugging information from a training model. The example training model used in this video is a simple
convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker in a
TensorFlow framework and Debugger enable you to build an estimator directly using the training script
and debug the training job.

Debug Models with Amazon SageMaker Debugger (part 1)

You can find the example notebook in the video in this Studio Demo repository provided by the author.
You need to clone the debugger.ipynb notebook file and the mnist_keras_tf.py training script
to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the
path keras_script_path to the mnist_keras_tf.py file inside the debugger.ipynb notebook.
For example, if you cloned the two files in the same directory, set it as keras_script_path =
"mnist_keras_tf.py".

Deep Dive on Amazon SageMaker Debugger and SageMaker Model Monitor


Julien Simon, AWS Technical Evangelist | Length: 44 minutes 34 seconds

This video session explores advanced features of Debugger and SageMaker Model Monitor that help
boost productivity and the quality of your models. First, this video shows how to detect and fix training
issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to
monitor models in production and identify prediction issues such as missing features or data drift using
SageMaker Model Monitor. Finally, it offers cost optimization tips to help you make the most of your
machine learning budget.

Debug Models with Debugger (part 2)

You can find the example notebook in the video in this AWS Dev Days 2020 repository offered by the
author.

Debugger Example Notebooks


SageMaker Debugger example notebooks are provided in the aws/amazon-sagemaker-examples
repository. The Debugger example notebooks walk you through basic to advanced use cases of
debugging and profiling training jobs.

We recommend that you run the example notebooks on SageMaker Studio or a SageMaker Notebook
instance because most of the examples are designed for training jobs in the SageMaker ecosystem,
including Amazon EC2, Amazon S3, and Amazon SageMaker Python SDK.

To clone the example repository to SageMaker Studio, follow the instructions at Amazon SageMaker
Studio Tour.

To find the examples in a SageMaker Notebook instance, follow the instructions at SageMaker Notebook
Instance Example Notebooks.

1655
Amazon SageMaker Developer Guide
Tutorials

Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the
SMDebug client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment,
run the following code to install the latest versions of the libraries and restart the kernel.

import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)

Debugger Example Notebooks for Profiling Training Jobs


The following list shows Debugger example notebooks introducing Debugger's adaptability to monitor
and profile training jobs for various machine learning models, datasets, and frameworks.

Notebook Framework Model Dataset Description


Title

Amazon TensorFlow Keras Cifar-10 This notebook provides an


SageMaker ResNet50 introduction to interactive
Debugger analysis of profiled data
Profiling Data captured by SageMaker
Analysis Debugger. Explore the full
functionality of the SMDebug
interactive analysis tools.

Profile TensorFlow 1-D IMDB dataset Profile a TensorFlow 1-D CNN


machine Convolutional for sentiment analysis of IMDB
learning Neural data that consists of movie
training with Network reviews labeled as having
Amazon positive or negative sentiment.
SageMaker Explore the Studio Debugger
Debugger insights and Debugger profiling
report.

Profiling TensorFlow ResNet50 Cifar-10 Run TensorFlow training jobs


TensorFlow with various distributed training
ResNet model settings, monitor system
training resource utilization, and profile
with various model performance using
distributed Debugger.
training
settings

Profiling PyTorch ResNet50 Cifar-10 Run PyTorch training jobs with


PyTorch various distributed training
ResNet model settings, monitor system
training resource utilization, and profile
with various model performance using
distributed Debugger.
training
settings

1656
Amazon SageMaker Developer Guide
Tutorials

Debugger Example Notebooks for Analyzing Model Parameters


The following list shows Debugger example notebooks introducing Debugger's adaptability to debug
training jobs for various machine learning models, datasets, and frameworks.

Notebook Framework Model Dataset Description


Title

Amazon TensorFlow Convolutional MNIST Use the Amazon SageMaker


SageMaker Neural Debugger built-in rules for
Debugger - Network debugging a TensorFlow model.
Use built-in
rule

Amazon TensorFlow ResNet50 Cifar-10 Use the Amazon SageMaker


SageMaker Debugger hook configuration
Debugger - and built-in rules for debugging
Tensorflow 2.1 a model with the Tensorflow 2.1
framework.

Visualizing MXNet Gluon Fashion MNIST Run a training job and configure
Debugging Convolutional SageMaker Debugger to store
Tensors of Neural all tensors from this job, then
MXNet training Network visualize those tensors ina
notebook.

Enable Spot MXNet Gluon Fashion MNIST Learn how Debugger collects
Training with Convolutional tensor data from a training job
Amazon Neural on a spot instance, and how to
SageMaker Network use the Debugger built-in rules
Debugger with managed spot training.

Explain an XGBoost XGBoost Adult Census Learn how to use the Debugger
XGBoost model Regression dataset hook and built-in rules for
that predicts collecting and visualizing tensor
an individual’s data from an XGBoost regression
income with model, such as loss values,
Amazon features, and SHAP values.
SageMaker
Debugger

To find advanced visualizations of model parameters and use cases, see the next topic at Debugger
Advanced Demos and Visualization (p. 1657).

Debugger Advanced Demos and Visualization


The following demos walk you through advanced use cases and visualization scripts using Debugger.

Topics
• Train and Tune Your Models with Amazon SageMaker Experiments and Debugger (p. 1658)
• Using SageMaker Debugger to Monitor a Convolutional Autoencoder Model Training (p. 1661)
• Using SageMaker Debugger to Monitor Attentions in BERT Model Training (p. 1661)
• Using SageMaker Debugger to Visualize Class Activation Maps in Convolutional Neural Networks
(CNNs) (p. 1664)

1657
Amazon SageMaker Developer Guide
Tutorials

Train and Tune Your Models with Amazon SageMaker Experiments and
Debugger
Dr. Nathalie Rauschmayr, AWS Applied Scientist | Length: 49 minutes 26 seconds

Train and Prune Models with SageMaker Experiments and Debugger

Find out how Amazon SageMaker Experiments and Debugger can simplify the management of your
training jobs. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves
training metrics into your Amazon S3 bucket. SageMaker Experiments enables you to call the training
information as trials through SageMaker Studio and supports visualization of the training job. This helps
you keep model quality high while reducing less important parameters based on importance rank.

This video demonstrates a model pruning technique that makes pre-trained ResNet50 and AlexNet
models lighter and affordable while keeping high standards for model accuracy.

SageMaker Estimator trains those algorithms supplied from the PyTorch model zoo in an AWS Deep
Learning Containers with PyTorch framework, and Debugger extracts training metrics from the training
process.

The video also demonstrates how to set up a Debugger custom rule to watch the accuracy of a pruned
model, to trigger an Amazon CloudWatch event and an AWS Lambda function when the accuracy hits a
threshold, and to automatically stop the pruning process to avoid redundant iterations.

Learning objectives are as follows:

• Learn how to use SageMaker to accelerate ML model training and improve model quality.
• Understand how to manage training iterations with SageMaker Experiments by automatically
capturing input parameters, configurations, and results.
• Discover how Debugger makes the training process transparent by automatically capturing real-time
tensor data from metrics such as weights, gradients, and activation outputs of convolutional neural
networks.
• Use CloudWatch to trigger Lambda when Debugger catches issues.
• Master the SageMaker training process using SageMaker Experiments and Debugger.

You can find the notebooks and training scripts used in this video from SageMaker Debugger PyTorch
Iterative Model Pruning.

The following image shows how the iterative model pruning process reduces the size of AlexNet by
cutting out the 100 least significant filters based on importance rank evaluated by activation outputs
and gradients.

The pruning process reduced the initial 50 million parameters to 18 million. It also reduced the estimated
model size from 201 MB to 73 MB.

1658
Amazon SageMaker Developer Guide
Tutorials

You also need to track model accuracy, and the following image shows how you can plot the model
pruning process to visualize changes in model accuracy based on the number of parameters in
SageMaker Studio.

1659
Amazon SageMaker Developer Guide
Tutorials

In SageMaker Studio, choose the Experiments tab, select a list of tensors saved by Debugger from the
pruning process, and then compose a Trial Component List panel. Select all ten iterations and then
choose Add chart to create a Trial Component Chart. After you decide on a model to deploy, choose the
trial component and choose a menu to perform an action or choose Deploy model.
Note
To deploy a model through SageMaker Studio using the following notebook example, add a line
at the end of the train function in the train.py script.

# In the train.py script, look for the train function in line 58.
def train(epochs, batch_size, learning_rate):
...
print('acc:{:.4f}'.format(correct/total))
hook.save_scalar("accuracy", correct/total, sm_metric=True)

# Add the following code to line 128 of the train.py script to save the pruned
models
# under the current SageMaker Studio model directory
torch.save(model.state_dict(), os.environ['SM_MODEL_DIR'] + '/model.pt')

1660
Amazon SageMaker Developer Guide
Tutorials

Using SageMaker Debugger to Monitor a Convolutional Autoencoder Model


Training
This notebook demonstrates how SageMaker Debugger visualizes tensors from an unsupervised (or self-
supervised) learning process on a MNIST image dataset of handwritten numbers.

The training model in this notebook is a convolutional autoencoder with the MXNet framework. The
convolutional autoencoder has a bottleneck-shaped convolutional neural network that consists of an
encoder part and a decoder part.

The encoder in this example has two convolution layers to produce compressed representation (latent
variables) of the input images. In this case, the encoder produces a latent variable of size (1, 20) from an
original input image of size (28, 28) and significantly reduces the size of data for training by 40 times.

The decoder has two deconvolutional layers and ensures that the latent variables preserve key
information by reconstructing output images.

The convolutional encoder powers clustering algorithms with smaller input data size and the
performance of clustering algorithms such as k-means, k-NN, and t-Distributed Stochastic Neighbor
Embedding (t-SNE).

This notebook example demonstrates how to visualize the latent variables using Debugger, as shown
in the following animation. It also demonstrates how the t-SNE algorithm classifies the latent variables
into ten clusters and projects them into a two-dimensional space. The scatter plot color scheme on the
right side of the image reflects the true values to show how well the BERT model and t-SNE algorithm
organize the latent variables into the clusters.

Using SageMaker Debugger to Monitor Attentions in BERT Model Training


Bidirectional Encode Representations from Transformers (BERT) is a language representation model. As
the name of model reflects, the BERT model builds on transfer learning and the Transformer model for
natural language processing (NLP).

The BERT model is pre-trained on unsupervised tasks such as predicting missing words in a sentence or
predicting the next sentence that naturally follows a previous sentence. The training data contains 3.3
billion words (tokens) of English text, from sources such as Wikipedia and electronic books. For a simple
example, the BERT model can give a high attention to appropriate verb tokens or pronoun tokens from a
subject token.

1661
Amazon SageMaker Developer Guide
Tutorials

The pre-trained BERT model can be fine-tuned with an additional output layer to achieve state-of-the-
art model training in NLP tasks, such as automated responses to questions, text classification, and many
others.

Debugger collects tensors from the fine-tuning process. In the context of NLP, the weight of neurons is
called attention.

This notebook demonstrates how to use the pre-trained BERT model from the GluonNLP model zoo on
the Stanford Question and Answering dataset and how to set up SageMaker Debugger to monitor the
training job.

Plotting attention scores and individual neurons in the query and key vectors can help to identify causes
of incorrect model predictions. With SageMaker Debugger, you can retrieve the tensors and plot the
attention-head view in real time as training progresses and understand what the model is learning.

The following animation shows the attention scores of the first 20 input tokens for ten iterations in the
training job provided in the notebook example.

1662
Amazon SageMaker Developer Guide
Tutorials

1663
Amazon SageMaker Developer Guide
Debug Training Jobs

Using SageMaker Debugger to Visualize Class Activation Maps in Convolutional


Neural Networks (CNNs)
This notebook demonstrates how to use SageMaker Debugger to plot class activation maps for image
detection and classification in convolutional neural networks (CNNs). In deep learning, a convolutional
neural network (CNN or ConvNet) is a class of deep neural networks, most commonly applied to analyzing
visual imagery. One of the applications that adopts the class activation maps is self-driving cars, which
require instantaneous detection and classification of images such as traffic signs, roads, and obstacles.

In this notebook, the PyTorch ResNet model is trained on the German Traffic Sign Dataset, which
contains more than 40 classes of traffic-related objects and more than 50,000 images in total.

During the training process, SageMaker Debugger collects tensors to plot the class activation maps
in real time. As shown in the animated image, the class activation map (also called as a saliency map)
highlights regions with high activation in red color.

Using tensors captured by Debugger, you can visualize how the activation map evolves during the model
training. The model starts by detecting the edge on the lower-left corner at the beginning of the training
job. As the training progresses, the focus shifts to the center and detects the speed limit sign, and the
model successfully predicts the input image as Class 3, which is a class of speed limit 60km/h signs, with
a 97% confidence level.

Debug Training Jobs Using Amazon SageMaker


Debugger
To prepare your training script and run training jobs with SageMaker Debugger to debug model training
progress, you follow the typical two-step process: modify your training script using the sagemaker-
debugger Python SDK, and construct a SageMaker estimator using the SageMaker Python SDK. Go
through the following topics to learn how to use SageMaker Debugger's debugging functionality.

Topics
• Step 1: Adapt Your Training Script to Register a Hook (p. 1665)
• Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK (p. 1669)
• SageMaker Debugger Interactive Report for XGBoost (p. 1684)
• Action on Amazon SageMaker Debugger Rules (p. 1698)
• Visualize Amazon SageMaker Debugger Output Tensors in TensorBoard (p. 1707)

1664
Amazon SageMaker Developer Guide
Debug Training Jobs

Step 1: Adapt Your Training Script to Register a Hook


Amazon SageMaker Debugger comes with a client library called the sagemaker-debugger Python SDK.
The sagemaker-debugger Python SDK provides tools for adapting your training script before training
and analysis tools after training. In this page, you'll learn how to adapt your training script using the
client library.

The sagemaker-debugger Python SDK provides wrapper functions that help register a hook to extract
model tensors, without altering your training script. To get started with collecting model output tensors
and debug them to find training issues, make the following modifications in your training script.
Tip
While you're following this page, use the sagemaker-debugger open source SDK
documentation for API references.

Topics
• Adapt Your PyTorch Training Script (p. 1665)
• Adapt Your TensorFlow Training Script (p. 1667)

Adapt Your PyTorch Training Script


To start collecting model output tensors and debug training issues, make the following modifications to
your PyTorch training script.

For PyTorch 1.12.0

If you bring a PyTorch training script, you can run the training job and extract model output tensors with
a few additional code lines in your training script. You need to use the hook APIs in the sagemaker-
debugger client library. Walk through the following instructions that break down the steps with code
examples.

1. Create a hook.

(Recommended) For training jobs within SageMaker

import smdebug.pytorch as smd


hook=smd.get_hook(create_if_not_exists=True)

When you launch a training job in the section called “Step 2: Launch and Debug Training Jobs Using
SageMaker Python SDK” (p. 1669) with any of the DebuggerHookConfig, TensorBoardConfig, or Rules
in your estimator, SageMaker adds a JSON configuration file to your training instance that is picked
up by the get_hook function. Note that if you do not include any of the configuration APIs in your
estimator, there will be no configuration file for the hook to find, and the function returns None.

(Optional) For training jobs outside SageMaker

If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2
instances, or your own local devices, use smd.Hook class to create a hook. However, this approach
can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s
built-in Rules don’t work with the local mode because the Rules require SageMaker ML training
instances and S3 to store outputs from the remote instances in real time. The smd.get_hook API
returns None in this case.

If you want to create a manual hook to save tensors in local mode, use the following code snippet
with the logic to check if the smd.get_hook API returns None and create a manual hook using the
smd.Hook class. Note that you can specify any output directory in your local machine.

1665
Amazon SageMaker Developer Guide
Debug Training Jobs

import smdebug.pytorch as smd


hook=smd.get_hook(create_if_not_exists=True)

if hook is None:
hook=smd.Hook(
out_dir='/path/to/your/local/output/',
export_tensorboard=True
)

2. Wrap your model with the hook’s class methods.

The hook.register_module() method takes your model and iterates through each layer, looking
for any tensors that match with regular expressions that you’ll provide through the configuration in
the section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).
The collectable tensors through this hook method are weights, biases, activations, gradients, inputs,
and outputs.

hook.register_module(model)

Tip
If you collect the entire output tensors from a large deep learning model, the total size of
those collections can exponentially grow and might cause bottlenecks. If you want to save
specific tensors, you can also use the hook.save_tensor() method. This method helps you
pick the variable for the specific tensor and save to a custom collection named as you want.
For more information, see step 7 (p. 1667) of this instruction.
3. Warp the loss function with the hook’s class methods.

The hook.register_loss method is to wrap the loss function. It extracts loss values every
save_interval that you’ll set during configuration in the section called “Step 2: Launch and Debug
Training Jobs Using SageMaker Python SDK” (p. 1669), and saves them to the "losses" collection.

hook.register_loss(loss_function)

4. Add hook.set_mode(ModeKeys.TRAIN) in the train block. This indicates the tensor collection is
extracted during the training phase.

def train():
...
hook.set_mode(ModeKeys.TRAIN)

5. Add hook.set_mode(ModeKeys.EVAL) in the validation block. This indicates the tensor collection is
extracted during the validation phase.

def validation():
...
hook.set_mode(ModeKeys.EVAL)

6. Use hook.save_scalar() to save custom scalars. You can save scalar values that aren’t in your
model. For example, if you want to record the accuracy values computed during evaluation, add the
following line of code below the line where you calculate accuracy.

hook.save_scalar("accuracy", accuracy)

Note that you need to provide a string as the first argument to name the custom scalar collection. This
is the name that'll be used for visualizing the scalar values in TensorBoard, and can be any string you
want.

1666
Amazon SageMaker Developer Guide
Debug Training Jobs

7. Use hook.save_tensor() to save custom tensors. Similarly to hook.save_scalar(), you can save
additional tensors, defining your own tensor collection. For example, you can extract input image data
that are passed into the model and save as a custom tensor by adding the following code line, where
"images" is an example name of the custom tensor, image_inputs is an example variable for the
input image data.

hook.save_tensor("images", image_inputs)

Note that you must provide a string to the first argument to name the custom tensor.
hook.save_tensor() has the third argument collections_to_write to specify the tensor
collection to save the custom tensor. The default is collections_to_write="default". If you
don't explicitely specify the third argument, the custom tensor is saved to the "default" tensor
collection.

After you have completed adapting your training script, proceed to the section called “Step 2: Launch
and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).

Adapt Your TensorFlow Training Script


To start collecting model output tensors and debug training issues, make the following modifications to
your TensorFlow training script.

Create a hook for training jobs within SageMaker

import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)

This creates a hook when you start a SageMaker training job. When you launch a training job in the
section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669) with
any of the DebuggerHookConfig, TensorBoardConfig, or Rules in your estimator, SageMaker adds
a JSON configuration file to your training instance that is picked up by the smd.get_hook method. Note
that if you do not include any of the configuration APIs in your estimator, there will be no configuration
file for the hook to find, and the function returns None.

(Optional) Create a hook for training jobs outside SageMaker

If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances,
or your own local devices, use smd.Hook class to create a hook. However, this approach can only store
the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules
don’t work with the local mode. The smd.get_hook method also returns None in this case.

If you want to create a manual hook, use the following code snippet with the logic to check if the hook
returns None and create a manual hook using the smd.Hook class.

import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)

if hook is None:
hook=smd.KerasHook(
out_dir='/path/to/your/local/output/',
export_tensorboard=True
)

After adding the hook creation code, proceed to the following topic for TensorFlow Keras.
Note
SageMaker Debugger currently supports TensorFlow Keras only.

1667
Amazon SageMaker Developer Guide
Debug Training Jobs

Register the hook in your TensorFlow Keras training script

The following precedure walks you through how to use the hook and its methods to collect output
scalars and tensors from your model and optimizer.

1. Wrap your Keras model and optimizer with the hook’s class methods.

The hook.register_model() method takes your model and iterates through each layer, looking for
any tensors that match with regular expressions that you’ll provide through the configuration in the
section called “Step 2: Launch and Debug Training Jobs Using SageMaker Python SDK” (p. 1669). The
collectable tensors through this hook method are weights, biases, and activations.

model=tf.keras.Model(...)
hook.register_model(model)

2. Wrap the optimizer by the hook.wrap_optimizer() method.

optimizer=tf.keras.optimizers.Adam(...)
optimizer=hook.wrap_optimizer(optimizer)

3. Compile the model in eager mode in TensorFlow.

To collect tensors from the model, such as the input and output tensors of each layer, you must run
the training in eager mode. Otherwise, SageMaker Debugger will not be able to collect the tensors.
However, other tensors, such as model weights, biases, and the loss, can be collected without explicitly
running in eager mode.

model.compile(
loss="categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"],
# Required for collecting tensors of each layer
run_eagerly=True
)

4. Register the hook to the tf.keras.Model.fit() method.

To collect the tensors from the hooks that you registered, add callbacks=[hook] to the Keras
model.fit() class method. This will pass the sagemaker-debugger hook as a Keras callback.

model.fit(
X_train, Y_train,
batch_size=batch_size,
epochs=epoch,
validation_data=(X_valid, Y_valid),
shuffle=True,
callbacks=[hook]
)

5. TensorFlow 2.x provides only symbolic gradient variables that do not provide access to their values. To
collect gradients, wrap tf.GradientTape by the hook.wrap_tape() method, which requires you
to write your own training step as follows.

def training_step(model, dataset):


with hook.wrap_tape(tf.GradientTape()) as tape:
pred=model(data)
loss_value=loss_fn(labels, pred)
grads=tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

1668
Amazon SageMaker Developer Guide
Debug Training Jobs

By wrapping the tape, the sagemaker-debugger hook can identify output tensors such as gradients,
parameters, and losses. Wrapping the tape ensures that the hook.wrap_tape() method around
functions of the tape object, such as push_tape(), pop_tape(), gradient(), will set up the
writers of SageMaker Debugger and save tensors that are provided as input to gradient() (trainable
variables and loss) and output of gradient() (gradients).
Note
To collect with a custom training loop, make sure that you use eager mode. Otherwise,
SageMaker Debugger is not able to collect any tensors.

For a full list of actions that the sagemaker-debugger hook APIs offer to construct hooks and save
tensors, see Hook Methods in the sagemaker-debugger Python SDK documentation.

After you have completed adapting your training script, proceed to the section called “Step 2: Launch
and Debug Training Jobs Using SageMaker Python SDK” (p. 1669).

Step 2: Launch and Debug Training Jobs Using SageMaker


Python SDK
To configure a SageMaker estimator with SageMaker Debugger, use Amazon SageMaker Python SDK
and specify Debugger-specific parameters. To fully utilize the debugging functionality, there are three
parameters you need to configure: debugger_hook_config, tensorboard_output_config, and
rules.
Important
Before constructing and running the estimator fit method to launch a training job, make sure
that you adapt your training script following the instructions at the section called “Step 1: Adapt
Your Training Script to Register a Hook” (p. 1665).

Construct a SageMaker Estimator with Debugger-specific parameters


The code examples in this section show how to construct a SageMaker estimator with the Debugger-
specific parameters.
Note
The following code examples are templates for constructing the SageMaker framework
estimators and not directly executable. You need to proceed to the next sections and configure
the Debugger-specific parameters.

PyTorch

# An example of constructing a SageMaker PyTorch estimator


import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),

1669
Amazon SageMaker Developer Guide
Debug Training Jobs

base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.12.0",
py_version="py37",

# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)

estimator.fit(wait=False)

TensorFlow

# An example of constructing a SageMaker TensorFlow estimator


import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule()),
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",

# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)

estimator.fit(wait=False)

MXNet

# An example of constructing a SageMaker MXNet estimator


import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",

1670
Amazon SageMaker Developer Guide
Debug Training Jobs

framework_version="1.7.0",
py_version="py37",

# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)

estimator.fit(wait=False)

XGBoost

# An example of constructing a SageMaker XGBoost estimator


import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.5-1",

# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,
rules=rules
)

estimator.fit(wait=False)

Generic estimator

# An example of constructing a SageMaker generic estimator using the XGBoost algorithm


base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",

# Debugger-specific parameters
debugger_hook_config=debugger_hook_config,

1671
Amazon SageMaker Developer Guide
Debug Training Jobs

rules=rules
)

estimator.fit(wait=False)

Configure the following parameters to activate SageMaker Debugger:

• debugger_hook_config (an object of DebuggerHookConfig) – Required to activate the hook in


the adapted training script during the section called “Step 1: Adapt Your Training Script to Register a
Hook” (p. 1665), configure the SageMaker training launcher (estimator) to collect output tensors from
your training job, and save the tensors into your secured S3 bucket or local machine. To learn how
to configure the debugger_hook_config parameter, see Configure SageMaker Debugger to Save
Tensors (p. 1672).
• rules (a list of Rule objects) – Configure this parameter to activate SageMaker Debugger built-
in rules that you want to run in real time. The built-in rules are logics that automatically debug the
training progress of your model and find training issues by analyzing the output tensors saved in your
secured S3 bucket. To learn how to configure the rules parameter, see Configure Debugger Built-in
Rules (p. 1678). To find a complete list of built-in rules for debugging output tensors, see the section
called “Debugger Rule” (p. 1750). If you want to create your own logic to detect any training issues,
see the section called “Create Custom Rules” (p. 1793).
Note
The built-in rules are available only through SageMaker training instances. You cannot use
them in local mode.
• tensorboard_output_config (an object of TensorBoardOutputConfig) – Configure SageMaker
Debugger to collect output tensors in the TensorBoard-compatible format and save to your S3 output
path specified in the TensorBoardOutputConfig object. To learn more, see the section called
“Visualize Debugger Output Tensors in TensorBoard” (p. 1707).
Note
The tensorboard_output_config must be configured with the debugger_hook_config
parameter, which also requires you to adapt your training script by adding the sagemaker-
debugger hook.

Note
SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For
example, the format of the default S3 bucket URI in your account is s3://sagemaker-
<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/.
There are two subfolders created by SageMaker Debugger: debug-output, and rule-output.
If you add the tensorboard_output_config parameter, you'll also find tensorboard-
output folder.

See the following topics to find more examples of how to configure the Debugger-specific parameters in
detail.

Topics
• Configure SageMaker Debugger to Save Tensors (p. 1672)
• Configure Debugger Built-in Rules (p. 1678)
• Turn Off Debugger (p. 1683)
• Useful SageMaker Estimator Classmethods for Debugger (p. 1684)

Configure SageMaker Debugger to Save Tensors


Tensors are data collections of updated parameters from the backward and forward pass of each
training iteration. SageMaker Debugger collects the output tensors to analyze the state of a training

1672
Amazon SageMaker Developer Guide
Debug Training Jobs

job. SageMaker Debugger's CollectionConfig and DebuggerHookConfig API operations provide


methods for grouping tensors into collections and saving them to a target S3 bucket.
Note
After properly configured and activated, SageMaker Debugger saves the output tensors in
a default S3 bucket, unless otherwise specified. The format of the default S3 bucket URI is
s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-
output/.

While constructing a SageMaker estimator, activate SageMaker Debugger by specifying the


debugger_hook_config parameter. The following steps include examples of how to set up the
debugger_hook_config using the CollectionConfig and DebuggerHookConfig API operations to
pull tensors out of your training jobs and save them.

Configure Tensor Collections Using the CollectionConfig API


Use the CollectionConfig API operation to configure tensor collections. Debugger provides pre-built
tensor collections that cover a variety of regular expressions (regex) of parameters if using Debugger-
supported deep learning frameworks and machine learning algorithms. As shown in the following
example code, add the built-in tensor collections you want to debug.

from sagemaker.debugger import CollectionConfig

collection_configs=[
CollectionConfig(name="weights"),
CollectionConfig(name="gradients")
]

The preceding collections set up the Debugger hook to save the tensors every 500 steps based on the
default "save_interval" value.

For a full list of available Debugger built-in collections, see Debugger Built-in Collections.

If you want to customize the built-in collections, such as changing the save intervals and tensor regex,
use the following CollectionConfig template to adjust parameters.

from sagemaker.debugger import CollectionConfig

collection_configs=[
CollectionConfig(
name="tensor_collection",
parameters={
"key_1": "value_1",
"key_2": "value_2",
...
"key_n": "value_n"
}
)
]

For more information about available parameter keys, see CollectionConfig in the Amazon SageMaker
Python SDK. For example, the following code example shows how you can adjust the save intervals of
the "losses" tensor collection at different phases of training: save loss every 100 steps in training phase
and validation loss every 10 steps in validation phase.

from sagemaker.debugger import CollectionConfig

collection_configs=[
CollectionConfig(
name="losses",
parameters={

1673
Amazon SageMaker Developer Guide
Debug Training Jobs

"train.save_interval": "100",
"eval.save_interval": "10"
}
)
]

Tip
This tensor collection configuration object can be used for both DebuggerHookConfig and Rule
API operations.

Configure the DebuggerHookConfig API to Save Tensors


Use the DebuggerHookConfig API to create a debugger_hook_config object using the
collection_configs object you created in the previous step.

from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
collection_configs=collection_configs
)

Debugger saves the model training output tensors into the default S3 bucket. The format of the default
S3 bucket URI is s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/
debug-output/.

If you want to specify an exact S3 bucket URI, use the following code example:

from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
s3_output_path="specify-your-s3-bucket-uri"
collection_configs=collection_configs
)

For more information, see DebuggerHookConfig in the Amazon SageMaker Python SDK.

Example Notebooks and Code Samples to Configure Debugger Hook


The following sections provide notebooks and code examples of how to use Debugger hook to save,
access, and visualize output tensors.

Topics
• Tensor Visualization Example Notebooks (p. 1674)
• Save Tensors Using Debugger Built-in Collections (p. 1676)
• Save Tensors Using Debugger Modified Built-in Collections (p. 1677)
• Save Tensors Using Debugger Custom Collections (p. 1677)

Tensor Visualization Example Notebooks


The following two notebook examples show advanced use of Amazon SageMaker Debugger for
visualizing tensors. Debugger provides a transparent view into training deep learning models.

• Interactive Tensor Analysis in SageMaker Studio Notebook with MXNet

This notebook example shows how to visualize saved tensors using Amazon SageMaker Debugger.
By visualizing the tensors, you can see how the tensor values change while training deep learning
algorithms. This notebook includes a training job with a poorly configured neural network and uses
Amazon SageMaker Debugger to aggregate and analyze tensors, including gradients, activation

1674
Amazon SageMaker Developer Guide
Debug Training Jobs

outputs, and weights. For example, the following plot shows the distribution of gradients of a
convolutional layer that is suffering from a vanishing gradient problem.

This notebook also illustrates how a good initial hyperparameter setting improves the training process
by generating the same tensor distribution plots.
• Visualizing and Debugging Tensors from MXNet Model Training

This notebook example shows how to save and visualize tensors from an MXNet Gluon model training
job using Amazon SageMaker Debugger. It illustrates that Debugger is set to save all tensors to an
Amazon S3 bucket and retrieves ReLu activation outputs for the visualization. The following figure
shows a three-dimensional visualization of the ReLu activation outputs. The color scheme is set to blue
to indicate values close to 0 and yellow to indicate values close to 1.

In this notebook, the TensorPlot class imported from tensor_plot.py is designed to


plot convolutional neural networks (CNNs) that take two-dimensional images for inputs. The

1675
Amazon SageMaker Developer Guide
Debug Training Jobs

tensor_plot.py script provided with the notebook retrieves tensors using Debugger and visualizes
the CNN. You can run this notebook on SageMaker Studio to reproduce the tensor visualization and
implement your own convolutional neural network model.
• Real-time Tensor Analysis in a SageMaker Notebook with MXNet

This example guides you through installing required components for emitting tensors in an Amazon
SageMaker training job and using the Debugger API operations to access those tensors while training
is running. A gluon CNN model is trained on the Fashion MNIST dataset. While the job is running, you
will see how Debugger retrieves activation outputs of the first convolutional layer from each of 100
batches and visualizes them. Also, this will show you how to visualize weights after the job is done.

Save Tensors Using Debugger Built-in Collections

You can use built-in collections of tensors using the CollectionConfig API and save them using the
DebuggerHookConfig API. The following example shows how to use the default settings of Debugger
hook configurations to construct a SageMaker TensorFlow estimator. You can also utilize this for MXNet,
PyTorch, and XGBoost estimators.
Note
In the following example code, the s3_output_path parameter for DebuggerHookConfig
is optional. If you do not specify it, Debugger saves the tensors at s3://<output_path>/
debug-output/, where the <output_path> is the default output path of SageMaker training
jobs. For example:

"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-training-YYYY-MM-DD-HH-
MM-SS-123/debug-output"

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call built-in collections


collection_configs=[
CollectionConfig(name="weights"),
CollectionConfig(name="gradients"),
CollectionConfig(name="losses"),
CollectionConfig(name="biases")
]

# configure Debugger hook


# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-built-in-collections-hook'

hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator


sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),
base_job_name='debugger-demo-job',
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",

1676
Amazon SageMaker Developer Guide
Debug Training Jobs

py_version="py39",

# debugger-specific hook argument below


debugger_hook_config=hook_config
)

sagemaker_estimator.fit()

To see a list of Debugger built-in collections, see Debugger Built-in Collections.

Save Tensors Using Debugger Modified Built-in Collections


You can modify the Debugger built-in collections using the CollectionConfig API operation. The
following example shows how to tweak the built-in losses collection and construct a SageMaker
TensorFlow estimator. You can also use this for MXNet, PyTorch, and XGBoost estimators.

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call and modify built-in collections


collection_configs=[
CollectionConfig(
name="losses",
parameters={"save_interval": "50"})]

# configure Debugger hook


# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-modified-collections-hook'

hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator


sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),
base_job_name='debugger-demo-job',
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",

# debugger-specific hook argument below


debugger_hook_config=hook_config
)

sagemaker_estimator.fit()

For a full list of CollectionConfig parameters, see Debugger CollectionConfig API.

Save Tensors Using Debugger Custom Collections


You can also save a reduced number of tensors instead of the full set of tensors (for example, if you want
to reduce the amount of data saved in your Amazon S3 bucket). The following example shows how to
customize the Debugger hook configuration to specify target tensors that you want to save. You can use
this for TensorFlow, MXNet, PyTorch, and XGBoost estimators.

1677
Amazon SageMaker Developer Guide
Debug Training Jobs

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to create a custom collection


collection_configs=[
CollectionConfig(
name="custom_activations_collection",
parameters={
"include_regex": "relu|tanh", # Required
"reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
})
]

# configure Debugger hook


# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-custom-collections-hook'

hook_config=DebuggerHookConfig(
s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
format(BUCKET_NAME=BUCKET_NAME,
LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator


sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),
base_job_name='debugger-demo-job',
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",

# debugger-specific hook argument below


debugger_hook_config=hook_config
)

sagemaker_estimator.fit()

For a full list of CollectionConfig parameters, see Debugger CollectionConfig.

Configure Debugger Built-in Rules


Amazon SageMaker Debugger's built-in rules analyze tensors emitted during the training of a model.
SageMaker Debugger offers the Rule API operation that monitors training job progress and errors for
the success of training your model. For example, the rules can detect whether gradients are getting too
large or too small, whether a model is overfitting or overtraining, and whether a training job does not
decrease loss function and improve. To see a full list of available built-in rules, see List of Debugger Built-
in Rules (p. 1748).

In the following topics, you'll learn how to use the SageMaker Debugger built-in rules.

Topics
• Use Debugger Built-in Rules with the Default Parameter Settings (p. 1679)
• Use Debugger Built-in Rules with Custom Parameter Values (p. 1679)
• Example Notebooks and Code Samples to Configure Debugger Rules (p. 1680)

1678
Amazon SageMaker Developer Guide
Debug Training Jobs

Use Debugger Built-in Rules with the Default Parameter Settings

To specify Debugger built-in rules in an estimator, you need to configure a list object. The following
example code shows the basic structure of listing the Debugger built-in rules:

from sagemaker.debugger import Rule, rule_configs

rules=[
Rule.sagemaker(rule_configs.built_in_rule_name_1()),
Rule.sagemaker(rule_configs.built_in_rule_name_2()),
...
Rule.sagemaker(rule_configs.built_in_rule_name_n()),
... # You can also append more profiler rules in the
ProfilerRule.sagemaker(rule_configs.*()) format.
]

For more information about default parameter values and descriptions of the built-in rule, see List of
Debugger Built-in Rules (p. 1748).

To find the SageMaker Debugger API reference, see sagemaker.debugger.rule_configs and


sagemaker.debugger.Rule.

For example, to inspect the overall training performance and progress of your model, construct a
SageMaker estimator with the following built-in rule configuration.

from sagemaker.debugger import Rule, rule_configs

rules=[
Rule.sagemaker(rule_configs.loss_not_decreasing()),
Rule.sagemaker(rule_configs.overfit()),
Rule.sagemaker(rule_configs.overtraining()),
Rule.sagemaker(rule_configs.stalled_training_rule())
]

When you start the training job, Debugger collects system resource utilization data every
500 milliseconds and the loss and accuracy values every 500 steps by default. Debugger
analyzes the resource utilization to identify if your model is having bottleneck problems. The
loss_not_decreasing, overfit, overtraining, and stalled_training_rule monitors if your
model is optimizing the loss function without those training issues. If the rules detect training anomalies,
the rule evaluation status changes to IssueFound. You can set up automated actions, such as notifying
training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more
information, see Action on Amazon SageMaker Debugger Rules (p. 1698).

Use Debugger Built-in Rules with Custom Parameter Values

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure
the base_config and rule_parameters parameters for the ProfilerRule.sagemaker and
Rule.sagemaker classmethods. In case of the Rule.sagemaker class methods, you can also customize
tensor collections through the collections_to_save parameter. The instruction of how to use the
CollectionConfig class is provided at Configure Tensor Collections Using the CollectionConfig
API (p. 1673).

Use the following configuration template for built-in rules to customize parameter values. By changing
the rule parameters as you want, you can adjust the sensitivity of the rules to be triggered.

• The base_config argument is where you call the built-in rule methods.
• The rule_parameters argument is to adjust the default key values of the built-in rules listed in List
of Debugger Built-in Rules (p. 1748).

1679
Amazon SageMaker Developer Guide
Debug Training Jobs

• The collections_to_save argument takes in a tensor configuration through the


CollectionConfig API, which requires name and parameters arguments.
• To find available tensor collections for name, see Debugger Built-in Tensor Collections .
• For a full list of adjustable parameters, see Debugger CollectionConfig API.

For more information about the Debugger rule class, methods, and parameters, see SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
Rule.sagemaker(
base_config=rule_configs.built_in_rule_name(),
rule_parameters={
"key": "value"
},
collections_to_save=[
CollectionConfig(
name="tensor_collection_name",
parameters={
"key": "value"
}
)
]
)
]

The parameter descriptions and value customization examples are provided for each rule at List of
Debugger Built-in Rules (p. 1748).

Example Notebooks and Code Samples to Configure Debugger Rules

In the following sections, notebooks and code samples of how to use Debugger rules to monitor
SageMaker training jobs are provided.

Topics
• Debugger Built-in Rules Example Notebooks (p. 1680)
• Debugger Built-in Rules Example Code (p. 1681)
• Use Debugger Built-in Rules with Parameter Modifications (p. 1682)

Debugger Built-in Rules Example Notebooks

The following example notebooks show how to use Debugger built-in rules when running training jobs
with Amazon SageMaker:

• Using a SageMaker Debugger built-in rule with TensorFlow


• Using a SageMaker Debugger built-in rule with Managed Spot Training and MXNet
• Using a SageMaker Debugger built-in rule with XGBoost
• Using a SageMaker Debugger built-in rule with parameter modifications for a real-time training job
analysis with XGBoost

While running the example notebooks in SageMaker Studio, you can find the training job trial created
on the Studio Experiment List tab. For example, as shown in the following screenshot, you can find and
open a Describe Trial Component window of your current training job. On the Debugger tab, you can

1680
Amazon SageMaker Developer Guide
Debug Training Jobs

check if the Debugger rules, vanishing_gradient() and loss_not_decreasing(), are monitoring


the training session in parallel. For a full instruction of how to find your training job trial components in
the Studio UI, see SageMaker Studio - View Experiments, Trials, and Trial Components.

There are two ways of using the Debugger built-in rules in the SageMaker environment: deploy the built-
in rules as it is prepared or adjust their parameters as you want. The following topics show you how to
use the built-in rules with example codes.

Debugger Built-in Rules Example Code

The following code sample shows how to set the Debugger built-in rules using the Rule.sagemaker
method. To specify built-in rules that you want to run, use the rules_configs API operation to call
the built-in rules. To find a full list of Debugger built-in rules and default parameter values, see List of
Debugger Built-in Rules (p. 1748).

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call built-in rules that you want to use.


built_in_rules=[
Rule.sagemaker(rule_configs.vanishing_gradient())
Rule.sagemaker(rule_configs.loss_not_decreasing())
]

# construct a SageMaker estimator with the Debugger built-in rules

1681
Amazon SageMaker Developer Guide
Debug Training Jobs

sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),
base_job_name='debugger-built-in-rules-demo',
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",

# debugger-specific arguments below


rules=built_in_rules
)
sagemaker_estimator.fit()

Note
The Debugger built-in rules run in parallel with your training job. The maximum number of
built-in rule containers for a training job is 20.

For more information about the Debugger rule class, methods, and parameters, see the SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.

To find an example of how to adjust the Debugger rule parameters, see the following Use Debugger
Built-in Rules with Parameter Modifications (p. 1682) section.

Use Debugger Built-in Rules with Parameter Modifications

The following code example shows the structure of built-in rules to adjust parameters. In this example,
the stalled_training_rule collects the losses tensor collection from a training job at every 50
steps and an evaluation stage at every 10 steps. If the training process starts stalling and not collecting
tensor outputs for 120 seconds, the stalled_training_rule stops the training job.

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call the built-in rules and modify the CollectionConfig parameters

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

built_in_rules_modified=[
Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
'threshold': '120',
'training_job_name_prefix': base_job_name_prefix,
'stop_training_on_fire' : 'True'
}
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"train.save_interval": "50"
"eval.save_interval": "10"
}
)
]
)
]

# construct a SageMaker estimator with the modified Debugger built-in rule


sagemaker_estimator=TensorFlow(
entry_point='directory/to/your_training_script.py',
role=sm.get_execution_role(),

1682
Amazon SageMaker Developer Guide
Debug Training Jobs

base_job_name=base_job_name_prefix,
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.9.0",
py_version="py39",

# debugger-specific arguments below


rules=built_in_rules_modified
)
sagemaker_estimator.fit()

For an advanced configuration of the Debugger built-in rules using the CreateTrainingJob API, see
Configure Debugger Using Amazon SageMaker API (p. 1799).

Turn Off Debugger
If you want to completely turn off Debugger, do one of the following:

• Before starting a training job, do the following:

To stop both monitoring and profiling, include the disable_profiler parameter to your estimator
and set it to True.
Warning
If you disable it, you won't be able to view the comprehensive Studio Debugger insights
dashboard and the autogenerated profiling report.

To stop debugging, set the debugger_hook_config parameter to False.


Warning
If you disable it, you won't be able to collect output tensors and cannot debug your model
parameters.

estimator=Estimator(
...
disable_profiler=True
debugger_hook_config=False
)

For more information about the Debugger-specific parameters, see SageMaker Estimator in the
Amazon SageMaker Python SDK.
• While a training job is running, do the following:

To disable both monitoring and profiling while your training job is running, use the following
estimator classmethod:

estimator.disable_profiling()

To disable framework profiling only and keep system monitoring, use the update_profiler method:

estimator.update_profiler(disable_framework_metrics=true)

For more information about the estimator extension methods, see the estimator.disable_profiling and
estimator.update_profiler classmethods in the Amazon SageMaker Python SDK documentation.

1683
Amazon SageMaker Developer Guide
Debug Training Jobs

Useful SageMaker Estimator Classmethods for Debugger


The following estimator class methods are useful for accessing your SageMaker training job information
and retrieving output paths of training data collected by Debugger. The following methods are
executable after you initiate a training job with the estimator.fit() method.

• To check the base S3 bucket URI of a SageMaker training job:

estimator.output_path

• To check the base job name of a SageMaker training job:

estimator.latest_training_job.job_name

• To see a full CreateTrainingJob API operation configuration of a SageMaker training job:

estimator.latest_training_job.describe()

• To check a full list of the Debugger rules while a SageMaker training job is running:

estimator.latest_training_job.rule_job_summary()

• To check the S3 bucket URI where the model parameter data (output tensors) are saved:

estimator.latest_job_debugger_artifacts_path()

• To check the S3 bucket URI at where the model performance data (system and framework metrics) are
saved:

estimator.latest_job_profiler_artifacts_path()

• To check the Debugger rule configuration for debugging output tensors:

estimator.debugger_rule_configs

• To check the list of the Debugger rules for debugging while a SageMaker training job is running:

estimator.debugger_rules

• To check the Debugger rule configuration for monitoring and profiling system and framework metrics:

estimator.profiler_rule_configs

• To check the list of the Debugger rules for monitoring and profiling while a SageMaker training job is
running:

estimator.profiler_rules

For more information about the SageMaker estimator class and its methods, see Estimator API in the
Amazon SageMaker Python SDK.

SageMaker Debugger Interactive Report for XGBoost


Receive training reports autogenerated by Debugger. The Debugger reports provide insights into your
training jobs and suggest recommendations to improve your model performance.

1684
Amazon SageMaker Developer Guide
Debug Training Jobs

Note
You can download a Debugger reports while your training job is running or after the job has
finished. During training, Debugger concurrently updates the report reflecting the current rules'
evaluation status. You can download a complete Debugger report only after the training job has
completed.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

SageMaker Debugger XGBoost Training Report


For SageMaker XGBoost training jobs, use the Debugger CreateXgboostReport (p. 1761) rule to receive
a comprehensive training report of the training progress and results. Following this guide, specify the
CreateXgboostReport (p. 1761) rule while constructing an XGBoost estimator, download the report
using the Amazon SageMaker Python SDK or the Amazon S3 console, and gain insights into the training
results.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

Topics
• Construct a SageMaker XGBoost Estimator with the Debugger XGBoost Report Rule (p. 1685)
• Download the Debugger XGBoost Training Report (p. 1686)
• Debugger XGBoost Training Report Walkthrough (p. 1689)

Construct a SageMaker XGBoost Estimator with the Debugger XGBoost Report Rule

The CreateXgboostReport (p. 1761) rule collects the following output tensors from your training job:

• hyperparameters – Saves at the first step.


• metrics – Saves loss and accuracy every 5 steps.
• feature_importance – Saves every 5 steps.
• predictions – Saves every 5 steps.
• labels – Saves every 5 steps.

The output tensors are saved at a default S3 bucket. For example, s3://
sagemaker-<region>-<12digit_account_id>/<base-job-name>/debug-output/.

When you construct a SageMaker estimator for an XGBoost training job, specify the rule as shown in the
following example code.

Using the SageMaker generic estimator

import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import Rule, rule_configs

rules=[
Rule.sagemaker(rule_configs.create_xgboost_report())
]

1685
Amazon SageMaker Developer Guide
Debug Training Jobs

region = boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-xgboost-report-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",

# Add the Debugger XGBoost report rule


rules=rules
)

estimator.fit(wait=False)

Download the Debugger XGBoost Training Report

Download the Debugger XGBoost training report while your training job is running or after the job has
finished using the Amazon SageMaker Python SDK and AWS Command Line Interface (CLI).

Download using the SageMaker Python SDK and AWS CLI

1. Check the current job's default S3 output base URI.

estimator.output_path

2. Check the current job name.

estimator.latest_training_job.job_name

3. The Debugger XGBoost report is stored under <default-s3-output-base-uri>/


<training-job-name>/rule-output. Configure the rule output path as follows:

rule_output_path = estimator.output_path + "/" +


estimator.latest_training_job.job_name + "/rule-output"

4. To check if the report is generated, list directories and files recursively under the
rule_output_path using aws s3 ls with the --recursive option.

! aws s3 ls {rule_output_path} --recursive

This should return a complete list of files under autogenerated folders that are named
CreateXgboostReport and ProfilerReport-1234567890. The XGBoost training
report is stored in the CreateXgboostReport, and the profiling report is stored in the
ProfilerReport-1234567890 folder. To learn more about the profiling report generated by
default with the XGBoost training job, see SageMaker Debugger Profiling Report (p. 1729).

1686
Amazon SageMaker Developer Guide
Debug Training Jobs

The xgboost_report.html is an autogenerated XGBoost training report by Debugger. The


xgboost_report.ipynb is a Jupyter notebook that's used to aggregate training results into
the report. You can download all of the files, browse the HTML report file, and modify the
report using the notebook.
5. Download the files recursively using aws s3 cp. The following command saves all of the rule
output files to the ProfilerReport-1234567890 folder under the current working directory.

! aws s3 cp {rule_output_path} ./ --recursive

Tip
If you are using a Jupyter notebook server, run !pwd to verify the current working
directory.
6. Under the /CreateXgboostReport directory, open xgboost_report.html. If you are using
JupyterLab, choose Trust HTML to see the autogenerated Debugger training report.

7. Open the xgboost_report.ipynb file to explore how the report is generated. You can
customize and extend the training report using the Jupyter notebook file.

Download using the Amazon S3 console

1. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
2. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base
S3 bucket name should be in the following format: sagemaker-<region>-111122223333.
Look up the base S3 bucket through the Find bucket by name field.

3. In the base S3 bucket, look up the training job name by entering your job name prefix in Find
objects by prefix and then choosing the training job name.

1687
Amazon SageMaker Developer Guide
Debug Training Jobs

4. In the training job's S3 bucket, choose rule-output/ subfolder. There must be three subfolders
for training data collected by Debugger: debug-output/, profiler-output/, and rule-output/.

5. In the rule-output/ folder, choose the CreateXgboostReport/ folder. The folder contains
xbgoost_report.html (the autogenerated report in html) and xbgoost_report.ipynb (a Jupyter
notebook with scripts that are used for generating the report).
6. Choose the xbgoost_report.html file, choose Download actions, and then choose Download.

1688
Amazon SageMaker Developer Guide
Debug Training Jobs

7. Open the downloaded xbgoost_report.html file in a web browser.

Debugger XGBoost Training Report Walkthrough

This section walks you through the Debugger XGBoost training report. The report is automatically
aggregated depending on the output tensor regex, recognizing what type of your training job is among
binary classification, multiclass classification, and regression.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

Topics
• Distribution of True Labels of the Dataset (p. 1690)
• Loss versus Step Graph (p. 1690)
• Feature Importance (p. 1691)
• Confusion Matrix (p. 1692)
• Evaluation of the Confusion Matrix (p. 1693)
• Accuracy Rate of Each Diagonal Element Over Iteration (p. 1694)
• Receiver Operating Characteristic Curve (p. 1695)
• Distribution of Residuals at the Last Saved Step (p. 1696)
• Absolute Validation Error per Label Bin Over Iteration (p. 1697)

1689
Amazon SageMaker Developer Guide
Debug Training Jobs

Distribution of True Labels of the Dataset

This histogram shows the distribution of labeled classes (for classification) or values (for regression) in
your original dataset. Skewness in your dataset could contribute to inaccuracies. This visualization is
available for the following model types: binary classification, multiclassification, and regression.

Loss versus Step Graph

This is a line chart that shows the progression of loss on training data and validation data throughout
training steps. The loss is what you defined in your objective function, such as mean squared error. You
can gauge whether the model is overfit or underfit from this plot. This section also provides insights that
you can use to determine how to resolve the overfit and underfit problems. This visualization is available
for the following model types: binary classification, multiclassification, and regression.

1690
Amazon SageMaker Developer Guide
Debug Training Jobs

Feature Importance

There are three different types of feature importance visualizations provided: Weight, Gain and
Coverage. We provide detailed definitions for each of the three in the report. Feature importance
visualizations help you learn what features in your training dataset contributed to the predictions.
Feature importance visualizations are available for the following model types: binary classification,
multiclassification, and regression.

1691
Amazon SageMaker Developer Guide
Debug Training Jobs

Confusion Matrix

This visualization is only applicable to binary and multiclass classification models. Accuracy alone might
not be sufficient for evaluating the model performance. For some use cases, such as healthcare and fraud
detection, it’s also important to know the false positive rate and false negative rate. A confusion matrix
gives you the additional dimensions for evaluating your model performance.

1692
Amazon SageMaker Developer Guide
Debug Training Jobs

Evaluation of the Confusion Matrix

This section provides you with more insights on the micro, macro, and weighted metrics on precision,
recall, and F1-score for your model.

1693
Amazon SageMaker Developer Guide
Debug Training Jobs

Accuracy Rate of Each Diagonal Element Over Iteration

This visualization is only applicable to binary classification and multiclass classification models. This is a
line chart that plots the diagonal values in the confusion matrix throughout the training steps for each
class. This plot shows you how the accuracy of each class progresses throughout the training steps. You
can identify the under-performing classes from this plot.

1694
Amazon SageMaker Developer Guide
Debug Training Jobs

Receiver Operating Characteristic Curve

This visualization is only applicable to binary classification models. The Receiver Operating Characteristic
curve is commonly used to evaluate binary classification model performance. The y-axis of the curve
is True Positive Rate (TPF) and x-axis is false positive rate (FPR). The plot also displays the value for
the area under the curve (AUC). The higher the AUC value, the more predictive your classifier. You can
also use the ROC curve to understand the trade-off between TPR and FPR and identify the optimum
classification threshold for your use case. The classification threshold can be adjusted to tune the
behavior of the model to reduce more of one or another type of error (FP/FN).

1695
Amazon SageMaker Developer Guide
Debug Training Jobs

Distribution of Residuals at the Last Saved Step

This visualization is a column chart that shows the residual distributions in the last step Debugger
captures. In this visualization, you can check whether the residual distribution is close to normal
distribution that’s centered at zero. If the residuals are skewed, your features may not be sufficient for
predicting the labels.

1696
Amazon SageMaker Developer Guide
Debug Training Jobs

Absolute Validation Error per Label Bin Over Iteration

This visualization is only applicable to regression models. The actual target values are split into 10
intervals. This visualization shows how validation errors progress for each interval throughout the
training steps in line plots. Absolute validation error is the absolute value of difference between
prediction and actual during validation. You can identify the underperforming intervals from this
visualization.

1697
Amazon SageMaker Developer Guide
Debug Training Jobs

Action on Amazon SageMaker Debugger Rules


Based on the Debugger rule evaluation status, you can set up automated actions such as stopping a
training job and sending notifications using Amazon Simple Notification Service (Amazon SNS). You can
also create your own actions using Amazon CloudWatch Events and AWS Lambda. To learn how to set up
automated actions based on the Debugger rule evaluation status, see the following topics.

Topics
• Debugger Built-in Actions for Rules (p. 1698)
• Create Actions on Rules Using Amazon CloudWatch and AWS Lambda (p. 1702)

Debugger Built-in Actions for Rules


Use Debugger built-in actions to respond to issues found by Debugger Rule (p. 1750). The Debugger
rule_configs class provides tools to configure a list of actions, including automatically stopping
training jobs and sending notifications using Amazon Simple Notification Service (Amazon SNS) when
the Debugger rules find training issues.

Step 1: Set Up Amazon SNS, Create an SMDebugRules Topic, and Subscribe to the Topic

This section walks you through how to set up an Amazon SNS SMDebugRules topic, subscribe to it, and
confirm the subscription to receive notifications from the Debugger rules.
Note
For more information about billing for Amazon SNS, see Amazon SNS pricing and Amazon SNS
FAQs.

To create a SMDebugRules topic

1. Sign in to the AWS Management Console and open the Amazon SNS console at https://
console.aws.amazon.com/sns/v3/home.
2. In the left navigation pane, choose Topics.
3. On the Topics page, choose Create topic.
4. On the Create topic page, in the Details section, do the following:

a. For Type, choose Standard for topic type.


b. In Name, enter SMDebugRules.
5. Skip all other optional settings and choose Create topic. If you want to learn more about the
optional settings, see Creating an Amazon SNS topic.

To subscribe to the SMDebugRules topic

1. Open the Amazon SNS console at https://console.aws.amazon.com/sns/v3/home.


2. In the left navigation pane, choose Subscriptions.
3. On the Subscriptions page, choose Create subscription.
4. On the Create subscription page, in the Details section, do the following:

a. For Topic ARN, choose the SMDebugRules topic ARN. The ARN should be in format of
arn:aws:sns:<region-id>:111122223333:SMDebugRules.
b. For Protocol, choose Email or SMS.
c. For Endpoint, enter the endpoint value, such as an email address or a phone number that you
want to receive notifications.

1698
Amazon SageMaker Developer Guide
Debug Training Jobs

Note
Make sure you type the correct email address and phone number. Phone numbers must
include +, a country code, and phone number, with no special characters or spaces. For
example, the phone number +1 (222) 333-4444 is formatted as +12223334444.
5. Skip all other optional settings and choose Create subscription. If you want to learn more about the
optional settings, see Subscribing to an Amazon SNS topic.

After you subscribe to the SMDebugRules topic, you receive the following confirmation message in email
or by phone:

For more information about Amazon SNS, see Mobile text messaging (SMS) and Email notifications in the
Amazon SNS Developer Guide.

Step 2: Set Up Your IAM Role to Attach Required Policies

In this step, you add the required policies to your IAM role.

To add the required policies to your IAM role

1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left navigation pane, choose Policies, and choose Create policy.
3. On the Create policy page, do the following to create a new sns-access policy:

a. Choose the JSON tab.


b. Paste the JSON strings formatted in bold in the following code into the "Statement",
replacing the 12-digit AWS account ID with your AWS account ID.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"sns:Publish",
"sns:CreateTopic",
"sns:Subscribe"
],
"Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
}
]
}

1699
Amazon SageMaker Developer Guide
Debug Training Jobs

c. At the bottom of the page, choose Review policy.


d. On the Review policy page, for Name, enter sns-access.
e. At the bottom of the page, choose Create policy.
4. Go back to the IAM console, and choose Roles in the left navigation pane.
5. Look up the IAM role that you use for SageMaker model training and choose that IAM role.
6. On the Permissions tab of the Summary page, choose Attach policies.
7. Search for the sns-access policy, select the check box next to the policy, and then choose Attach
policy.

For more examples of setting up IAM policies for Amazon SNS, see Example cases for Amazon SNS access
control.

Step 3: Configure Debugger Rules with the Built-in Actions

After successfully finishing the required settings in the preceding steps, you can configure the Debugger
built-in actions for debugging rules as shown in the following example script. You can choose which
built-in actions to use while building the actions list object. The rule_configs is a helper module
that provides high-level tools to configure Debugger built-in rules and actions. The following built-in
actions are available for Debugger:

• rule_configs.StopTraining() – Stops a training job when the Debugger rule finds an issue.
• rule_configs.Email("[email protected]") – Sends a notification via email when the Debugger rule
finds an issue. Use the email address that you used when you set up your SNS topic subscription.
• rule_configs.SMS("+1234567890") – Sends a notification via text message when the Debugger
rule finds an issue. Use the phone number that you used when you set up your SNS topic subscription.
Note
Make sure you type the correct email address and phone number. Phone numbers must
include +, a country code, and a phone number, with no special characters or spaces. For
example, the phone number +1 (222) 333-4444 is formatted as +12223334444.

You can use all of the built-in actions or a subset of actions by wrapping up using the
rule_configs.ActionList() method, which takes the built-in actions and configures a list of
actions.

To add all of the three built-in actions to a single rule

If you want to assign all of the three built-in actions to a single rule, configure a Debugger built-in
action list while constructing an estimator. Use the following template to construct the estimator, and
Debugger will stop training jobs and send notifications through email and text for any rules that you use
to monitor your training job progress.

from sagemaker.debugger import Rule, rule_configs

# Configure an action list object for Debugger rules


actions = rule_configs.ActionList(
rule_configs.StopTraining(),
rule_configs.Email("[email protected]"),
rule_configs.SMS("+1234567890")
)

# Configure rules for debugging with the actions parameter


rules = [
Rule.sagemaker(
base_config=rule_configs.built_in_rule(), # Required
rule_parameters={"paramter_key": value }, # Optional

1700
Amazon SageMaker Developer Guide
Debug Training Jobs

actions=actions
)
]

estimator = Estimator(
...
rules = rules
)

estimator.fit(wait=False)

To create multiple built-in action objects to assign different actions to a single rule

If you want to assign the built-in actions to be triggered at different threshold values of a single rule,
you can create multiple built-in action objects as shown in the following script. To avoid a conflict error
by running the same rule, you must submit different rule job names (specify different strings for the
rules' name attribute) as shown in the following example script template. This example shows how to set
up StalledTrainingRule (p. 1781) to take two different actions: send an email to [email protected] when a
training job stalls for 60 seconds, and stop the training job if stalling for 120 seconds.

from sagemaker.debugger import Rule, rule_configs


import time

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure an action object for StopTraining


action_stop_training = rule_configs.ActionList(
rule_configs.StopTraining()
)

# Configure an action object for Email


action_email = rule_configs.ActionList(
rule_configs.Email("[email protected]")
)

# Configure a rule with the Email built-in action to trigger if a training job stalls for
60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "60",
"training_job_name_prefix": base_job_name_prefix
},
actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"

# Configure a rule with the StopTraining built-in action to trigger if a training job
stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "120",
"training_job_name_prefix": base_job_name_prefix
},
actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"

estimator = Estimator(
...
rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)

1701
Amazon SageMaker Developer Guide
Debug Training Jobs

estimator.fit(wait=False)

While the training job is running, the Debugger built-in action sends notification emails and text
messages whenever the rule finds issues with your training job. The following screenshot shows an
example of email notification for a training job that has a stalled training job issue.

The following screenshot shows an example text notification that Debugger sends when the rule finds a
StalledTraining issue.

Considerations for Using the Debugger Built-in Actions

• To use the Debugger built-in actions, an internet connection is required. This feature is not supported
in the network isolation mode provided by Amazon SageMaker or Amazon VPC.
• The built-in actions cannot be used for Debugger ProfilerRule (p. 1749).
• The built-in actions cannot be used on training jobs with spot training interruptions.
• In email or text notifications, None appears at the end of messages. This does not have any meaning,
so you can disregard the text None.

Create Actions on Rules Using Amazon CloudWatch and AWS Lambda


Amazon CloudWatch collects Amazon SageMaker model training job logs and Amazon SageMaker
Debugger rule processing job logs. Configure Debugger with Amazon CloudWatch Events and AWS
Lambda to take action based on Debugger rule evaluation status.

CloudWatch Logs for Debugger Rules and Training Jobs

To find training job logs and Debugger rule job logs

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. In the left navigation pane under the Log node, choose Log Groups.
3. In the log groups list, do the following:

1702
Amazon SageMaker Developer Guide
Debug Training Jobs

• Choose /aws/sagemaker/TrainingJobs for training job logs.


• Choose /aws/sagemaker/ProcessingJobs for Debugger rule job logs.

You can use the training and Debugger rule job status in the CloudWatch logs to take further actions
when there are training issues.

For more information about monitoring training jobs using CloudWatch, see Monitor Amazon
SageMaker.

Set Up Debugger for Automated Training Job Termination Using CloudWatch and Lambda

The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger
rule training job evaluation status.

Step 1: Create a Lambda Function

To create a Lambda function

1. Open the AWS Lambda console at https://console.aws.amazon.com/lambda/.


2. In the left navigation pane, choose Functions and then choose Create function.
3. On the Create function page, choose Author from scratch option.
4. In the Basic information section, enter a Function name (for example, debugger-rule-stop-
training-job).
5. For Runtime, choose Python 3.7.
6. For Permissions, expand the drop down option, and choose Change default execution role.
7. For Execution role, choose Use an existing role and choose the IAM role that you use for training
jobs on SageMaker.
Note
Make sure you use the execution role with AmazonSageMakerFullAccess and
AWSLambdaBasicExecutionRole attached. Otherwise, the Lambda function won't
properly react to the Debugger rule status changes of the training job. If you are unsure
which execution role is being used, run the following code in a Jupyter notebook cell to
retrieve the execution role output:

import sagemaker
sagemaker.get_execution_role()

8. At the bottom of the page, choose Create function.

The following figure shows an example of the Create function page with the input fields and selections
completed.

1703
Amazon SageMaker Developer Guide
Debug Training Jobs

Step 2: Configure the Lambda function

To configure the Lambda function

1. In the Function code section of the configuration page, paste the following Python script in the
Lambda code editor pane. The lambda_handler function monitors the Debugger rule evaluation
status collected by CloudWatch and triggers the StopTrainingJob API operation. The AWS SDK
for Python (Boto3) client for SageMaker provides a high-level method, stop_training_job,
which triggers the StopTrainingJob API operation.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):


training_job_name = event.get("detail").get("TrainingJobName")
logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')

1704
Amazon SageMaker Developer Guide
Debug Training Jobs

eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None)

if eval_statuses is None or len(eval_statuses) == 0:


logging.info("Couldn't find any debug rule statuses, skipping...")
return {
'statusCode': 200,
'body': json.dumps('Nothing to do')
}

# should only attempt stopping jobs with InProgress status


training_job_status = event.get("detail").get("TrainingJobStatus", None)
if training_job_status != 'InProgress':
logging.debug(f"Current Training job status({training_job_status}) is not
'InProgress'. Exiting")
return {
'statusCode': 200,
'body': json.dumps('Nothing to do')
}

client = boto3.client('sagemaker')

for status in eval_statuses:


logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' +
str(status))
if status.get("RuleEvaluationStatus") == "IssuesFound":
secondary_status = event.get("detail").get("SecondaryStatus", None)
logging.info(
f'About to stop training job, since evaluation of rule configuration
{status.get("RuleConfigurationName")} resulted in "IssuesFound". ' +
f'\ntraining job "{training_job_name}" status is
"{training_job_status}", secondary status is "{secondary_status}"' +
f'\nAttempting to stop training job "{training_job_name}"'
)
try:
client.stop_training_job(
TrainingJobName=training_job_name
)
except Exception as e:
logging.error(
"Encountered error while trying to "
"stop training job {}: {}".format(
training_job_name, str(e)
)
)
raise e
return None

For more information about the Lambda code editor interface, see Creating functions using the AWS
Lambda console editor.
2. Skip all other settings and choose Save at the top of the configuration page.

Step 3: Create a CloudWatch Events Rule and Link to the Lambda Function for Debugger

To create a CloudWatch Events rule and link to the Lambda function for Debugger

1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.


2. In the left navigation pane, choose Rules under the Events node.
3. Choose Create rule.
4. In the Event Source section of the Step 1: Create rule page, choose SageMaker for Service Name,
and choose SageMaker Training Job State Change for Event Type. The Event Pattern Preview
should look like the following example JSON strings:

1705
Amazon SageMaker Developer Guide
Debug Training Jobs

{
"source": [
"aws.sagemaker"
],
"detail-type": [
"SageMaker Training Job State Change"
]
}

5. In the Targets section, choose Add target*, and choose the debugger-rule-stop-training-job
Lambda function that you created. This step links the CloudWatch Events rule with the Lambda
function.
6. Choose Configure details and go to the Step 2: Configure rule details page.
7. Specify the CloudWatch rule definition name. For example, debugger-cw-event-rule.
8. Choose Create rule to finish.
9. Go back to the Lambda function configuration page and refresh the page. Confirm that it's
configured correctly in the Designer panel. The CloudWatch Events rule should be registered as a
trigger for the Lambda function. The configuration design should look like the following example:

Run Example Notebooks to Test Automated Training Job Termination

You can run the following example notebooks, which are prepared for experimenting with stopping a
training job using Debugger's built-in rules.

• Amazon SageMaker Debugger - Reacting to CloudWatch Events from Rules

This example notebook runs a training job that has a vanishing gradient issue. The Debugger
VanishingGradient (p. 1769) built-in rule is used while constructing the SageMaker TensorFlow
estimator. When the Debugger rule detects the issue, the training job is terminated.

1706
Amazon SageMaker Developer Guide
Debug Training Jobs

• Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule

This example notebook runs a training script with a code line that forces it to sleep for 10 minutes. The
Debugger StalledTrainingRule (p. 1781) built-in rule invokes issues and stops the training job.

Disable the CloudWatch Events Rule to Stop Using the Automated Training Job Termination

If you want to disable the automated training job termination, you need to disable the CloudWatch
Events rule. In the Lambda Designer panel, choose the EventBridge (CloudWatch Events) block linked
to the Lambda function. This shows an EventBridge panel below the Designer panel (for example, see
the previous screen shot). Select the check box next to EventBridge (CloudWatch Events): debugger-
cw-event-rule, and then choose Disable. If you want to use the automated termination functionality
later, you can enable the CloudWatch Events rule again.

Visualize Amazon SageMaker Debugger Output Tensors in


TensorBoard
Important
This page is deprecated in favor of Amazon SageMaker with TensoBoard, which provides a
comprehensive TensorBoard experience integrated with SageMaker Training and the access
control functionalities of SageMaker Domain. To learn more, see Use TensorBoard to Debug and
Analyze Training Jobs in Amazon SageMaker (p. 2146).

Use SageMaker Debugger to create output tensor files that are compatible with TensorBoard. Load the
files to visualize in TensorBoard and analyze your SageMaker training jobs. Debugger automatically
generates output tensor files that are compatible with TensorBoard. For any hook configuration
you customize for saving output tensors, Debugger has the flexibility to create scalar summaries,
distributions, and histograms that you can import to TensorBoard.

You can enable this by passing DebuggerHookConfig and TensorBoardOutputConfig objects to an


estimator.

The following procedure explains how to save scalars, weights, and biases as full tensors, histograms, and
distributions that can be visualized with TensorBoard. Debugger saves them to the training container's
local path (the default path is /opt/ml/output/tensors) and syncs to the Amazon S3 locations
passed through the Debugger output configuration objects.

1707
Amazon SageMaker Developer Guide
Debug Training Jobs

To save TensorBoard compatible output tensor files using Debugger

1. Set up a tensorboard_output_config configuration object to save TensorBoard output using


the Debugger TensorBoardOutputConfig class. For the s3_output_path parameter, specify the
default S3 bucket of the current SageMaker session or a preferred S3 bucket. This example does not
add the container_local_output_path parameter; instead, it is set to the default local path /
opt/ml/output/tensors.

import sagemaker
from sagemaker.debugger import TensorBoardOutputConfig

bucket = sagemaker.Session().default_bucket()
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path='s3://{}'.format(bucket)
)

For additional information, see the Debugger TensorBoardOutputConfig API in the Amazon
SageMaker Python SDK.
2. Configure the Debugger hook and customize the hook parameter values. For example, the
following code configures a Debugger hook to save all scalar outputs every 100 steps in training
phases and 10 steps in validation phases, the weights parameters every 500 steps (the default
save_interval value for saving tensor collections is 500), and the bias parameters every 10
global steps until the global step reaches 500.

from sagemaker.debugger import CollectionConfig, DebuggerHookConfig

hook_config = DebuggerHookConfig(
hook_parameters={
"train.save_interval": "100",
"eval.save_interval": "10"
},
collection_configs=[
CollectionConfig("weights"),
CollectionConfig(
name="biases",
parameters={
"save_interval": "10",
"end_step": "500",
"save_histogram": "True"
}
),
]
)

For more information about the Debugger configuration APIs, see the Debugger
CollectionConfig and DebuggerHookConfig APIs in the Amazon SageMaker Python SDK.
3. Construct a SageMaker estimator with the Debugger parameters passing the configuration objects.
The following example template shows how to create a generic SageMaker estimator. You can
replace estimator and Estimator with other SageMaker frameworks' estimator parent classes
and estimator classes. Available SageMaker framework estimators for this functionality are
TensorFlow, PyTorch, and MXNet.

from sagemaker.estimator import Estimator

estimator = Estimator(
...
# Debugger parameters
debugger_hook_config=hook_config,
tensorboard_output_config=tensorboard_output_config

1708
Amazon SageMaker Developer Guide
Profile Training Jobs

)
estimator.fit()

The estimator.fit() method starts a training job, and Debugger writes the output tensor files
in real time to the Debugger S3 output path and to the TensorBoard S3 output path. To retrieve the
output paths, use the following estimator methods:

• For the Debugger S3 output path, use


estimator.latest_job_debugger_artifacts_path().
• For the TensorBoard S3 output path, use
estimator.latest_job_tensorboard_artifacts_path().
4. After the training has completed, check the names of saved output tensors:

from smdebug.trials import create_trial


trial = create_trial(estimator.latest_job_debugger_artifacts_path())
trial.tensor_names()

5. Check the TensorBoard output data in Amazon S3:

tensorboard_output_path=estimator.latest_job_tensorboard_artifacts_path()
print(tensorboard_output_path)
!aws s3 ls {tensorboard_output_path}/

6. Download the TensorBoard output data to your notebook instance. For example, the following AWS
CLI command downloads the TensorBoard files to /logs/fit under the current working directory
of your notebook instance.

!aws s3 cp --recursive {tensorboard_output_path} ./logs/fit

7. Compress the file directory to a TAR file to download to your local machine.

!tar -cf logs.tar logs

8. Download and extract the Tensorboard TAR file to a directory on your device, launch a Jupyter
notebook server, open a new notebook, and run the TensorBoard app.

!tar -xf logs.tar


%load_ext tensorboard
%tensorboard --logdir logs/fit

Profile Training Jobs Using Amazon SageMaker


Debugger
To profile compute resource utilization and framework operations of your training job, use profiling tools
offered by Amazon SageMaker Debugger.

For any training job you run in SageMaker using the SageMaker Python SDK, Debugger starts profiling
basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization,
network, and I/O wait time. It collects these resource utilization metrics every 500 milliseconds. To see
the graphs of the resource utilization metrics of your training job, simply use the SageMaker Debugger UI
in SageMaker Studio Experiments.

Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon
CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity

1709
Amazon SageMaker Developer Guide
Profile Training Jobs

into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep
into the metrics at the level of an operation or a step.

If you want to change the metric collection time interval, you need to add parameters for profiling
to your training job launcher. If you're using SageMaker Python SDK, you need to pass the
profiler_config parameter when you create an estimator. To learn how to adjust the resource
utilization metric collection interval, see the section called “Construct a SageMaker Estimator with
SageMaker Debugger” (p. 1711) and then the section called “Configure Debugger for Monitoring
Resource Utilization” (p. 1714).

Additionally, you can add profiling analysis tools called built-in profiling rules provided by SageMaker
Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect
computational performance issues. For more information, see the section called “Configure Built-in
Profiler Rules” (p. 1719). You can receive rule analysis results through the SageMaker Debugger UI in
SageMaker Studio Experiments or the SageMaker Debugger Profiling Report. You can also create custom
profiling rules using the SageMaker Python SDK.

Use the following topics to learn more about profiling functionalities provided by SageMaker Debugger.

Topics
• Configure Debugger Using Amazon SageMaker Python SDK (p. 1710)
• Configure Built-in Profiling Rules Managed by Amazon SageMaker Debugger (p. 1719)
• Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments (p. 1721)
• SageMaker Debugger Interactive Report (p. 1729)
• Analyze Data Using the SMDebug Client Library (p. 1740)

Configure Debugger Using Amazon SageMaker Python SDK


By default, SageMaker Debugger monitors resource utilization metrics, such as CPU utilization, GPU
utilization, GPU memory utilization, Network, and I/O wait time, of all SageMaker training jobs
submitted using the SageMaker Python SDK. SageMaker Debugger collects these resource utilization
metrics every 500 milliseconds. You don't need to make any additional changes in your code, training
script, or job launcher for basic resource utilization tracking purposes. If you want to check the
visualization of the resource utilization metrics of your training job in SageMaker Studio, you can jump
onto the Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments (p. 1721).

If you want to change settings for profiling , you can specify Debugger-specific parameters while
creating a SageMaker training job launcher using SageMaker Python SDK, AWS SDK for Python (Boto3),
or AWS Command Line Interface (CLI). In this guide, we focus on how to change profiling options using
the Amazon SageMaker Python SDK. There are two parameters in the SageMaker estimator classes:
profiler_config for changing the profiler settings, and rules for activating additional analysis tools.
Important
To use the latest SageMaker Debugger features, you need to upgrade the SageMaker Python
SDK and the SMDebug client library. In your iPython kernel, Jupyter Notebook, or JupyterLab
environment, run the following code to install the latest versions of the libraries and restart the
kernel.

import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)

1710
Amazon SageMaker Developer Guide
Profile Training Jobs

Construct a SageMaker Estimator with SageMaker Debugger


The following code samples are the templates you can start with. When you construct a SageMaker
estimator, add the profiler_config and rulesparameters. In the subtopic sections of this page, you
can find more information about how to configure each parameter.
Note
The following example codes are not directly executable. You need to proceed to the next
sections and to learn more about how to configure the parameters.

PyTorch

# An example of constructing a SageMaker PyTorch estimator


import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=PyTorch(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.12.0",
py_version="py37",

# SageMaker Debugger parameters


profiler_config=profiler_config,
rules=rules
)

estimator.fit(wait=False)

TensorFlow

# An example of constructing a SageMaker TensorFlow estimator


import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,

1711
Amazon SageMaker Developer Guide
Profile Training Jobs

instance_type="ml.p3.2xlarge",
framework_version="2.8.0",
py_version="py37",

# SageMaker Debugger parameters


profiler_config=profiler_config,
rules=rules
)

estimator.fit(wait=False)

MXNet

# An example of constructing a SageMaker MXNet estimator


import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=MXNet(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.7.0",
py_version="py37",

# SageMaker Debugger parameters


profiler_config=profiler_config,
rules=rules
)

estimator.fit(wait=False)

Note
For MXNet, when configuring the profiler_config parameter, you can only configure for
system monitoring. Profiling framework metrics is not supported for MXNet.
XGBoost

# An example of constructing a SageMaker XGBoost estimator


import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=XGBoost(
entry_point="directory/to/your_training_script.py",
role=sagemaker.get_execution_role(),
base_job_name="debugger-profiling-demo",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="1.5-1",

# Debugger-specific parameters

1712
Amazon SageMaker Developer Guide
Profile Training Jobs

profiler_config=profiler_config,
rules=rules
)

estimator.fit(wait=False)

Note
For XGBoost, when configuring the profiler_config parameter, you can only configure
for system monitoring. Profiling framework metrics is not supported for XGBoost.
Generic estimator

# An example of constructing a SageMaker generic estimator using the XGBoost algorithm


base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule,
rule_configs

profiler_config=ProfilerConfig(...)
rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
role=sagemaker.get_execution_role()
image_uri=xgboost_container,
base_job_name="debugger-demo",
instance_count=1,
instance_type="ml.m5.2xlarge",

# Debugger-specific parameters
profiler_config=profiler_config,
rules=rules
)

estimator.fit(wait=False)

The following provides brief descriptions of the parameters.

• profiler_config – Configure Debugger to collect system metrics and framework metrics from your
training job and save into your secured S3 bucket URI or local machine. You can set how frequently
or loosely collect the system metrics. To learn how to configure the profiler_config parameter,
see Configure Debugger for Monitoring Resource Utilization (p. 1714) and Configure Debugger for
Framework Profiling (p. 1714).
• rules – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run
in parallel. Make sure that your training job has access to this S3 bucket. The rules runs on processing
containers and automatically analyze your training job to find computational and operational
performance issues. The ProfilerReport rule is the most integrated rule that runs all built-in profiling
rules and saves the profiling results as a report into your secured S3 bucket. To learn how to configure
the rules parameter, see Configure Debugger Built-in Rules (p. 1678).

Note
Debugger securely saves output data in subfolders of your default S3 bucket. For
example, the format of the default S3 bucket URI is s3://sagemaker-<region>-

1713
Amazon SageMaker Developer Guide
Profile Training Jobs

<12digit_account_id>/<base-job-name>/<debugger-subfolders>/. There are


three subfolders created by Debugger: debug-output, profiler-output, and rule-
output. You can also retrieve the default S3 bucket URIs using the SageMaker estimator
classmethods (p. 1684).

See the following topics to find out how to configure the Debugger-specific parameters in detail.

Topics
• Configure Debugger for Monitoring Resource Utilization (p. 1714)
• Configure Debugger for Framework Profiling (p. 1714)
• Updating Debugger System Monitoring and Framework Profiling Configuration while a Training Job
is Running (p. 1718)
• Turn Off Debugger (p. 1718)

Configure Debugger for Monitoring Resource Utilization


To adjust Debugger system monitoring time intervals, use the ProfilerConfig API operation to create
a parameter object while constructing a SageMaker framework or generic estimator depending on your
preference.
Note
By default, for all SageMaker training jobs, Debugger collects resource utilization metrics from
Amazon EC2 instances every 500 milliseconds for system monitoring, without any Debugger-
specific parameters specified in SageMaker estimators.
Debugger saves the system metrics in a default S3 bucket. The format of the default S3 bucket
URI is s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/
profiler-output/.

The following code example shows how to set up the profiler_config parameter with a system
monitoring time interval of 1000 milliseconds.

from sagemaker.debugger import ProfilerConfig

profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000
)

• system_monitor_interval_millis (int) – Specify the monitoring intervals in milliseconds to


record system metrics. Available values are 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and
60000 (1 minute) milliseconds. The default value is 500 milliseconds.

To see the progress of system monitoring, see Open the Amazon SageMaker Debugger Insights
Dashboard (p. 1721).

Configure Debugger for Framework Profiling


To enable Debugger framework profiling, configure the framework_profile_params parameter when
you construct an estimator. Debugger framework profiling collects framework metrics, such as data from
initialization stage, data loader processes, Python operators of deep learning frameworks and training
scripts, detailed profiling within and between steps, with cProfile or Pyinstrument options. Using the
FrameworkProfile class, you can configure custom framework profiling options.
Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11
and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and
SDKs as follows.

1714
Amazon SageMaker Developer Guide
Profile Training Jobs

• SageMaker Python SDK <= v2.130.0


• PyTorch >= v1.6.0, < v2.0
• TensorFlow >= v2.3.1, < v2.11

See also Amazon SageMaker Debugger Release Notes: March 16, 2023 (p. 1820).
Note
Before getting started with Debugger framework profiling, verify that the framework used to
build your model is supported by Debugger for framework profiling. For more information, see
Supported Frameworks and Algorithms (p. 1650).
Debugger saves the framework metrics in a default S3 bucket. The format of the default S3
bucket URI is s3://sagemaker-<region>-<12digit_account_id>/<training-job-
name>/profiler-output/.

Start a Training Job with the Default System Monitoring and Framework Profiling
The following example code is the simplest profiler_config parameter setting to start the default
system monitoring and the default framework profiling. The FrameworkProfile class in the following
example code initiates the default framework profiling when a training job starts. Debugger framework
profiling includes the following options: detailed profiling, data loader profiling, and Python profiling.

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile()
)

With this profiler_config parameter configuration, Debugger calls the default settings of monitoring
and profiling. Debugger monitors system metrics every 500 milliseconds; profiles the fifth step with the
detailed profiling option; the seventh step with the data loader profiling option; and the ninth, tenth,
and eleventh steps with the Python profiling option.

To find available profiling configuration options, the default parameter settings, and examples of how to
configure them, see Start a Training Job with the Default System Monitoring and Customized Framework
Profiling with Different Profiling Options (p. 1717) and SageMaker Debugger APIs – FrameworkProfile in
the Amazon SageMaker Python SDK.

If you want to change the system monitoring interval and enable the default framework profiling,
you can specify the system_monitor_interval_millis parameter explicitly with the
framework_profile_params parameter. For example, to monitor every 1000 milliseconds and enable
the default framework profiling, use the following example code.

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile()
)

For more information about the FrameworkProfile class, see SageMaker Debugger APIs –
FrameworkProfile in the Amazon SageMaker Python SDK.

Start a Training Job with the Default System Monitoring and Customized Framework Profiling
for Target Steps or a Target Time Range
If you want to specify target steps or target time intervals to profile your training job, you need to
specify parameters for the FrameworkProfile class. The following code examples show how to specify
the target ranges for profiling along with system monitoring.

1715
Amazon SageMaker Developer Guide
Profile Training Jobs

• For a target step range

With the following example configuration, Debugger monitors the entire training job every 500
milliseconds (the default monitoring) and profiles a target step range from step 5 to step 15 (for 10
steps).

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
)

With the following example configuration, Debugger monitors the entire training job every 1000
milliseconds and profiles a target step range from step 5 to step 15 (for 10 steps).

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
)

• For a target time range

With the following example configuration, Debugger monitors the entire training job every 500
milliseconds (the default monitoring) and profiles a target time range from the current Unix time for
600 seconds.

import time
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()),
duration=600)
)

With the following example configuration, Debugger monitors the entire training job every 1000
milliseconds and profiles a target time range from the current Unix time for 600 seconds.

import time
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
system_monitor_interval_millis=1000,
framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()),
duration=600)
)

The framework profiling is performed for all of the profiling options at the target step or time range.

To find more information about available profiling options, see SageMaker Debugger APIs –
FrameworkProfile in the Amazon SageMaker Python SDK.

The next section shows you how to script the available profiling options.

1716
Amazon SageMaker Developer Guide
Profile Training Jobs

Start a Training Job with the Default System Monitoring and Customized Framework Profiling
with Different Profiling Options
You can use the following profiling configuration classes to manage the framework profiling options:

• DetailedProfilingConfig – Specify a target step or time range to profile framework operations using
the native framework profilers (TensorFlow profiler and PyTorch profiler). For example, if using
TensorFlow, the Debugger hooks enable the TensorFlow profiler to collect TensorFlow-specific
framework metrics. Detailed profiling enables you to profile all framework operators at a pre-step
(before the first step), within steps, and between steps of a training job.
Note
Detailed profiling might significantly increase GPU memory consumption. We do not
recommend enabling detailed profiling for more than a couple of steps.
• DataloaderProfilingConfig – Specify a target step or time range to profile deep learning framework
data loader processes. Debugger collects every data loader event of the frameworks.
Note
Data loader profiling might lower the training performance while collecting information from
data loaders. We don't recommend enabling data loader profiling for more than a couple of
steps.
Debugger is preconfigured to annotate data loader processes only for the AWS deep learning
containers. Debugger cannot profile data loader processes from any other custom or external
training containers.
• PythonProfilingConfig – Specify a target step or time range to profile Python functions. You can also
choose between two Python profilers: cProfile and Pyinstrument.
• cProfile – The standard Python profiler. cProfile collects information for every Python operator
called during training. With cProfile, Debugger saves cumulative time and annotation for each
function call, providing complete detail about Python functions. In deep learning, for example, the
most frequently called functions might be the convolutional filters and backward pass operators,
and cProfile profiles every single of them. For the cProfile option, you can further select a timer
option: total time, CPU time, and off-CPU time. While you can profile every function call executing
on processors (both CPU and GPU) in CPU time, you can also identify I/O or network bottlenecks
with the off-CPU time option. The default is total time, and Debugger profiles both CPU and off-CPU
time. With cProfile, you are able to drill down to every single functions when analyzing the profile
data.
• Pyinstrument – Pyinstrument is a low-overhead Python profiler that works based on sampling.
With the Pyinstrument option, Debugger samples profiling events every millisecond. Because
Pyinstrument measures elapsed wall-clock time instead of CPU time, the Pyinstrument option
can be a better choice over the cProfile option for reducing profiling noise (filtering out irrelevant
function calls that are cumulatively fast) and capturing operators that are actually compute
intensive (cumulatively slow) for training your model. With Pyinstrument, you are able to see a tree
of function calls and better understand the structure and root cause of the slowness.
Note
Enabling Python profiling might slow down the overall training time. cProfile profiles the
most frequently called Python operators at every call, so the processing time on profiling
increases with respect to the number of calls. For Pyinstrument, the cumulative profiling time
increases with respect to time because of its sampling mechanism.

The following example configuration shows the full structure when you use the different profiling
options with specified values.

import time
from sagemaker.debugger import (ProfilerConfig,
FrameworkProfile,
DetailedProfilingConfig,
DataloaderProfilingConfig,

1717
Amazon SageMaker Developer Guide
Profile Training Jobs

PythonProfilingConfig,
PythonProfiler, cProfileTimer)

profiler_config=ProfilerConfig(
system_monitor_interval_millis=500,
framework_profile_params=FrameworkProfile(
detailed_profiling_config=DetailedProfilingConfig(
start_step=5,
num_steps=1
),
dataloader_profiling_config=DataloaderProfilingConfig(
start_step=7,
num_steps=1
),
python_profiling_config=PythonProfilingConfig(
start_step=9,
num_steps=1,
python_profiler=PythonProfiler.CPROFILE,
cprofile_timer=cProfileTimer.TOTAL_TIME
)
)
)

For more information about available profiling options, see DetailedProfilingConfig,


DataloaderProfilingConfig, and PythonProfilingConfig in the Amazon SageMaker Python SDK.

Updating Debugger System Monitoring and Framework Profiling Configuration


while a Training Job is Running
If you want to activate or update the Debugger monitoring configuration for a training job that is
currently running, use the following SageMaker estimator extension methods:

• To activate Debugger system monitoring for a running training job and receive a Debugger profiling
report, use the following:

estimator.enable_default_profiling()

When you use the enable_default_profiling method, Debugger initiates the default system
monitoring and the ProfileReport built-in rule, which generates a comprehensive profiling report
at the end of the training job. This method can be called only if the current training job is running
without both Debugger monitoring and profiling.

For more information, see estimator.enable_default_profiling in the Amazon SageMaker Python SDK.
• To update system monitoring configuration, use the following:

estimator.update_profiler(
system_monitor_interval_millis=500
)

For more information, see estimator.update_profiler in the Amazon SageMaker Python SDK.

Turn Off Debugger
If you want to completely turn off Debugger, do one of the following:

• Before starting a training job, do the following:

To turn off profiling, include the disable_profiler parameter to your estimator and set it to True.

1718
Amazon SageMaker Developer Guide
Profile Training Jobs

Warning
If you disable it, you won't be able to view the comprehensive Studio Debugger insights
dashboard and the autogenerated profiling report.

To turn off debugging, set the debugger_hook_config parameter to False.


Warning
If you disable it, you won't be able to collect output tensors and cannot debug your model
parameters.

estimator=Estimator(
...
disable_profiler=True
debugger_hook_config=False
)

For more information about the Debugger-specific parameters, see SageMaker Estimator in the
Amazon SageMaker Python SDK.
• While a training job is running, do the following:

To disable both monitoring and profiling while your training job is running, use the following
estimator classmethod:

estimator.disable_profiling()

To disable framework profiling only and keep system monitoring, use the update_profiler method:

estimator.update_profiler(disable_framework_metrics=true)

For more information about the estimator extension methods, see the estimator.disable_profiling and
estimator.update_profiler classmethods in the Amazon SageMaker Python SDK documentation.

Configure Built-in Profiling Rules Managed by Amazon


SageMaker Debugger
The Amazon SageMaker Debugger built-in profiling rules analyze system metrics and framework
operations collected during the training of a model. Debugger offers the ProfilerRule API operation
that helps configure the rules to monitor training compute resources and operations and to detect
anomalies. For example, the profiling rules can help you detect whether there are computational
problems such as CPU bottlenecks, excessive I/O wait time, imbalanced workload across GPU workers,
and compute resource underutilization. To see a full list of available built-in profiling rules, see List of
Debugger Built-in Rules (p. 1748).
Note
The built-in rules are provided through Amazon SageMaker processing containers and fully
managed by SageMaker Debugger at no additional cost. For more information about billing, see
the Amazon SageMaker Pricing page.

In the following topics, learn how to use the Debugger built-in rules.

Topics
• Use SageMaker Debugger Built-in Profiler Rules with the Default Parameter Settings (p. 1720)
• Use Debugger Built-in Profiler Rules with Custom Parameter Values (p. 1720)

1719
Amazon SageMaker Developer Guide
Profile Training Jobs

Use SageMaker Debugger Built-in Profiler Rules with the Default Parameter
Settings
To add SageMaker Debugger built-in rules in your estimator, you need to configure a rules list object.
The following example code shows the basic structure of listing the SageMaker Debugger built-in rules.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules=[
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_1()),
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2()),
...
ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n()),
... # You can also append more debugging rules in the Rule.sagemaker(rule_configs.*())
format.
]

estimator=Estimator(
...
rules=rules
)

For a complete list of available built-in rules, see List of Debugger Built-in Rules (p. 1748).

To use the profiling rules and inspect the computational performance and progress of your training job,
add the ProfilerReport rule of SageMaker Debugger. This rule activates all built-in rules under the
Debugger ProfilerRule ProfilerRule family. Furthermore, this rule generates an aggregated profiling
report. For more information, see Profiling Report Generated Using SageMaker Debugger. You can use
the following code to add the profiling report rule to your training estimator.

from sagemaker.debugger import Rule, rule_configs

rules=[
ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]

When you start the training job with the ProfilerReport rule, Debugger collects resource utilization
data every 500 milliseconds. Debugger analyzes the resource utilization to identify if your model is
having bottleneck problems. If the rules detect training anomalies, the rule evaluation status changes to
IssueFound. You can set up automated actions, such as notifying training issues and stopping training
jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see Action on Amazon
SageMaker Debugger Rules (p. 1698).

Use Debugger Built-in Profiler Rules with Custom Parameter Values


If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure
the base_config and rule_parameters parameters for the ProfilerRule.sagemaker and
Rule.sagemaker class methods. In case of the Rule.sagemaker class methods, you can also
customize tensor collections through the collections_to_save parameter. For instruction on how
to use the CollectionConfig class, see Configure Tensor Collections Using the CollectionConfig
API (p. 1673).

Use the following configuration template for built-in rules to customize parameter values. By changing
the rule parameters as you want, you can adjust the sensitivity of the rules to be initiated.

• The base_config argument is where you call the built-in rule methods.
• The rule_parameters argument is to adjust the default key values of the built-in rules listed in List
of Debugger Built-in Rules (p. 1748).

1720
Amazon SageMaker Developer Guide
Profile Training Jobs

For more information about the Debugger rule class, methods, and parameters, see SageMaker
Debugger Rule class in the Amazon SageMaker Python SDK.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
ProfilerRule.sagemaker(
base_config=rule_configs.BuiltInProfilerRuleName(),
rule_parameters={
"key": "value"
}
)
]

The parameter descriptions and value customization examples are provided for each rule at List of
Debugger Built-in Rules (p. 1748).

For a low-level JSON configuration of the Debugger built-in rules using the CreateTrainingJob API,
see Configure Debugger Using Amazon SageMaker API (p. 1799).

Amazon SageMaker Debugger UI in Amazon SageMaker Studio


Experiments
Use the Amazon SageMaker Debugger Insights dashboard in Amazon SageMaker Studio Experiments
to analyze your model performance and system bottlenecks while running training jobs on Amazon
Elastic Compute Cloud (Amazon EC2) instances. Gain insights into your training jobs and improve your
model training performance and accuracy with the Debugger dashboards. By default, Debugger monitors
system metrics (CPU, GPU, GPU memory, network, and data I/O) every 500 milliseconds and basic
output tensors (loss and accuracy) every 500 iterations for training jobs. You can also further customize
Debugger configuration parameter values and adjust the saving intervals through the Studio UI or using
the Amazon SageMaker Python SDK.
Important
If you're using an existing Studio app, delete the app and restart to use the latest Studio
features. For instructions on how to restart and update your Studio environment, see Update
Amazon SageMaker Studio.

Topics
• Open the Amazon SageMaker Debugger Insights Dashboard (p. 1721)
• Amazon SageMaker Debugger Insights Dashboard Controller (p. 1722)
• Amazon SageMaker Debugger Insights Dashboard (p. 1724)
• Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728)

Open the Amazon SageMaker Debugger Insights Dashboard


In the SageMaker Debugger Insights dashboard in Studio, you can see the compute resource utilization,
resource utilization, and system bottleneck information of your training job that runs on Amazon EC2
instances in real time and after training.
Note
The SageMaker Debugger Insights dashboard runs a Studio application on an ml.m5.4xlarge
instance to process and render the visualizations. Each SageMaker Debugger Insights tab
runs one Studio kernel session. Multiple kernel sessions for multiple SageMaker Debugger
Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the
corresponding kernel session is also closed. The Studio application remains active and accrues

1721
Amazon SageMaker Developer Guide
Profile Training Jobs

charges for the ml.m5.4xlarge instance usage. For information about pricing, see the Amazon
SageMaker Pricing page.
Important
When you are done using the SageMaker Debugger Insights dashboard, you must shut down the
ml.m5.4xlarge instance to avoid accruing charges. For instructions on how to shut down the
instance, see Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728).

To open the SageMaker Debugger Insights dashboard

1. On the Studio Home page, choose Experiments in the left navigation pane.
2. Search your training job in the Experiments page. If your training job is set up with an Experiments
run, the job should appear in the Experiments tab; if you didn't set up an Experiments run, the job
should appear in the Unassigned runs tab.
3. Choose (click) the link of the training job name to see the job details.
4. Under the OVERVIEW menu, choose Debuggger. This should show the following two sections.

• In the Debugger rules section, you can browse the status of the Debugger built-in rules associated
with the training job.
• In the Debugger insights section, you can find links to open SageMaker Debugger Insights on the
dashboard.
5. In the SageMaker Debugger Insights section, choose the link of the training job name to open the
SageMaker Debugger Insights dashboard. This opens a Debug [your-training-job-name] window. In
this window, Debugger provides an overview of the computational performance of your training job
on Amazon EC2 instances and helps you identify issues in compute resource utilization.

You can also download an aggregated profiling report by adding the built-in ProfilerReport rule of
SageMaker Debugger. For more information, see Configure Built-in Profiler Rules and Profiling Report
Generated Using SageMaker Debugger.

Amazon SageMaker Debugger Insights Dashboard Controller


There are different components of the Debugger controller for monitoring and profiling. In this guide,
you learn about the Debugger controller components.
Note
The SageMaker Debugger Insights dashboard runs a Studio app on an ml.m5.4xlarge instance
to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio
kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on
the single instance. When you close a SageMaker Debugger Insights tab, the corresponding
kernel session is also closed. The Studio app remains active and accrues charges for the
ml.m5.4xlarge instance usage. For information about pricing, see the Amazon SageMaker
Pricing page.
Important
When you are done using the SageMaker Debugger Insights dashboard, shut down the
ml.m5.4xlarge instance to avoid accruing charges. For instructions on how to shut down the
instance, see Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728).

SageMaker Debugger Insights Controller UI

Using the Debugger controller located at the upper-left corner of the Insights dashboard, you can refresh
the dashboard, configure or update Debugger settings for monitoring system metrics, stop a training job,
and download a Debugger profiling report.

1722
Amazon SageMaker Developer Guide
Profile Training Jobs

• If you want to manually refresh the dashboard, choose the refresh button (the round arrow at the
upper-left corner) as shown in the preceding screenshot.
• The Monitoring toggle button is on by default for any SageMaker training job initiated using the
SageMaker Python SDK. If not activated, you can use the toggle button to start monitoring. During
monitoring, Debugger only collects resource utilization metrics to detect computational problems
such as CPU bottlenecks and GPU underutilization. For a complete list of resource utilization problems
that Debugger monitors, see Debugger built-in rules for profiling hardware system resource utilization
(system metrics) (p. 1749).
• The Configure monitoring button opens a pop-up window that you can use to set or update the data
collection frequency and the S3 path to save the data.

You can specify values for the following fields.


• S3 bucket URI: Specify the base S3 bucket URI.
• Collect monitoring data every: Select a time interval to collect system metrics. You can choose one
of the monitoring intervals from the dropdown list. Available intervals are 100 milliseconds, 200
milliseconds, 500 milliseconds (default), 1 second, 5 seconds, and 1 minute.

1723
Amazon SageMaker Developer Guide
Profile Training Jobs

Note
If you choose one of the lower time intervals, you increase the granularity of resource
utilization metrics, so you can capture spikes and anomalies with a higher time resolution.
However, higher the resolution, larger the size of system metrics to process. This might
introduce additional overhead and impact the overall training and processing time.
• Using the Stop training button, you can stop the training job when you find anomalies in resource
utilization.
• Using the Download report button, you can download an aggregated profiling report by using the
built-in ProfilerReport rule of SageMaker Debugger. The button is activated when you add the built-
in ProfilerReport rule to the estimator. For more information, see Configure Built-in Profiler Rules and
Profiling Report Generated Using SageMaker Debugger.

Amazon SageMaker Debugger Insights Dashboard


When you initiate a SageMaker training job, SageMaker Debugger starts monitoring the resource
utilization of the Amazon EC2 instances by default. You can track the system utilization rates, statistics
overview, and built-in rule analysis through the Insights dashboard. This guide walks you through the
content of the SageMaker Debugger Insights dashboard under the following tabs: System Metrics and
Rules.
Note
The SageMaker Debugger Insights dashboard runs a Studio application on an ml.m5.4xlarge
instance to process and render the visualizations. Each SageMaker Debugger Insights tab
runs one Studio kernel session. Multiple kernel sessions for multiple SageMaker Debugger
Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the
corresponding kernel session is also closed. The Studio application remains active and accrues
charges for the ml.m5.4xlarge instance usage. For information about pricing, see the Amazon
SageMaker Pricing page.
Important
When you are done using the SageMaker Debugger Insights dashboard, shut down the
ml.m5.4xlarge instance to avoid accruing charges. For instructions on how to shut down the
instance, see Shut Down the Amazon SageMaker Debugger Insights Instance (p. 1728).
Important
In the reports, plots and recommendations are provided for informational purposes and are not
definitive. You are responsible for making your own independent assessment of the information.

Topics
• System Metrics (p. 1724)
• Rules (p. 1727)

System Metrics

In the System Metrics tab, you can use the summary table and timeseries plots to understand resource
utilization.

Resource utilization summary

This summary table shows the statistics of compute resource utilization metrics of all nodes (denoted
as algo-n). The resource utilization metrics include the total CPU utilization, the total GPU utilization,
the total CPU memory utilization, the total GPU memory utilization, the total I/O wait time, and the
total network in bytes. The table shows the minimum and the maximum values, and p99, p90, and p50
percentiles.

1724
Amazon SageMaker Developer Guide
Profile Training Jobs

Resource utilization time series plots

Use the time series graphs to see more details of resource utilization and identify at what time interval
each instance shows any undesired utilization rate, such as low GPU utilization and CPU bottlenecks that
can cause a waste of the expensive instance.

The time series graph controller UI

The following screenshot shows the UI controller for adjusting the time series graphs.

• algo-1: Use this dropdown menu to choose the node that you want to look into.
• Zoom In: Use this button to zoom in the time series graphs and view shorter time intervals.
• Zoom Out: Use this button to zoom out the time series graphs and view wider time intervals.
• Pan Left: Move the time series graphs to an earlier time interval.
• Pan Right: Move the time series graphs to a later time interval.
• Fix Timeframe: Use this check box to fix or bring back the time series graphs to show the whole view
from the first data point to the last data point.

CPU utilization and I/O wait time

The first two graphs show CPU utilization and I/O wait time over time. By default, the graphs show the
average of CPU utilization rate and I/O wait time spent on the CPU cores. You can select one or more
CPU cores by selecting the labels to graph them on single chart and compare utilization across cores. You
can drag and zoom in and out to have a closer look at specific time intervals.

1725
Amazon SageMaker Developer Guide
Profile Training Jobs

GPU utilization and GPU memory utilization

The following graphs show GPU utilization and GPU memory utilization over time. By default, the graphs
show the mean utilization rate over time. You can select the GPU core labels to see the utilization rate
of each core. Taking the mean of utilization rate over the total number of GPU cores shows the mean
utilization of the entire hardware system resource. By looking at the mean utilization rate, you can check
the overall system resource usage of an Amazon EC2 instance. The following figure shows an example
training job on an ml.p3.16xlarge instance with 8 GPU cores. You can monitor if the training job is
well distributed, fully utilizing all GPUs.

1726
Amazon SageMaker Developer Guide
Profile Training Jobs

Overall system utilization over time

The following heatmap shows an example of the entire system utilization of an ml.p3.16xlarge
instance over time, projected onto the two-dimensional plot. Every CPU and GPU core is listed in the
vertical axis, and the utilization is recorded over time with a color scheme, where the bright colors
represent low utilization and the darker colors represent high utilization. See the labeled color bar on the
right side of the plot to find out which color level corresponds to which utilization rate.

Rules

Use the Rules tab to find a summary of the profiling rule analysis on your training job. If the profiling
rule is activated with the training job, the text appears highlighted with the solid white text. Inactive
rules are dimmed in gray text. To activate these rules, follow instructions at the section called “Configure
Built-in Profiler Rules” (p. 1719).

1727
Amazon SageMaker Developer Guide
Profile Training Jobs

Shut Down the Amazon SageMaker Debugger Insights Instance


When you are not using the SageMaker Debugger Insights dashboard, you should shut down the app
instance to avoid incurring additional fees.

To shut down the SageMaker Debugger Insights app instance in Studio

1728
Amazon SageMaker Developer Guide
Profile Training Jobs

1.
In Studio, select the Running Instances and Kernels icon ( ).
2. Under the RUNNING APPS list, look for the sagemaker-debugger-1.0 app. Select the shutdown icon

( ) next to the app. The SageMaker Debugger Insights dashboards run on an ml.m5.4xlarge
instance. This instance also disappears from the RUNNING INSTANCES when you shut down the
sagemaker-debugger-1.0 app.

SageMaker Debugger Interactive Report


Receive profiling reports autogenerated by Debugger. The Debugger report provide insights into
your training jobs and suggest recommendations to improve your model performance. The following
screenshot shows a collage of the Debugger profiling report. To learn more, see SageMaker Debugger
Profiling Report (p. 1729).
Note
You can download a Debugger reports while your training job is running or after the job has
finished. During training, Debugger concurrently updates the report reflecting the current rules'
evaluation status. You can download a complete Debugger report only after the training job has
completed.
Important
In the reports, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

SageMaker Debugger Profiling Report


For any SageMaker training jobs, the SageMaker Debugger ProfilerReport (p. 1751) rule invokes all of
the monitoring and profiling rules (p. 1749) and aggregates the rule analysis into a comprehensive
report. Following this guide, download the report using the Amazon SageMaker Python SDK or the S3
console, and learn what you can interpret from the profiling results.

1729
Amazon SageMaker Developer Guide
Profile Training Jobs

Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

Download the SageMaker Debugger Profiling Report


Download the SageMaker Debugger profiling report while your training job is running or after the job
has finished using the Amazon SageMaker Python SDK and AWS Command Line Interface (CLI).
Note
To get the profiling report generated by SageMaker Debugger, you must use the built-in
ProfilerReport rule offered by SageMaker Debugger. To activate the rule with your training job,
see Configure Built-in Profiler Rules.
Tip
You can also download the report with a single click in the SageMaker Studio Debugger insights
dashboard. This doesn't require any additional scripting to download the report. To find out
how to download the report from Studio, see Open the Amazon SageMaker Debugger Insights
Dashboard (p. 1721).

Download using SageMaker Python SDK and AWS CLI

1. Check the current job's default S3 output base URI.

estimator.output_path

2. Check the current job name.

estimator.latest_training_job.job_name

3. The Debugger profiling report is stored under <default-s3-output-base-uri>/


<training-job-name>/rule-output. Configure the rule output path as follows:

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name +


"/rule-output"

4. To check if the report is generated, list directories and files recursively under the
rule_output_path using aws s3 ls with the --recursive option.

! aws s3 ls {rule_output_path} --recursive

This should return a complete list of files under an autogenerated folder that's named
as ProfilerReport-1234567890. The folder name is a combination of strings:
ProfilerReport and a unique 10-digit tag based on the Unix timestamp when the
ProfilerReport rule is initiated.

The profiler-report.html is an autogenerated profiling report by Debugger. The remaining


files are the built-in rule analysis components stored in JSON and a Jupyter notebook that are
used to aggregate them into the report.
5. Download the files recursively using aws s3 cp. The following command saves all of the rule
output files to the ProfilerReport-1234567890 folder under the current working directory.

1730
Amazon SageMaker Developer Guide
Profile Training Jobs

! aws s3 cp {rule_output_path} ./ --recursive

Tip
If using a Jupyter notebook server, run !pwd to double check the current working
directory.
6. Under the /ProfilerReport-1234567890/profiler-output directory, open profiler-
report.html. If using JupyterLab, choose Trust HTML to see the autogenerated Debugger
profiling report.

7. Open the profiler-report.ipynb file to explore how the report is generated. You can also
customize and extend the profiling report using the Jupyter notebook file.

Download using Amazon S3 Console

1. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
2. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base
S3 bucket name should be in the following format: sagemaker-<region>-111122223333.
Look up the base S3 bucket through the Find bucket by name field.

3. In the base S3 bucket, look up the training job name by specifying your job name prefix into the
Find objects by prefix input field. Choose the training job name.

1731
Amazon SageMaker Developer Guide
Profile Training Jobs

4. In the training job's S3 bucket, there must be three subfolders for training data collected by
Debugger: debug-output/, profiler-output/, and rule-output/. Choose rule-output/.

5. In the rule-output/ folder, choose ProfilerReport-1234567890, and choose profiler-output/


folder. The profiler-output/ folder contains profiler-report.html (the autogenerated profiling
report in html), profiler-report.ipynb (a Jupyter notebook with scripts that are used for
generating the report), and a profiler-report/ folder (contains rule analysis JSON files that are
used as components of the report).
6. Select the profiler-report.html file, choose Actions, and Download.

7. Open the downloaded profiler-report.html file in a web browser.

Note
If you started your training job without configuring the Debugger-specific parameters, Debugger
generates the report based only on the system monitoring rules because the Debugger
parameters are not configured to save framework metrics. To enable framework metrics

1732
Amazon SageMaker Developer Guide
Profile Training Jobs

profiling and receive an extended Debugger profiling report, configure the profiler_config
parameter when constructing or updating SageMaker estimators.
To learn how to configure the profiler_config parameter before starting a training job, see
Configure Debugger for Framework Profiling (p. 1714).
To update the current training job and enable framework metrics profiling, see Update
Debugger Framework Profiling Configuration (p. 1718).

Debugger Profiling Report Walkthrough

This section walks you through the Debugger profiling report section by section. The profiling report is
generated based on the built-in rules for monitoring and profiling. The report shows result plots only for
the rules that found issues.
Important
In the report, plots and and recommendations are provided for informational purposes and
are not definitive. You are responsible for making your own independent assessment of the
information.

Topics
• Training Job Summary (p. 1733)
• System Usage Statistics (p. 1734)
• Framework metrics summary (p. 1735)
• Rules Summary (p. 1736)
• Analyzing the Training Loop – Step Durations (p. 1737)
• GPU Utilization Analysis (p. 1737)
• Batch Size (p. 1737)
• CPU Bottlenecks (p. 1738)
• I/O Bottlenecks (p. 1739)
• LoadBalancing in Multi-GPU Training (p. 1739)
• GPU Memory Analysis (p. 1739)

Training Job Summary

At the beginning of the report, Debugger provides a summary of your training job. In this section, you
can overview the time durations and timestamps at different training phases.

1733
Amazon SageMaker Developer Guide
Profile Training Jobs

The summary table contains the following information:

• start_time – The exact time when the training job started.


• end_time – The exact time when the training job finished.
• job_duration_in_seconds – The total training time from the start_time to the end_time.
• training_loop_start – The exact time when the first step of the first epoch has started.
• training_loop_end – The exact time when the last step of the last epoch has finished.
• training_loop_duration_in_seconds – The total time between the training loop start time and the
training loop end time.
• initialization_in_seconds – Time spent on initializing the training job. The initialization phase covers
the period from the start_time to the training_loop_start time. The initialization time is spent on
compiling the training script, starting the training script, creating and initializing the model, initiating
EC2 instances, and downloading training data.
• finalization_in_seconds – Time spent on finalizing the training job, such as finishing the model
training, updating the model artifacts, and closing the EC2 instances. The finalization phase covers the
period from the training_loop_end time to the end_time.
• initialization (%) – The percentage of time spent on initialization over the total
job_duration_in_seconds.
• training loop (%) – The percentage of time spent on training loop over the total
job_duration_in_seconds.
• finalization (%) – The percentage of time spent on finalization over the total
job_duration_in_seconds.

System Usage Statistics

In this section, you can see an overview of system utilization statistics.

The Debugger profiling report includes the following information:

• node – Lists the name of nodes. If using distributed training on multi nodes (multiple EC2 instances),
the node names are in format of algo-n.

1734
Amazon SageMaker Developer Guide
Profile Training Jobs

• metric – The system metrics collected by Debugger: CPU, GPU, CPU memory, GPU memory, I/O, and
Network metrics.
• unit – The unit of the system metrics.
• max – The maximum value of each system metric.
• p99 – The 99th percentile of each system utilization.
• p95 – The 95th percentile of each system utilization.
• p50 – The 50th percentile (median) of each system utilization.
• min – The minimum value of each system metric.

Framework metrics summary

In this section, the following pie charts show the breakdown of framework operations on CPUs and
GPUs.

Each of the pie charts analyzes the collected framework metrics in various aspects as follows:

• Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on
different training phases.
• Ratio between forward and backward pass – Shows the ratio between time durations spent on
forward and backward pass in the training loop.
• Ratio between CPU/GPU operators – Shows the ratio between time spent on operators running on
CPU or GPU, such as convolutional operators.
• General metrics recorded in framework – Shows the ratio between time spent on major framework
metrics, such as data loading, forward and backward pass.

1735
Amazon SageMaker Developer Guide
Profile Training Jobs

Overview: CPU Operators

This section provides information of the CPU operators in detail. The table shows the percentage of the
time and the absolute cumulative time spent on the most frequently called CPU operators.

Overview: GPU Operators

This section provides information of the GPU operators in detail. The table shows the percentage of the
time and the absolute cumulative time spent on the most frequently called GPU operators.

Rules Summary

In this section, Debugger aggregates all of the rule evaluation results, analysis, rule descriptions, and
suggestions.

1736
Amazon SageMaker Developer Guide
Profile Training Jobs

Analyzing the Training Loop – Step Durations

In this section, you can find a detailed statistics of step durations on each GPU core of each node.
Debugger evaluates mean, maximum, p99, p95, p50, and minimum values of step durations, and
evaluate step outliers. The following histogram shows the step durations captured on different worker
nodes and GPUs. You can enable or disable the histogram of each worker by choosing the legends on the
right side. You can check if there is a particular GPU that's causing step duration outliers.

GPU Utilization Analysis

This section shows the detailed statistics about GPU core utilization based on LowGPUUtilization rule.
It also summarizes the GPU utilization statistics, mean, p95, and p5 to determine if the training job is
underutilizing GPUs.

Batch Size

This section shows the detailed statistics of total CPU utilization, individual GPU utilizations, and GPU
memory footprints. The BatchSize rule determines if you need to change the batch size to better utilize

1737
Amazon SageMaker Developer Guide
Profile Training Jobs

the GPUs. You can check whether the batch size is too small resulting in underutilization or too large
causing overutilization and out of memory issues. In the plot, the boxes show the p25 and p75 percentile
ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars
show the 5th percentile for the lower bound and 95th percentile for the upper bound.

CPU Bottlenecks

In this section, you can drill down into the CPU bottlenecks that the CPUBottleneck rule detected from
your training job. The rule checks if the CPU utilization is above cpu_threshold (90% by default) and
also if the GPU utilization is below gpu_threshold (10% by default).

The pie charts show the following information:

• Low GPU usage caused by CPU bottlenecks – Shows the ratio of data points between the ones with
GPU utilization above and below the threshold and the ones that matches the CPU bottleneck criteria.

1738
Amazon SageMaker Developer Guide
Profile Training Jobs

• Ratio between TRAIN/EVAL phase and others – Shows the ratio between time durations spent on
different training phases.
• Ratio between forward and backward pass – Shows the ratio between time durations spent on
forward and backward pass in the training loop.
• Ratio between CPU/GPU operators – Shows the ratio between time durations spent on GPUs and
CPUs by Python operators, such as data loader processes and forward and backward pass operators.
• General metrics recorded in framework – Shows major framework metrics and the ratio between
time durations spent on the metrics.

I/O Bottlenecks

In this section, you can find a summary of I/O bottlenecks. The rule evaluates the I/O wait time and GPU
utilization rates and monitors if the time spent on the I/O requests exceeds a threshold percent of the
total training time. It might indicate I/O bottlenecks where GPUs are waiting for data to arrive from
storage.

LoadBalancing in Multi-GPU Training

In this section, you can identify workload balancing issue across GPUs.

GPU Memory Analysis

In this section, you can analyze the GPU memory utilization collected by the GPUMemoryIncrease rule.
In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow
respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and
95th percentile for the upper bound.

1739
Amazon SageMaker Developer Guide
Profile Training Jobs

Analyze Data Using the SMDebug Client Library


While your training job is running or after it has completed, you can access the training data collected
by Debugger using the Amazon SageMaker Python SDK and the SMDebug client library. The SMDebug
library provides analysis and visualization tools that enable you to drill down into your training job data.

To install the library and use the SMDebug analysis tools (in a JupyterLab notebook or an iPython
kernel)

! pip install -U smdebug

The following topics walk you through how to use the SMDebug tools to visualize and analyze the
training data collected by Debugger.

Analyze system and framework metrics

• Access the Monitoring and Profiling Data (p. 1740)


• Plot the System Metrics and Framework Metrics Data (p. 1741)
• Access the Profiling Data Using the Pandas Data Parsing Tool (p. 1743)
• Access the Python Profiling Stats Data (p. 1743)
• Merge Timelines of Different Profiling Trace Files (p. 1745)
• Profiling Data Loader (p. 1747)

Access the Monitoring and Profiling Data


The SMDebug TrainingJob class reads data from the S3 bucket where the system and framework
metrics are saved.

To set up a TrainingJob object and retrieve profiling event files of a training job

from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob


tj = TrainingJob(training_job_name, region)

Tip
You need to specify the training_job_name and region parameters to log to a training job.
There are two ways to specify the training job information:

• Use the SageMaker Python SDK while the estimator is still attached to the training job.

1740
Amazon SageMaker Developer Guide
Profile Training Jobs

import sagemaker
training_job_name=estimator.latest_training_job.job_name
region=sagemaker.Session().boto_region_name

• Pass strings directly.

training_job_name="your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
region="us-west-2"

Note
By default, SageMaker Debugger collects system metrics to monitor hardware resource
utilization and system bottlenecks. Running the following functions, you might receive error
messages regarding unavailability of framework metrics. To retrieve framework profiling data
and gain insights into framework operations, you must enable framework profiling.

• If you use SageMaker Python SDK to manipulate your training job request, pass the
framework_profile_params to the profiler_config argument of your estimator. To
learn more, see Configure SageMaker Debugger Framework Profiling.
• If you use Studio, turn on profiling using the Profiling toggle button in the Debugger insights
dashboard. To learn more, see SageMaker Debugger Insights Dashboard Controller.

To retrieve a description of the training job description and the S3 bucket URI where the metric data
are saved

tj.describe_training_job()
tj.get_config_and_profiler_s3_output_path()

To check if the system and framework metrics are available from the S3 URI

tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()

To create system and framework reader objects after the metric data become available

system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()

To refresh and retrieve the latest training event files

The reader objects have an extended method, refresh_event_file_list(), to retrieve the latest
training event files.

system_metrics_reader.refresh_event_file_list()
framework_metrics_reader.refresh_event_file_list()

Plot the System Metrics and Framework Metrics Data


You can use the system and algorithm metrics objects for the following visualization classes to plot
timeline graphs and histograms.
Note
To visualize the data with narrowed-down metrics in the following visualization object plot
methods, specify select_dimensions and select_events parameters. For example, if

1741
Amazon SageMaker Developer Guide
Profile Training Jobs

you specify select_dimensions=["GPU"], the plot methods filter the metrics that include
the "GPU" keyword. If you specify select_events=["total"], the plot methods filter the
metrics that include the "total" event tags at the end of the metric names. If you enable these
parameters and give the keyword strings, the visualization classes return the charts with filtered
metrics.

• The MetricsHistogram class

from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram

metrics_histogram = MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(
starttime=0,
endtime=system_metrics_reader.get_timestamp_of_latest_available_file(),
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"] # optional
)

• The StepTimelineChart class

from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import


StepTimelineChart

view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)

• The StepHistogram class

from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram

step_histogram = StepHistogram(framework_metrics_reader)
step_histogram.plot(
starttime=step_histogram.last_timestamp - 5 * 1000 * 1000,
endtime=step_histogram.last_timestamp,
show_workers=True
)

• The TimelineCharts class

from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

view_timeline_charts = TimelineCharts(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"] # optional
)

view_timeline_charts.plot_detailed_profiler_data([700,710])

• The Heatmap class

from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap

view_heatmap = Heatmap(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"], # optional
plot_height=450
)

1742
Amazon SageMaker Developer Guide
Profile Training Jobs

Access the Profiling Data Using the Pandas Data Parsing Tool
The following PandasFrame class provides tools to convert the collected profiling data to Pandas data
frame.

from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame

The PandasFrame class takes the tj object's S3 bucket output path, and its methods
get_all_system_metrics() get_all_framework_metrics() return system metrics and
framework metrics in the Pandas data format.

pf = PandasFrame(tj.profiler_s3_output_path)
system_metrics_df = pf.get_all_system_metrics()
framework_metrics_df = pf.get_all_framework_metrics(
selected_framework_metrics=[
'Step:ModeKeys.TRAIN',
'Step:ModeKeys.GLOBAL'
]
)

Access the Python Profiling Stats Data


The Python profiling provides framework metrics related to Python functions and operators in your
training scripts and the SageMaker deep learning frameworks.

Training Modes and Phases for Python Profiling

To profile specific intervals during training to partition statistics for each of these intervals, Debugger
provides tools to set modes and phases.

For training modes, use the following PythonProfileModes class:

from smdebug.profiler.python_profile_utils import PythonProfileModes

This class provides the following options:

• PythonProfileModes.TRAIN – Use if you want to profile the target steps in the training phase. This
mode option available only for TensorFlow.
• PythonProfileModes.EVAL – Use if you want to profile the target steps in the evaluation phase.
This mode option available only for TensorFlow.
• PythonProfileModes.PREDICT – Use if you want to profile the target steps in the prediction phase.
This mode option available only for TensorFlow.
• PythonProfileModes.GLOBAL – Use if you want to profile the target steps in the global phase,
which includes the previous three phases. This mode option available only for PyTorch.
• PythonProfileModes.PRE_STEP_ZERO – Use if you want to profile the target steps in the
initialization stage before the first training step of the first epoch starts. This phase includes the initial
job submission, uploading the training scripts to EC2 instances, preparing the EC2 instances, and
downloading input data. This mode option available for both TensorFlow and PyTorch.
• PythonProfileModes.POST_HOOK_CLOSE – Use if you want to profile the target steps in the
finalization stage after the training job has done and the Debugger hook is closed. This phase includes
profiling data while the training jobs are finalized and completed. This mode option available for both
TensorFlow and PyTorch.

For training phases, use the following StepPhase class:

1743
Amazon SageMaker Developer Guide
Profile Training Jobs

from smdebug.profiler.analysis.utils.python_profile_analysis_utils import StepPhase

This class provides the following options:

• StepPhase.START – Use to specify the start point of the initialization phase.


• StepPhase.STEP_START – Use to specify the start step of the training phase.
• StepPhase.FORWARD_PASS_END – Use to specify the steps where the forward pass ends. This option
is available only for PyTorch.
• StepPhase.STEP_END – Use to specify the end steps in the training phase. This option is available
only for TensorFlow.
• StepPhase.END – Use to specify the ending point of the finalization (post-hook-close) phase. If the
callback hook is not closed, the finalization phase profiling does not occur.

Python Profiling Analysis Tools

Debugger supports the Python profiling with two profiling tools:

• cProfile – The standard python profiler. cProfile collects framework metrics on CPU time for every
function called when profiling was enabled.
• Pyinstrument – This is a low overhead Python profiler sampling profiling events every milliseconds.

To learn more about the Python profiling options and what's collected, see Start a Training Job
with the Default System Monitoring and Customized Framework Profiling with Different Profiling
Options (p. 1717).

The following methods of the PythonProfileAnalysis, cProfileAnalysis,


PyinstrumentAnalysis classes are provided to fetch and analyze the Python profiling data. Each
function loads the latest data from the default S3 URI.

from smdebug.profiler.analysis.python_profile_analysis import PythonProfileAnalysis,


cProfileAnalysis, PyinstrumentAnalysis

To set Python profiling objects for analysis, use the cProfileAnalysis or PyinstrumentAnalysis classes as
shown in the following example code. It shows how to set a cProfileAnalysis object, and if you want
to use PyinstrumentAnalysis, replace the class name.

python_analysis = cProfileAnalysis(
local_profile_dir=tf_python_stats_dir,
s3_path=tj.profiler_s3_output_path
)

The following methods are available for the cProfileAnalysis and PyinstrumentAnalysis classes
to fetch the Python profiling stats data:

• python_analysis.fetch_python_profile_stats_by_time(start_time_since_epoch_in_secs,
end_time_since_epoch_in_secs) – Takes in a start time and end time, and returns the function
stats of step stats whose start or end times overlap with the provided interval.
• python_analysis.fetch_python_profile_stats_by_step(start_step, end_step,
mode, start_phase, end_phase) – Takes in a start step and end step and returns the function
stats of all step stats whose profiled step satisfies start_step <= step < end_step.
• start_step and end_step (str) – Specify the start step and end step to fetch the Python profiling
stats data.

1744
Amazon SageMaker Developer Guide
Profile Training Jobs

• mode (str) – Specify the mode of training job using the PythonProfileModes enumerator class.
The default is PythonProfileModes.TRAIN. Available options are provided in the Training Modes
and Phases for Python Profiling (p. 1743) section.
• start_phase (str) – Specify the start phase in the target step(s) using the StepPhase enumerator
class. This parameter enables profiling between different phases of training. The default is
StepPhase.STEP_START. Available options are provided in the Training Modes and Phases for
Python Profiling (p. 1743) section.
• end_phase (str) – Specify the end phase in the target step(s) using the StepPhase enumerator
class. This parameter sets up the end phase of training. Available options are as same as the ones for
the start_phase parameter. The default is StepPhase.STEP_END. Available options are provided
in the Training Modes and Phases for Python Profiling (p. 1743) section.
• python_analysis.fetch_profile_stats_between_modes(start_mode, end_mode) –
Fetches stats from the Python profiling between the start and end modes.
• python_analysis.fetch_pre_step_zero_profile_stats() – Fetches the stats from the
Python profiling until step 0.
• python_analysis.fetch_post_hook_close_profile_stats() – Fetches stats from the Python
profiling after the hook is closed.
• python_analysis.list_profile_stats() – Returns a DataFrame of the Python profiling stats.
Each row holds the metadata for each instance of profiling and the corresponding stats file (one per
step).
• python_analysis.list_available_node_ids() – Returns a list the available node IDs for the
Python profiling stats.

The cProfileAnalysis class specific methods:

• fetch_profile_stats_by_training_phase() – Fetches and aggregates the Python profiling


stats for every possible combination of start and end modes. For example, if a training and validation
phases are done while detailed profiling is enabled, the combinations are (PRE_STEP_ZERO,
TRAIN), (TRAIN, TRAIN), (TRAIN, EVAL), (EVAL, EVAL), and (EVAL, POST_HOOK_CLOSE). All
stats files within each of these combinations are aggregated.
• fetch_profile_stats_by_job_phase() – Fetches and aggregates the Python profiling stats by
job phase. The job phases are initialization (profiling until step 0), training_loop (training and
validation), and finalization (profiling after the hook is closed).

Merge Timelines of Different Profiling Trace Files


The SMDebug client library provide profiling analysis and visualization tools for merging timelines of
system metrics, framework metrics, and Python profiling data collected by Debugger.
Tip
Before proceeding, you need to set a TrainingJob object that will be utilized throughout the
examples in this page. For more information about setting up a TrainingJob object, see Access
the Monitoring and Profiling Data (p. 1740).

The MergedTimeline class provides tools to integrate and correlate different profiling information
in a single timeline. After Debugger captures profiling data and annotations from different phases of a
training job, JSON files of trace events are saved in a default tracefolder directory.

• For annotations in the Python layers, the trace files are saved in *pythontimeline.json.
• For annotations in the TensorFlow C++ layers, the trace files are saved in *model_timeline.json.
• Tensorflow profiler saves events in a *trace.json.gz file.

1745
Amazon SageMaker Developer Guide
Profile Training Jobs

Tip
If you want to list all of the JSON trace files, use the following AWS CLI command:

! aws s3 ls {tj.profiler_s3_output_path} --recursive | grep '\.json$'

As shown in the following animated screenshot, putting and aligning the trace events captured from
the different profiling sources in a single plot can provide an overview of the entire events occurring in
different phases of the training job.

Tip
To interact with the merged timeline on the traicing app using a keyboard, use the W key for
zooming in, the A key for shifting to the left, the S key for zooming out, and the D key for
shifiting to the right.

The multiple event trace JSON files can be merged into one trace event JSON file
using the following MergedTimeline API operation and class method from the
smdebug.profiler.analysis.utils.merge_timelines module.

from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline

combined_timeline = MergedTimeline(path, file_suffix_filter, output_directory)


combined_timeline.merge_timeline(start, end, unit)

The MergedTimeline API operation passes the following parameters:

• path (str) – Specify a root folder (/profiler-output) that contains system and framework profiling
trace files. You can locate the profiler-output using the SageMaker estimator classmethod or
the TrainingJob object. For example, estimator.latest_job_profiler_artifacts_path() or
tj.profiler_s3_output_path.
• file_suffix_filter (list) – Specify a list of file suffix filters to merge timelines. Available suffiex
filters are ["model_timeline.json", "pythontimeline.json", "trace.json.gz"]. If this
parameter is not manually specified, all of the trace files are merged by default.

1746
Amazon SageMaker Developer Guide
Profile Training Jobs

• output_directory (str) – Specify a path to save the merged timeline JSON file. The default is to the
directory specified for the path parameter.

The merge_timeline() classmethod passes the following parameters to execute the merging process:

• start (int) – Specify start time (in microseconds and in Unix time format) or start step to merge
timelines.
• end (int) – Specify end time (in microseconds and in Unix time format) or end step to merge timelines.
• unit (str) – Choose between "time" and "step". The default is "time".

Using the following example codes, execute the merge_timeline() method and download the merged
JSON file.

• Merge timeline with the "time" unit option. The following example code merges all available trace
files between the Unix start time (the absolute zero Unix time) and the current Unix time, which means
that you can merge the timelines for the entire training duration.

import time
from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
from smdebug.profiler.profiler_constants import CONVERT_TO_MICROSECS

combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")


combined_timeline.merge_timeline(0, int(time.time() * CONVERT_TO_MICROSECS))

• Merge timeline with the "step" unit option. The following example code merges all available
timelines between step 3 and step 9.

from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline

combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")


combined_timeline.merge_timeline(3, 9, unit="step")

Open the Chrome tracing app at chrome://tracing on a Chrome browser, and open the JSON file. You can
explore the output to plot the merged timeline.

Profiling Data Loader


In PyTorch, data loader iterators, such as SingleProcessingDataLoaderIter and
MultiProcessingDataLoaderIter, are initiated at the beginning of every iteration over a dataset.
During the initialization phase, PyTorch turns on worker processes depending on the configured number
of workers, establishes data queue to fetch data and pin_memory threads.

To use the PyTorch data loader profiling analysis tool, import the following PT_dataloader_analysis
class:

from smdebug.profiler.analysis.utils.pytorch_dataloader_analysis import


PT_dataloader_analysis

Pass the profiling data retrieved as a Pandas frame data object in the Access the Profiling Data Using the
Pandas Data Parsing Tool (p. 1743) section:

pt_analysis = PT_dataloader_analysis(pf)

The following functions are available for the pt_analysis object:

1747
Amazon SageMaker Developer Guide
List of Built-in Rules

The SMDebug S3SystemMetricsReader class reads the system metrics from the S3 bucket specified to
the s3_trial_path parameter.

• pt_analysis.analyze_dataloaderIter_initialization()

The analysis outputs the median and maximum duration for these initializations. If there are outliers,
(i.e duration is greater than 2 * median), the function prints the start and end times for those
durations. These can be used to inspect system metrics during those time intervals.

The following list shows what analysis is available from this class method:
• Which type of data loader iterators were initialized.
• The number of workers per iterator.
• Inspect whether the iterator was initialized with or without pin_memory.
• Number of times the iterators were initialized during training.
• pt_analysis.analyze_dataloaderWorkers()

The following list shows what analysis is available from this class method:
• The number of worker processes that were spun off during the entire training.
• Median and maximum duration for the worker processes.
• Start and end time for the worker processes that are outliers.
• pt_analysis.analyze_dataloader_getnext()

The following list shows what analysis is available from this class method:
• Number of GetNext calls made during the training.
• Median and maximum duration in microseconds for GetNext calls.
• Start time, End time, duration and worker id for the outlier GetNext call duration.
• pt_analysis.analyze_batchtime(start_timestamp, end_timestamp,
select_events=[".*"], select_dimensions=[".*"])

Debugger collects the start and end times of all the GetNext calls. You can find the amount of time
spent by the training script on one batch of data. Within the specified time window, you can identify
the calls that are not directly contributing to the training. These calls can be from the following
operations: computing the accuracy, adding the losses for debugging or logging purposes, and printing
the debugging information. Operations like these can be compute intensive or time consuming.
We can identify such operations by correlating the Python profiler, system metrics, and framework
metrics.

The following list shows what analysis is available from this class method:
• Profile time spent on each data batch, BatchTime_in_seconds, by finding the difference between
start times of current and subsequent GetNext calls.
• Find the outliers in BatchTime_in_seconds and start and end time for those outliers.
• Obtain the system and framework metrics during those BatchTime_in_seconds timestamps. This
indicates where the time was spent.
• pt_analysis.plot_the_window()

Plots a timeline charts between a start timestamp and the end timestamp.

List of Debugger Built-in Rules


Use the Debugger built-in rules provided by Amazon SageMaker Debugger and analyze metrics and
tensors collected while training your models. The Debugger built-in rules monitor various common
conditions that are critical for the success of 1748
a training job. You can call the built-in rules using Amazon
Amazon SageMaker Developer Guide
List of Built-in Rules

SageMaker Python SDK or the low-level SageMaker API operations. There's no additional cost for using
the built-in rules. For more information about billing, see the Amazon SageMaker Pricing page.
Note
The maximum numbers of built-in rules that you can attach to a training job are 20 for
ProfilerRule and 20 for Rule. SageMaker Debugger fully manages the built-in rules and
analyzes your training job synchronously.
Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the
SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment,
run the following code to install the latest versions of the libraries and restart the kernel.

import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)

Debugger ProfilerRule
The following rules are the Debugger built-in rules that are callable using the
ProfilerRule.sagemaker classmethod.

Debugger built-in rule for generating the profiling report

Scope of Validity Built-in Rules

Profiling Report for any SageMaker training job • ProfilerReport (p. 1751)

Debugger built-in rules for profiling hardware system resource utilization (system metrics)

Scope of Validity Built-in Rules

Generic system monitoring rules for any • BatchSize (p. 1752)


SageMaker training job • CPUBottleneck (p. 1754)
• GPUMemoryIncrease (p. 1755)
• IOBottleneck (p. 1756)
• LoadBalancing (p. 1757)
• LowGPUUtilization (p. 1758)
• OverallSystemUsage (p. 1759)

Debugger built-in rules for profiling framework metrics

Scope of Validity Built-in Rules

Profiling rules for deep learning frameworks • MaxInitializationTime (p. 1759)


(TensorFlow and PyTorch) • OverallFrameworkMetrics (p. 1760)
• StepOutlier (p. 1760)

Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11
and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and
SDKs as follows.

1749
Amazon SageMaker Developer Guide
List of Built-in Rules

• SageMaker Python SDK <= v2.130.0

• PyTorch >= v1.6.0, < v2.0


• TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger also discontinues support for the three
ProfilerRules for framework profiling. See also Amazon SageMaker Debugger Release Notes:
March 16, 2023 (p. 1820).

Debugger Rule
The following rules are the Debugger built-in rules that are callable using the Rule.sagemaker
classmethod.

Debugger built-in rules for generating training reports

Scope of Validity Built-in Rules

Training Report for SageMaker XGboost training • create_xgboost_report (p. 1761)


job

Debugger built-in rules for debugging model training data (output tensors)

Scope of Validity Built-in Rules

Deep learning frameworks (TensorFlow, MXNet, • dead_relu (p. 1762)


and PyTorch) • exploding_tensor (p. 1763)
• poor_weight_initialization (p. 1765)
• saturated_activation (p. 1766)
• vanishing_gradient (p. 1769)
• weight_update_ratio (p. 1770)

Deep learning frameworks (TensorFlow, MXNet, • all_zero (p. 1771)


and PyTorch) and the XGBoost algorithm • class_imbalance (p. 1773)
• loss_not_decreasing (p. 1775)
• overfit (p. 1777)
• overtraining (p. 1779)
• similar_across_runs (p. 1780)
• stalled_training_rule (p. 1781)
• tensor_variance (p. 1783)
• unchanged_tensor (p. 1784)

Deep learning applications • check_input_images (p. 1786)


• nlp_sequence_ratio (p. 1788)

XGBoost algorithm • confusion (p. 1789)


• feature_importance_overweight (p. 1791)
• tree_depth (p. 1792)

To use the built-in rules with default parameter values – use the following configuration format:

1750
Amazon SageMaker Developer Guide
List of Built-in Rules

from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_1()),
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2()),
...
ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n()),
Rule.sagemaker(rule_configs.built_in_rule_name_1()),
Rule.sagemaker(rule_configs.built_in_rule_name_2()),
...
Rule.sagemaker(rule_configs.built_in_rule_name_n())
]

To use the built-in rules with customizing the parameter values – use the following configuration
format:

from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
ProfilerRule.sagemaker(
base_config=rule_configs.BuiltInRuleName(),
rule_parameters={
"key": "value"
}
)
Rule.sagemaker(
base_config=rule_configs.built_in_rule_name(),
rule_parameters={
"key": "value"
}
collections_to_save=[
CollectionConfig(
name="tensor_collection_name",
parameters={
"key": "value"
}
)
]
)
]

To find available keys for the rule_parameters parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description
tables.

• For a full instruction and examples of using the Debugger built-in rules, see Debugger Built-in Rules
Example Code (p. 1681).
• For a full instruction on using the built-in rules with the low-level SageMaker API operations, see
Configure Debugger Using Amazon SageMaker API (p. 1799).

ProfilerReport
The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling
report and updates when the individual rules are triggered. You can download a comprehensive profiling
report while a training job is running or after the training job is complete. You can adjust the rule
parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following
example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport
rule.

1751
Amazon SageMaker Developer Guide
List of Built-in Rules

rules=[
ProfilerRule.sagemaker(
rule_configs.ProfilerReport(
<BuiltInRuleName>_<parameter_name> = value
)
)
]

If you trigger this ProfilerReport rule without any customized parameter as shown in the following
example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling
with their default parameter values.

rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

The following example code shows how to specify and adjust the CPUBottleneck rule's cpu_threshold
parameter and the IOBottleneck rule's threshold parameter.

rules=[
ProfilerRule.sagemaker(
rule_configs.ProfilerReport(
CPUBottleneck_cpu_threshold = 90,
IOBottleneck_threshold = 90
)
)
]

To explore what's in the profiler report, see SageMaker Debugger Profiling Report. Also, because this
rule activates all of the profiling rules, you can also check the rule analysis status using the SageMaker
Debugger UI in SageMaker Studio Experiments.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

<BuiltInRuleName>_<parameter_name> Customizable parameter to adjust thresholds of


other built-in monitoring and profiling rules.

Optional

Default value: None

BatchSize
The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this
rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on
CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on
a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks
that heavily overallocate memory. However, increasing the batch size can lead to processing or data
loading bottlenecks because more data preprocessing time is required in each iteration.

1752
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Descriptions for the BatchSize Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

cpu_threshold_p95 Defines the threshold for 95th quantile of CPU


utilization in percentage.

Optional

Valid values: Integer

Default value: 70 (in percentage)

gpu_threshold_p95 Defines the threshold for 95th quantile of GPU


utilization in percentage.

Optional

Valid values: Integer

Default value: 70 (in percentage)

gpu_memory_threshold_p95 Defines the threshold for 95th quantile of GPU


memory utilization in percentage.

Optional

Valid values: Integer

Default values: 70 (in percentage)

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent
it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

Default values: 100

window Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

1753
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

scan_interval_us Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

CPUBottleneck
The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if
number of CPU bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the CPUBottleneck Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold Defines the threshold for proportion of


bottlenecked time to the total training time. If
the proportion exceeds the percentage specified
to the threshold parameter, the rule switches the
rule status to True.

Optional

Valid values: Integer

Default value: 50 (in percentage)

gpu_threshold A threshold that defines low GPU utilization.

Optional

Valid values: Integer

Default value: 10 (in percentage)

cpu_threshold A threshold that defines high CPU utilization.

Optional

Valid values: Integer

Default values: 90 (in percentage)

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent

1754
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

Default values: 100

scan_interval_us Time interval with which timeline files are


scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

GPUMemoryIncrease
The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.

Parameter Descriptions for the GPUMemoryIncrease Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

increase Defines the threshold for absolute memory


increase.

Optional

Valid values: Integer

Default value: 10 (in percentage)

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent
it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

Default values: 100

1755
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

window Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

scan_interval_us Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

IOBottleneck
This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number
of IO bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the IOBottleneck Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold Defines the threshold when Rule to return True.

Optional

Valid values: Integer

Default value: 50 (in percentage)

gpu_threshold A threshold that defines when GPU is considered


underutilized.

Optional

Valid values: Integer

Default value: 70 (in percentage)

io_threshold A threshold that defines high IO wait time.

Optional

Valid values: Integer

Default values: 50 (in percentage)

1756
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent
it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

Default values: 1000

scan_interval_us Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

LoadBalancing
The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.

Parameter Descriptions for the LoadBalancing Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold Defines the workload percentage.

Optional

Valid values: Integer

Default value: 0.5 (unitless proportion)

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent
it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

1757
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Default values: 10

scan_interval_us Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

LowGPUUtilization
The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is
checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold_p95 which
indicates underutilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is
below threshold_p5 which indicates fluctuations.

Parameter Descriptions for the LowGPUUtilization Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold_p95 A threshold for 95th quantile below which GPU is


considered to be underutilized.

Optional

Valid values: Integer

Default value: 70 (in percentage)

threshold_p5 A threshold for 5th quantile. Default is 10 percent.

Optional

Valid values: Integer

Default values: 10 (in percentage)

patience Defines the number of data points to skip until


the rule starts evaluation. The first several steps
of training jobs usually show high volume of data
processes, so keep the rule patient and prevent
it from being invoked too soon with a given
number of profiling data that you specify with this
parameter.

Optional

Valid values: Integer

1758
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Default values: 1000

window Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

scan_interval_us Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

OverallSystemUsage
The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only
aggregates values per node and computes their percentiles.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

MaxInitializationTime
The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule
waits until the first step is available.

Parameter Descriptions for the MaxInitializationTime Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

1759
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Required

Valid values: String

threshold Defines the threshold in minutes to wait for the


first step to become available.

Optional

Valid values: Integer

Default value: 20 (in minutes)

scan_interval_us Time interval with which timeline files are


scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

OverallFrameworkMetrics
The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward
and backward pass, and data loading.

Parameter Descriptions for the OverallFrameworkMetrics Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

StepOutlier
The StepOutlier rule helps detect outliers in step durations. This rule returns True if there are outliers
with step durations larger than stddev sigmas of the entire step durations in a time range.

1760
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Descriptions for the StepOutlier Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

stddev Defines a factor by which to multiply the standard


deviation. For example, the rule is invoked by
default when a step duration is larger or smaller
than 5 times the standard deviation.

Optional

Valid values: Integer

Default value: 5 (in minutes)

mode Mode under which steps have been saved and on


which Rule should run on. Per default rule will run
on steps from EVAL and TRAIN phase

Optional

Valid values: Integer

Default value: 5 (in minutes)

n_outliers How many outliers to ignore before rule returns


True

Optional

Valid values: Integer

Default value: 10

scan_interval_us Time interval with which timeline files are


scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

CreateXgboostReport
The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a
comprehensive training report. You can download a comprehensive profiling report while a training job
is running or after the training job is complete, and check progress of training or the final result of the
training job. The CreateXgboostReport rule collects the following output tensors by default:

• hyperparameters – Saves at the first step

1761
Amazon SageMaker Developer Guide
List of Built-in Rules

• metrics – Saves loss and accuracy every 5 steps


• feature_importance – Saves every 5 steps
• predictions – Saves every 5 steps
• labels – Saves every 5 steps

Parameter Descriptions for the CreateXgboostReport Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

rules=[
Rule.sagemaker(
rule_configs.create_xgboost_report()
)
]

DeadRelu
This rule detects when the percentage of rectified linear unit (ReLU) activation functions in a trial are
considered dead because their activation activity has dropped below a threshold. If the percent of
inactive ReLUs in a layer is greater than the threshold_layer value of inactive ReLUs, the rule returns
True.

Parameter Descriptions for the DeadRelu Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: ".*relu_output"

1762
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

threshold_inactivity Defines a level of activity below which a ReLU is


considered to be dead. A ReLU might be active in
the beginning of a trial and then slowly die during
the training process. If the ReLU is active less than
the threshold_inactivity, it is considered to
be dead.

Optional

Valid values: Float

Default values: 1.0 (in percentage)

threshold_layer Returns True if the percentage of inactive ReLUs


in a layer is greater than threshold_layer.

Returns False if the percentage of inactive ReLUs


in a layer is less than threshold_layer.

Optional

Valid values: Float

Default values: 50.0 (in percentage)

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.dead_relu(),
rule_parameters={
"tensor_regex": ".*relu_output|.*ReLU_output",
"threshold_inactivity": "1.0",
"threshold_layer": "50.0"
},
collections_to_save=[
CollectionConfig(
name="custom_relu_collection",
parameters={
"include_regex: ".*relu_output|.*ReLU_output",
"save_interval": "500"
}
)
]
)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

ExplodingTensor
This rule detects whether the tensors emitted during training have non-finite values, either infinite or
NaN (not a number). If a non-finite value is detected, the rule returns True.

1763
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Descriptions for the ExplodingTensor Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: String

Default value: None

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: String

Default value: None

only_nan True to monitor the base_trial tensors only


for NaN values and not for infinity.

False to treat both NaN and infinity as exploding


values and to monitor for both.

Optional

Default value: False

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.exploding_tensor(),
rule_parameters={
"tensor_regex": ".*gradient",
"only_nan": "False"
},
collections_to_save=[
CollectionConfig(
name="gradients",
parameters={
"save_interval": "500"
}
)
]

1764
Amazon SageMaker Developer Guide
List of Built-in Rules

)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

PoorWeightInitialization
This rule detects if your model parameters have been poorly initialized.

Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains
commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively.
Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for
training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an
initialization can lead to exploding gradients. This rule checks the variance of activation inputs across
layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural
network has been poorly initialized.

Parameter Descriptions for the PoorWeightInitialization Rule


Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

activation_inputs_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: String

Default value: ".*relu_input"

threshold If the ratio between minimum and maximum


variance of weights per layer exceeds the
threshold at a step, the rule returns True.

Optional

Valid values: Float

Default value: 10.0

distribution_range If the minimum difference between 5th and 95th


percentiles of the gradient distribution is less
than the distribution_range, the rule returns
True.

1765
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Optional

Valid values: Float

Default value: 0.001

patience The number of steps to wait until the loss is


considered to be no longer decreasing.

Optional

Valid values: Integer

Default value: 5

steps The number of steps this rule analyzes. You


typically need to check only the first few
iterations.

Optional

Valid values: Float

Default value: 10

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.poor_weight_initialization(),
rule_parameters={
"activation_inputs_regex": ".*relu_input|.*ReLU_input",
"threshold": "10.0",
"distribution_range": "0.001",
"patience": "5",
"steps": "10"
},
collections_to_save=[
CollectionConfig(
name="custom_relu_collection",
parameters={
"include_regex": ".*relu_input|.*ReLU_input",
"save_interval": "500"
}
)
]
)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

SaturatedActivation
This rule detects if the tanh and sigmoid activation layers are becoming saturated. An activation
layer is saturated when the input of the layer is close to the maximum or minimum of the activation
function. The minimum and maximum of the tanh and sigmoid activation functions are defined by their

1766
Amazon SageMaker Developer Guide
List of Built-in Rules

respective min_threshold and max_thresholds values. If the activity of a node drops below the
threshold_inactivity percentage, it is considered saturated. If more than a threshold_layer
percent of the nodes are saturated, the rule returns True.

Parameter Descriptions for the SaturatedActivation Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: String

Default value:
".*tanh_input|.*sigmoid_input".

threshold_tanh_min The minimum and maximum thresholds


that define the extremes of the input for
a tanh activation function, defined as:
(min_threshold, max_threshold). The
default values are determined based on a
vanishing gradient threshold of 0.0000001.

Optional

Valid values: Float

Default values: -9.4999

threshold_tanh_max The minimum and maximum thresholds


that define the extremes of the input for
a tanh activation function, defined as:
(min_threshold, max_threshold). The
default values are determined based on a
vanishing gradient threshold of 0.0000001.

1767
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Optional

Valid values: Float

Default values: 9.4999

threshold_sigmoid_min The minimum and maximum thresholds


that define the extremes of the input for
a sigmoid activation function, defined as:
(min_threshold, max_threshold). The
default values are determined based on a
vanishing gradient threshold of 0.0000001.

Optional

Valid values: Float

Default values: -23

threshold_sigmoid_max The minimum and maximum thresholds


that define the extremes of the input for
a sigmoid activation function, defined as:
(min_threshold, max_threshold). The
default values are determined based on a
vanishing gradient threshold of 0.0000001.

Optional

Valid values: Float

Default values: 16.99999

threshold_inactivity The percentage of inactivity below which the


activation layer is considered to be saturated. The
activation might be active in the beginning of a
trial and then slowly become less active during
the training process.

Optional

Valid values: Float

Default values: 1.0

threshold_layer Returns True if the number of saturated


activations in a layer is greater than the
threshold_layer percentage.

Returns False if the number of saturated


activations in a layer is less than the
threshold_layer percentage.

Optional

Valid values: Float

Default values: 50.0

1768
Amazon SageMaker Developer Guide
List of Built-in Rules

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.saturated_activation(),
rule_parameters={
"tensor_regex": ".*tanh_input|.*sigmoid_input",
"threshold_tanh_min": "-9.4999",
"threshold_tanh_max": "9.4999",
"threshold_sigmoid_min": "-23",
"threshold_sigmoid_max": "16.99999",
"threshold_inactivity": "1.0",
"threshold_layer": "50.0"
},
collections_to_save=[
CollectionConfig(
name="custom_activations_collection",
parameters={
"include_regex": ".*tanh_input|.*sigmoid_input"
"save_interval": "500"
}
)
]
)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

VanishingGradient
This rule detects if the gradients in a trial become extremely small or drop to a zero magnitude. If the
mean of the absolute values of the gradients drops below a specified threshold, the rule returns True.

Parameters Descriptions for the VanishingGradient Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold The value at which the gradient is determined to


be vanishing.

Optional

Valid values: Float

Default value: 0.0000001.

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.vanishing_gradient(),

1769
Amazon SageMaker Developer Guide
List of Built-in Rules

rule_parameters={
"threshold": "0.0000001"
},
collections_to_save=[
CollectionConfig(
name="gradients",
parameters={
"save_interval": "500"
}
)
]
)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

WeightUpdateRatio
This rule keeps track of the ratio of updates to weights during training and detects if that ratio gets too
large or too small. If the ratio of updates to weights is larger than the large_threshold value or if
this ratio is smaller than small_threshold, the rule returns True.

Conditions for training are best when the updates are commensurate to gradients. Excessively
large updates can push the weights away from optimal values, and very small updates result
in very slow convergence. This rule requires weights to be available for two training steps, and
train.save_interval needs to be set equal to num_steps.

Parameter Descriptions for the WeightUpdateRatio Rule

Parameter Name, Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

num_steps The number of steps across which the rule checks


to determine if the tensor has changed.

The number of steps across which you want to


compare the weight ratios. If you pass no value,
the rule runs by default against the current
step and the immediately previous saved step.
If you override the default by passing a value
for this parameter, the comparison is done
between weights at step s and at a step >= s -
num_steps.

Optional

Valid values: Integer

Default value: None

1770
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name, Description

large_threshold The maximum value that the ratio of updates to


weight can take before the rule returns True.

Optional

Valid values: Float

Default value: 10.0

small_threshold The minimum value that the ratio of updates to


weight can take, below which the rule returns
True.

Optional

Valid values: Float

Default value: 0.00000001

epsilon A small constant used to ensure that Debugger


does not divide by zero when computing the ratio
updates to weigh.

Optional

Valid values: Float

Default value: 0.000000001

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.weight_update_ratio(),
rule_parameters={
"num_steps": "100",
"large_threshold": "10.0",
"small_threshold": "0.00000001",
"epsilon": "0.000000001"
},
collections_to_save=[
CollectionConfig(
name="weights",
parameters={
"train.save_interval": "100"
}
)
]
)
]

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
This rule is not available for the XGBoost algorithm.

AllZero
This rule detects if all or a specified percentage of the tensor values are zero.

1771
Amazon SageMaker Developer Guide
List of Built-in Rules

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameters Descriptions for the AllZero Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

threshold Specifies the percentage of values in the tensor


that needs to be zero for this rule to be invoked.

Optional

Valid values: Float

Default value: 100 (in percentage)

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.all_zero(),
rule_parameters={
"tensor_regex": ".*",

1772
Amazon SageMaker Developer Guide
List of Built-in Rules

"threshold": "100"
},
collections_to_save=[
CollectionConfig(
name="all",
parameters={
"save_interval": "500"
}
)
]
)
]

ClassImbalance
This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a
threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.

Classification models require well-balanced classes in the training dataset or a proper weighting/
sampling of classes during training. The rule performs the following checks:

• It counts the occurrences per class. If the ratio of number of samples between smallest and largest
class is larger than the threshold_imbalance, an error is thrown.
• It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied,
then the model can reach high accuracy for the class with many training samples, but low accuracy
for the classes with few training samples. If a fraction of mispredictions for a certain class is above
threshold_misprediction, an error is thrown.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the ClassImbalance Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold_imbalance The acceptable imbalance between the number


of samples in the smallest class and in the largest
class. Exceeding this threshold value throws an
error.

Optional

Valid values: Float

Default value: 10

threshold_misprediction A limit on the fraction of mispredictions allowed


for each class. Exceeding this threshold throws an

1773
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


error. The underrepresented classes are most at
risk of crossing this threshold.

Optional

Valid values: Float

Default value: 0.7

samples The number of labels that have to be processed


before an imbalance is evaluated. The rule might
not be triggered until it has seen sufficient
samples across several steps. The more classes
that your dataset contains, the larger this sample
number should be.

Optional

Valid values: Integer

Default value: 500 (assuming a dataset like MNIST


with 10 classes)

argmax If True, np.argmax is applied to the prediction


tensor. Required when you have a vector of
probabilities for each class. It is used to determine
which class has the highest probability.

Conditional

Valid values: Boolean

Default value: False

labels_regex The name of the tensor that contains the labels.

Optional

Valid values: String

Default value: ".*labels"

predictions_regex The name of the tensor that contains the


predictions.

Optional

Valid values: String

Default value: ".*predictions"

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.class_imbalance(),
rule_parameters={
"threshold_imbalance": "10",
"threshold_misprediction": "0.7",
"samples": "500",

1774
Amazon SageMaker Developer Guide
List of Built-in Rules

"argmax": "False",
"labels_regex": ".*labels",
"predictions_regex": ".*predictions"
},
collections_to_save=[
CollectionConfig(
name="custom_output_collection",
parameters={
"include_regex": ".*labels|.*predictions",
"save_interval": "500"
}
)
]
)
]

LossNotDecreasing
This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be
scalars.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the LossNotDecreasing Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patterns that is used to restrict


this comparison to specific scalar-valued tensors.
The rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

1775
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Valid values: List of strings or a comma-separated
string

Default value: None

use_losses_collection If set to True, looks for losses in the collection


named "losses" when the collection is present.

Optional

Valid values: Boolean

Default value: True

num_steps The minimum number of steps after which


the rule checks if the loss has decreased. Rule
evaluation happens every num_steps. The rule
compares the loss for this step with the loss at
a step which is at least num_steps behind the
current step. For example, suppose that the loss is
being saved every three steps, but num_steps is
set to 10. At step 21, loss for step 21 is compared
with loss for step 9. The next step at which loss is
checked is step 33, because ten steps after step
21 is step 31, and at step 31 and step 32 loss is
not saved.

Optional

Valid values: Integer

Default value: 10

diff_percent The minimum percentage difference by which the


loss should decrease between num_steps.

Optional

Valid values: 0.0 < float < 100

Default value: 0.1 (in percentage)

increase_threshold_percent The maximum threshold percent that loss


is allowed to increase in case loss has been
increasing

Optional

Valid values: 0 < float < 100

Default value: 5 (in percentage)

1776
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

mode The name of the Debugger mode to query


tensor values for rule checking. If this is not
passed, the rule checks in order by default for
the mode.EVAL, then mode.TRAIN, and then
mode.GLOBAL.

Optional

Valid values: String (EVAL, TRAIN, or GLOBAL)

Default value: GLOBAL

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.loss_not_decreasing(),
rule_parameters={
"tensor_regex": ".*",
"use_losses_collection": "True",
"num_steps": "10",
"diff_percent": "0.1",
"increase_threshold_percent": "5",
"mode": "GLOBAL"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]

Overfit
This rule detects if your model is being overfit to the training data by comparing the validation and
training losses.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
A standard way to prevent overfitting is to regularize your model.

Parameter Descriptions for the Overfit Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

1777
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Valid values: String

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

start_step The step from which to start comparing the


validation and training loss.

Optional

Valid values: Integer

Default value: 0

patience The number of steps for which the


ratio_threshold is allowed to exceed the value
set before the model is considered to be overfit.

Optional

Valid values: Integer

Default value: 1

ratio_threshold The maximum ratio of the difference between the


mean validation loss and mean training loss to the
mean training loss. If this threshold is exceeded
for a patience number of steps, the model is
being overfit and the rule returns True.

Optional

Valid values: Float

Default value: 0.1

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.overfit(),
rule_parameters={
"tensor_regex": ".*",
"start_step": "0",
"patience": "1",
"ratio_threshold": "0.1"
},
collections_to_save=[

1778
Amazon SageMaker Developer Guide
List of Built-in Rules

CollectionConfig(
name="losses",
parameters={
"train.save_interval": "100",
"eval.save_interval": "10"
}
)
]
)
]

Overtraining
This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved
model (both training and validation loss decrease), the model approaches to a minimum of the loss
function and does not improve anymore. If the model continues training it can happen that validation
loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to
determine if the model is not improving, and prevents overfitting problems due to overtraining.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).
Note
Overtraining can be avoided by early stopping. For information on early stopping, see Stop
Training Jobs Early (p. 1640). For an example that shows how to use spot training with
Debugger, see Enable Spot Training with Amazon SageMaker Debugger.

Parameter Descriptions for the Overtraining Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

patience_train The number of steps to wait before the training


loss is considered to not to be improving anymore.

Optional

Valid values: Integer

Default value: 5

patience_validation The number of steps to wait before the validation


loss is considered to not to be improving anymore.

Optional

Valid values: Integer

Default value: 10

1779
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

delta The minimum threshold by how much the error


should improve before it is considered as a new
optimum.

Optional

Valid values: Float

Default value: 0.01

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.overtraining(),
rule_parameters={
"patience_train": "5",
"patience_validation": "10",
"delta": "0.01"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]

SimilarAcrossRuns
This rule compares tensors gathered from a base trial with tensors from another trial.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the SimilarAcrossRuns Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

other_trials A completed training job name whose tensors you


want to compare to those tensors gathered from
the current base_trial.

Required

1780
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.similar_across_runs(),
rule_parameters={
"other_trials": "<specify-another-job-name>",
"collection_names": "losses",
"tensor_regex": ".*"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]

StalledTrainingRule
StalledTrainingRule detects if there is no progress made on training job, and stops the training job if the
rule fires. This rule requires tensors to be periodically saved in a time interval defined by its threshold
parameter. This rule keeps on monitoring for new tensors, and if no new tensor has been emitted for
threshold interval rule gets fired.

1781
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Descriptions for the StalledTrainingRule Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold A threshold that defines by how much time in


seconds the rule waits for a tensor output until it
fires a stalled training issue. Default value is 1800
seconds.

Optional

Valid values: Integer

Default value: 1800

stop_training_on_fire If set to True, watches if the base training job


outputs tensors in "threshold" seconds.

Optional

Valid values: Boolean

Default value: False

training_job_name_prefix The prefix of base training job name. If


stop_training_on_fire is true, the rule
searches for SageMaker training jobs with this
prefix in the same account. If there is an inactivity
found, the rule takes a StopTrainingJob action.
Note if there are multiple jobs found with same
prefix, the rule skips termination. It is important
that the prefix is set unique per each training job.

Optional

Valid values: String

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.stalled_training_rule(),
rule_parameters={
"threshold": "1800",
"stop_training_on_fire": "True",
"training_job_name_prefix": "<specify-training-base-job-name>"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)

1782
Amazon SageMaker Developer Guide
List of Built-in Rules

]
)
]

TensorVariance
This rule detects if you have tensors with very high or low variances. Very high or low variances in a
tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very
high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues
early.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the TensorVariance Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

max_threshold The threshold for the upper bound of tensor


variance.

Optional

1783
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Valid values: Float

Default value: None

min_threshold The threshold for the lower bound of tensor


variance.

Optional

Valid values: Float

Default value: None

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.tensor_variance(),
rule_parameters={
"collection_names": "weights",
"max_threshold": "10",
"min_threshold": "0.00001",
},
collections_to_save=[
CollectionConfig(
name="weights",
parameters={
"save_interval": "500"
}
)
]
)
]

UnchangedTensor
This rule detects whether a tensor is no longer changing across steps.

This rule runs the numpy.allclose method to check if the tensor isn't changing.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet,
and PyTorch) or to the XGBoost algorithm. You must specify either the collection_names or
tensor_regex parameter. If both the parameters are specified, the rule inspects the union of tensors
from both sets.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the UnchangedTensor Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

1784
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

collection_names The list of collection names whose tensors the


rule inspects.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

tensor_regex A list of regex patternsused to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: None

num_steps The number of steps across which the rule checks


to determine if the tensor has changed.

This checks the last num_steps that are available.


They don't need to be consecutive. If num_steps
is 2, at step s it doesn't necessarily check for
s-1 and s. If s-1 isn't available, it checks the last
available step along with s. In that case, it checks
the last available step with the current step.

Optional

Valid values: Integer

Default value: 3

rtol The relative tolerance parameter to be passed to


the numpy.allclose method.

Optional

Valid values: Float

Default value: 1e-05

atol The absolute tolerance parameter to be passed to


the numpy.allclose method.

Optional

Valid values: Float

Default value: 1e-08

1785
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description

equal_nan Whether to compare NaNs as equal. If True, NaNs


in input array a are considered equal to NaNs in
input array b in the output array. This parameter is
passed to the numpy.allclose method.

Optional

Valid values: Boolean

Default value: False

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.unchanged_tensor(),
rule_parameters={
"collection_names": "losses",
"tensor_regex": "",
"num_steps": "3",
"rtol": "1e-05",
"atol": "1e-08",
"equal_nan": "False"
},
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"save_interval": "500"
}
)
]
)
]

CheckInputImages
This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the
sample data differs by more than a threshold value from zero. Many computer vision models require that
input data has a zero mean and unit variance.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the CheckInputImages Rule


Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold_mean A threshold that defines by how much mean of


the input data can differ from 0.

1786
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Optional

Valid values: Float

Default value: 0.2

threshold_samples The number of images that have to be sampled


before an error can be thrown. If the value is too
low, the estimation of the dataset mean will be
inaccurate.

Optional

Valid values: Integer

Default value: 500

regex The name of the input data tensor.

Optional

Valid values: String

Default value:
".*hybridsequential0_input_0" (the name
of the input tensor for Apache MXNet models
using HybridSequential)

channel The position of the color channel in the input


tensor shape array.

Optional

Valid values: Integer

Default value: 1 (for example, MXNet expects


input data in the form of (batch_size, channel,
height, width))

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.check_input_images(),
rule_parameters={
"threshold_mean": "0.2",
"threshold_samples": "500",
"regex": ".*hybridsequential0_input_0",
"channel": "1"
},
collections_to_save=[
CollectionConfig(
name="custom_inputs_collection",
parameters={
"include_regex": ".*hybridsequential0_input_0",
"save_interval": "500"
}
)
]
)

1787
Amazon SageMaker Developer Guide
List of Built-in Rules

NLPSequenceRatio
This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for
optimizing performance. For example, you can calculate the percentage of padding end-of-sentence
(EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing
strategy should be performed. You also can calculate the percentage of unknown tokens in your input
sequence. If the number of unknown words is too high, an alternate vocabulary could be used.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the NLPSequenceRatio Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

tensor_regex A list of regex patterns used to restrict this


comparison to specific scalar-valued tensors. The
rule inspects only the tensors that match the
regex patterns specified in the list. If no patterns
are passed, the rule compares all tensors gathered
in the trials by default. Only scalar-valued tensors
can be matched.

Optional

Valid values: List of strings or a comma-separated


string

Default value: ".*embedding0_input_0"


(assuming an embedding as the initial layer of the
network)

token_values A string of a list of the numerical values of the


tokens. For example, "3, 0".

Optional

Valid values: Comma-separated string of


numerical values

Default value: 0

token_thresholds_percent A string of a list of thresholds (in percentages)


that correspond to each of the token_values.
For example,"50.0, 50.0".

Optional

1788
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Valid values: Comma-separated string of floats

Default value: "50"

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.nlp_sequence_ratio(),
rule_parameters={
"tensor_regex": ".*embedding0_input_0",
"token_values": "0",
"token_thresholds_percent": "50"
},
collections_to_save=[
CollectionConfig(
name="custom_inputs_collection",
parameters={
"include_regex": ".*embedding0_input_0"
}
)
]
)
]

Confusion
This rule evaluates the goodness of a confusion matrix for a classification problem.

It creates a matrix of size category_no*category_no and populates it with data coming


from (labels, predictions) pairs. For each (labels, predictions) pair, the count in
confusion[labels][predictions] is incremented by 1. When the matrix is fully populated, the
ratio of data on-diagonal values and off-diagonal values are evaluated as follows:

• For elements on the diagonal: confusion[i][i]/sum_j(confusion[j][j])>=min_diag


• For elements off the diagonal: confusion[j][i])/sum_j(confusion[j][i])<=max_off_diag

This rule can be applied to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the Confusion Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

category_no The number of categories.

Optional

Valid values: Integer ≥2

1789
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Default value: "None"

labels The labels tensor collection or an 1-d vector of


true labels.

Optional

Valid values: String

Default value: "labels"

predictions The predictions tensor collection or an 1-d


vector of estimated labels.

Optional

Valid values: String

Default value: "predictions"

labels_collection The rule inspects the tensors in this collection for


labels.

Optional

Valid values: String

Default value: "labels"

predictions_collection The rule inspects the tensors in this collection for


predictions.

Optional

Valid values: String

Default value: "predictions"

min_diag The minimum threshold for the ratio of data on


the diagonal.

Optional

Valid values: 0≤float≤1

Default value: 0.9

max_off_diag The maximum threshold for the ratio of data off


the diagonal.

Optional

Valid values: 0≤float≤1

Default value: 0.1

built_in_rules = [
Rule.sagemaker(

1790
Amazon SageMaker Developer Guide
List of Built-in Rules

base_config=rule_configs.confusion(),
rule_parameters={
"category_no": "10",
"labels": "labels",
"predictions": "predictions",
"labels_collection": "labels",
"predictions_collection": "predictions",
"min_diag": "0.9",
"max_off_diag": "0.1"
},
collections_to_save=[
CollectionConfig(
name="labels",
parameters={
"save_interval": "500"
}
),
CollectionConfig(
name="predictions",
parameters={
"include_regex": "500"
}
)
]
)
]

Note
This rule infers default values for the optional parameters if their values aren't specified.

FeatureImportanceOverweight
This rule accumulates the weights of the n largest feature importance values per step and ensures that
they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not
hold more than 80 percent of the total weights of the model.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the FeatureImportanceOverweight Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

Required

Valid values: String

threshold Defines the threshold for the proportion of


the cumulative sum of the n largest features.
The number n is defined by the nfeatures
parameter.

Optional

Valid values: Float

1791
Amazon SageMaker Developer Guide
List of Built-in Rules

Parameter Name Description


Default value: 0.8

nfeatures The number of largest features.

Optional

Valid values: Integer

Default value: 3

tensor_regex Regular expression (regex) of tensor names the


rule to analyze.

Optional

Valid values: String

Default value: ".*feature_importance/


weight"

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.feature_importance_overweight(),
rule_parameters={
"threshold": "0.8",
"nfeatures": "3",
"tensor_regex": ".*feature_importance/weight"
},
collections_to_save=[
CollectionConfig(
name="feature_importance",
parameters={
"save_interval": "500"
}
)
]
)
]

TreeDepth
This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if they do not improve
loss. This regularizes the training. As a result, the tree might not grow as deep as defined by the depth
parameter.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in
Rules (p. 1678).

Parameter Descriptions for the TreeDepth Rule

Parameter Name Description

base_trial The base trial training job name. This parameter


is automatically set to the current training job by
Amazon SageMaker Debugger.

1792
Amazon SageMaker Developer Guide
Create Custom Rules

Parameter Name Description


Required

Valid values: String

depth The depth of the tree. The depth of the tree is


obtained by computing the base 2 logarithm of
the largest node ID.

Optional

Valid values: Float

Default value: 4

built_in_rules = [
Rule.sagemaker(
base_config=rule_configs.tree_depth(),
rule_parameters={
"depth": "4"
},
collections_to_save=[
CollectionConfig(
name="tree",
parameters={
"save_interval": "500"
}
)
]
)
]

Create Debugger Custom Rules for Training Job


Analysis
You can create custom rules to monitor your training job using the Debugger Rule APIs and the open
source smdebug Python library that provide tools to build your own rule containers.

Topics
• Prerequisites for Creating Debugger Custom Rules (p. 1793)
• Use the Debugger Client Library smdebug to Create a Custom Rule Python Script (p. 1794)
• Use the Debugger APIs to Run Your Own Custom Rules (p. 1794)

Prerequisites for Creating Debugger Custom Rules


To create Debugger custom rules, you need the following prerequisites.

• SageMaker Debugger Rule.custom API


• The open source smdebug Python library
• Your own custom rule python script
• Amazon SageMaker Debugger Registry URLs for Custom Rule Evaluators (p. 1814)

1793
Amazon SageMaker Developer Guide
Create Custom Rules

Use the Debugger Client Library smdebug to Create a Custom


Rule Python Script
The smdebug Rule API provides an interface to set up your own custom rules. The following python
script is a sample of how to construct a custom rule, CustomGradientRule. This tutorial custom rule
watches if the gradients are getting too large and set the default threshold as 10. The custom rule takes
a base trial created by a SageMaker estimator when it initiates training job.

from smdebug.rules.rule import Rule

class CustomGradientRule(Rule):
def __init__(self, base_trial, threshold=10.0):
super().__init__(base_trial)
self.threshold = float(threshold)

def invoke_at_step(self, step):


for tname in self.base_trial.tensor_names(collection="gradients"):
t = self.base_trial.tensor(tname)
abs_mean = t.reduction_value(step, "mean", abs=True)
if abs_mean > self.threshold:
return True
return False

You can add multiple custom rule classes as many as you want in the same python script and deploy
them to any training job trials by constructing custom rule objects in the following section.

Use the Debugger APIs to Run Your Own Custom Rules


The following code sample shows how to configure a custom rule with the Amazon SageMaker Python
SDK. This example assumes that the custom rule script you created in the previous step is located at
'path/to/my_custom_rule.py'.

from sagemaker.debugger import Rule, CollectionConfig

custom_rule = Rule.custom(
name='MyCustomRule',
image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-
evaluator:latest',
instance_type='ml.t3.medium',
source='path/to/my_custom_rule.py',
rule_to_invoke='CustomGradientRule',
collections_to_save=[CollectionConfig("gradients")],
rule_parameters={"threshold": "20.0"}
)

The following list explains the Debugger Rule.custom API arguments.

• name (str): Specify a custom rule name as you want.


• image_uri (str): This is the image of the container that has the logic of understanding your custom
rule. It sources and evaluates the specified tensor collections you save in the training job. You can find
the list of open source SageMaker rule evaluator images from Amazon SageMaker Debugger Registry
URLs for Custom Rule Evaluators (p. 1814).
• instance_type (str): You need to specify an instance to build a rule docker container. This spins up
the instance in parallel with a training container.
• source (str): This is the local path or the Amazon S3 URI to your custom rule script.
• rule_to_invoke (str): This specifies the particular Rule class implementation in your custom rule
script. SageMaker supports only one rule to be evaluated at a time in a rule job.

1794
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers

• collections_to_save (str): This specifies which tensor collections you will save for the rule to run.
• rule_parameters (dictionary): This accepts parameter inputs in a dictionary format. You can adjust
the parameters that you configured in the custom rule script.

After you set up the custom_rule object, you can use it for building a SageMaker estimator for any
training jobs. Specify the entry_point to your training script. You do not need to make any change of
your training script.

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
role=sagemaker.get_execution_role(),
base_job_name='smdebug-custom-rule-demo-tf-keras',
entry_point='path/to/your_training_script.py'
train_instance_type='ml.p2.xlarge'
...

# debugger-specific arguments below


rules = [custom_rule]
)

estimator.fit()

For more variations and advanced examples of using Debugger custom rules, see the following example
notebooks.

• Monitor your training job with Amazon SageMaker Debugger custom rules
• PyTorch iterative model pruning of ResNet and AlexNet
• Trigger Amazon CloudWatch Events using Debugger Rules to Take an Action Based on Training Status
with TensorFlow

Use Debugger with Custom Training Containers


Amazon SageMaker Debugger is available for any deep learning models that you bring to Amazon
SageMaker. The AWS CLI, SageMaker Estimator API, and the Debugger APIs enable you to use any
Docker base images to build and customize containers to train your models. To use Debugger with
customized containers, you need to make a minimal change to your training script to implement the
Debugger hook callback and retrieve tensors from training jobs.

You need the following resources to build a customized container with Debugger.

• Amazon SageMaker Python SDK


• The SMDebug open source client library
• A Docker base image of your choice
• Your training script with a Debugger hook registered – For more information about registering a
Debugger hook to your training script, see Register Debugger Hook to Your Training Script (p. 1796).

For an end-to-end example of using Debugger with a custom training container, see the following
example notebook.

• Build a Custom Training Container and Debug Training Jobs with Debugger

1795
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers

Tip
This custom container with Debugger guide is an extension of the Adapting your own training
container (p. 2686) guide which walks you thorough how to build and push your custom
training container to Amazon ECR.

Prepare to Build a Custom Training Container


To build a docker container, the basic structure of files should look like the following:

### debugger_custom_container_test_notebook.ipynb # a notebook to run python snippet


codes
### debugger_custom_container_test_folder # this is a docker folder
### your-training-script.py # your training script with Debugger
hook
### Dockerfile # a Dockerfile to build your own
container

Register Debugger Hook to Your Training Script


To debug your model training, you need to add a Debugger hook to your training script.
Note
This step is required to collect model parameters (output tensors) for debugging your model
training. If you only want to monitor and profile, you can skip this hook registration step and
exclude the debugger_hook_config parameter when constructing an estimater.

The following example code shows the structure of a training script using the Keras ResNet50 model and
how to pass the Debugger hook as a Keras callback for debugging. To find a complete training script, see
TensorFlow training script with SageMaker Debugger hook.

# An example of training script (your-training-script.py)


import tensorflow.compat.v2 as tf
from tensorflow.keras.applications.resnet50 import ResNet50
import smdebug.tensorflow as smd

def train(batch_size, epoch, model, hook):

...
model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epoch,
validation_data=(X_valid, Y_valid),
shuffle=True,

# smdebug modification: Pass the Debugger hook in the main() as a Keras


callback
callbacks=[hook])

def main():
parser=argparse.ArgumentParser(description="Train resnet50 cifar10")

# hyperparameter settings
parser.add_argument(...)

args = parser.parse_args()

model=ResNet50(weights=None, input_shape=(32,32,3), classes=10)

# Add the following line to register the Debugger hook for Keras.
hook=smd.KerasHook.create_from_json_file()

1796
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers

# Start the training.


train(args.batch_size, args.epoch, model, hook)

if __name__ == "__main__":
main()

For more information about registering the Debugger hook for the supported frameworks and
algorithm, see the following links in the SMDebug client library:

• SMDebug TensorFlow hook


• SMDebug PyTorch hook
• SMDebug MXNet hook
• SMDebug XGBoost hook

In the following example notebooks' training scripts, you can find more examples about how to add the
Debugger hooks to training scripts and collect output tensors in detail:

• Debugger in script mode with the TensorFlow 2.1 framework

To see the difference between using Debugger in a Deep Learning Container and in script mode, open
this notebook and put it and the previous Debugger in a Deep Learning Container TensorFlow v2.1
notebook example side by side.

In script mode, the hook configuration part is removed from the script in which you set the estimator.
Instead, the Debugger hook feature is merged into the training script, TensorFlow Keras ResNet
training script in script mode. The training script imports the smdebug library in the required
TensorFlow Keras environment to communicate with the TensorFlow ResNet50 algorithm. It also
manually implements the smdebug hook functionality by adding the callbacks=[hook] argument
inside the train function (in line 49), and by adding the manual hook configuration (in line 89)
provided through SageMaker Python SDK.

This script mode example runs the training job in the TF 2.1 framework for direct comparison with
the zero script change in the TF 2.1 example. The benefit of setting up Debugger in script mode is the
flexibility to choose framework versions not covered by AWS Deep Learning Containers.
• Using Amazon SageMaker Debugger in a PyTorch Container in Script Mode

This notebook enables Debugger in script mode in PyTorch v1.3.1 framework. PyTorch v1.3.1 is
supported by SageMaker containers, and this example shows details of how to modify a training script.

The SageMaker PyTorch estimator is already in script mode by default. In the notebook, the line to
activate script_mode is not included in the estimator configuration.

This notebook shows detailed steps to change an original PyTorch training script to a modified
version with Debugger enabled. Additionally, this example shows how you can use Debugger built-in
rules to detect training issues such as the vanishing gradients problem, and the Debugger trial features
to call and analyze the saved tensors.

Create and Configure a Dockerfile


Open your SageMaker JupyterLab and create a new folder,
debugger_custom_container_test_folder in this example, to save your training script and
Dockerfile. The following code example is a Dockerfile that includes essential docker build
commends. Paste the following code into the Dockerfile text file and save it. Upload your training
script to the same folder.

1797
Amazon SageMaker Developer Guide
Use Debugger with Custom Training Containers

# Specify a docker base image


FROM tensorflow/tensorflow:2.2.0rc2-gpu-py3
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip install --upgrade protobuf

# Install required packages to enable the SageMaker Python SDK and the smdebug library
RUN pip install sagemaker-training
RUN pip install smdebug
CMD ["bin/bash"]

If you want to use a pre-built AWS Deep Learning Container image, see Available AWS Deep Learning
Containers Images.

Build and Push the Custom Training Container to Amazon ECR


Create a test notebook, debugger_custom_container_test_notebook.ipynb, and run the
following code in the notebook cell. This will access the debugger_byoc_test_docker directory, build
the docker with the specified algorithm_name, and push the docker container to your Amazon ECR.

import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-debugger-mnist-byoc-tf2'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
uri_suffix = 'amazonaws.com.cn'
byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix,
ecr_repository + tag)

!docker build -t $ecr_repository docker


!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $byoc_image_uri
!docker push $byoc_image_uri

Tip
If you use one of the AWS Deep Learning Container base images, run the following code to log
in to Amazon ECR and access to the Deep Learning Container image repository.

! aws ecr get-login-password --region {region} | docker login --username AWS --


password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

Run and Debug Training Jobs Using the Custom Training


Container
After you build and push your docker container to Amazon ECR, configure a SageMaker estimator with
your training script and the Debugger-specific parameters. After you execute the estimator.fit(),
Debugger will collect output tensors, monitor them, and detect training issues. Using the saved tensors,
you can further analyze the training job by using the smdebug core features and tools. Configuring a
workflow of Debugger rule monitoring process with Amazon CloudWatch Events and AWS Lambda, you
can automate a stopping training job process whenever the Debugger rules spots training issues.

import sagemaker

1798
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

from sagemaker.estimator import Estimator


from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs

profiler_config=ProfilerConfig(...)
debugger_hook_config=DebuggerHookConfig(...)
rules=[
Rule.sagemaker(rule_configs.built_in_rule()),
ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=Estimator(
image_uri=byoc_image_uri,
entry_point="./debugger_custom_container_test_folder/your-training-script.py"
role=sagemaker.get_execution_role(),
base_job_name='debugger-custom-container-test',
instance_count=1,
instance_type='ml.p3.2xlarge',

# Debugger-specific parameters
profiler_config=profiler_config,
debugger_hook_config=debugger_hook_config,
rules=rules
)

# start training
estimator.fit()

Configure Debugger Using Amazon SageMaker API


The preceding topics focus on using Debugger through Amazon SageMaker Python SDK, which is a
wrapper around AWS SDK for Python (Boto3) and SageMaker API operations. This offers a high-level
experience of accessing the Amazon SageMaker API operations. In case you need to manually configure
the SageMaker API operations using AWS Boto3 or AWS Command Line Interface (CLI) for other SDKs,
such as Java, Go, and C++, this section covers how to configure the following low-level API operations.

Topics
• JSON (AWS CLI) (p. 1799)
• AWS Boto3 (p. 1804)

JSON (AWS CLI)


Amazon SageMaker Debugger built-in rules can be configured for a training job using the
DebugHookConfig, DebugRuleConfiguration, ProfilerConfig, and ProfilerRuleConfiguration objects
through the SageMaker CreateTrainingJob API operation. You need to specify the right image URI in the
RuleEvaluatorImage parameter, and the following examples walk you through how to set up the
JSON strings to request CreateTrainingJob.

The following code shows a complete JSON template to run a training job with required settings and
Debugger configurations. Save the template as a JSON file in your working directory and run the training
job using AWS CLI. For example, save the following code as debugger-training-job-cli.json.
Note
Ensure that you use the correct Docker container images. To find AWS Deep Learning Container
images, see Available Deep Learning Containers Images. To find a complete list of available
Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813).

1799
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

"TrainingJobName": "debugger-aws-cli-test",
"RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-
YYYYMMDDT123456",
"AlgorithmSpecification": {
// Specify a training Docker container image URI (Deep Learning Container or your own
training container) to TrainingImage.
"TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-
training:2.4.1-gpu-py37-cu110-ubuntu18.04",
"TrainingInputMode": "File",
"EnableSageMakerMetricsTimeSeries": false
},
"HyperParameters": {
"sagemaker_program": "entry_point/tf-hvd-train.py",
"sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-
profiling-test/source.tar.gz"
},
"OutputDataConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
},
"DebugHookConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-
output",
"CollectionConfigurations": [
{
"CollectionName": "losses",
"CollectionParameters" : {
"train.save_interval": "50"
}
}
]
},
"DebugRuleConfigurations": [
{
"RuleConfigurationName": "LossNotDecreasing",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
}
],
"ProfilerConfig": {
"S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/
profiler-output",
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex
\": \".*\", }",
"DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
"PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\":
\"cprofile\", \"cProfileTimer\": \"total_time\"}",
"LocalPath": "/opt/ml/output/profiler/"
}
},
"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "ProfilerReport",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {"rule_to_invoke": "ProfilerReport"}
}
],
"ResourceConfig": {
"InstanceType": "ml.p3.8xlarge",
"InstanceCount": 1,
"VolumeSizeInGB": 30
},

1800
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
}
}

After saving the JSON file, run the following command in your terminal. (Use ! at the beginning of the
line if you use a Jupyter notebook.)

aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json

To configure a Debugger rule for debugging model parameters


The following code samples show how to configure a built-in VanishingGradient rule using this
SageMaker API.

To enable Debugger to collect output tensors

Specify the Debugger hook configuration as follows:

"DebugHookConfig": {
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
"CollectionConfigurations": [
{
"CollectionName": "gradients",
"CollectionParameters" : {
"save_interval": "500"
}
}
]
}

This will make the training job save the tensor collection, gradients, every save_interval of 500
steps. To find available CollectionName values, see Debugger Built-in Collections in the SMDebug
client library documentation. To find available CollectionParameters parameter keys and values, see
the sagemaker.debugger.CollectionConfig class in the SageMaker Python SDK documentation.

To enable Debugger rules for debugging the output tensors

The following DebugRuleConfigurations API example shows how to run the built-in
VanishingGradient rule on the saved gradients collection.

"DebugRuleConfigurations": [
{
"RuleConfigurationName": "VanishingGradient",
"RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "VanishingGradient",
"threshold": "20.0"
}
}
]

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training
job using the VanishingGradient rule on the collection of gradients tensor. To find a complete list
of available Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813). To find the key-value pairs for RuleParameters, see List of Debugger Built-in
Rules (p. 1748).

1801
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

To configure a Debugger built-in rule for profiling system and framework


metrics
The following example code shows how to specify the ProfilerConfig API operation to enable collecting
system and framework metrics.

To enable Debugger profiling to collect system and framework metrics

Target Step

"ProfilerConfig": {
// Optional. Path to an S3 bucket to save profiling outputs
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output",
// Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3,
\"MetricsRegex\": \".*\" }",
"DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
// For PythonProfilingConfig,
// available ProfilerName options: cProfile, Pyinstrument
// available cProfileTimer options only when using cProfile: cpu, off_cpu,
total_time
"PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName
\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
// Optional. Local path for profiling outputs
"LocalPath": "/opt/ml/output/profiler/"
}
}

Target Time Duration

"ProfilerConfig": {
// Optional. Path to an S3 bucket to save profiling outputs
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output",
// Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
"ProfilingIntervalInMilliseconds": 500,
"ProfilingParameters": {
"DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
"DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10 }",
// For PythonProfilingConfig,
// available ProfilerName options: cProfile, Pyinstrument
// available cProfileTimer options only when using cProfile: cpu, off_cpu,
total_time
"PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789,
\"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\":
\"total_time\" }",
// Optional. Local path for profiling outputs
"LocalPath": "/opt/ml/output/profiler/"
}
}

To enable Debugger rules for profiling the metrics

The following example code shows how to configure the ProfilerReport rule.

"ProfilerRuleConfigurations": [

1802
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

{
"RuleConfigurationName": "ProfilerReport",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "ProfilerReport",
"CPUBottleneck_cpu_threshold": "90",
"IOBottleneck_threshold": "90"
}
}
]

To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).

Update Debugger Profiling Configuration Using the UpdateTrainingJob API


Operation
Debugger profiling configuration can be updated while your training job is running by using the
UpdateTrainingJob API operation. Configure new ProfilerConfig and ProfilerRuleConfiguration objects,
and specify the training job name to the TrainingJobName parameter.

{
"ProfilerConfig": {
"DisableProfiler": boolean,
"ProfilingIntervalInMilliseconds": number,
"ProfilingParameters": {
"string" : "string"
}
},
"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "string",
"RuleEvaluatorImage": "string",
"RuleParameters": {
"string" : "string"
}
}
],
"TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}

Add Debugger Custom Rule Configuration to the CreateTrainingJob API


Operation
A custom rule can be configured for a training job using the DebugHookConfig and
DebugRuleConfiguration objects in the CreateTrainingJob API operation. The following code sample
shows how to configure a custom ImproperActivation rule written with the smdebug library
using this SageMaker API operation. This example assumes that you’ve written the custom rule in
custom_rules.py file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker
images that you can use to run your custom rules. These are listed at Amazon SageMaker Debugger
Registry URLs for Custom Rule Evaluators (p. 1814). You specify the URL registry address for the pre-
built Docker image in the RuleEvaluatorImage parameter.

"DebugHookConfig": {
"S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
"CollectionConfigurations": [
{
"CollectionName": "relu_activations",

1803
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

"CollectionParameters": {
"include_regex": "relu",
"save_interval": "500",
"end_step": "5000"
}
}
]
},
"DebugRulesConfigurations": [
{
"RuleConfigurationName": "improper_activation_job",
"RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-
debugger-rule-evaluator:latest",
"InstanceType": "ml.c4.xlarge",
"VolumeSizeInGB": 400,
"RuleParameters": {
"source_s3_uri": "s3://bucket/custom_rules.py",
"rule_to_invoke": "ImproperActivation",
"collection_names": "relu_activations"
}
}
]

To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).

AWS Boto3
Amazon SageMaker Debugger built-in rules can be configured for a training job using the
create_training_job() function of the AWS Boto3 SageMaker client. You need to specify the right
image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how
to set up the request body for the create_training_job() function.

The following code shows a complete example of how to configure Debugger for the
create_training_job() request body and start a training job in us-west-2, assuming that a
training script entry_point/train.py is prepared using TensorFlow. To find an end-to-end example
notebook, see Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker
Debugger (Boto3).
Note
Ensure that you use the correct Docker container images. To find available AWS Deep Learning
Container images, see Available Deep Learning Containers Images. To find a complete list of
available Docker images for using the Debugger rules, see Use Debugger Docker Images for
Built-in or Custom Rules (p. 1813).

import sagemaker, boto3


import datetime, tarfile

# Start setting up a SageMaker session and a Boto3 SageMaker client


session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = 'debugger-boto3-test'

tar = tarfile.open(source, 'w:gz')


tar.add ('entry_point/train.py') # Specify the directory and name of your training script
tar.close()

1804
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

# Set up a Boto3 session client for SageMaker


sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job


sm.create_training_job(
TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-
%S'),
HyperParameters={
'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source,
'sagemaker_program': '/entry_point/train.py' # training scrip file location and
name under the sagemaker_submit_directory
},
AlgorithmSpecification={
# Specify a training Docker container image URI (Deep Learning Container or your
own training container) to TrainingImage.
'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-
training:2.4.1-gpu-py37-cu110-ubuntu18.04',
'TrainingInputMode': 'File',
'EnableSageMakerMetricsTimeSeries': False
},
RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-
ExecutionRole-20201014T161125',
OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
ResourceConfig={
'InstanceType': 'ml.p3.8xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 30
},
StoppingCondition={
'MaxRuntimeInSeconds': 86400
},
DebugHookConfig={
'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output',
'CollectionConfigurations': [
{
'CollectionName': 'losses',
'CollectionParameters' : {
'train.save_interval': '500',
'eval.save_interval': '50'
}
}
]
},
DebugRuleConfigurations=[
{
'RuleConfigurationName': 'LossNotDecreasing',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing'}
}
],
ProfilerConfig={
'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output',
'ProfilingIntervalInMilliseconds': 500,
'ProfilingParameters': {
'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex":
".*", }',
'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }',
'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName":
"cprofile", "cProfileTimer": "total_time"}',
'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling
outputs
}

1805
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

},
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
}
]
)

To configure a Debugger rule for debugging model parameters


The following code samples show how to configure a built-in VanishingGradient rule using this
SageMaker API.

To enable Debugger to collect output tensors

Specify the Debugger hook configuration as follows:

DebugHookConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
'CollectionConfigurations': [
{
'CollectionName': 'gradients',
'CollectionParameters' : {
'train.save_interval': '500',
'eval.save_interval': '50'
}
}
]
}

This will make the training job save a tensor collection, gradients, every save_interval of 500 steps.
To find available CollectionName values, see Debugger Built-in Collections in the SMDebug client
library documentation. To find available CollectionParameters parameter keys and values, see the
sagemaker.debugger.CollectionConfig class in the SageMaker Python SDK documentation.

To enable Debugger rules for debugging the output tensors

The following DebugRuleConfigurations API example shows how to run the built-in
VanishingGradient rule on the saved gradients collection.

DebugRuleConfigurations=[
{
'RuleConfigurationName': 'VanishingGradient',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'VanishingGradient',
'threshold': '20.0'
}
}
]

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training
job using the VanishingGradient rule on the collection of gradients tensor. To find a complete list
of available Docker images for using the Debugger rules, see Use Debugger Docker Images for Built-in or
Custom Rules (p. 1813). To find the key-value pairs for RuleParameters, see List of Debugger Built-in
Rules (p. 1748).

1806
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

To configure a Debugger built-in rule for profiling system and framework


metrics
The following example code shows how to specify the ProfilerConfig API operation to enable collecting
system and framework metrics.

To enable Debugger profiling to collect system and framework metrics

Target Step

ProfilerConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', #
Optional. Path to an S3 bucket to save profiling outputs
# Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
'ProfilingIntervalInMilliseconds': 500,
'ProfilingParameters': {
'DataloaderProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3,
"MetricsRegex": ".*"
}',
'DetailedProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3
}',
'PythonProfilingConfig': '{
"StartStep": 5,
"NumSteps": 3,
"ProfilerName": "cprofile", # Available options: cprofile, pyinstrument
"cProfileTimer": "total_time" # Include only when using cprofile.
Available options: cpu, off_cpu, total_time
}',
'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling
outputs
}
}

Target Time Duration

ProfilerConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', #
Optional. Path to an S3 bucket to save profiling outputs
# Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1
second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
'ProfilingIntervalInMilliseconds': 500,
'ProfilingParameters': {
'DataloaderProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10,
"MetricsRegex": ".*"
}',
'DetailedProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10
}',
'PythonProfilingConfig': '{
"StartTimeInSecSinceEpoch": 12345567789,
"DurationInSeconds": 10,
"ProfilerName": "cprofile", # Available options: cprofile, pyinstrument
"cProfileTimer": "total_time" # Include only when using cprofile.
Available options: cpu, off_cpu, total_time

1807
Amazon SageMaker Developer Guide
Configure Debugger Using SageMaker API

}',
'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling
outputs
}
}

To enable Debugger rules for profiling the metrics

The following example code shows how to configure the ProfilerReport rule.

ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'ProfilerReport',
'CPUBottleneck_cpu_threshold': '90',
'IOBottleneck_threshold': '90'
}
}
]

To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).

Update Debugger Profiling Configuration Using the UpdateTrainingJob API


Operation
Debugger profiling configuration can be updated while your training job is running by using the
update_training_job() function of the AWS Boto3 SageMaker client. Configure new ProfilerConfig
and ProfilerRuleConfiguration objects, and specify the training job name to the TrainingJobName
parameter.

ProfilerConfig={
'DisableProfiler': boolean,
'ProfilingIntervalInMilliseconds': number,
'ProfilingParameters': {
'string' : 'string'
}
},
ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'string',
'RuleEvaluatorImage': 'string',
'RuleParameters': {
'string' : 'string'
}
}
],
TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'

Add Debugger Custom Rule Configuration to the CreateTrainingJob API


Operation
A custom rule can be configured for a training job using the DebugHookConfig and
DebugRuleConfiguration objects using the AWS Boto3 SageMaker client's create_training_job()

1808
Amazon SageMaker Developer Guide
Best Practices for Debugger

function. The following code sample shows how to configure a custom ImproperActivation rule
written with the smdebug library using this SageMaker API operation. This example assumes that you’ve
written the custom rule in custom_rules.py file and uploaded it to an Amazon S3 bucket. The example
provides pre-built Docker images that you can use to run your custom rules. These are listed at Amazon
SageMaker Debugger Registry URLs for Custom Rule Evaluators (p. 1814). You specify the URL registry
address for the pre-built Docker image in the RuleEvaluatorImage parameter.

DebugHookConfig={
'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
'CollectionConfigurations': [
{
'CollectionName': 'relu_activations',
'CollectionParameters': {
'include_regex': 'relu',
'save_interval': '500',
'end_step': '5000'
}
}
]
},
DebugRulesConfigurations=[
{
'RuleConfigurationName': 'improper_activation_job',
'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-
debugger-rule-evaluator:latest',
'InstanceType': 'ml.c4.xlarge',
'VolumeSizeInGB': 400,
'RuleParameters': {
'source_s3_uri': 's3://bucket/custom_rules.py',
'rule_to_invoke': 'ImproperActivation',
'collection_names': 'relu_activations'
}
}
]

To find a complete list of available Docker images for using the Debugger rules, see Use Debugger
Docker Images for Built-in or Custom Rules (p. 1813). To find the key-value pairs for RuleParameters,
see List of Debugger Built-in Rules (p. 1748).

Best Practices for Amazon SageMaker Debugger


Use the following guidelines when you run training jobs with Debugger.

Topics
• Choose a Machine Learning Framework (p. 1810)
• Use Studio Debugger Insights Dashboard (p. 1810)
• Download Debugger Reports and Gain More Insights (p. 1810)
• Capture Data from Your Training Job and Save Data to Amazon S3 (p. 1810)
• Analyze the Data with a Fleet of Debugger Built-in Rules (p. 1810)
• Take Actions Based on the Built-in Rule Status (p. 1810)
• Dive Deep into the Data Using the SMDebug Client Library (p. 1811)
• Monitor and Analyze Training Job Metrics (p. 1811)
• Monitoring System Utilization and Detect Bottlenecks (p. 1811)
• Profiling Framework Operations (p. 1811)
• Debugging Model Output Tensors (p. 1812)

1809
Amazon SageMaker Developer Guide
Best Practices for Debugger

Choose a Machine Learning Framework


You can choose a machine learning framework and use SageMaker pre-built training containers or your
own containers. Use Debugger to detect training and performance issues, and analyze training progress
of your training job in SageMaker. SageMaker provides you options to use pre-built containers that are
prepared for a number of machine learning framework environments to train your model on Amazon
EC2. Any training job can be adapted to run in AWS Deep Learning Containers, SageMaker training
containers, and custom containers.

Use Studio Debugger Insights Dashboard


With Studio Debugger insights dashboard, you are in control of your training jobs. Use the Studio
Debugger dashboards to keep your model performance on Amazon EC2 instances in control and
optimized. For any SageMaker training jobs running on Amazon EC2 instance, Debugger monitors
resource utilization and basic model output data (loss and accuracy values). Through the Studio
Debugger dashboards, gain insights into your training jobs and improve your model training
performance. To learn more, see Amazon SageMaker Debugger UI in Amazon SageMaker Studio
Experiments (p. 1721).

Download Debugger Reports and Gain More Insights


You can view aggregated results and gain insights in Debugger reports. Debugger aggregates training
and profiling results collected from the built-in rule analysis into a report per training job. You can find
more detailed information about your training results through the Debugger reports. To learn more, see
SageMaker Debugger Interactive Report (p. 1729).

Capture Data from Your Training Job and Save Data to Amazon
S3
You can use a Debugger hook to save output tensors. After you choose a container and a framework that
fit your training script, use a Debugger hook to configure which tensors to save and to which directory
to save them, such as a Amazon S3 bucket. A Debugger hook helps you to build the configuration and to
keep it in your account to use in subsequent analyses, where it is secured for use with the most privacy-
sensitive applications. To learn more, see Configure SageMaker Debugger to Save Tensors (p. 1672).

Analyze the Data with a Fleet of Debugger Built-in Rules


You can use Debugger built-in rules to inspect tensors in parallel with a training job. To analyze the
training performance data, Debugger provides built-in rules that watch for abnormal training process
behaviors. For example, a Debugger rule detects issues when the training process suffers from system
bottleneck issues or training issues, such as vanishing gradients, exploding tensors, overfitting, or
overtraining. If necessary, you can also build customized rules by creating a rule definition with your own
criteria to define a training issue. To learn more about the Debugger rules, see Configure Debugger Built-
in Rules (p. 1678) for detailed instructions of using the Amazon SageMaker Python SDK. For a full list of
the Debugger built-in rules, see List of Debugger Built-in Rules (p. 1748). If you want to create a custom
rule, see Create Debugger Custom Rules for Training Job Analysis (p. 1793).

Take Actions Based on the Built-in Rule Status


You can use Debugger with Amazon CloudWatch Events and AWS Lambda. You can automate actions
based on the rule status, such as stopping training jobs early and setting up notifications through email
or text. When the Debugger rules detect problems and triggers an "IssuesFound" evaluation status,
CloudWatch Events detects the rule status changes and invokes the Lambda function to take actions.
To configure automated actions to your training issues, see Create Actions on Rules Using Amazon
CloudWatch and AWS Lambda (p. 1702).

1810
Amazon SageMaker Developer Guide
Best Practices for Debugger

Dive Deep into the Data Using the SMDebug Client Library
You can use the SMDebug tools to access and analyze training data collected by Debugger. The
TrainingJob and create_trial classes load the metrics and tensors saved by Debugger. These
classes provide extended class methods to analyze the data in real time or after the training has
finished. The SMDebug library also provides visualization tools: merge timelines of framework metrics to
aggregate different profiling, line charts and heatmap to track the system utilization, and histograms to
find step duration outliers. To learn more about the SMDebug library tools, see Analyze Data Using the
SMDebug Client Library (p. 1740).

Monitor and Analyze Training Job Metrics


Amazon CloudWatch supports high-resolution custom metrics, and its finest resolution is 1 second.
However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second
frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about
the resolution and the lifespan of the CloudWatch metrics, see GetMetricStatistics in the Amazon
CloudWatch API Reference.

If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second)
granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any
time, consider using Amazon SageMaker Debugger. SageMaker Debugger provides built-in rules to
automatically detect common training issues; it detects hardware resource utilization issues (such as
CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients,
and exploding tensors).

SageMaker Debugger also provides visualizations through Studio and its profiling report. Unlike
CloudWatch metrics, which accumulates resource utilization rates of CPU and GPU cores and averages
those out across multiple instances, Debugger tracks the utilization rate of each core. This enables you to
identify unbalanced usage of hardware resources as you scale up to larger compute clusters. To explore
the Debugger visualizations, see SageMaker Debugger Insights Dashboard Walkthrough, Debugger
Profiling Report Walkthrough, and Analyze Data Using the SMDebug Client Library.

Monitoring System Utilization and Detect Bottlenecks


With Amazon SageMaker Debugger monitoring, you can measure hardware system resource utilization
of Amazon EC2 instances. Monitoring is available for any SageMaker training job constructed with
the SageMaker framework estimators (TensorFlow, PyTorch, and MXNet) and the generic SageMaker
estimator (SageMaker built-in algorithms and your own custom containers). Debugger built-in rules for
monitoring detect system bottleneck issues and notify you when they detect the bottleneck issues.

To learn how to enable Debugger system monitoring, see Configure Debugger Using Amazon SageMaker
Python SDK (p. 1710) and then Configure Debugger for Monitoring Resource Utilization (p. 1714).

For a full list of available built-in rules for monitoring, see Debugger built-in rules for profiling hardware
system resource utilization (system metrics) (p. 1749).

Profiling Framework Operations


With Amazon SageMaker Debugger profiling you can profile deep learning frameworks operations. You
can profile your model training with the SageMaker TensorFlow training containers, the SageMaker
PyTorch framework containers, and your own training containers. Using the profiling feature of
Debugger, you can drill down into the Python operators and functions that are executed to perform
the training job. Debugger supports detailed profiling, Python profiling, data loader profiling, and
Horovod distributed training profiling. You can merge the profiled timelines to correlate with the system
bottlenecks. Debugger built-in rules for profiling watch framework operation related issues, including
excessive training initialization time due to data downloading before training starts and step duration
outliers in training loops.

1811
Amazon SageMaker Developer Guide
Advanced Topics and Reference

To learn how to configure Debugger for framework profiling, see Configure Debugger Using Amazon
SageMaker Python SDK (p. 1710) and then Configure Debugger for Framework Profiling (p. 1714).

For a complete list of available built-in rules for profiling, see Debugger built-in rules for profiling
framework metrics (p. 1749).

Debugging Model Output Tensors


Debugging is available for deep learning frameworks using AWS Deep Learning Containers and the
SageMaker training containers. For fully supported framework versions (see the versions at Supported
Frameworks and Algorithms (p. 1650)), Debugger automatically registers hooks to collect output
tensors, and you can directly run your training script. For the versions with one asterisk sign, you need to
manually register the hooks to collect tensors. Debugger provides preconfigured tensor collections with
generalized names that you can utilize across the different frameworks. If you want to customize output
tensor configuration, you can also use the CollectionConfig and DebuggerHookConfig API operations and
the Amazon SageMaker Python SDK to configure your own tensor collections. Debugger built-in rules
for debugging analyze the output tensors and identifies model optimization problems that blocks your
model from minimizing the loss function. For example, the rules identify overfitting, overtraining, loss
not decreasing, exploding tensors, and vanishing gradients.

To learn how to configure Debugger for debugging output tensors, see Step 2: Launch and Debug
Training Jobs Using SageMaker Python SDK (p. 1669) and then Configure SageMaker Debugger to Save
Tensors (p. 1672).

For a full list of available built-in rules for debugging, see Debugger built-in rules for debugging model
training data (output tensors) (p. 1750).

Amazon SageMaker Debugger Advanced Topics and


Reference Documentation
The following sections contain advanced topics, reference documentation for the API operations,
exceptions, and known limitations for Debugger.

Topics
• Amazon SageMaker Debugger API Operations (p. 1812)
• Use Debugger Docker Images for Built-in or Custom Rules (p. 1813)
• Amazon SageMaker Debugger Exceptions (p. 1815)
• Considerations for Amazon SageMaker Debugger (p. 1816)
• Amazon SageMaker Debugger Usage Statistics (p. 1818)

Amazon SageMaker Debugger API Operations


Amazon SageMaker Debugger has API operations in several locations that are used to implement its
monitoring and analysis of model training.

Amazon SageMaker Debugger also provides the open source sagemaker-debugger Python SDK that
is used to configure built-in rules, define custom rules, and register hooks to collect output tensor data
from training jobs.

The Amazon SageMaker Python SDK is a high-level SDK focused on machine learning experimentation.
The SDK can be used to deploy built-in or custom rules defined with the SMDebug Python library to
monitor and analyze these tensors using SageMaker estimators.

Debugger has added operations and types to the Amazon SageMaker API that enable the platform to
use Debugger when training a model and to manage the configuration of inputs and outputs.

1812
Amazon SageMaker Developer Guide
Advanced Topics and Reference

• CreateTrainingJob and UpdateTrainingJob use the following Debugger APIs to configure tensor
collections, rules, rule images, and profiling options:
• CollectionConfiguration
• DebugHookConfig
• DebugRuleConfiguration
• TensorBoardOutputConfig
• ProfilerConfig
• ProfilerRuleConfiguration
• DescribeTrainingJob provides a full description of a training job, including the following Debugger
configurations and rule evaluation statuses:
• DebugHookConfig
• DebugRuleConfiguration
• DebugRuleEvaluationStatus
• ProfilerConfig
• ProfilerRuleConfiguration
• ProfilerRuleEvaluationStatus

The rule configuration API operations use the SageMaker Processing functionality when analyzing a
model training. For more information about SageMaker Processing, see Process Data (p. 1196).

Use Debugger Docker Images for Built-in or Custom Rules


Amazon SageMaker provides two sets of Docker images for rules: one set for evaluating rules provided
by SageMaker (built-in rules) and one set for evaluating custom rules provided in Python source files.

If you use the Amazon SageMaker Python SDK, you can simply use SageMaker high-level Debugger API
operations with SageMaker Estimator API operations, without having to manually retrieve the Debugger
Docker images and configure the ConfigureTrainingJobAPI.

If you are not using the SageMaker Python SDK, you have to retrieve a relevant pre-built container base
image for the Debugger rules. Amazon SageMaker Debugger provides pre-built Docker images for built-
in and custom rules, and the images are stored in Amazon Elastic Container Registry (Amazon ECR). To
pull an image from an Amazon ECR repository (or to push an image to one), use the full name registry
URL of the image using the CreateTrainingJob API. SageMaker uses the following URL patterns for
the Debugger rule container image registry address.

<account_id>.dkr.ecr.<Region>.amazonaws.com/<ECR repository name>:<tag>

For the account ID in each AWS Region, Amazon ECR repository name, and tag value, see the following
topics.

Topics
• Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators (p. 1813)
• Amazon SageMaker Debugger Registry URLs for Custom Rule Evaluators (p. 1814)

Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators


Use the following values for the components of the registry URLs for the images that provide built-in
rules for Amazon SageMaker Debugger. For account IDs, see the following table.

ECR Repository Name: sagemaker-debugger-rules

1813
Amazon SageMaker Developer Guide
Advanced Topics and Reference

Tag: latest

Example of a full registry URL:

904829902805.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rules:latest

Account IDs for Built-in Rules Container Images by AWS Region

Region account_id

af-south-1 314341159256

ap-east-1 199566480951

ap-northeast-1 430734990657

ap-northeast-2 578805364391

ap-south-1 904829902805

ap-southeast-1 972752614525

ap-southeast-2 184798709955

ca-central-1 519511493484

cn-north-1 618459771430

cn-northwest-1 658757709296

eu-central-1 482524230118

eu-north-1 314864569078

eu-south-1 563282790590

eu-west-1 929884845733

eu-west-2 250201462417

eu-west-3 447278800020

me-south-1 986000313247

sa-east-1 818342061345

us-east-1 503895931360

us-east-2 915447279597

us-west-1 685455198987

us-west-2 895741380848

us-gov-west-1 515509971035

Amazon SageMaker Debugger Registry URLs for Custom Rule Evaluators


Use the following values for the components of the registry URL for the images that provide custom rule
evaluators for Amazon SageMaker Debugger. For account IDs, see the following table.

ECR Repository Name: sagemaker-debugger-rule-evaluator

1814
Amazon SageMaker Developer Guide
Advanced Topics and Reference

Tag: latest

Example of a full registry URL:

552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-
evaluator:latest

Account IDs for Custom Rules Container Images by AWS Region

Region account_id

af-south-1 515950693465

ap-east-1 645844755771

ap-northeast-1 670969264625

ap-northeast-2 326368420253

ap-south-1 552407032007

ap-southeast-1 631532610101

ap-southeast-2 445670767460

ca-central-1 105842248657

cn-north-1 617202126805

cn-northwest-1 658559488188

eu-central-1 691764027602

eu-north-1 091235270104

eu-south-1 335033873580

eu-west-1 606966180310

eu-west-2 074613877050

eu-west-3 224335253976

me-south-1 050406412588

sa-east-1 466516958431

us-east-1 864354269164

us-east-2 840043622174

us-west-1 952348334681

us-west-2 759209512951

us-gov-west-1 515361955729

Amazon SageMaker Debugger Exceptions


Amazon SageMaker Debugger is designed to be aware that tensors required to execute a rule might
not be available at every step. As a result, it raises a few exceptions, which enable you to control what

1815
Amazon SageMaker Developer Guide
Advanced Topics and Reference

happens when a tensor is missing. These exceptions are available in the smdebug.exceptions module.
You can import them as follows:

from smdebug.exceptions import *

The following exceptions are available:

• TensorUnavailableForStep – The tensor requested is not available for the step. This might mean
that this step might not be saved at all by the hook, or that this step might have saved some tensors
but the requested tensor is not part of them. Note that when you see this exception, it means that this
tensor can never become available for this step in the future. If the tensor has reductions saved for the
step, it notifies you they can be queried.
• TensorUnavailable – This tensor is not being saved or has not been saved by the smdebug API. This
means that this tensor is never seen for any step in smdebug.
• StepUnavailable – The step was not saved and Debugger has no data from the step.
• StepNotYetAvailable – The step has not yet been seen by smdebug. It might be available in the
future if the training is still going on. Debugger automatically loads new data as it becomes available.
• NoMoreData – Raised when the training ends. Once you see this, you know that there are no more
steps and no more tensors to be saved.
• IndexReaderException – The index reader is not valid.
• InvalidWorker – A worker was invoked that was not valid.
• RuleEvaluationConditionMet – Evaluation of the rule at the step resulted in the condition being
met.
• InsufficientInformationForRuleInvocation – Insufficient information was provided to invoke
the rule.

Considerations for Amazon SageMaker Debugger


Consider the following when using Amazon SageMaker Debugger.

Considerations for Distributed Training


The following list shows the scope of validity and considerations for using Debugger on training jobs
with deep learning frameworks and various distributed training options.

• Horovod

Scope of validity of using Debugger for training jobs with Horovod

Deep Learning Apache TensorFlow TensorFlow TensorFlow PyTorch


Framework MXNet 1.x 2.x 2.x with Keras

Monitoring Yes Yes Yes Yes Yes


system
bottlenecks

Profiling No No No Yes Yes


framework
operations

Debugging Yes Yes Yes Yes Yes


model output
tensors
• SageMaker distributed data parallel

1816
Amazon SageMaker Developer Guide
Advanced Topics and Reference

Scope of validity of using Debugger for training jobs with SageMaker distributed data
parallel

Deep Learning TensorFlow 2.x TensorFlow 2.x with PyTorch


Framework Keras

Monitoring system Yes Yes Yes


bottlenecks

Profiling framework No* No** Yes


operations

Debugging model Yes Yes Yes


output tensors

* Debugger does not support framework profiling for TensorFlow 2.x.

** SageMaker distributed data parallel does not support TensorFlow 2.x with Keras implementation.
• SageMaker distributed model parallel – Debugger does not support SageMaker distributed model
parallel training.
• Distributed training with SageMaker checkpoints – Debugger is not available for training jobs when
both the distributed training option and SageMaker checkpoints are enabled. You might see an error
that looks like the following:

SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled

To use Debugger for training jobs with distributed training options, you need to disable SageMaker
checkpointing and add manual checkpointing functions to your training script. For more information
about using Debugger with distributed training options and checkpoints, see Using SageMaker
Distributed Data Parallel with Amazon SageMaker Debugger and Checkpoints (p. 1862) and Saving
Checkpoints (p. 1939).
• Parameter Server – Debugger does not support parameter server-based distributed training.
• Profiling distributed training framework operations, such as the AllReduced operation of SageMaker
distributed data parallel and Horovod operations, is not available.

Considerations for Monitoring System Bottlenecks and Profiling Framework


Operations
• For AWS TensorFlow, data loader metrics cannot be collected using the default local_path setting of
the FrameworkProfile class. The path has to be manually configured and end in "/". For example:

FrameworkProfile(local_path="/opt/ml/output/profiler/")

• For AWS TensorFlow, the data loader profiling configuration cannot be updated while a training job is
running.
• For AWS TensorFlow, a NoneType error might occur when you use analysis tools and notebook
examples with TensorFlow 2.3 training jobs and the detailed profiling option.
• Python profiling and detailed profiling are only supported for Keras API.
• To access the deep profiling feature for TensorFlow and PyTorch, currently you must specify the latest
AWS deep learning container images with CUDA 11. For example, you must specify the specific image
URI in the TensorFlow and PyTorch estimator as follows:
• For TensorFlow

1817
Amazon SageMaker Developer Guide
Advanced Topics and Reference

image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-
gpu-py37-cu110-ubuntu18.04"

• For PyTorch

image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.6.0-gpu-
py36-cu110-ubuntu18.04"

Considerations for Debugging Model Output Tensors


• Avoid using functional API operations. Debugger cannot collect model output tensors from PyTorch
and MXNet training scripts composed of functional API operations.
• Debugger cannot collect model output tensors from the torch.nn.functional API operations.
When you write a PyTorch training script, it is recommended to use the torch.nn modules instead.
• Debugger cannot collect model output tensors from MXNet functional objects in hybrid blocks. For
example, the ReLu activation (F.relu) outputs cannot be collected from the following example of
mxnet.gluon.HybridBlock with F in the hybrid_forward function.

import mxnet as mx
from mxnet.gluon import HybridBlock, nn

class Model(HybridBlock):
def __init__(self, **kwargs):
super(Model, self).__init__(**kwargs)
# use name_scope to give child Blocks appropriate names.
with self.name_scope():
self.dense0 = nn.Dense(20)
self.dense1 = nn.Dense(20)

def hybrid_forward(self, F, x):


x = F.relu(self.dense0(x))
return F.relu(self.dense1(x))

model = Model()
model.initialize(ctx=mx.cpu(0))
model.hybridize()
model(mx.nd.zeros((10, 10), ctx=mx.cpu(0)))

Amazon SageMaker Debugger Usage Statistics


Consider the following when using autogenerated reports by Amazon SageMaker Debugger.

Debugger Profiling Report Usage


For all SageMaker training jobs, Amazon SageMaker Debugger runs the ProfilerReport (p. 1751) rule
and autogenerates a SageMaker Debugger Profiling Report (p. 1729). The ProfilerReport rule
provides a Jupyter notebook file (profiler-report.ipynb) that generates a corresponding HTML file
(profiler-report.html).

Debugger collects profiling report usage statistics by including code in the Jupyter notebook that
collects the unique ProfilerReport rule's processing job ARN if the user opens the final profiler-
report.html file.

Debugger only collects information about whether a user opens the final HTML report. It DOES NOT
collect any information from training jobs, training data, training scripts, processing jobs, logs, or the
content of the profiling report itself.

1818
Amazon SageMaker Developer Guide
Advanced Topics and Reference

You can opt out of the collection of usage statistics using either of the following options.

(Recommended) Option 1: Opt Out before Running a Training Job

To opt out, you need to add the following Debugger ProfilerReport rule configuration to your
training job request.

SageMaker Python SDK

estimator=sagemaker.estimator.Estimator(
...

rules=ProfilerRule.sagemaker(
base_config=rule_configs.ProfilerReport()
rule_parameters={"opt_out_telemetry": "True"}
)
)

AWS CLI

"ProfilerRuleConfigurations": [
{
"RuleConfigurationName": "ProfilerReport-1234567890",
"RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest",
"RuleParameters": {
"rule_to_invoke": "ProfilerReport",
"opt_out_telemetry": "True"
}
}
]

AWS SDK for Python (Boto3)

ProfilerRuleConfigurations=[
{
'RuleConfigurationName': 'ProfilerReport-1234567890',
'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-
debugger-rules:latest',
'RuleParameters': {
'rule_to_invoke': 'ProfilerReport',
'opt_out_telemetry': 'True'
}
}
]

Option 2: Opt Out after a Training Job Has Completed

To opt out after training has completed, you need to modify the profiler-report.ipynb file.
Note
HTML reports autogenerated without Option 1 already added to your training job request still
report the usage statistics even after you opt out using Option 2.

1. Follow the instructions on downloading the Debugger profiling report files in the Download the
SageMaker Debugger Profiling Report (p. 1730) page.
2. In the /ProfilerReport-1234567890/profiler-output directory, open profiler-
report.ipynb.

1819
Amazon SageMaker Developer Guide
SageMaker Debugger Release Notes

3. Add opt_out=True to the setup_profiler_report() function in the fifth code cell as shown in
the following example code:

setup_profiler_report(processing_job_arn, opt_out=True)

4. Run the code cell to finish opting out.

Amazon SageMaker Debugger Release Notes


See the following release notes to track the latest updates for Amazon SageMaker Debugger.

Amazon SageMaker Debugger Release Notes: April 4, 2023


New features

SageMaker Debugger launches TensorBoard on SageMaker, a capability that brings the TensorBoard app
to SageMaker with access control.

Amazon SageMaker Debugger Release Notes: March 16, 2023


Deprecation notes

SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and
PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

• SageMaker Python SDK <= v2.130.0


• PyTorch >= v1.6.0, < v2.0
• TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger also discontinues support for the following three
ProfilerRules for framework profiling.

• MaxInitializationTime
• OverallFrameworkMetrics
• StepOutlier

Amazon SageMaker Debugger Release Notes: February 21, 2023


Other changes

• The XGBoost report tab has been removed from the SageMaker Debugger's profiler dashboard. You
can still access the XGBoost report by downloading it as a Jupyter notebook or a HTML file. For more
information, see SageMaker Debugger XGBoost Training Report.
• Starting from this release, the built-in profiler rules are not activated by default. To use the SageMaker
Debugger profiler rules to detect certain computational problems, you need to add the rules when you
configure a SageMaker training job launcher.

Amazon SageMaker Debugger Release Notes: December 1, 2020


Amazon SageMaker Debugger launched deep profiling features at re:Invent 2020.

1820
Amazon SageMaker Developer Guide
Distributed Training

Amazon SageMaker Debugger Release Notes: December 3, 2019


Amazon SageMaker Debugger initially launched at re:Invent 2019.

Distributed Training in Amazon SageMaker


SageMaker provides distributed training libraries and supports various distributed training options
for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With
SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data
parallel and model parallel deep learning training jobs. You can also use other distributed training
frameworks and packages such as PyTorch DistributedDataParallel (DDP), torchrun, MPI (mpirun), and
parameter server. Throughout the documentation, instructions and examples focus on how to set up the
distributed training options for deep learning tasks using the SageMaker Python SDK.
Tip
To learn best practices for distributed computing of machine learning (ML) training and
processing jobs in general, see Distributed computing with SageMaker best practices (p. 1944).

Get Started with Distributed Training


If you’re familiar with distributed training, choose one of the following options that matches your
preferred strategy or framework to get started. If you want to learn about distributed training in general,
see the section called “Basic Distributed Training Concepts” (p. 1824).

The SageMaker distributed training libraries are optimized for the SageMaker training environment,
help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.
The libraries offer both data parallel and model parallel training strategies. They combine software and
hardware technologies to improve inter-GPU and inter-node communications, and extend SageMaker’s
training capabilities with built-in options that require minimal code changes to your training scripts.

• To use SageMaker's data parallelism library, configure the distribution parameter of the
SageMaker framework estimators. Supported framework estimators are PyTorch and TensorFlow. The
following code example shows how to set a framework estimator for distributed training with the data
parallelism library on two ml.p4d.24xlarge instances.

from sagemaker.framework import Framework

estimator = Framework(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"smdistributed" : {"dataparallel" : {"enabled" : True}}}
)

To learn how to prepare your training script and launch a distributed training job, see SageMaker's
data parallelism library (p. 1831) (see also Distributed Training APIs in the SageMaker Python SDK
documentation).
• To use SageMaker's model parallelism library, configure the distribution parameter of the
SageMaker framework estimators. Supported framework estimators are PyTorch and TensorFlow. The
following code example shows how to construct a framework estimator for distributed training with
the model parallelism library on two ml.p4d.24xlarge instances.

from sagemaker.framework import Framework

distribution={

1821
Amazon SageMaker Developer Guide
Get Started with Distributed Training

"smdistributed": {
"modelparallel": {
"enabled":True,
"parameters": {
... # enter parameter key-value pairs here
}
},
},
"mpi": {
"enabled" : True,
... # enter parameter key-value pairs here
}
}

estimator = Framework(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution=distribution
)

To learn how to prepare your training script, configure distribution parameters, and launch a
distributed training job, see SageMaker's model parallelism library (p. 1864) (see also Distributed
Training APIs in the SageMaker Python SDK documentation).

SageMaker also supports the following options to operate mpirun and torchrun in the backend.

• To use PyTorch DistributedDataParallel (DDP) in SageMaker with the mpirun backend, add
distribution={"pytorchddp": {"enabled": True}} to your PyTorch estimator. For more
information, see also PyTorch Distributed Training and SageMaker PyTorch Estimator's distribution
argument in the SageMaker Python SDK documentation.
Note
This option is available for PyTorch 1.12.0 and later.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
...,
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"pytorchddp": {"enabled": True}} # runs mpirun in the backend
)

• SageMaker supports the PyTorch torchrun launcher for distributed training on GPU-based Amazon
EC2 instances, such as P3 and P4, as well as Trn1 powered by the AWS Trainium device.

To use PyTorch DistributedDataParallel (DDP) in SageMaker with the torchrun backend, add
distribution={"torch_distributed": {"enabled": True}} to the PyTorch estimator.
Note
This option is available for PyTorch 1.13.0 and later.

The following code snippet shows an example of constructing a SageMaker PyTorch estimator to run
distributed training on two ml.p4d.24xlarge instances with the torch_distributed distribution
option.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
...,

1822
Amazon SageMaker Developer Guide
Get Started with Distributed Training

instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={"torch_distributed": {"enabled": True}} # runs torchrun in the
backend
)

For more information, see Distributed PyTorch Training and SageMaker PyTorch Estimator's
distribution argument in the SageMaker Python SDK documentation.

Notes for distributed training on Trn1

A Trn1 instance consists of up to 16 Trainium devices, and each Trainium device consists of two
NeuronCores. For specs of the AWS Trainium devices, see Trainium Architecture in the AWS Neuron
Documentation.

To train on the Trainium-powered instances, you only need to specify the Trn1 instance code,
ml.trn1.*, in string to the instance_type argument of the SageMaker PyTorch estimator class. To
find available Trn1 instance types, see AWS Trn1 Architecture in the AWS Neuron documentation.
Note
SageMaker Training on Amazon EC2 Trn1 instances is currently available only for the PyTorch
framework in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0. To find
a complete list of supported versions of PyTorch Neuron, see Neuron Containers in the AWS
Deep Learning Containers GitHub repository.

When you launch a training job on Trn1 instances using the SageMaker Python SDK, SageMaker
automatically picks up and runs the right container from Neuron Containers provided by AWS Deep
Learning Containers. The Neuron Containers are prepackaged with training environment settings
and dependencies for easier adaptation of your training job to the SageMaker Training platform and
Amazon EC2 Trn1 instances.
Note
To run your PyTorch training job on Trn1 instances with SageMaker, you should modify your
training script to initialize process groups with the xla backend and use PyTorch/XLA. To
support the XLA adoption process, the AWS Neuron SDK provides PyTorch Neuron that uses
XLA to make conversion of PyTorch operations to Trainium instructions. To learn how to
modify your training script, see Developer Guide for Training with PyTorch Neuron (torch-
neuronx) in the AWS Neuron Documentation.

For more information, see Distributed Training with PyTorch Neuron on Trn1 instances and SageMaker
PyTorch Estimator's distribution argument in the SageMaker Python SDK documentation.
• To use MPI in SageMaker, add distribution={"mpi": {"enabled": True}} to your estimator.
The MPI distribution option is available for the following frameworks: MXNet, PyTorch, and
TensorFlow.
• To use a parameter server in SageMaker, add distribution={"parameter_server":
{"enabled": True}} to your estimator. The parameter server option is available for the following
frameworks: MXNet, PyTorch, and TensorFlow.
Tip
For more information about using the MPI and parameter server options per framework, use
the following links to the SageMaker Python SDK documentation.
• MXNet Distributed Training and SageMaker MXNet Estimator's distribution argument
• PyTorch Distributed Training and SageMaker PyTorch Estimator's distribution argument
• TensorFlow Distributed Training and SageMaker TensorFlow Estimator's distribution
argument.

1823
Amazon SageMaker Developer Guide
Basic Distributed Training Concepts

Basic Distributed Training Concepts


SageMaker’s distributed training libraries use the following distributed training terms and features.

Datasets and Batches

• Training Dataset: All of the data you use to train the model.
• Global batch size: The number of records selected from the training dataset in each iteration to send
to the GPUs in the cluster. This is the number of records over which the gradient is computed at each
iteration. If data parallelism is used, it is equal to the total number of model replicas multiplied by the
per-replica batch size: global batch size = (the number of model replicas) * (per-
replica batch size). A single batch of global batch size is often referred to as the mini-batch in
machine learning literature.
• Per-replica batch size: When data parallelism is used, this is the number of records sent to each model
replica. Each model replica performs a forward and backward pass with this batch to calculate weight
updates. The resulting weight updates are synchronized (averaged) across all replicas before the next
set of per-replica batches are processed.
• Micro-batch: A subset of the mini-batch or, if hybrid model and data parallelism is used , it is a subset
of the per-replica sized batch . When you use SageMaker’s distributed model parallelism library, each
micro-batch is fed into the training pipeline one-by-one and follows an execution schedule defined by
the library's runtime.

Training

• Epoch: One training cycle through the entire dataset. It is common to have multiple iterations per an
epoch. The number of epochs you use in training is unique on your model and use case.
• Iteration: A single forward and backward pass performed using a global batch sized batch (a mini-
batch) of training data. The number of iterations performed during training is determined by the
global batch size and the number of epochs used for training. For example, if a dataset includes 5,000
samples, and you use a global batch size of 500, it will take 10 iterations to complete a single epoch.
• Learning rate: A variable that influences the amount that weights are changed in response to the
calculated error of the model. The learning rate plays an important role in the model’s ability to
converge as well as the speed and optimality of convergence.

Instances and GPUs

• Instances: An AWS machine learning compute instance. These are also referred to as nodes.
• Cluster size: When using SageMaker's distributed training library, this is the number of instances
multiplied by the number of GPUs in each instance. For example, if you use two ml.p3.8xlarge
instances in a training job, which have 4 GPUs each, the cluster size is 8. While increasing cluster size
can lead to faster training times, communication between instances must be optimized; Otherwise,
communication between the nodes can add overhead and lead to slower training times. The
SageMaker distributed training library is designed to optimize communication between Amazon EC2
ML compute instances, leading to higher device utilization and faster training times.

Distributed Training Solutions

• Data parallelism: A strategy in distributed training where a training dataset is split up across multiple
GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. Each GPU contains
a replica of the model, receives different batches of training data, performs a forward and backward
pass, and shares weight updates with the other nodes for synchronization before moving on to the
next batch and ultimately another epoch.

1824
Amazon SageMaker Developer Guide
Advanced Concepts

• Model parallelism: A strategy in distributed training where the model partitioned across multiple
GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. The model might
be complex and have a large number of hidden layers and weights, making it unable to fit in the
memory of a single instance. Each GPU carries a subset of the model, through which the data flows
and the transformations are shared and compiled. The efficiency of model parallelism, in terms of GPU
utilization and training time, is heavily dependent on how the model is partitioned and the execution
schedule used to perform forward and backward passes.
• Pipeline Execution Schedule (Pipelining): The pipeline execution schedule determines the order in
which computations (micro-batches) are made and data is processed across devices during model
training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome
the performance loss due to sequential computation by having the GPUs compute simultaneously on
different data samples. To learn more, see Pipeline Execution Schedule.

Advanced Concepts
Machine Learning (ML) practitioners commonly face two scaling challenges when training models:
scaling model size and scaling training data. While model size and complexity can result in better
accuracy, there is a limit to the model size you can fit into a single CPU or GPU. Furthermore, scaling
model size may result in more computations and longer training times.

Not all models handle training data scaling equally well because they need to ingest all the training data
in memory for training. They only scale vertically, and to bigger and bigger instance types. In most cases,
scaling training data results in longer training times.

Deep Learning (DL) is a specific family of ML algorithms consisting of several layers of artificial neural
networks. The most common training method is with mini-batch Stochastic Gradient Descent (SGD).
In mini-batch SGD, the model is trained by conducting small iterative changes of its coefficients in
the direction that reduces its error. Those iterations are conducted on equally sized subsamples of the
training dataset called mini-batches. For each mini-batch, the model is run in each record of the mini-
batch, its error measured and the gradient of the error estimated. Then the average gradient is measured
across all the records of the mini-batch and provides an update direction for each model coefficient. One
full pass over the training dataset is called an epoch. Model trainings commonly consist of dozens to
hundreds of epochs. Mini-batch SGD has several benefits: First, its iterative design makes training time
theoretically linear of dataset size. Second, in a given mini-batch each record is processed individually
by the model without need for inter-record communication other than the final gradient average. The
processing of a mini-batch is consequently particularly suitable for parallelization and distribution.

Parallelizing SGD training by distributing the records of a mini-batch over different computing devices is
called data parallel distributed training, and is the most commonly used DL distribution paradigm. Data
parallel training is a relevant distribution strategy to scale the mini-batch size and process each mini-
batch faster. However, data parallel training comes with the extra complexity of having to compute the
mini-batch gradient average with gradients coming from all the workers and communicating it to all the
workers, a step called allreduce that can represent a growing overhead, as the training cluster is scaled,
and that can also drastically penalize training time if improperly implemented or implemented over
improper hardware subtracts.

Data parallel SGD still requires developers to be able to fit at least the model and a single record
in a computing device, such as a single CPU or GPU. When training very large models such as large
transformers in Natural Language Processing (NLP), or segmentation models over high-resolution
images, there may be situations in which this is not feasible. An alternative way to break up the workload
is to partition the model over multiple computing devices, an approach called model-parallel distributed
training.

1825
Amazon SageMaker Developer Guide
Strategies

Strategies
Distributed training is usually split by two approaches: data parallel and model parallel. Data parallel is
the most common approach to distributed training: You have a lot of data, batch it up, and send blocks
of data to multiple CPUs or GPUs (nodes) to be processed by the neural network or ML algorithm, then
combine the results. The neural network is the same on each node. A model parallel approach is used
with large models that won’t fit in a node’s memory in one piece; it breaks up the model and places
different parts on different nodes. In this situation, you need to send your batches of data out to each
node so that the data is processed on all parts of the model.

The terms network and model are often used interchangeably: A large model is really a large network
with many layers and parameters. Training with a large network produces a large model, and loading
the model back onto the network with all your pre-trained parameters and their weights loads a large
model into memory. When you break apart a model to split it across nodes, you’re also breaking apart
the underlying network. A network consists of layers, and to split up the network, you put layers on
different compute devices.

A common pitfall of naively splitting layers across devices is severe GPU under-utilization. Training is
inherently sequential in both forward and backward passes, and at a given time, only one GPU can
actively compute, while the others wait on the activations to be sent. Modern model parallel libraries
solve this problem by using pipeline execution schedules to improve device utilization. However, only
the Amazon SageMaker's distributed model parallel library includes automatic model splitting. The two
core features of the library, automatic model splitting and pipeline execution scheduling, simplifies the
process of implementing model parallelism by making automated decisions that lead to efficient device
utilization.

Train with Data Parallel and Model Parallel


If you are training with a large dataset, start with a data parallel approach. If you run out of memory
during training, you may want to switch to a model parallel approach, or try hybrid model and data
parallelism. You can also try the following to improve performance with data parallel:

• Change your model’s hyperparameters.


• Reduce the batch size.
• Keep reducing the batch size until it fits. If you reduce batch size to 1, and still run out of memory,
then you should try model-parallel training.

Try gradient compression (FP16, INT8):

• On NVIDIA TensorCore-equipped hardware, using mixed precision training creates both speed-up and
memory consumption reduction.
• SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the
box. No extra action is needed to enable AMP other than the framework-level modifications to your
training script. If gradients are in FP16, the SageMaker data parallelism library runs its AllReduce
operation in FP16. For more information about implementing AMP APIs to your training script, see the
following resources:
• Frameworks - PyTorch in the NVIDIA Deep Learning Performance documentation
• Frameworks - TensorFlow in the NVIDIA Deep Learning Performance documentation
• Automatic Mixed Precision for Deep Learning in the NVIDIA Developer Docs
• Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs in the
PyTorch Blog
• TensorFlow mixed precision APIs in the TensorFlow documentation

Try reducing the input size:

1826
Amazon SageMaker Developer Guide
Optimize Distributed Training

• Reduce the NLP sequence length if you increase the sequence link, need to adjust the batch size down,
or adjust the GPUs up to spread the batch.
• Reduce image resolution.

Check if you use batch normalization, since this can impact convergence. When you use distributed
training, your batch is split across GPUs and the effect of a much lower batch size can be a higher error
rate thereby disrupting the model from converging. For example, if you prototyped your network on a
single GPU with a batch size of 64, then scaled up to using four p3dn.24xlarge, you now have 32 GPUs
and your per-GPU batch size drops from 64 to 2. This will likely break the convergence you saw with a
single node.

Start with model-parallel training when:

• Your model does not fit on a single device.


• Due to your model size, you’re facing limitations in choosing larger batch sizes, such as if your model
weights take up most of your GPU memory and you are forced to choose a smaller, suboptimal batch
size.

To learn more about the SageMaker distributed libraries, see the following:

• SageMaker's Data Parallelism Library (p. 1831)


• SageMaker's Model Parallelism Library (p. 1864)

Optimize Distributed Training


Customize hyperparameters for your use case and your data to get the best scaling efficiency. In the
following discussion, we highlight some of the most impactful training variables and provide references
to state-of-the-art implementations so you can learn more about your options. Also, we recommend that
you refer to your preferred framework’s distributed training documentation.

• Apache MXNet distributed training


• PyTorch distributed training
• TensorFlow distributed training

Batch Size
SageMaker distributed toolkits generally allow you to train on bigger batches. For example, if a model
fits within a single device but can only be trained with a small batch size, using either model-parallel
training or data parallel training enables you to experiment with larger batch sizes.

Be aware that batch size directly influences model accuracy by controlling the amount of noise in the
model update at each iteration. Increasing batch size reduces the amount of noise in the gradient
estimation, which can be beneficial when increasing from very small batches sizes, but can result in
degraded model accuracy as the batch size increases to large values.
Tip
Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as
you increase its batch size.

A number of techniques have been developed to maintain good model convergence when batch is
increased.

1827
Amazon SageMaker Developer Guide
Scenarios

Mini-Batch Size
In SGD, the mini-batch size quantifies the amount of noise present in the gradient estimation. A small
mini-batch results in a very noisy mini-batch gradient, which is not representative of the true gradient
over the dataset. A large mini-batch results in a mini-batch gradient close to the true gradient over the
dataset and potentially not noisy enough—likely to stay locked in irrelevant minima.

To learn more about these techniques, see the following papers:

• Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour, Goya et al.


• PowerAI DDL, Cho et al.
• Scale Out for Large Minibatch SGD: Residual Network Training on ImageNet-1K with Improved
Accuracy and Reduced Time to Train, Codreanu et al.
• ImageNet Training in Minutes, You et al.
• Large Batch Training of Convolutional Networks, You et al.
• Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes, You et al.
• Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes, Zheng et al.
• Deep Gradient Compression, Lin et al.

Scenarios
The following sections cover scenarios in which you may want to scale up training, and how you can do
so using AWS resources.

Scaling from a Single GPU to Many GPUs


The amount of data or the size of the model used in machine learning can create situations in which the
time to train a model is longer that you are willing to wait. Sometimes, the training doesn’t work at all
because the model or the training data is too large. One solution is to increase the number of GPUs you
use for training. On an instance with multiple GPUs, like a p3.16xlarge that has eight GPUs, the data
and processing is split across the eight GPUs. When you use distributed training libraries, this can result
in a near-linear speedup in the time it takes to train your model. It takes slightly over 1/8 the time it
would have taken on p3.2xlarge with one GPU.

Instance type GPUs

p3.2xlarge 1

p3.8xlarge 4

p3.16xlarge 8

p3dn.24xlarge 8

Note
The ml instance types used by SageMaker training have the same number of GPUs as the
corresponding p3 instance types. For example, ml.p3.8xlarge has the same number of GPUs
as p3.8xlarge - 4.

1828
Amazon SageMaker Developer Guide
Scenarios

Scaling from a Single Instance to Multiple Instances


If you want to scale your training even further, you can use more instances. However, you should choose
a larger instance type before you add more instances. Review the previous table to see how many GPUs
are in each p3 instance type.

If you have made the jump from a single GPU on a p3.2xlarge to four GPUs on a p3.8xlarge, but
decide that you require more processing power, you may see better performance and incur lower costs
if you choose a p3.16xlarge before trying to increase instance count. Depending on the libraries you
use, when you keep your training on a single instance, performance is better and costs are lower than a
scenario where you use multiple instances.

When you are ready to scale the number of instances, you can do this with SageMaker Python
SDK estimator function by setting your instance_count. For example, you can set instance_type
= p3.16xlarge and instance_count = 2. Instead of the eight GPUs on a single p3.16xlarge,
you have 16 GPUs across two identical instances. The following chart shows scaling and throughput
starting with eight GPUs on a single instance and increasing to 64 instances for a total of 256 GPUs.

1829
Amazon SageMaker Developer Guide
Scenarios

1830
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Availability Zones and Network Backplane


With multiple instances, it's important to understand the network that connects the instances, how
they read the training data, and how they share information between themselves (for example,
communication between the nodes in the cluster when doing an AllReduce operation).

First, your instances need to be in the same Region and same Availability Zone. For example, instances in
us-west-2 must all be in us-west-2a. When you use the SageMaker Python SDK, this is handled for
you. If you use Amazon EC2 and orchestrate your own training clusters, you need to be aware of this, or
your training speeds suffer.

Your training data should also be in the same Availability Zone. When you use a SageMaker estimator,
you pass in the Region and the S3 bucket, and if the data is not in the Region you set, you get an error.

Optimized GPU, Network, and Storage


The p3dn.24xlarge instance type was designed for fast local storage and a fast network backplane
with up to 100 gigabits, and we highly recommend it as the most performant option for distributed
training. SageMaker supports streaming data modes from S3, referred to as pipe mode. For HPC loads
like distributed training, we recommend Amazon FSx for your file storage.

Custom Training Scripts


While SageMaker makes it simple to deploy and scale the number of instances and GPUs, depending on
your framework of choice, managing the data and results can be very challenging, which is why external
supporting libraries are often used. This most basic form of distributed training requires modification of
your training script to manage the data distribution.

SageMaker also supports Horovod and implementations of distributed training native to each major
deep learning framework. If you choose to use examples from these frameworks, you can follow
SageMaker’s container guide for Deep Learning Containers, and various example notebooks that
demonstrate implementations.

SageMaker's Data Parallelism Library


The SageMaker data parallelism library extends SageMaker training capabilities on deep learning models
with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes.

When training a model on a large amount of data, machine learning practitioners often turn to
distributed training to reduce the time to train. In some cases, where time is of the essence, the business
requirement is to finish training as quickly as possible or at least within a constrained time period. Then,
distributed training is scaled to use a cluster of multiple nodes—not just multiple GPUs in a computing
instance, but multiple instances with multiple GPUs. As the cluster size increases, so does the significant
drop in performance. This drop in performance is primarily caused by the communications overhead
between nodes in a cluster.

To resolve such overhead problems, SageMaker offers two distributed training options: SageMaker
model parallelism and SageMaker data parallelism. This guide focuses on how to train models using the
SageMaker data parallelism library.

• The library optimizes your training job for AWS network infrastructure and Amazon EC2 instance
topology.
• The library takes advantage of gradient updates to communicate between nodes with a custom
AllReduce algorithm.

To track the latest updates of the library, see the SageMaker Distributed Data Parallel Release Notes in
the SageMaker Python SDK documentation.

1831
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

For more information about training with a model-parallel strategy, see SageMaker's Model Parallelism
Library (p. 1864).

Topics
• Introduction to SageMaker's Distributed Data Parallel Library (p. 1832)
• Supported Frameworks, AWS Regions, and Instances Types (p. 1834)
• Run a SageMaker Distributed Training Job with Data Parallelism (p. 1839)
• SageMaker Distributed Data Parallel Configuration Tips and Pitfalls (p. 1858)
• Amazon SageMaker Data Parallel Library FAQ (p. 1860)
• Data Parallel Troubleshooting (p. 1862)

Introduction to SageMaker's Distributed Data Parallel Library


Why Use SageMaker Distributed Data Parallel Library?
SageMaker's distributed data parallel library addresses communications overhead in two ways:

1. The library performs AllReduce, a key operation during distributed training that is responsible for a
large portion of communication overhead.
2. The library performs optimized node-to-node communication by fully utilizing AWS’s network
infrastructure and Amazon EC2 instance topology.

Use this data parallel library to increase speed by up to 25% in training models such as BERT. While
implementations like Horovod offer sub-linear performance at scale, this library offers near-linear
performance at scale. This means that you get a faster training time and a lower cost to train a model.
Note
The SageMaker distributed training libraries are available only through the AWS deep learning
containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker
training platform. To use the libraries, you must use the SageMaker Python SDK or the
SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout
the documentation, instructions and examples focus on how to use the distributed training
libraries with the SageMaker Python SDK.

Training Benchmarks
PyTorch with SageMaker's data parallel library

1832
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Using instance type p3dn.24xl and on 2, 4, and 8 node clusters:

• BERT: When used with PyTorch, the SageMaker library is 41%, 52%, and 13% faster than PyTorch-
DDP.
• MaskRCNN: When used with PyTorch, the SageMaker library is 4%, 19%, and 15% faster than PyTorch-
DDP.

These benchmarks were run on PyTorch v1.6 using ml.p3dn.24xlarge instances. You can find the
training code on the SageMaker examples website. The examples website also has benchmark training
code for these models using TensorFlow 2.3.

Optimal Bandwidth Use with Balanced Fusion Buffer


SageMaker's distributed data parallel library uses a communication pattern similar to parameter servers
to reduce the amount of data transferred and the number of steps involved in averaging gradients from
multiple GPUs. It also uses a new technique called balanced fusion buffers to make optimal use of the
bandwidth available across all nodes in the cluster.

One key disadvantage of traditional parameter servers is their suboptimal use of available network
bandwidth. Parameter servers treat variables as atomic units and place each variable on one server.
Since gradients become available sequentially during the backward pass, at any given instant, there
is imbalance in the volume of data being sent and received from different servers. Some servers are
receiving and sending more data, some less, and some none. This problem becomes worse as the number
of parameter servers increases.

The library addresses these problems by introducing balanced fusion buffers. A balanced fusion buffer is a
buffer in the GPU that holds the gradients until the size of the buffer exceeds a threshold. In a setup with
N parameter servers, when the buffer exceeds the threshold, the balanced fusion buffer is copied to CPU
memory, sharded into N parts, and the ith part is sent to the ith parameter server. Each server receives
exactly the same number of bytes from a balanced fusion buffer. The ith server receives the ith partition
of the balanced fusion buffer from all workers, sums them up, and sends the results back to all workers.
Since all the servers participate equally in averaging each balanced fusion buffer, server bandwidth is
efficiently utilized.

Optimal GPU Usage with Efficient AllReduce Overlapping with a Backward


Pass
SageMaker's distributed data parallel library achieves optimal overlapping of the AllReduce operation
with the backward pass, significantly improving the GPU utilization, and achieves near-linear scaling
efficiency and faster time to train by optimizing tasks between CPUs and GPUs. The library performs
AllReduce in parallel while GPU is computing gradients without taking away additional GPU cycles,
which makes the library faster.

• Leverages CPUs: The library uses CPUs to AllReduce gradients, offloading this task from the GPUs.
• Improved GPU usage: The cluster’s GPUs focus on computing gradients, improving their utilization
throughout training.

SageMaker Distributed Data Parallel Architecture


The library supports larger compute instances that have 8 GPUs per node: ml.p3.16xlarge,
ml.p3dn.24xlarge, and ml.p4d.24xlarge. The high-level workflow of the SageMaker distributed
data parallel library is as following:

1. The library assigns ranks to GPUs (workers).


2. At each iteration, the library divides each global batch by the total number of workers (world size) and
assigns small batches (batch shards) to the workers.

1833
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

• The size of the global batch is (number of nodes in a cluster) * (number of GPUs per
node) * (per batch shard).
• A batch shard (small batch) is a subset of dataset assigned to each GPU (worker) per iteration.
3. The library launches a training script on each worker.
4. The library manages copies of model weights and gradients from the workers at the end of every
iteration.
5. The library synchronizes model weights and gradients across the workers to aggregate a single trained
model.

The following architecture diagram shows an example of how the library sets up data parallelism for a
cluster of 3 nodes.

To start using the SageMaker distributed data parallel library, see Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847) to set up a SageMaker estimator
through Amazon SageMaker Python SDK, and Run a SageMaker Distributed Training Job with Data
Parallelism (p. 1839) to adapt your training script using the SageMaker distributed data parallel library.

Supported Frameworks, AWS Regions, and Instances Types


Before using the SageMaker data parallelism library, check what are the supported ML frameworks and
instance types and if there are enough quotas in your AWS account and AWS Region.

Supported Frameworks
The following tables show the deep learning frameworks and their versions that SageMaker and the
SageMaker data parallelism library support. The SageMaker data parallelism library is available in AWS
Deep Learning Containers (DLC) or downloadable as a binary file.

1834
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Note
To check the latest updates and release notes of the library, see also the SageMaker Data
Parallel Release Notes in the SageMaker Python SDK documentation.

Topics
• PyTorch (p. 1835)
• PyTorch Lightning (p. 1837)
• TensorFlow (p. 1837)
• Hugging Face Transformers (p. 1838)

PyTorch

PyTorch versions SageMaker data smdistributed- URL of the binary file**


parallelism library dataparallel
versions integrated image URI

v2.0.0 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.8.0pytorch-training:2.0.0- smdataparallel.s3.amazonaws.com/
gpu-py310-cu118- binary/pytorch/2.0.0/
ubuntu20.04- cu118/2023-03-20/
sagemaker smdistributed_dataparallel-1.8.0-
cp310-cp310-
linux_x86_64.whl

v1.13.1 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.7.0pytorch-training:1.13.1- smdataparallel.s3.amazonaws.com/
gpu-py39-cu117- binary/pytorch/1.13.1/
ubuntu20.04- cu117/2023-01-09/
sagemaker smdistributed_dataparallel-1.7.0-
cp39-cp39-
linux_x86_64.whl

v1.12.1 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.6.0pytorch-training:1.12.1- smdataparallel.s3.amazonaws.com/
gpu-py38-cu113- binary/pytorch/1.12.1/
ubuntu20.04- cu113/2022-12-05/
sagemaker smdistributed_dataparallel-1.6.0-
cp38-cp38-
linux_x86_64.whl

v1.12.0 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.5.0pytorch-training:1.12.0- smdataparallel.s3.amazonaws.com/
gpu-py38-cu113- binary/pytorch/1.12.0/
ubuntu20.04- cu113/2022-07-01/
sagemaker smdistributed_dataparallel-1.5.0-
cp38-cp38-
linux_x86_64.whl

v1.11.0 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.4.1pytorch-training:1.11.0- smdataparallel.s3.amazonaws.com/
gpu-py38-cu113- binary/pytorch/1.11.0/
ubuntu20.04- cu113/2022-04-14/
sagemaker smdistributed_dataparallel-1.4.1-
cp38-cp38-
linux_x86_64.whl

1835
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

PyTorch versions SageMaker data smdistributed- URL of the binary file**


parallelism library dataparallel
versions integrated image URI

v1.10.2 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.4.0pytorch-training:1.10.2- smdataparallel.s3.amazonaws.com/
gpu-py38-cu113- binary/pytorch/1.10.2/
ubuntu20.04- cu113/2022-02-18/
sagemaker smdistributed_dataparallel-1.4.0-
cp38-cp38-
linux_x86_64.whl

v1.9.1 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.2.0pytorch-training:1.9.1- smdataparallel.s3.amazonaws.com/
gpu-py38-cu111- binary/pytorch/1.9.0/
ubuntu20.04 cu111/2021-08-13/
smdistributed_dataparallel-1.2.0-
cp38-cp38-
linux_x86_64.whl

v1.8.1 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.2.3pytorch-training:1.8.1- smdataparallel.s3.amazonaws.com/
gpu-py36-cu111- binary/pytorch/1.8.1/
ubuntu18.04 cu111/2021-12-13/
smdistributed_dataparallel-1.2.3-
cp36-cp36m-
linux_x86_64.whl

v1.7.1 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.0.0pytorch-training:1.7.1- smdataparallel.s3.amazonaws.com/
gpu-py36-cu110- binary/pytorch/1.7.1/
ubuntu18.04 cu110/2021-01-26/
smdistributed_dataparallel-1.0.0-
cp36-cp36m-
linux_x86_64.whl

v1.6.0 smdistributed- https://


763104351884.dkr.ecr.<region>.amazonaws.com/
dataparallel==v1.0.0pytorch-training:1.6.0- smdataparallel.s3.amazonaws.com/
gpu-py36-cu110- binary/pytorch/1.6.0/
ubuntu18.04 cu110/2021-01-14/
smdistributed_dataparallel-1.0.0-
cp36-cp36m-
linux_x86_64.whl

Note
The SageMaker data parallelism library v1.4.0 and later works as a backend of PyTorch
distributed. In accordance with the change, the following smdistributed APIs for the PyTorch
distributed package are deprecated.

• smdistributed.dataparallel.torch.distributed is deprecated. Use the


torch.distributed package instead.
• smdistributed.dataparallel.torch.parallel.DistributedDataParallel is
deprecated. Use the torch.nn.parallel.DistributedDataParallel API instead.

1836
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

If you need to use the previous versions of the library (v1.3.0 or before), see the archived
SageMaker data parallelism library documentation in the SageMaker Python SDK
documentation.

** The URLs of the binary files are for installing the SageMaker data parallelism library in custom
containers. For more information, see Create Your Own Docker Container with the SageMaker Distributed
Data Parallel Library (p. 1850).

PyTorch Lightning

PyTorch PyTorch versions SageMaker data smdistributed- URL of the binary


Lightning parallelism dataparallel file**
versions library versions integrated image
URI

1.7.2 1.12.0 https://


smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.com/
pytorch-
dataparallel==v1.5.0 smdataparallel.s3.amazonaws.com
1.7.0 training:1.12.0- binary/
gpu-py38-cu113- pytorch/1.12.0/
1.6.4 ubuntu20.04- cu113/2022-07-01/
sagemaker smdistributed_dataparallel-1.5.0-
1.6.3
cp38-cp38-
1.5.10 linux_x86_64.whl

Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the
PyTorch DLCs. When you construct a SageMaker PyTorch estimator and submit a training job
request in Step 2, you need to provide requirements.txt to install pytorch-lightning
and lightning-bolts in the SageMaker PyTorch training container.

# requirements.txt
pytorch-lightning
lightning-bolts

For more information about specifying the source directory to place the requirements.txt
file along with your training script and a job submission, see Using third-party libraries in the
Amazon SageMaker Python SDK documentation.

TensorFlow

TensorFlow versions SageMaker data parallelism smdistributed-


library versions dataparallel integrated
image URI

2.9.1 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.4.1 tensorflow-training:2.9.1-gpu-
py39-cu112-ubuntu20.04-
sagemaker

2.8.0 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.3.0 tensorflow-training:2.8.0-gpu-
py39-cu112-ubuntu20.04-
sagemaker

2.7.1 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.3.0 tensorflow-training:2.7.1-gpu-

1837
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

TensorFlow versions SageMaker data parallelism smdistributed-


library versions dataparallel integrated
image URI
py38-cu112-ubuntu20.04-
sagemaker

2.6.2 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.2.1 tensorflow-training:2.6.2-gpu-
py38-cu112-ubuntu20.04

2.5.1 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.2.1 tensorflow-inference:2.5.1-gpu-
py37-cu112-ubuntu18.04

2.4.1 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.2.0 tensorflow-training:2.4.1-gpu-
py37-cu110-ubuntu18.04

2.3.2 smdistributed- 763104351884.dkr.ecr.<region>.amazonaws.c


dataparallel==v1.0.0 tensorflow-training:2.3.2-gpu-
py37-cu110-ubuntu18.04

Hugging Face Transformers


The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch
and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and
paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging
Face Container Versions.

AWS Regions
The SageMaker data parallelism library is available in all of the AWS Regions where the AWS Deep
Learning Containers for SageMaker are in service. For more information, see Available Deep Learning
Containers Images.

Supported Instance Types


The SageMaker data parallelism library requires one of the following ML instance types.

Instance type

ml.p3.16xlarge

ml.p3dn.24xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling

1838
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge


for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.

Run a SageMaker Distributed Training Job with Data Parallelism


SageMaker's distributed data parallel library APIs are designed for ease of use and to provide seamless
integration with existing distributed training toolkits.

• SageMaker Python SDK with the library API – In most cases, all you have to change in your training
script is the data parallel library import statements. Swap these out with the SageMaker data parallel
library equivalents.
• Focus on your model training without infrastructure management – When training a deep learning
model with the library on SageMaker, you can focus on writing your training script and model training.
You can run a training job using estimator classes provided by the SageMaker Python SDK. The
estimator classes help prepare ML instances, load datasets from specified data resources, submit the
training job using your training script, and shut down the instances after the training job is completed.

To begin, you need to adapt TensorFlow or PyTorch training scripts to use the library. The following
topics provide instructions on how to modify your training script.

Topics
• Step 1: Modify Your Own Training Script (p. 1839)
• Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK (p. 1847)

Step 1: Modify Your Own Training Script


Use this section to learn how to customize your training script to use the core features of the Amazon
SageMaker distributed data parallel library. To use the library-specific API functions and parameters,
we recommend you use this documentation alongside the SageMaker data parallel library APIs in the
SageMaker Python SDK documentation.

The training script examples provided in these sections are simplified and designed to highlight the
required changes you must make to use the library. For end-to-end, runnable notebook examples that
demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker distributed data
parallel library, see Amazon SageMaker Distributed Training Notebook Examples (p. 1942).

Topics
• Modify a TensorFlow Training Script (p. 1839)
• Modify a PyTorch Training Script (p. 1842)
• Modify a PyTorch Lightning Script (p. 1845)

Modify a TensorFlow Training Script


The following steps show you how to modify a TensorFlow training script to utilize SageMaker's
distributed data parallel library.

The library APIs are designed to be similar to Horovod APIs. For additional details on each API
that the library offers for TensorFlow, see the SageMaker distributed data parallel TensorFlow API
documentation.
Note
SageMaker distributed data parallel is adaptable to TensorFlow training scripts composed of tf
core modules except tf.keras modules. SageMaker distributed data parallel does not support
TensorFlow with Keras implementation.

1839
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Note
SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP)
out of the box. No extra action is needed to enable AMP other than the framework-level
modifications to your training script. If gradients are in FP16, the SageMaker data parallelism
library runs its AllReduce operation in FP16. For more information about implementing AMP
APIs to your training script, see the following resources:

• Frameworks - TensorFlow in the NVIDIA Deep Learning Performance documentation


• Automatic Mixed Precision for Deep Learning in the NVIDIA Developer Docs
• TensorFlow mixed precision APIs in the TensorFlow documentation

1. Import the library's TensorFlow client and initialize it.

import smdistributed.dataparallel.tensorflow as sdp


sdp.init()

2. Pin each GPU to a single smdistributed.dataparallel process with local_rank—this refers


to the relative rank of the process within a given node. The sdp.tensorflow.local_rank() API
provides you the local rank of the device. The leader node is rank 0, and the worker nodes are
rank 1, 2, 3, and so on. This is invoked in the following code block as sdp.local_rank().
set_memory_growth is not directly related to SageMaker distributed, but must be set for
distributed training with TensorFlow.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')

3. Scale the learning rate by the number of workers. The sdp.tensorflow.size() API provides you
the number of workers in the cluster. This is invoked in the following code block as sdp.size().

learning_rate = learning_rate * sdp.size()

4. Use the library’s DistributedGradientTape to optimize AllReduce operations during training.


This wraps tf.GradientTape.

with tf.GradientTape() as tape:


output = model(input)
loss_value = loss(label, output)

# SageMaker data parallel: Wrap tf.GradientTape with the library's


DistributedGradientTape
tape = sdp.DistributedGradientTape(tape)

5. Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks
1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use
the sdp.tensorflow.broadcast_variables API after the model and optimizer variables are
initialized. This is invoked in the following code block as sdp.broadcast_variables().

sdp.broadcast_variables(model.variables, root_rank=0)
sdp.broadcast_variables(opt.variables(), root_rank=0)

6. Finally, modify your script to save checkpoints only on the leader node. The leader node has a
synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly
corrupting the checkpoints.

1840
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

if sdp.rank() == 0:
checkpoint.save(checkpoint_dir)

The following is an example TensorFlow training script for distributed training with the library.

import tensorflow as tf

# SageMaker data parallel: Import the library TF API


import smdistributed.dataparallel.tensorflow as sdp

# SageMaker data parallel: Initialize the library


sdp.init()

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
# SageMaker data parallel: Pin GPUs to a single library process
tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')

# Prepare Dataset
dataset = tf.data.Dataset.from_tensor_slices(...)

# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()

# SageMaker data parallel: Scale Learning Rate


# LR for 8 node run : 0.000125
# LR for single node run : 0.001
opt = tf.optimizers.Adam(0.000125 * sdp.size())

@tf.function
def training_step(images, labels, first_batch):
with tf.GradientTape() as tape:
probs = mnist_model(images, training=True)
loss_value = loss(labels, probs)

# SageMaker data parallel: Wrap tf.GradientTape with the library's


DistributedGradientTape
tape = sdp.DistributedGradientTape(tape)

grads = tape.gradient(loss_value, mnist_model.trainable_variables)


opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

if first_batch:
# SageMaker data parallel: Broadcast model and optimizer variables
sdp.broadcast_variables(mnist_model.variables, root_rank=0)
sdp.broadcast_variables(opt.variables(), root_rank=0)

return loss_value

...

# SageMaker data parallel: Save checkpoints only from master node.


if sdp.rank() == 0:
checkpoint.save(checkpoint_dir)

After you have completed adapting your training script, move on to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).

1841
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Modify a PyTorch Training Script


In the SageMaker data parallel library v1.4.0 and later, the library is available as a backend option
for the PyTorch distributed package. You only need to import the library once at the top of your
training script and set it as the PyTorch distributed backend during initialization. With the single line
of backend specification, you can keep your PyTorch training script unchanged and directly use the
PyTorch distributed modules. To find the latest API documentation for the library, see the SageMaker
distributed data parallel APIs for PyTorch in the SageMaker Python SDK documentation. To learn more
about the PyTorch distributed package and backend options, see Distributed communication package -
torch.distributed.
Important
Because the SageMaker distributed data parallelism library v1.4.0 and later works as a backend
of PyTorch distributed, the following smdistributed APIs for the PyTorch distributed package are
deprecated.

• smdistributed.dataparallel.torch.distributed is deprecated. Use the


torch.distributed package instead.
• smdistributed.dataparallel.torch.parallel.DistributedDataParallel is
deprecated. Use the torch.nn.parallel.DistributedDataParallel API instead.

If you need to use the previous versions of the library (v1.3.0 or before), see the archived
SageMaker distributed data parallel library documentation in the SageMaker Python SDK
documentation.

Use the SageMaker Distributed Data Parallel Library as the Backend of torch.distributed
To use the SageMaker distributed data parallel library, the only thing you need
to do is to import the SageMaker distributed data parallel library’s PyTorch client
(smdistributed.dataparallel.torch.torch_smddp). The client registers smddp as
a backend for PyTorch. When you initialize the PyTorch distributed process group using the
torch.distributed.init_process_group API, make sure you specify 'smddp' to the backend
argument.

import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist

dist.init_process_group(backend='smddp')

Note
The smddp backend currently does not support creating subprocess groups with the
torch.distributed.new_group() API. You cannot use the smddp backend concurrently
with other process group backends such as NCCL and Gloo.

If you already have a working PyTorch script and only need to add the backend specification, you can
proceed to Using the SageMaker Framework Estimators For PyTorch and TensorFlow (p. 1847) in the
Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK (p. 1847) topic.

If you still need to modify your training script to properly use the PyTorch distributed package, follow
the rest of the procedures on this page.

Preparing a PyTorch Training Script for Distributed Training


The following steps provide additional tips on how to prepare your training script to successfully run a
distributed training job using PyTorch.
Note
In v1.4.0, the SageMaker distributed data parallel library supports the following collective
primitive data types of the torch.distributed interface: all_reduce, broadcast, reduce,
all_gather, and barrier.

1842
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

1. Import the PyTorch distributed modules.

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

2. After parsing arguments and defining a batch size parameter (for example,
batch_size=args.batch_size), add two lines of code to resize the batch size per worker (GPU).
PyTorch's DataLoader operation does not automatically handle the batch resizing for distributed
training.

batch_size //= dist.get_world_size()


batch_size = max(batch_size, 1)

3. Pin each GPU to a single SageMaker data parallel library process with local_rank—this refers to
the relative rank of the process within a given node.

You can retrieve the rank of the process from the LOCAL_RANK environment variable.

import os
local_rank = os.environ["LOCAL_RANK"]
torch.cuda.set_device(local_rank)

4. After defining a model, wrap it with the PyTorch DistributedDataParallel API.

model = ...

# Wrap the model with the PyTorch DistributedDataParallel API


model = DDP(model)

5. When you call the torch.utils.data.distributed.DistributedSampler


API, specify the total number of processes (GPUs) participating in training across all the
nodes in the cluster. This is called world_size, and you can retrieve the number from the
torch.distributed.get_world_size() API. Also, specify the rank of each process among all
processes using the torch.distributed.get_rank() API.

from torch.utils.data.distributed import DistributedSampler

train_sampler = DistributedSampler(
train_dataset,
num_replicas = dist.get_world_size(),
rank = dist.get_rank()
)

6. Modify your script to save checkpoints only on the leader process (rank 0). The leader process has
a synchronized model. This also avoids other processes overwriting the checkpoints and possibly
corrupting the checkpoints.

if dist.get_rank() == 0:
torch.save(...)

The following example code shows the structure of a PyTorch training script with smddp as the backend.

import os
import torch

# SageMaker data parallel: Import the library PyTorch API


import smdistributed.dataparallel.torch.torch_smddp

1843
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

# SageMaker data parallel: Import PyTorch's distributed API


import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# SageMaker data parallel: Initialize the process group


dist.init_process_group(backend='smddp')

class Net(nn.Module):
...
# Define model

def train(...):
...
# Model training

def test(...):
...
# Model evaluation

def main():

# SageMaker data parallel: Scale batch size by world size


batch_size //= dist.get_world_size()
batch_size = max(batch_size, 1)

# Prepare dataset
train_dataset = torchvision.datasets.MNIST(...)

# SageMaker data parallel: Set num_replicas and rank in DistributedSampler


train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=dist.get_world_size(),
rank=dist.get_rank())

train_loader = torch.utils.data.DataLoader(..)

# SageMaker data parallel: Wrap the PyTorch model with the library's DDP
model = DDP(Net().to(device))

# SageMaker data parallel: Pin each GPU to a single library process.


local_rank = os.environ["LOCAL_RANK"]
torch.cuda.set_device(local_rank)
model.cuda(local_rank)

# Train
optimizer = optim.Adadelta(...)
scheduler = StepLR(...)
for epoch in range(1, args.epochs + 1):
train(...)
if rank == 0:
test(...)
scheduler.step()

# SageMaker data parallel: Save model on master node.


if dist.get_rank() == 0:
torch.save(...)

if __name__ == '__main__':
main()

After you have completed adapting your training script, proceed to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).

1844
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Modify a PyTorch Lightning Script

If you want to bring your PyTorch Lightning training script and run a distributed data parallel training job
in SageMaker, you can run the training job with minimal changes in your training script. The necessary
changes include the following: import the smdistributed.dataparallel library’s PyTorch modules,
set up the environment variables for PyTorch Lightning to accept the SageMaker environment variables
that are preset by the SageMaker training toolkit, and activate the SageMaker data parallel library by
setting the process group backend to "smddp". To learn more, walk through the following instructions
that break down the steps with code examples.
Note
The PyTorch Lightning support is available in the SageMaker data parallel library v1.5.0 and
later.

1. Import the pytorch_lightning library and the smdistributed.dataparallel.torch modules.

import pytorch_lightning as pl
import smdistributed.dataparallel.torch.torch_smddp

2. Set the world size and the rank for the LightningEnvironment class object. When launching a
training job in SageMaker, the SageMaker training toolkit sets up the environment variables "RANK",
"LOCAL_RANK", and "WORLD_SIZE". These environment variables represent the processes' global
ranks, their local ranks, and the world size, respectively. Use these SageMaker environment variables
to configure the LightningEnvironment.

import os
from pytorch_lightning.plugins.environments.lightning_environment \
import LightningEnvironment

env = LightningEnvironment()
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])

3. Set distributed training strategy using the PyTorch Lightning DDPStrategy module, create a PyTorch
Lightning Trainer object, and adapt them to use the SageMaker data parallel library.

(Recommended) For PyTorch Lightning v1.6.0 and later

Create an object (ddp in the following code example) of the DDPStrategy class, and specify "smddp"
to the process_group_backend parameter. When configuring a PyTorch Lightning Trainer object,
use the SageMaker environment variables to specify the scale of the GPU cluster and the ddp object
to set up the distributed training strategy.
Note
We recommend that you check the versions of PyTorch Lightning tested for compatibility
with the SageMaker data parallel library in the the section called “Supported Frameworks and
AWS Regions” (p. 1872) page.

from pytorch_lightning.strategies import DDPStrategy

ddp = DDPStrategy(
cluster_environment=env,
process_group_backend="smddp",
accelerator="gpu"
)

world_size = int(os.environ["WORLD_SIZE"])
num_gpus = int(os.environ["SM_NUM_GPUS"])
num_nodes = int(world_size/num_gpus)

trainer = pl.Trainer(

1845
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

devices=num_gpus,
num_nodes=num_nodes,
max_epochs=10,
strategy=ddp
)

(Optional) For PyTorch Lightning v1.5.10

If you are using DDPPlugin, which is a deprecated functionality, set the distributed strategy as shown
in the following code example.

from pytorch_lightning.plugins.training_type.ddp import DDPPlugin

os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "smddp"

ddp = DDPPlugin(
parallel_devices=[torch.device("cuda", d) for d in range(num_gpus)],
cluster_environment=env
)

world_size = int(os.environ["WORLD_SIZE"])
num_gpus = int(os.environ["SM_NUM_GPUS"])
num_nodes = int(world_size/num_gpus)

trainer = pl.Trainer(
gpus=num_gpus,
num_nodes=num_nodes,
max_epochs=10,
strategy=ddp
)

4. Run trainer.fit to start the training job of a PyTorch model. The following code example shows
a PyTorch model object wrapped by the PyTorch Lightning Trainer’s fit method with the PyTorch
Lightning MNIST data module.

from pl_bolts.datamodules import MNISTDataModule

trainer.fit(model, datamodule=MNISTDataModule(batch_size=32))

After you have completed adapting your training script, proceed to Step 2: Launch a SageMaker
Distributed Training Job Using the SageMaker Python SDK (p. 1847).
Note
When you construct a SageMaker PyTorch estimator and submit a training job request in Step
2, you need to provide requirements.txt to install pytorch-lightning and lightning-
bolts in the SageMaker PyTorch training container.

# requirements.txt
pytorch-lightning
lightning-bolts

For more information about specifying the source directory to place the requirements.txt
file along with your training script and a job submission, see Using third-party libraries in the
Amazon SageMaker Python SDK documentation.

1846
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker


Python SDK
To run your adapted script in Step 1: Modify Your Own Training Script (p. 1839), start with creating a
SageMaker framework or generic estimator object with the prepared training script and the distributed
training configuration parameter. You can use the library in any kind of SageMaker environment and web
IDEs, such as SageMaker notebook instance and SageMaker Studio.

Use the high-level SageMaker Python SDK to do one of the following:

• If you want to achieve a quick adoption of your distributed training job in SageMaker, configure a
SageMaker PyTorch or TensorFlow framework estimator class. The framework estimator picks up your
training script and automatically matches the right image URI of the pre-built PyTorch or TensorFlow
Deep Learning Containers (DLC), given the value specified to the framework_version parameter.
• If you want to extend one of the pre-built containers or build a custom container to create your own
ML environment with SageMaker, use the SageMaker generic Estimator class and specify the image
URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).

Your training datasets should be stored in Amazon S3 or Amazon FSx for Lustre in the AWS Region in
which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker
notebook instance or a SageMaker Studio app running in the same AWS Region. For more information
about storing your training data, see the SageMaker Python SDK data inputs documentation.
Tip
We highly recommend that you use Amazon FSx for Lustre instead of Amazon S3 to increase
training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.

Choose one of the following topics for instructions on how to run your TensorFlow or PyTorch training
scripts. After you launch a training job, you can monitor system utilization and model performance using
Debug and Profile Training Jobs Using Amazon SageMaker Debugger (p. 1649) or Amazon CloudWatch.

While you follow instructions in the following topics to learn more about technical details, we also
recommend that you try the Amazon SageMaker Distributed Training Notebook Examples (p. 1942) to
get started.

Topics
• Using the SageMaker Framework Estimators For PyTorch and TensorFlow (p. 1847)
• Using the SageMaker Generic Estimator to Extend Prebuilt Containers (p. 1849)
• Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850)

Using the SageMaker Framework Estimators For PyTorch and TensorFlow


You can activate the SageMaker distributed data parallel library by specifying it as the distribution
strategy in the SageMaker framework estimator class.

SageMaker PyTorch

from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
base_job_name="training_job_name_prefix",
source_dir="sub-folder-for-your-code",
entry_point="adapted-training-script.py",
role="SageMakerRole",
py_version="py38",
framework_version="1.12.0",

# For running a multi-node distributed training job, specify a value greater than 1

1847
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

# Example: 2,3,4,..8
instance_count=2,

# Instance types supported by the SageMaker data parallel library:


# ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
instance_type="ml.p3.16xlarge",

# Training using the SageMaker data parallel distributed training strategy


distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

pt_estimator.fit("s3://bucket/path/to/training/data")

Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the
SageMaker PyTorch DLCs. Create the following requirements.txt file and save in the
source directory where you save the training script.

# requirements.txt
pytorch-lightning
lightning-bolts

For example, the tree-structured directory should look like the following.

### pytorch_training_launcher_jupyter_notebook.ipynb
### sub-folder-for-your-code
### adapted-training-script.py
### requirements.txt

For more information about specifying the source directory to place the
requirements.txt file along with your training script and a job submission, see Using
third-party libraries in the Amazon SageMaker Python SDK documentation.
SageMaker TensorFlow

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
base_job_name = "training_job_name_prefix",
entry_point="adapted-training-script.py",
role="SageMakerRole",
framework_version="2.9.1",
py_version="py38",

# For running a multi-node distributed training job, specify a value greater than 1
# Example: 2,3,4,..8
instance_count=2,

# Instance types supported by the SageMaker data parallel library:


# ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
instance_type="ml.p3.16xlarge",

# Training using the SageMaker data parallel distributed training strategy


distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")

The following two parameters of the SageMaker framework estimator are required to activate the
SageMaker data parallelism.

1848
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

distribution (dict): A dictionary with information on how to run distributed training (default: None).

• To use smdistributed.dataparallel as a distribution strategy, configure a dictionary as shown in


the following code:

distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }

• custom_mpi_options (str)(optional): Custom MPI options. The following is an example of


how you can use this parameter when defining distribution. To learn more, see Custom MPI
Options (p. 1859).

distribution = {
"smdistributed": {
"dataparallel": {
"enabled": True,
"custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
}
}
}

instance_type (str): A type of Amazon EC2 instance to use.

• If using the smdistributed with dataparallel distribution strategy, you must use one of the
following instance types: ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge. For best
performance, we recommend that you use an EFA-enabled instance, which are ml.p3dn.24xlarge
and ml.p4d.24xlarge.

Using the SageMaker Generic Estimator to Extend Prebuilt Containers


You can customize SageMaker prebuilt containers or extend them to handle any additional functional
requirements for your algorithm or model that the prebuilt SageMaker Docker image doesn't support.
For an example of how you can extend a pre-built container, see Extend a Prebuilt Container.

To extend a prebuilt container or adapt your own container to use the library, you must use one of the
images listed in Supported Frameworks (p. 1834).
Important
From TensorFlow 2.4.1 and PyTorch 1.8.1, the framework DLCs supports EFA-enabled instance
types (ml.p3dn.24xlarge, ml.p4d.24xlarge). We recommend that you use the DLC images
that contain TensorFlow 2.4.1 or later and PyTorch 1.8.1 or later.

For example, if you use PyTorch, your Dockerfile should contain a FROM statement similar to the
following:

# SageMaker PyTorch image


FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/pytorch-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker PyTorch container to determine our
user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to
store your user code.
COPY cifar10.py /opt/ml/code/cifar10.py

# Defines cifar10.py as script entrypoint

1849
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

ENV SAGEMAKER_PROGRAM cifar10.py

You can further customize your own Docker container to work with SageMaker using the SageMaker
Training toolkit and the binary file of the SageMaker distributed data parallel library. To learn more, see
the instructions in the following section.

Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library
To build your own Docker container for training and use the SageMaker data parallel library, you must
include the correct dependencies and the binary files of the SageMaker distributed parallel libraries
in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the
minimum set of dependencies for distributed training in SageMaker using the data parallel library.
Note
This custom Docker option with the SageMaker data parallel library as a binary is available only
for PyTorch.

To create a Dockerfile with the SageMaker training toolkit and the data parallel library

1. Start with a Docker image from NVIDIA CUDA. Use the cuDNN developer versions that contain CUDA
runtime and development tools (headers and libraries) to build from the PyTorch source code.

FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA base
images. If you want to use the prebuilt DLC images as references while following the rest of
the instructions, see the AWS Deep Learning Containers for PyTorch Dockerfiles.
2. Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the
Amazon S3 bucket paths to the SageMaker data parallel library and other software to use AWS
resources, such as the Amazon S3 plug-in.

To use versions of the third party libraries other than the ones provided in the following code
example, we recommend you look into the official Dockerfiles of AWS Deep Learning Container for
PyTorch to find versions that are tested, compatible, and suitable for your application.

To find URLs for the SMDATAPARALLEL_BINARY argument, see the look up tables at Supported
Frameworks (p. 1834).

ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/
${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-
linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/
binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws

3. Set the following environment variables to properly build SageMaker training components and run
the data parallel library. You use these variables for the components in the subsequent steps.

# Set ENV variables required to build PyTorch


ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3

# Add OpenMPI to the path.


ENV PATH /opt/amazon/openmpi/bin:$PATH

1850
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

# Add Conda to path


ENV PATH $CONDA_PREFIX/bin:$PATH

# Set this enviroment variable for SageMaker to launch SMDDP correctly.


ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main

# Add enviroment variable for processes to be able to call fork()


ENV RDMAV_FORK_SAFE=1

# Indicate the container type


ENV DLC_CONTAINER_TYPE=training

# Add EFA and SMDDP to LD library path


ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/
smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH

4. Install or update curl, wget, and git to download and build packages in the subsequent steps.

RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
git \
&& rm -rf /var/lib/apt/lists/*

5. Install Elastic Fabric Adapter (EFA) software for Amazon EC2 network communication.

RUN DEBIAN_FRONTEND=noninteractive apt-get update


RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-
${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/efa

6. Install Conda to handle package management.

RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-


latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p $CONDA_PREFIX && \
rm ~/miniconda.sh && \
$CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml
numpy ipython && \
$CONDA_PREFIX/bin/conda clean -ya

7. Get, build, and install PyTorch and its dependencies. We build PyTorch from the source code because
we need to have control of the NCCL version to guarantee compatibility with the AWS OFI NCCL plug-
in.
a. Following the steps in the PyTorch official dockerfile, install build dependencies and set up ccache
to speed up recompilation.

RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \

1851
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

git \
libjpeg-dev \
libpng-dev \
&& rm -rf /var/lib/apt/lists/*

# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache

b. Install PyTorch’s common and Linux dependencies.

# Common dependencies for PyTorch


RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
typing_extensions future six requests dataclasses

# Linux specific dependency for PyTorch


RUN conda install -c pytorch magma-cuda113

c. Clone the PyTorch GitHub repository.

RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}

d. Install and build a specific NCCL version. To do this, replace the content in the PyTorch’s default
NCCL folder (/pytorch/third_party/nccl) with the specific NCCL version from the NVIDIA
repository. The NCCL version was set in the step 3 of this guide.

RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-
gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1

e. Build and install PyTorch. This process usually takes slightly more than one hour to complete. It is
built using the NCCL version downloaded in a previous step.

RUN cd /pytorch \
&& CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install \
&& rm -rf /pytorch

8. Build and install AWS OFI NCCL plugin. This enables libfabric support for the SageMaker data parallel
library.

RUN DEBIAN_FRONTEND=noninteractive apt-get update \


&& apt-get install -y --no-install-recommends \
autoconf \
automake \
libtool
RUN mkdir /tmp/efa-ofi-nccl \
&& cd /tmp/efa-ofi-nccl \
&& git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
&& cd aws-ofi-nccl \
&& ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa \
--with-mpi=/opt/amazon/openmpi \
--with-cuda=/usr/local/cuda \
--with-nccl=$CONDA_PREFIX \

1852
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

&& make \
&& make install \
&& rm -rf /tmp/efa-ofi-nccl

9. Build and install TorchVision.

RUN pip install --no-cache-dir -U \


packaging \
mpi4py==3.0.3
RUN cd /tmp \
&& git clone https://github.com/pytorch/vision.git -b v0.9.1 \
&& cd vision \
&& BUILD_VERSION="0.9.1+cu111" python setup.py install \
&& cd /tmp \
&& rm -rf vision

10.Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers.
Allow OpenSSH to talk to containers without asking for confirmation.

RUN apt-get update \


&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-
recommends \
&& apt-get install -y --no-install-recommends openssh-client openssh-server \
&& mkdir -p /var/run/sshd \
&& cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new
\
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
&& rm -rf /var/lib/apt/lists/*

# Configure OpenSSH so that nodes can communicate with each other


RUN mkdir -p /var/run/sshd && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /
etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
mkdir -p /root/.ssh/ && \
ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
&& printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

11.Install the PT S3 plug-in to efficiently access datasets in Amazon S3.

RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}


RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/
certs/ca-bundle.crt

12.Install the libboost library. This package is needed for networking the asynchronous IO functionality
of the SageMaker data parallel library.

WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/
download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
&& ./bootstrap.sh \
&& ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC
install || true \
&& cd .. \
&& rm -rf boost_1_73_0.tar.gz \
&& rm -rf boost_1_73_0 \
&& cd ${CONDA_PREFIX}/include/boost

13.Install the following SageMaker tools for PyTorch training.

1853
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

WORKDIR /root
RUN pip install --no-cache-dir -U \
smclarify \
"sagemaker>=2,<3" \
sagemaker-experiments==0.* \
sagemaker-pytorch-training

14.Finally, install the SageMaker data parallel binary and the remaining dependencies.

RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
jq \
libhwloc-dev \
libnuma1 \
libnuma-dev \
libssl1.1 \
libtool \
hwloc \
&& rm -rf /var/lib/apt/lists/*

RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}

15.After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to
build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python
SDK.

The following example code shows a complete Dockerfile after combining all the previous code blocks.

# This file creates a docker image with minimum dependencies to run SageMaker data parallel
training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

# Set appropiate versions and location for components


ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/
${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-
linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/
awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws

# Set ENV variables required to build PyTorch


ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3

# Add OpenMPI to the path.


ENV PATH /opt/amazon/openmpi/bin:$PATH

# Add Conda to path


ENV PATH $CONDA_PREFIX/bin:$PATH

# Set this enviroment variable for SageMaker to launch SMDDP correctly.


ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main

# Add enviroment variable for processes to be able to call fork()


ENV RDMAV_FORK_SAFE=1

# Indicate the container type

1854
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

ENV DLC_CONTAINER_TYPE=training

# Add EFA and SMDDP to LD library path


ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/
smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH

# Install basic dependencies to download and build other dependencies


RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
git \
&& rm -rf /var/lib/apt/lists/*

# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-
${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/efa

# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-
latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p $CONDA_PREFIX && \
rm ~/miniconda.sh && \
$CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml
numpy ipython && \
$CONDA_PREFIX/bin/conda clean -ya

# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
git \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*

# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache

# Common dependencies for PyTorch


RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
typing_extensions future six requests dataclasses

# Linux specific dependency for PyTorch


RUN conda install -c pytorch magma-cuda113

# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}

1855
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.

# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-
gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1

# Build and install PyTorch.


RUN cd /pytorch \
&& CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install \
&& rm -rf /pytorch

RUN ccache -C

# Build and install OFI plugin. \


# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
&& apt-get install -y --no-install-recommends \
autoconf \
automake \
libtool
RUN mkdir /tmp/efa-ofi-nccl \
&& cd /tmp/efa-ofi-nccl \
&& git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
&& cd aws-ofi-nccl \
&& ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa \
--with-mpi=/opt/amazon/openmpi \
--with-cuda=/usr/local/cuda \
--with-nccl=$CONDA_PREFIX \
&& make \
&& make install \
&& rm -rf /tmp/efa-ofi-nccl

# Build and install Torchvision


RUN pip install --no-cache-dir -U \
packaging \
mpi4py==3.0.3
RUN cd /tmp \
&& git clone https://github.com/pytorch/vision.git -b v0.9.1 \
&& cd vision \
&& BUILD_VERSION="0.9.1+cu111" python setup.py install \
&& cd /tmp \
&& rm -rf vision

# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers
without asking for confirmation
RUN apt-get update \
&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-
recommends \
&& apt-get install -y --no-install-recommends openssh-client openssh-server \
&& mkdir -p /var/run/sshd \
&& cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
&& rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other

1856
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

RUN mkdir -p /var/run/sshd && \


sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /
etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
mkdir -p /root/.ssh/ && \
ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
&& printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/
certs/ca-bundle.crt

# Install libboost from source.


# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/
download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
&& ./bootstrap.sh \
&& ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC
install || true \
&& cd .. \
&& rm -rf boost_1_73_0.tar.gz \
&& rm -rf boost_1_73_0 \
&& cd ${CONDA_PREFIX}/include/boost

# Install SageMaker PyTorch training.


WORKDIR /root
RUN pip install --no-cache-dir -U \
smclarify \
"sagemaker>=2,<3" \
sagemaker-experiments==0.* \
sagemaker-pytorch-training

# Install SageMaker data parallel binary (SMDDP)


# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
jq \
libhwloc-dev \
libnuma1 \
libnuma-dev \
libssl1.1 \
libtool \
hwloc \
&& rm -rf /var/lib/apt/lists/*

# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}

Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see
Use Your Own Training Algorithms.
Tip
If you want to extend the custom Dockerfile to incorporate the SageMaker model parallel
library, see Create Your Own Docker Container with the SageMaker Distributed Model Parallel
Library (p. 1925).

1857
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

SageMaker Distributed Data Parallel Configuration Tips and


Pitfalls
Review the following tips and pitfalls before using SageMaker's distributed data parallel library. This list
includes tips that are applicable across frameworks.

Topics
• Data Preprocessing (p. 1858)
• Single Versus Multiple Nodes (p. 1858)
• Debug Scaling Efficiency with Debugger (p. 1858)
• Batch Size (p. 1858)
• Custom MPI Options (p. 1859)
• Use Amazon FSx and set up an optimal storage and throughput capacity (p. 1859)

Data Preprocessing
If you preprocess data during training using an external library that utilizes the CPU, you may run into a
CPU bottleneck because SageMaker distributed data parallel uses the CPU for AllReduce operations.
You may be able to improve training time by moving preprocessing steps to a library that uses GPUs or
by completing all preprocessing before training.

Single Versus Multiple Nodes


We recommend that you use this library with multiple nodes. The library can be used with a single-
host, multi-device setup (for example, a single ML compute instance with multiple GPUs); however,
when you use two or more nodes, the library’s AllReduce operation gives you significant performance
improvement. Also, on a single host, NVLink already contributes to in-node AllReduce efficiency.

Debug Scaling Efficiency with Debugger


You can use Amazon SageMaker Debugger to monitor and visualize CPU and GPU utilization and other
metrics of interest during training. You can use Debugger built-in rules to monitor computational
performance issues, such as CPUBottleneck, LoadBalancing, and LowGPUUtilization. You can
specify these rules with Debugger configurations when you define an Amazon SageMaker Python SDK
estimator. If you use AWS CLI and AWS Boto3 for training on SageMaker, you can enable Debugger as
shown in Configure Debugger Using Amazon SageMaker API.

To see an example using Debugger in a SageMaker training job, you can reference one of the notebook
examples in the SageMaker Notebook Examples GitHub repository. To learn more about Debugger, see
Amazon SageMaker Debugger.

Batch Size
In distributed training, as more nodes are added, batch sizes should increase proportionally. To improve
convergence speed as you add more nodes to your training job and increase the global batch size,
increase the learning rate.

One way to achieve this is by using a gradual learning rate warmup where the learning rate is ramped up
from a small to a large value as the training job progresses. This ramp avoids a sudden increase of the
learning rate, allowing healthy convergence at the start of training. For example, you can use a Linear
Scaling Rule where each time the mini-batch size is multiplied by k, the learning rate is also multiplied by
k. To learn more about this technique, see the research paper, Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour, Sections 2 and 3.

1858
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Custom MPI Options


The SageMaker distributed data parallel library employs Message Passing Interface (MPI), a popular
standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s
NCCL library for GPU-level communication. When you use the data parallel library with a TensorFlow or
Pytorch Estimator, the respective container sets up the MPI environment and executes the mpirun
command to start jobs on the cluster nodes.

You can set custom MPI operations using the custom_mpi_options parameter in the Estimator.
Any mpirun flags passed in this field are added to the mpirun command and executed by SageMaker
for training. For example, you may define the distribution parameter of an Estimator using the
following to use the NCCL_DEBUG variable to print the NCCL version at the start of the program:

distribution = {'smdistributed':{'dataparallel':{'enabled': True, "custom_mpi_options": "-


verbose -x NCCL_DEBUG=VERSION"}}}

Use Amazon FSx and set up an optimal storage and throughput capacity
When training a model on multiple nodes with distributed data parallelism, it is highly recommended to
use FSx for Lustre. Amazon FSx is a scalable and high-performance storage service that supports shared
file storage with a faster throughput. Using Amazon FSx storage at scale, you can achieve a faster data
loading speed across the compute nodes.

Typically, with distributed data parallelism, you would expect that the total training throughput scales
near-linearly with the number of GPUs. However, if you use suboptimal Amazon FSx storage, the training
performance might slow down due to a low Amazon FSx throughput.

For example, if you use the SCRATCH_2 deployment type of Amazon FSx file system with the minimum
1.2 TiB storage capacity, the I/O throughput capacity is 240 MB/s. Amazon FSx storage works in a way
that you can assign physical storage devices, and the more devices assigned, the larger throughput you
get. The smallest storage increment for the SRATCH_2 type is 1.2 TiB, and the corresponding throughput
gain is 240 MB/s.

Assume that you have a model to train on a 4-node cluster over a 100 GB data set. With a given batch
size that’s optimized to the cluster, assume that the model can complete one epoch in about 30 seconds.
In this case, the minimum required I/O speed is approximately 3 GB/s (100 GB / 30 s). This is apparently
a much higher throughput requirement than 240 MB/s. With such a limited Amazon FSx capacity, scaling
your distributed training job up to larger clusters might aggravate I/O bottleneck problems; model
training throughput might improve in later epochs as cache builds up, but Amazon FSx throughput can
still be a bottleneck.

To alleviate such I/O bottleneck problems, you should increase the Amazon FSx storage size to obtain
a higher throughput capacity. Typically, to find an optimal I/O throughput, you may experiment with
different Amazon FSx throughput capacities, assigning an equal to or slightly lower throughput than
your estimate, until you find that it is sufficient to resolve the I/O bottleneck problems. In case of the
aforementioned example, Amazon FSx storage with 2.4 GB/s throughput and 67 GB RAM cache would
be sufficient. If the file system has an optimal throughput, the model training throughput should reach
maximum either immediately or after the first epoch as cache has built up.

To learn more about how to increase Amazon FSx storage and deployment types, see the following
pages in the Amazon FSx for Lustre documentation:

• How to increase storage capacity


• Aggregate file system performance

1859
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

Amazon SageMaker Data Parallel Library FAQ


Use the following to find answers to commonly asked questions about SageMaker's data parallelism
library.

Q: When using the library, how are the allreduce-supporting CPU instances managed? Do I have to
create heterogeneous CPU-GPU clusters, or does the SageMaker service create extra C5s for jobs that
use the library?

The library uses the CPUs available with GPU instances. No additional C5 or CPU instances are launched;
if your SageMaker training job is 8-node ml.p3dn.24xlarge clusters, only 8 ml.p3dn.24xlarge
instances are used. No additional instances are provisioned.

Q: I have a training job taking 5 days on a single ml.p3.24xlarge instance with a set of
hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker's data parallelism
library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I
have to revisit its training hyperparameters after activating the library?

The library changes the overall batch size. The new overall batch size is scaled linearly with the number
of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed
to ensure convergence.

Q: Does the library support Spot?

Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker
training job. You save and restore checkpoints in their training script as mentioned in the last steps of the
section called “TensorFlow” (p. 1839) and the section called “PyTorch” (p. 1842).

Q: Is the library relevant in a single-host, multi-device setup?

The library can be used in single-host multi-device training but the library offers performance
improvements only in multi-host training.

Q: Where should the training dataset be stored?

The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this
document for various supported input file systems for a training job.

Q: When using the library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS
and Amazon S3 be used?

We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you
prefer, you can use Amazon EFS or Amazon S3.

Q: Can the library be used with CPU nodes?

No. The library supports ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge instances at


this time.

Q: What frameworks and framework versions are currently supported by the library at launch?

The library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support
TensorFlow 1.x. For more information about which version of the library is packaged within AWS deep
learning containers, see Release Notes for Deep Learning Containers.

Q: Does the library support AMP?

Yes, SageMaker's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of
the box. No extra action is needed to use AMP other than the framework-level modifications to your
training script. If gradients are in FP16, the SageMaker data parallelism library runs its AllReduce

1860
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

operation in FP16. For more information about implementing AMP APIs to your training script, see the
following resources:

• Frameworks - PyTorch in the NVIDIA Deep Learning Performace documentation


• Frameworks - TensorFlow in the NVIDIA Deep Learning Performace documentation
• Automatic Mixed Precision for Deep Learning in the NVIDIA Developer Docs
• Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs in the
PyTorch Blog
• TensorFlow mixed precision APIs in the TensorFlow documentation

Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?

With a larger cluster, the training job requires more I/O throughput, and therefore the training
throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates
that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput
requirement and more complex network topology). For more information about monitoring the Amazon
FSx throughput on CloudWatch, see Monitoring FSx for Lustre in the FSx for Lustre User Guide.

Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?

We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If
you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your
Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information
about how to estimate and choose the right size of I/O throughput capacity, see Use Amazon FSx and set
up an optimal storage and throughput capacity (p. 1859).

Q: (For the library v1.4.0 or later) How do I resolve the Invalid backend error while initializing
process group.

If you encounter the error message ValueError: Invalid backend: 'smddp' when calling
init_process_group, this is due to the breaking change in the library v1.4.0 and later. You must
import the PyTorch client of the library, smdistributed.dataparallel.torch.torch_smddp,
which registers smddp as a backend for PyTorch. To learn more, see Use the SageMaker Distributed Data
Parallel Library as the Backend of torch.distributed (p. 1842).

Q: (For the library v1.4.0 or later) I would like to call the collective primitives of the torch.distributed
interface. Which primitives does the smddp backend support?

In v1.4.0, the library supports all_reduce, broadcast, reduce, all_gather, and barrier.

Q: (For the library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries
like Apex DDP?

The SageMaker data parallel library is tested with other third-party distributed data parallel libraries and
framework implementations that use the torch.distribtued modules. Using the SageMaker data
parallel library with custom DDP classes works as long as the collectives used by the custom DDP classes
are supported by the library. See the preceding question for a list of supported collectives. If you have
these use cases and need further support, reach out to the SageMaker team through the AWS Support
Center or AWS Developer Forums for Amazon SageMaker.

Q: Does the library support the bring-your-own-container (BYOC) option? If so, how do I install the
library and run a distributed training job by writing a custom Dockerfile?

If you want to integrate the SageMaker data parallel library and its minimum dependencies in your
own Docker container, BYOC is the right approach. You can build your own container using the binary
file of the library. The recommended process is to write a custom Dockerfile with the library and its
dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch
a training job using the SageMaker generic estimator class. For more instructions on how to prepare a

1861
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

custom Dockerfile for distributed training in SageMaker with the SageMaker data parallel library, see
Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850).

Data Parallel Troubleshooting


If you have problems in running a training job when you use the library, use the following list to try
to troubleshoot. If you need further support, reach out to the SageMaker team through AWS Support
Center or AWS Developer Forums for Amazon Amazon SageMaker.

Topics
• Using SageMaker Distributed Data Parallel with Amazon SageMaker Debugger and
Checkpoints (p. 1862)
• An Unexpected Prefix Attached to Model Parameter Keys (p. 1863)
• SageMaker Distributed Training Job Stalling During Initialization (p. 1863)
• SageMaker Distributed Training Job Stalling at the End of Training (p. 1863)
• Observing Scaling Efficiency Degradation Due to Amazon FSx Throughput Bottlenecks (p. 1864)
• SageMaker Distributed Training Job with PyTorch Returns Deprecation Warnings (p. 1864)

Using SageMaker Distributed Data Parallel with Amazon SageMaker Debugger


and Checkpoints
To monitor system bottlenecks, profile framework operations, and debug model output tensors for
training jobs with SageMaker distributed data parallel, use Amazon SageMaker Debugger.

However, when you use SageMaker Debugger, SageMaker distributed data parallel, and SageMaker
checkpoints, you might see an error that looks like the following example.

SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled

This is due to an internal error between Debugger and checkpoints, which occurs when you enable
SageMaker distributed data parallel.

• If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing
debugger_hook_config=False, which is equivalent to the following framework estimator
example.

bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints


checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

estimator = TensorFlow(
...

distribution={"smdistributed": {"dataparallel": { "enabled": True }}},


checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path="/opt/ml/checkpoints",
debugger_hook_config=False
)

• If you want to keep using both SageMaker distributed data parallel and SageMaker Debugger, a
workaround is manually adding checkpointing functions to your training script instead of specifying
the checkpoint_s3_uri and checkpoint_local_path parameters from the estimator.
For more information about setting up manual checkpointing in a training script, see Saving
Checkpoints (p. 1939).

1862
Amazon SageMaker Developer Guide
SageMaker's Data Parallelism Library

An Unexpected Prefix Attached to Model Parameter Keys


For PyTorch distributed training jobs, an unexpected prefix (model for example) might be attached to
state_dict keys (model parameters). The SageMaker data parallel library does not directly alter or
prepend any model parameter names when PyTorch training jobs save model artifacts. The PyTorch's
distributed training changes the names in the state_dict to go over the network, prepending the
prefix. If you encounter any model failure problem due to different parameter names while you are using
the SageMaker data parallel library and checkpointing for PyTorch training, adapt the following example
code to remove the prefix at the step you load checkpoints in your training script.

state_dict = {k.partition('model.')[2]:state_dict[k] for k in state_dict.keys()}

This takes each state_dict key as a string value, separates the string at the first occurrence of
'model.', and takes the third list item (with index 2) of the partitioned string.

For more information about the prefix issue, see a discussion thread at Prefix parameter names in saved
model if trained by multi-GPU? in the PyTorch discussion forum.

For more information about the PyTorch methods for saving and loading models, see Saving & Loading
Model Across Devices in the PyTorch documentation.

SageMaker Distributed Training Job Stalling During Initialization


If your SageMaker distributed data parallel training job stalls during initialization when using
EFA-enabled instances (ml.p3dn.24xlarge and ml.p4d.24xlarge), this might be due to a
misconfiguration in the security group of the VPC subnet that's used for the training job. EFA requires a
proper security group configuration to enable traffic between the nodes.

To configure inbound and outbound rules for the security group

1. Sign in to the AWS Management Console and open the Amazon VPC console at https://
console.aws.amazon.com/vpc/.
2. Choose Security Groups in the left navigation pane.
3. Select the security group that's tied to the VPC subnet you use for training.
4. In the Details section, copy the Security group ID.
5. On the Inbound rules tab, choose Edit inbound rules.
6. On the Edit inbound rules page, do the following:
a. Choose Add rule.
b. For Type, choose All traffic.
c. For Source, choose Custom, paste the security group ID into the search box, and select the security
group that pops up.
7. Choose Save rules to finish configuring the inbound rule for the security group.
8. On the Outbound rules tab, choose Edit outbound rules.
9. Repeat the step 6 and 7 to add the same rule as an outbound rule.

After you complete the preceding steps for configuring the security group with the inbound and
outbound rules, rerun the training job and verify if the stalling issue is resolved.

For more information about configuring security groups for VPC and EFA, see Security groups for your
VPC and Elastic Fabric Adapter.

SageMaker Distributed Training Job Stalling at the End of Training


One of the root causes of stalling issues at the end of training is a mismatch in the number of batches
that are processed per epoch across different ranks. All workers (GPUs) synchronize their local gradients

1863
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

in the backward pass to ensure they all have the same copy of the model at the end of the batch
iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of
training, the training job stalls. For example, while a group of workers (group A) finishes processing all
batches and exits the training loop, another group of workers (group B) starts processing another batch
and still expects communication from group A to synchronize the gradients. This causes group B to wait
for group A, which already completed training and does not have any gradients to synchronize.

Therefore, when setting up your training dataset, it is important that each worker gets the same number
of data samples so that each worker goes through the same number of batches while training. Make sure
each rank gets the same number of batches to avoid this stalling issue.

Observing Scaling Efficiency Degradation Due to Amazon FSx Throughput


Bottlenecks
One potential cause of lowered scaling efficiency is the FSx throughput limit. If you observe a sudden
drop in scaling efficiency when you switch to a larger training cluster, try using a larger FSx for Lustre
file system with a higher throughput limit. For more information, see Aggregate file system performance
and Managing storage and throughput capacity in the Amazon FSx for Lustre User Guide.

SageMaker Distributed Training Job with PyTorch Returns Deprecation Warnings


Since v1.4.0, the SageMaker distributed data parallelism library works as a backend of PyTorch
distributed. Because of the breaking change of using the library with PyTorch, you might encounter a
warning message that the smdistributed APIs for the PyTorch distributed package are deprecated.
The warning message should be similar to the following:

smdistributed.dataparallel.torch.dist is deprecated in the SageMaker distributed data


parallel library v1.4.0+.
Please use torch.distributed and specify 'smddp' as a backend when initializing process
group as follows:
torch.distributed.init_process_group(backend='smddp')
For more information, see the library's API documentation at
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html

In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set
as the backend during the PyTorch distributed initialization. With the single line of backend specification,
you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules.
See Modify a PyTorch Training Script (p. 1842) to learn about the breaking changes and the new way to
use the library with PyTorch.

SageMaker's Model Parallelism Library


Use Amazon SageMaker's model parallel library to train large deep learning (DL) models that are difficult
to train due to GPU memory limitations. The library automatically and efficiently splits a model across
multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by
efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across
multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through
the SageMaker Python SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel
library. This library's API documentation is located at Distributed Training APIs in the SageMaker Python
SDK documentation.

To track the latest updates of the library, see the SageMaker Model Parallel Release Notes in the
SageMaker Python SDK documentation.

Topics

1864
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• Introduction to Model Parallelism (p. 1865)


• Supported Frameworks and AWS Regions (p. 1872)
• Core Features of the SageMaker Model Parallelism Library (p. 1875)
• Run a SageMaker Distributed Training Job with Model Parallelism (p. 1906)
• Checkpointing and Fine-Tuning a Model with Model Parallelism (p. 1926)
• SageMaker Distributed Model Parallelism Best Practices (p. 1932)
• The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls (p. 1935)
• Model Parallel Troubleshooting (p. 1938)

Introduction to Model Parallelism


Model parallelism is a distributed training method in which the deep learning model is partitioned across
multiple devices, within or across instances. This introduction page provides a high-level overview about
model parallelism, a description of how it can help overcome issues that arise when training DL models
that are typically very large in size, and examples of what the SageMaker model parallel library offers to
help manage model parallel strategies as well as memory consumption.

What is Model Parallelism?


Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex
tasks such as computer vision and natural language processing. However, there is a limit to the
maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory
limitations can be bottlenecks in the following ways:

• They limit the size of the model you can train, since the memory footprint of a model scales
proportionally to the number of parameters.
• They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker provides
the model parallel library to help distribute and train DL models efficiently on multiple compute
nodes. Furthermore, with the library, you can achieve most optimized distributed training using EFA-
supported devices, which enhance the performance of inter-node communication with low latency, high
throughput, and OS bypass.

Estimate Memory Requirements Before Using Model Parallelism


Before you use the SageMaker model parallel library, consider the following to get a sense of the
memory requirements of training large DL models.

For a training job that uses AMP (FP16) and Adam optimizers, the required GPU memory per parameter
is about 20 bytes, which we can break down as follows:

• An FP16 parameter ~ 2 bytes


• An FP16 gradient ~ 2 bytes
• An FP32 optimizer state ~ 8 bytes based on the Adam optimizers
• An FP32 copy of parameter ~ 4 bytes (needed for the optimizer apply (OA) operation)
• An FP32 copy of gradient ~ 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory,
which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory
and V100 with 16/32 GB) available on a single GPU. Note that on top of the memory requirements for
model and optimizer states, there are other memory consumers such as activations generated in the
forward pass. The memory required can be a lot greater than 200GB.

1865
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

For distributed training, we recommend that you use Amazon EC2 P3 and P4 instances that have NVIDIA
V100 and A100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores,
RAM, attached storage volume, and network bandwidth, see the Accelerated Computing section in the
Amazon EC2 Instance Types page.

Even with the accelerated computing instances, it is obvious that models with about 10 billion
parameters such as Megatron-LM and T5 and even larger models with hundreds of billions of parameters
such as GPT-3 cannot fit model replicas in each GPU device.

How the Library Employs Model Parallelism and Memory Saving Techniques
The library consists of various types of model parallelism features and memory-saving features such as
optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be
combined to efficiently train large models that consist of hundreds of billions of parameters.

Topics
• Sharded data parallelism (available for PyTorch) (p. 1866)
• Pipeline parallelism (available for PyTorch and TensorFlow) (p. 1866)
• Tensor parallelism (available for PyTorch) (p. 1868)
• Optimizer state sharding (available for PyTorch) (p. 1870)
• Activation offloading and checkpointing (available for PyTorch) (p. 1872)
• Choosing the right techniques for your model (p. 1872)

Sharded data parallelism (available for PyTorch)


Sharded data parallelism is a memory-saving distributed training technique that splits the state of a
model (model parameters, gradients, and optimizer states) across GPUs within a data-parallel group.

SageMaker implements sharded data parallelism through the implementation of MiCS, which is a library
that minimizes communication scale and discussed in the blog post Near-linear scaling of gigantic-
model training on AWS.

You can apply sharded data parallelism to your model as a stand-alone strategy. Furthermore, if
you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs,
ml.p4d.24xlarge, you can take the advantage of improved training speed from the AllGather
operation offered by SMDDP Collectives.

To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded
data parallelism with other techniques like tensor parallelism and FP16 training, see the section called
“Sharded Data Parallelism” (p. 1876).

Pipeline parallelism (available for PyTorch and TensorFlow)


Pipeline parallelism partitions the set of layers or operations across the set of devices,
leaving each operation intact. When you specify a value for the number of model partitions
(pipeline_parallel_degree), the total number of GPUs (processes_per_host) must be divisible
by the number of the model partitions. To set this up properly, you have to specify the correct values
for the pipeline_parallel_degree and processes_per_host parameters. The simple math is as
follows:

(pipeline_parallel_degree) x (data_parallel_degree) = processes_per_host

The library takes care of calculating the number of model replicas (also called
data_parallel_degree) given the two input parameters you provide.

For example, if you set "pipeline_parallel_degree": 2 and "processes_per_host": 8 to use


an ML instance with eight GPU workers such as ml.p3.16xlarge, the library automatically sets up the

1866
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

distributed model across the GPUs and four-way data parallelism. The following image illustrates how
a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline
parallelism. Each model replica, where we define it as a pipeline parallel group and label it as PP_GROUP,
is partitioned across two GPUs. Each partition of the model is assigned to four GPUs, where the four
partition replicas are in a data parallel group and labeled as DP_GROUP. Without tensor parallelism, the
pipeline parallel group is essentially the model parallel group.

To dive deep into pipeline parallelism, see Core Features of the SageMaker Model Parallelism
Library (p. 1875).

To get started with running your model using pipeline parallelism, see Run a SageMaker Distributed
Training Job with the SageMaker Model Parallel Library.

1867
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Tensor parallelism (available for PyTorch)

Tensor parallelism splits individual layers, or nn.Modules, across devices, to be run in parallel. The
following figure shows the simplest example of how the library splits a model with four layers to achieve
two-way tensor parallelism ("tensor_parallel_degree": 2). The layers of each model replica are
bisected and distributed into two GPUs. In this example case, the model parallel configuration also
includes "pipeline_parallel_degree": 1 and "ddp": True (uses PyTorch DistributedDataParallel
package in the background), so the degree of data parallelism becomes eight. The library manages
communication across the tensor-distributed model replicas.

1868
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

1869
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

The usefulness of this feature is in the fact that you can select specific layers or a subset of layers
to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features
for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see Tensor
Parallelism (p. 1890).

Optimizer state sharding (available for PyTorch)

To understand how the library performs optimizer state sharding, consider a simple example model
with four layers. The key idea in optimizing state sharding is you don't need to replicate your optimizer
state in all of your GPUs. Instead, a single replica of the optimizer state is sharded across data-parallel
ranks, with no redundancy across devices. For example, GPU 0 holds the optimizer state for layer
one, the next GPU 1 holds the optimizer state for L2, and so on. The following animated figure shows
a backward propagation with the optimizer state sharding technique. At the end of the backward
propagation, there's compute and network time for the optimizer apply (OA) operation to update
optimizer states and the all-gather (AG) operation to update the model parameters for the next
iteration. Most importantly, the reduce operation can overlap with the compute on GPU 0, resulting
in a more memory-efficient and faster backward propagation. In the current implementation, AG and
OA operations do not overlap with compute. It can result in an extended computation during the AG
operation, so there might be a tradeoff.

1870
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

1871
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

For more information about how to use this feature, see Optimizer State Sharding.

Activation offloading and checkpointing (available for PyTorch)


To save GPU memory, the library supports activation checkpointing to avoid storing internal activations
in the GPU memory for user-specified modules during the forward pass. The library recomputes these
activations during the backward pass. In addition, the activation offloading feature offloads the stored
activations to CPU memory and fetches back to GPU during the backward pass to further reduce
activation memory footprint. For more information about how to use these features, see Activation
Checkpointing and Activation Offloading.

Choosing the right techniques for your model


For more information about choosing the right techniques and configurations, see SageMaker
Distributed Model Parallel Best Practices and Configuration Tips and Pitfalls.

Supported Frameworks and AWS Regions


Before using the SageMaker model parallelism library, check the supported frameworks and instance
types, and determine if there are enough quotas in your AWS account and AWS Region.
Note
To check the latest updates and release notes of the library, see the SageMaker Model Parallel
Release Notes in the SageMaker Python SDK documentation.

Supported Frameworks
The SageMaker model parallelism library supports the following deep learning frameworks and is
available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.

PyTorch versions supported by SageMaker and the SageMaker model parallelism library

PyTorch version SageMaker model smdistributed- URL of the binary file**


parallelism library modelparallel
version integrated DLC image
URI

v2.0.0 smdistributed- https://sagemaker-


763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.15.0
pytorch- distributed-model-
training:2.0.0- parallel.s3.us-
gpu-py310-cu118- west-2.amazonaws.com/
ubuntu20.04- pytorch-2.0.0/build-
sagemaker artifacts/2023-04-14-20-14/
smdistributed_modelparallel-1.15.0-
cp310-cp310-
linux_x86_64.whl

v1.13.1 smdistributed- https://sagemaker-


763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.15.0
pytorch- distributed-model-
training:1.13.1- parallel.s3.us-
gpu-py39-cu117- west-2.amazonaws.com/
ubuntu20.04- pytorch-1.13.1/build-
sagemaker artifacts/2023-04-17-15-49/
smdistributed_modelparallel-1.15.0-
cp39-cp39-
linux_x86_64.whl

v1.12.1 smdistributed- https://sagemaker-


763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.13.0
pytorch- distributed-model-
parallel.s3.us-

1872
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

PyTorch version SageMaker model smdistributed- URL of the binary file**


parallelism library modelparallel
version integrated DLC image
URI
training:1.12.1- west-2.amazonaws.com/
gpu-py38-cu113- pytorch-1.12.1/build-
ubuntu20.04- artifacts/2022-12-08-21-34/
sagemaker smdistributed_modelparallel-1.13.0-
cp38-cp38-
linux_x86_64.whl

v1.12.0 smdistributed- https://sagemaker-


763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.11.0
pytorch- distributed-model-
training:1.12.0- parallel.s3.us-
gpu-py38-cu113- west-2.amazonaws.com/
ubuntu20.04- pytorch-1.12.0/build-
sagemaker artifacts/2022-08-12-16-58/
smdistributed_modelparallel-1.11.0-
cp38-cp38-
linux_x86_64.whl

v1.11.0 smdistributed- https://sagemaker-


763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.10.0
pytorch- distributed-model-
training:1.11.0- parallel.s3.us-
gpu-py38-cu113- west-2.amazonaws.com/
ubuntu20.04- pytorch-1.11.0/build-
sagemaker artifacts/2022-07-11-19-23/
smdistributed_modelparallel-1.10.0-
cp38-cp38-
linux_x86_64.whl

v1.10.2 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.7.0
pytorch-
training:1.10.2-
gpu-py38-cu113-
ubuntu20.04-
sagemaker

v1.10.0 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.5.0
pytorch-
training:1.10.0-
gpu-py38-cu113-
ubuntu20.04-
sagemaker

v1.9.1 smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.4.0
pytorch-
training:1.9.1-
gpu-py38-cu111-
ubuntu20.04

v1.8.1* smdistributed- -
763104351884.dkr.ecr.<region>.amazonaws.com/
modelparallel==v1.6.0
pytorch-
training:1.8.1-
gpu-py36-cu111-
ubuntu18.04

1873
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Note
The SageMaker model parallelism library v1.6.0 and later provides extended features for
PyTorch. For more information, see Core Features of the SageMaker Model Parallelism
Library (p. 1875).

** The URLs of the binary files are for installing the SageMaker model parallelism library in custom
containers. For more information, see the section called “Create Your Own Docker Container with the
Library” (p. 1925).

TensorFlow versions supported by SageMaker and the SageMaker model parallelism library

TensorFlow version SageMaker model parallelism smdistributed-


library version modelparallel integrated
DLC image URI

v2.6.0 smdistributed- 763104351884.dkr.ecr.<region>.amazon


modelparallel==v1.4.0 tensorflow-
training:2.6.0-gpu-py38-
cu112-ubuntu20.04

v2.5.1 smdistributed- 763104351884.dkr.ecr.<region>.amazon


modelparallel==v1.4.0 tensorflow-
training:2.5.1-gpu-py37-
cu112-ubuntu18.04

Hugging Face Transformers versions supported by SageMaker and the SageMaker distributed data
parallel library

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch
and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and
paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging
Face Container Versions.

AWS Regions
The SageMaker data parallel library is available in all of the AWS Regions where the AWS Deep Learning
Containers for SageMaker are in service. For more information, see Available Deep Learning Containers
Images.

Supported Instance Types


The SageMaker model parallelism library requires one of the following ML instance types.

Instance type

ml.g4dn.12xlarge

ml.p3.16xlarge

ml.p3dn.24xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.

1874
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling


the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.

Core Features of the SageMaker Model Parallelism Library


Amazon SageMaker's model parallelism library offers distribution strategies and memory-saving
techniques, such as sharded data parallelism, tensor parallelism, model partitioning by layers for pipeline
scheduling, and checkpointing. The model parallelism strategies and techniques help distribute large
models across multiple devices while optimizing training speed and memory consumption. The library
also provides Python helper functions, context managers, and wrapper functions to adapt your training
script for automated or manual partitioning of your model.

When you implement model parallelism to your training job, you keep the same two-step workflow
shown in the Run a SageMaker Distributed Training Job with Model Parallelism section. For adapting
your training script, you'll add zero or few additional code lines to your training script. For launching a
training job of the adapted training script, you'll need to set the distribution configuration parameters to
activate the memory-saving features or to pass values for the degree of parallelism.

To get started with examples, see the following Jupyter notebooks that demonstrate how to use the
SageMaker model parallelism library.

• PyTorch example notebooks


• TensorFlow example notebooks

To dive deep into the core features of the library, see the following topics.
Note
The SageMaker distributed training libraries are available through the AWS deep learning
containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform.
To utilize the features of the distributed training libraries, we recommend that you use the
SageMaker Python SDK. You can also manually configure in JSON request syntax if you use
SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout
the documentation, instructions and examples focus on how to use the distributed training
libraries with the SageMaker Python SDK.
Important
The SageMaker model parallelism library supports all the core features for PyTorch, and
supports pipeline parallelism for TensorFlow.

Topics
• Sharded Data Parallelism (p. 1876)
• Pipelining a Model (p. 1887)
• Tensor Parallelism (p. 1890)
• Optimizer State Sharding (p. 1901)
• Activation Checkpointing (p. 1902)
• Activation Offloading (p. 1903)
• FP16 Training with Model Parallelism (p. 1904)
• Support for FlashAttention (p. 1906)

1875
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Sharded Data Parallelism


Sharded data parallelism is a memory-saving distributed training technique that splits the state of a
model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.
Note
Sharded data parallelism is available for PyTorch in the SageMaker model parallelism library
v1.11.0 and later.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint
of the model by sharding the training state of the model over multiple GPUs. This returns two benefits:
you can fit larger models, which would otherwise run out of memory with standard data parallelism, or
you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data
parallel group, and performs gradient aggregation based on the AllReduce operation. Sharded data
parallelism modifies the standard data-parallel distributed training procedure to account for the sharded
nature of the optimizer states. A group of ranks over which the model and optimizer states are sharded
is called a sharding group. The sharded data parallelism technique shards the trainable parameters of a
model and corresponding gradients and optimizer states across the GPUs in the sharding group.

SageMaker achieves sharded data parallelism through the implementation of MiCS, which is discussed
in the AWS blog post Near-linear scaling of gigantic-model training on AWS. In this implementation, you
can set the sharding degree as a configurable parameter, which must be less than the data parallelism
degree. During each forward and backward pass, MiCS temporarily recombines the model parameters
in all GPUs through the AllGather operation. After the forward or backward pass of each layer, MiCS
shards the parameters again to save GPU memory. During the backward pass, MiCS reduces gradients
and simultaneously shards them across GPUs through the ReduceScatter operation. Finally, MiCS
applies the local reduced and sharded gradients to their corresponding local parameter shards, using
the local shards of optimizer states. To bring down communication overhead, the SageMaker model
parallelism library prefetches the upcoming layers in the forward or backward pass, and overlaps the
network communication with the computation.

The training state of the model is replicated across the sharding groups. This means that before
gradients are applied to the parameters, the AllReduce operation must take place across the sharding
groups, in addition to the ReduceScatter operation that takes place within the sharding group.

In effect, sharded data parallelism introduces a tradeoff between the communication overhead and GPU
memory efficiency. Using sharded data parallelism increases the communication cost, but the memory
footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data
parallelism degree, thus larger models can be fit in the GPU cluster.

Selecting the degree of sharded data parallelism

When you select a value for the degree of sharded data parallelism, the value must evenly divide the
degree of data parallelism. For example, for an 8-way data parallelism job, choose 2, 4, or 8 for the
sharded data parallelism degree. While choosing the sharded data parallelism degree, we recommend
that you start with a small number, and gradually increase it until the model fits in the memory together
with the desired batch size.

Selecting the batch size

After setting up sharded data parallelism, make sure you find the most optimal training configuration
that can successfully run on the GPU cluster. For training large language models (LLM), start from the
batch size 1, and gradually increase it until you reach the point to receive the out-of-memory (OOM)
error. If you encounter the OOM error even with the smallest batch size, apply a higher degree of
sharded data parallelism or a combination of sharded data parallelism and tensor parallelism.

Topics

1876
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• How to apply sharded data parallelism to your training job (p. 1877)
• Reference configurations (p. 1878)
• Sharded data parallelism with SMDDP Collectives (p. 1879)
• Mixed precision training with sharded data parallelism (p. 1882)
• Sharded data parallelism with tensor parallelism (p. 1883)
• Tips and considerations for using sharded data parallelism (p. 1886)

How to apply sharded data parallelism to your training job

To get started with sharded data parallelism, apply required modifications to your training script, and
set up the SageMaker PyTorch estimator with the sharded-data-parallelism-specific parameters. Also
consider to take reference values and example notebooks as a starting point.

Adapt your PyTorch training script

Follow the instructions at Step 1: Modify a PyTorch Training Script (p. 1915) to wrap the model
and optimizer objects with the smdistributed.modelparallel.torch wrappers of the
torch.nn.parallel and torch.distributed modules.

(Optional) Additional modification to register external model parameters

If your model is built with torch.nn.Module and uses parameters that is not defined within the
module class, you should register them to the module manually for SMP to gather the full parameters
while . To register parameters to a module, use smp.register_parameter(module, parameter).

class Module(torch.nn.Module):
def __init__(self, *args):
super().__init__(self, *args)
self.layer1 = Layer1()
self.layer2 = Layer2()
smp.register_parameter(self, self.layer1.weight)

def forward(self, input):


x = self.layer1(input)
# self.layer1.weight is required by self.layer2.forward
y = self.layer2(x, self.layer1.weight)
return y

Set up the SageMaker PyTorch estimator

When configuring a SageMaker PyTorch estimator in the section called “Step 2: Launch a Training
Job” (p. 1921), add the parameters for sharded data parallelism.

To turn on sharded data parallelism, add the sharded_data_parallel_degree parameter to the


SageMaker PyTorch Estimator. This parameter specifies the number of GPUs over which the training
state is sharded. The value for sharded_data_parallel_degree must be an integer between one and
the data parallelism degree and must evenly divide the data parallelism degree. Note that the library
automatically detects the number of GPUs so thus the data parallel degree. The following additional
parameters are available for configuring sharded data parallelism.

• "sdp_reduce_bucket_size" (int, default: 5e8) – Specifies the size of PyTorch DDP gradient buckets
in number of elements of the default dtype.
• "sdp_param_persistence_threshold" (int, default: 1e6) – Specifies the size of a parameter tensor
in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter
tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor
is smaller than this threshold, the parameter tensor is not split; this helps reduce communication
overhead because the parameter tensor is replicated across data-parallel GPUs.

1877
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• "sdp_max_live_parameters" (int, default: 1e9) – Specifies the maximum number of parameters


that can simultaneously be in a recombined training state during the forward and backward pass.
Parameter fetching with the AllGather operation pauses when the number of active parameters
reaches the given threshold. Note that increasing this parameter increases the memory footprint.
• "sdp_hierarchical_allgather" (bool, default: True) – If set to True, the AllGather operation
runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node
distributed training jobs, the hierarchical AllGather operation is automatically activated.
• "sdp_gradient_clipping" (float, default: 1.0) – Specifies a threshold for gradient clipping the
L2 norm of the gradients before propagating them backward through the model parameters. When
sharded data parallelism is activated, gradient clipping is also activated. The default threshold is 1.0.
Adjust this parameter if you have the exploding gradients problem.

The following code shows an example of how to configure sharded data parallelism.

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
"enabled": True,
"parameters": {
# "pipeline_parallel_degree": 1, # Optional, default is 1
# "tensor_parallel_degree": 1, # Optional, default is 1
"ddp": True,
# parameters for sharded data parallelism
"sharded_data_parallel_degree": 2, # Add this to activate sharded data
parallelism
"sdp_reduce_bucket_size": int(5e8), # Optional
"sdp_param_persistence_threshold": int(1e6), # Optional
"sdp_max_live_parameters": int(1e9), # Optional
"sdp_hierarchical_allgather": True, # Optional
"sdp_gradient_clipping": 1.0 # Optional
}
}

mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}

smp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-job"
)

smp_estimator.fit('s3://my_bucket/my_training_data/')

Reference configurations

The SageMaker distributed training team provides the following reference configurations that you
can use as a starting point. You can extrapolate from the following configurations to experiment and
estimate the GPU memory usage for your model configuration.

1878
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Sharded data parallelism with SMDDP Collectives

Model/the Num Instance Sequence Global Mini batch Sharded


number of instances type length batch size size data
parameters parallel
degree

GPT- 2 ml.p4d.24xlarge
2048 64 4 16
NEOX-20B

GPT- 8 ml.p4d.24xlarge
2048 768 12 32
NEOX-20B

For example, if you increase the sequence length for a 20-billion-parameter model or increase the
size of the model to 65 billion parameters, you need to try reducing the batch size first. If the model
still doesn’t fit with the smallest batch size (the batch size of 1), try increasing the degree of model
parallelism.

Sharded data parallelism with tensor parallelism and NCCL Collectives

Model/ Num Instance Sequence Global Mini Sharded Tensor Activation


the instances type length batch batch data parallel offloading
number size size parallel degree
of degree
parameters

GPT- 64 ml.p4d.24xlarge
2048 512 8 16 8 Y
NEOX-65B

GPT- 64 ml.p4d.24xlarge
4096 512 2 64 2 Y
NEOX-65B

The combined usage of sharded data parallelism and tensor parallelism is useful when you want to fit
a large language model (LLM) into a large-scale cluster while using text data with a longer sequence
length, which leads to use a smaller batch size, and consequently handling the GPU memory usage to
train LLMs against longer text sequences. To learn more, see the section called “Sharded data parallelism
with tensor parallelism” (p. 1883).

For case studies, benchmarks, and more configuration examples, see the blog post New performance
improvements in Amazon SageMaker model parallel library.

Sharded data parallelism with SMDDP Collectives

The SageMaker data parallelism library offers collective communication primitives (SMDDP collectives)
optimized for the AWS infrastructure. It achieves optimization by adopting an all-to-all-type
communication pattern by making use of Elastic Fabric Adapter (EFA), resulting in high-throughput and
less latency-sensitive collectives, offloading the communication-related processing to the CPU, and
freeing up GPU cycles for computation. On large clusters, SMDDP Collectives can offer improvements
in distributed training performance by up to 40% compared to NCCL. For case studies and benchmark
results, see the blog New performance improvements in the Amazon SageMaker model parallelism
library.
Note
Sharded data parallelism with SMDDP Collectives is available in the SageMaker model
parallelism library v1.13.0 and later, and the SageMaker data parallelism library v1.6.0 and
later. See also Supported configurations (p. 1880) to use sharded data parallelism with SMDDP
Collectives.

1879
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

In sharded data parallelism, which is a commonly used technique in large-scale distributed training, the
AllGather collective is used to reconstitute the sharded layer parameters for forward and backward
pass computations, in parallel with GPU computation. For large models, performing the AllGather
operation efficiently is critical to avoid GPU bottleneck problems and slowing down training speed.
When sharded data parallelism is activated, SMDDP Collectives drops into these performance-critical
AllGather collectives, improving training throughput.

Train with SMDDP Collectives

When your training job has sharded data parallelism activated and meets the Supported
configurations (p. 1880), SMDDP Collectives are automatically activated. Internally, SMDDP Collectives
optimize the AllGather collective to be performant on the AWS infrastructure and falls back to NCCL
for all other collectives. Furthermore, under unsupported configurations, all collectives, including
AllGather, automatically use the NCCL backend.

Since the SageMaker model parallelism library version 1.13.0, the "ddp_dist_backend" parameter is
added to the modelparallel options. The default value for this configuration parameter is "auto",
which uses SMDDP Collectives whenever possible, and falls back to NCCL otherwise. To force the library
to always use NCCL, specify "nccl" to the "ddp_dist_backend" configuration parameter.

The following code example shows how to set up a PyTorch estimator using the sharded data parallelism
with the "ddp_dist_backend" parameter, which is set to "auto" by default and, therefore, optional
to add.

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
"enabled":True,
"parameters": {
"partitions": 1,
"ddp": True,
"sharded_data_parallel_degree": 64
"bf16": True,
"ddp_dist_backend": "auto" # Specify "nccl" to force to use NCCL.
}
}

mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}

smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=8,
instance_type='ml.p4d.24xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

Supported configurations

1880
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

The AllGather operation with SMDDP Collectives are activated in training jobs when all the following
configuration requirements are met.

• The sharded data parallelism degree greater than 1


• Instance_count greater than 1
• Instance_type equal to ml.p4d.24xlarge
• SageMaker training container for PyTorch v1.12.1 or later
• The SageMaker data parallelism library v1.6.0 or later
• The SageMaker model parallelism library v1.13.0 or later

Performance and memory tuning

SMDDP Collectives utilize additional GPU memory. There are two environment variables to configure the
GPU memory usage depending on different model training use cases.

• SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES – During the SMDDP AllGather operation, the


AllGather input buffer is copied into a temporary buffer for inter-node communication. The
SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES variable controls the size (in bytes) of this temporary
buffer. If the size of the temporary buffer is smaller than the AllGather input buffer size, the
AllGather collective falls back to use NCCL.
• Default value: 16 * 1024 * 1024 (16 MB)
• Acceptable values: any multiple of 8192

• SMDDP_AG_SORT_BUFFER_SIZE_BYTES – The SMDDP_AG_SORT_BUFFER_SIZE_BYTES variable is to


size the temporary buffer (in bytes) to hold data gathered from inter-node communication. If the size
of this temporary buffer is smaller than 1/8 * sharded_data_parallel_degree * AllGather
input size, the AllGather collective falls back to use NCCL.
• Default value: 128 * 1024 * 1024 (128 MB)
• Acceptable values: any multiple of 8192

Tuning guidance on the buffer size variables

The default values for the environment variables should work well for most use cases. We recommend
tuning these variables only if training runs into the out-of-memory (OOM) error.

The following list discusses some tuning tips to reduce the GPU memory footprint of SMDDP Collectives
while retaining the performance gain from them.

• Tuning SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES
• The AllGather input buffer size is smaller for smaller models. Hence, the required size for
SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES can be smaller for models with fewer parameters.
• The AllGather input buffer size decreases as sharded_data_parallel_degree
increases, because the model gets sharded across more GPUs. Hence, the required size for
SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES can be smaller for training jobs with large values for
sharded_data_parallel_degree.
• Tuning SMDDP_AG_SORT_BUFFER_SIZE_BYTES
• The amount of data gathered from inter-node communication is less for models with fewer
parameters. Hence, the required size for SMDDP_AG_SORT_BUFFER_SIZE_BYTES can be smaller for
such models with fewer number of parameters.

Some collectives might fall back to use NCCL; hence, you might not get the performance gain from the
optimized SMDDP collectives. If additional GPU memory is available for use, you can consider increasing

1881
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

the values of SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES and SMDDP_AG_SORT_BUFFER_SIZE_BYTES


to benefit from the performance gain.

The following code shows how you can configure the environment variables by appending them to
mpi_options in the distribution parameter for the PyTorch estimator.

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
.... # All modelparallel configuration options go here
}

mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8 # Required
}

# Use the following two lines to tune values of the environment variables for buffer
mpioptions += " -x SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES=8192"
mpioptions += " -x SMDDP_AG_SORT_BUFFER_SIZE_BYTES=8192"

smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=8,
instance_type='ml.p4d.24xlarge',
framework_version='1.13.1',
py_version='py3',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="sharded-data-parallel-demo-with-tuning",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

Mixed precision training with sharded data parallelism

To further save GPU memory with half-precision floating point numbers and sharded data parallelism,
you can activate 16-bit floating point format (FP16) or Brain floating point format (BF16) by adding one
additional parameter to the distributed training configuration.
Note
Mixed precision training with sharded data parallelism is available in the SageMaker model
parallelism library v1.11.0 and later.

For FP16 Training with Sharded Data Parallelism

To run FP16 training with sharded data parallelism, add "fp16": True" to the smp_options
configuration dictionary. In your training script, you can choose between the static and dynamic loss
scaling options through the smp.DistributedOptimizer module. For more information, see the
section called “FP16 Training with Model Parallelism” (p. 1904).

smp_options = {
"enabled": True,
"parameters": {
"ddp": True,
"sharded_data_parallel_degree": 2,

1882
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

"fp16": True
}
}

For BF16 Training with Sharded Data Parallelism

The sharded data parallelism feature of SageMaker supports training in BF16 data type. The BF16 data
type uses 8 bits to represent the exponent of a floating point number, while the FP16 data type uses
5 bits. Preserving the 8 bits for the exponent allows to keep the same representation of the exponent
of a 32-bit single precision floating point (FP32) number. This makes the conversion between FP32 and
BF16 simpler and significantly less prone to cause overflow and underflow issues that arise often in
FP16 training, especially when training larger models. While both data types use 16 bits in total, this
increased representation range for the exponent in the BF16 format comes at the expense of reduced
precision. For training large models, this reduced precision is often considered an acceptable trade-off for
the range and training stability.
Note
Currently, BF16 training works only when sharded data parallelism is activated.

To run BF16 training with sharded data parallelism, add "bf16": True to the smp_options
configuration dictionary.

smp_options = {
"enabled": True,
"parameters": {
"ddp": True,
"sharded_data_parallel_degree": 2,
"bf16": True
}
}

Sharded data parallelism with tensor parallelism

If you use sharded data parallelism and also need to reduce the global batch size, consider using tensor
parallelism with sharded data parallelism. When training a large model with sharded data parallelism
on a very large compute cluster (typically 128 nodes or beyond), even a small batch size per GPU results
in a very large global batch size. It might lead to convergence issues or low computational performance
issues. Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone
when a single batch is already large and cannot be reduced further. In such cases, using sharded data
parallelism in combination with tensor parallelism helps reduce the global batch size.

Choosing the optimal sharded data parallel and tensor parallel degrees depends on the scale of the
model, the instance type, and the global batch size that is reasonable for the model to converge. We
recommend that you start from a low tensor parallel degree to fit the global batch size into the compute
cluster to resolve CUDA out-of-memory errors and achieve the best performance. See the following two
example cases to learn how the combination of tensor parallelism and sharded data parallelism helps
you adjust the global batch size by grouping GPUs for model parallelism, resulting in a lower number of
model replicas and a smaller global batch size.
Note
This feature is available from the SageMaker model parallelism library v1.15, and supports
PyTorch v1.13.1.
Note
This feature is available for the supported models by the tensor parallelism functionality
of the library. To find the list of the supported models, see Support for Hugging Face
Transformer Models. Also note that you need to pass tensor_parallelism=True to the
smp.model_creation argument while modifying your training script. To learn more, see the
training script train_gpt_simple.py in the SageMaker Examples GitHub repository.

1883
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Example 1

Assume that we want to train a model over a cluster of 1536 GPUs (192 nodes with 8 GPUs in each),
setting the degree of sharded data parallelism to 32 (sharded_data_parallel_degree=32) and
the batch size per GPU to 1, where each batch has a sequence length of 4096 tokens. In this case, there
are 1536 model replicas, the global batch size becomes 1536, and each global batch contains about 6
million tokens.

(1536 GPUs) * (1 batch per GPU) = (1536 global batches)


(1536 batches) * (4096 tokens per batch) = (6,291,456 tokens)

Adding tensor parallelism to it can lower the global batch size. One configuration example can be setting
the tensor parallel degree to 8 and the batch size per GPU to 4. This forms 192 tensor parallel groups
or 192 model replicas, where each model replica is distributed across 8 GPUs. The batch size of 4 is the
amount of training data per iteration and per tensor parallel group; that is, each model replica consumes
4 batches per iteration. In this case, the global batch size becomes 768, and each global batch contains
about 3 million tokens. Hence, the global batch size is reduced by half compared to the previous case
with sharded data parallelism only.

(1536 GPUs) / (8 tensor parallel degree) = (192 tensor parallelism groups)


(192 tensor parallelism groups) * (4 batches per tensor parallelism group) = (768 global
batches)
(768 batches) * (4096 tokens per batch) = (3,145,728 tokens)

Example 2

When both sharded data parallelism and tensor parallelism are activated, the library first applies
tensor parallelism and shards the model across this dimension. For each tensor parallel rank, the data
parallelism is applied as per sharded_data_parallel_degree.

For example, assume that we want to set 32 GPUs with a tensor parallel degree of 4 (forming
groups of 4 GPUs), a sharded data parallel degree of 4, ending up with a replication degree of
2. The assignment creates eight GPU groups based on the tensor parallel degree as follows:
(0,1,2,3), (4,5,6,7), (8,9,10,11), (12,13,14,15), (16,17,18,19), (20,21,22,23),
(24,25,26,27), (28,29,30,31). That is, four GPUs form one tensor parallel group. In this
case, the reduced data parallel group for the 0th rank GPUs of the tensor parallel groups would be
(0,4,8,12,16,20,24,28). The reduced data parallel group is sharded based on the sharded data
parallel degree of 4, resulting in two replication groups for data parallelism. GPUs (0,4,8,12) form
one sharding group, which collectively hold a complete copy of all parameters for the 0th tensor parallel
rank, and GPUs (16,20,24,28) form another such group. Other tensor parallel ranks also have similar
sharding and replication groups.

Figure 1: Tensor parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4,
4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form tensor parallelism

1884
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

groups from TPG0 to TPG7. Replication groups are ({TPG0, TPG4}, {TPG1, TPG5}, {TPG2, TPG6} and {TPG3,
TPG7}); each replication group pair shares the same color but filled differently.

Figure 2: Sharded data parallelism groups for (nodes, sharded data parallel degree, tensor parallel
degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form
sharded data parallelism groups from SDPG0 to SDPG7. Replication groups are ({SDPG0, SDPG4}, {SDPG1,
SDPG5}, {SDPG2, SDPG6} and {SDPG3, SDPG7}); each replication group pair shares the same color but
filled differently.

How to activate sharded data parallelism with tensor parallelism

To use sharded data parallelism with tensor parallelism, you need to set both
sharded_data_parallel_degree and tensor_parallel_degree in the configuration for
distribution while creating an object of the SageMaker PyTorch estimator class.

You also need to activate prescaled_batch. This means that, instead of each GPU reading its
own batch of data, each tensor parallel group collectively reads a combined batch of the chosen
batch size. Effectively, instead of dividing the dataset into parts equal to the number of GPUs (or
data parallel size, smp.dp_size()), it divides into parts equal to the number of GPUs divided by
tensor_parallel_degree (also called reduced data parallel size, smp.rdp_size()). For more details
on prescaled batch, see Prescaled Batch in the SageMaker Python SDK documentation. See also the
example training script train_gpt_simple.py for GPT-2 in the SageMaker Examples GitHub repository.

The following code snippet shows an example of creating a PyTorch estimator object based on the
aforementioned scenario in the section called “Example 2” (p. 1884).

mpi_options = "-verbose --mca orte_base_help_aggregate 0 "


smp_parameters = {
"ddp": True,
"fp16": True,
"prescaled_batch": True,
"sharded_data_parallel_degree": 4,
"tensor_parallel_degree": 4
}

pytorch_estimator = PyTorch(
entry_point="your_training_script.py",
role=role,
instance_type="ml.p4d.24xlarge",
volume_size=200,
instance_count=4,
sagemaker_session=sagemaker_session,
py_version="py3",
framework_version="1.13.1",
distribution={
"smdistributed": {
"modelparallel": {

1885
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

"enabled": True,
"parameters": smp_parameters,
}
},
"mpi": {
"enabled": True,
"processes_per_host": 8,
"custom_mpi_options": mpi_options,
},
},
source_dir="source_directory_of_your_code",
output_path=s3_output_location
)

Tips and considerations for using sharded data parallelism

Consider the following when using the SageMaker model parallelism library's sharded data parallelism.

• Sharded data parallelism is compatible with FP16 training. To run FP16 training, see the the section
called “FP16 Training with Model Parallelism” (p. 1904) section.
• Sharded data parallelism is compatible with tensor parallelism. The following items are what you
might need to consider for using sharded data parallelism with tensor parallelism.
• When using sharded data parallelism with tensor parallelism, the embedding layers
are also automatically distributed across the tensor parallel group. In other words, the
distribute_embedding parameter is automatically set to True. For more information about
tensor parallelism, see the section called “Tensor Parallelism” (p. 1890).
• Note that sharded data parallelism with tensor parallelism currently uses the NCCL collectives as the
backend of the distributed training strategy.

To learn more, see the the section called “Sharded data parallelism with tensor parallelism” (p. 1883)
section.
• Sharded data parallelism currently is not compatible with pipeline parallelism (p. 1866) or optimizer
state sharding (p. 1901). To activate sharded data parallelism, turn off optimizer state sharding and
set the pipeline parallel degree to 1.
• The activation checkpointing (p. 1902) and activation offloading (p. 1903) features are compatible
with sharded data parallelism.
• To use sharded data parallelism with gradient accumulation, set the backward_passes_per_step
argument to the number of accumulation steps while wrapping your model with the
smdistributed.modelparallel.torch.DistributedModel module. This ensures that the
gradient AllReduce operation across the model replication groups (sharding groups) takes place at
the boundary of gradient accumulation.
• You can checkpoint your models trained with sharded data parallelism using the library's
checkpointing APIs, smp.save_checkpoint and smp.resume_from_checkpoint. For more
information, see the section called “Checkpointing a distributed PyTorch model (for the SageMaker
model parallelism library v1.10.0 and later)” (p. 1927).
• The behavior of the delayed_parameter_initialization configuration parameter changes under
sharded data parallelism. When these two features are simultaneously turned on, parameters are
immediately initialized upon model creation in a sharded manner instead of delaying the parameter
initialization, so that each rank initializes and stores its own shard of parameters.
• When sharded data parallelism is activated, the library performs gradient clipping internally when
the optimizer.step() call runs. You don't need to use utility APIs for gradient clipping, such as
torch.nn.utils.clip_grad_norm_(). To adjust the threshold value for gradient clipping, you can
set it through the sdp_gradient_clipping parameter for the distribution parameter configuration
when you construct the SageMaker PyTorch estimator, as shown in the the section called “How to
apply sharded data parallelism to your training job” (p. 1877) section.

1886
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Pipelining a Model
One of the core features of SageMaker's model parallelism library is pipeline parallelism, which
determines the order in which computations are made and data is processed across devices during model
training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the
GPUs compute simultaneously on different data samples, and to overcome the performance loss due
to sequential computation. When you use pipeline parallelism, training job is executed in a pipelined
fashion over microbatches to maximize GPU usage.
Note
Pipeline parallelism, also called model partitioning, is available for both PyTorch and
TensorFlow. For supported versions of the frameworks, see the section called “Supported
Frameworks and AWS Regions” (p. 1872).

Pipeline Execution Schedule

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline
one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller
subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by
which device for every time slot.

For example, depending on the pipeline schedule and the model partition, GPU i might perform
(forward or backward) computation on microbatch b while GPU i+1 performs computation on
microbatch b+1, thereby keeping both GPUs active at the same time. During a single forward or
backward pass, execution flow for a single microbatch might visit the same device multiple times,
depending on the partitioning decision. For instance, an operation that is at the beginning of the model
might be placed on the same device as an operation at the end of the model, while the operations in
between are on different devices, which means this device is visited twice.

The library offers two different pipeline schedules, simple and interleaved, which can be configured using
the pipeline parameter in the SageMaker Python SDK. In most cases, interleaved pipeline can achieve
better performance by utilizing the GPUs more efficiently.

Interleaved Pipeline

In an interleaved pipeline, backward execution of the microbatches is prioritized whenever possible. This
allows quicker release of the memory used for activations, using memory more efficiently. It also allows
for scaling the number of microbatches higher, reducing the idle time of the GPUs. At steady-state, each
device alternates between running forward and backward passes. This means that the backward pass of
one microbatch may run before the forward pass of another microbatch finishes.

The preceding figure illustrates an example execution schedule for the interleaved pipeline over 2 GPUs.
In the figure, F0 represents the forward pass for microbatch 0, and B1 represents the backward pass
for microbatch 1. Update represents the optimizer update of the parameters. GPU0 always prioritizes
backward passes whenever possible (for instance, executes B0 before F2), which allows for clearing of
the memory used for activations earlier.

Simple Pipeline

A simple pipeline, by contrast, finishes running the forward pass for each microbatch before starting
the backward pass. This means that it only pipelines the forward pass and backward pass stages within
themselves. The following figure illustrates an example of how this works, over 2 GPUs.

1887
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Pipelining Execution in Specific Frameworks

Use the following sections to learn about the framework-specific pipeline scheduling decisions
SageMaker's model parallelism library makes for TensorFlow and PyTorch.

Pipeline Execution with TensorFlow

The following image is an example of a TensorFlow graph partitioned by the model parallelism library,
using automated model splitting. When a graph is split, each resulting subgraph is replicated B times
(except for the variables), where B is the number of microbatches. In this figure, each subgraph is
replicated 2 times (B=2). An SMPInput operation is inserted at each input of a subgraph, and an
SMPOutput operation is inserted at each output. These operations communicate with the library
backend to transfer tensors to and from each other.

The following image is an example of 2 subgraphs split with B=2 with gradient operations added.
The gradient of a SMPInput op is a SMPOutput op, and vice versa. This enables the gradients to flow
backwards during back-propagation.

1888
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

This GIF demonstrates an example interleaved pipeline execution schedule with B=2 microbatches and
2 subgraphs. Each device sequentially executes one of the subgraph replicas to improve GPU utilization.
As B grows larger, the fraction of idle time slots goes to zero. Whenever it is time to do (forward or
backward) computation on a specific subgraph replica, the pipeline layer signals to the corresponding
blue SMPInput operations to start executing.

Once the gradients from all microbatches in a single mini-batch are computed, the library combines the
gradients across microbatches, which can then be applied to the parameters.

Pipeline Execution with PyTorch


Conceptually, pipelining follows a similar idea in PyTorch. However, since PyTorch does not involve static
graphs and so the model parallelism library's PyTorch feature uses a more dynamic pipelining paradigm.

As in TensorFlow, each batch is split into a number of microbatches, which are executed one at a time on
each device. However, the execution schedule is handled via execution servers launched on each device.
Whenever the output of a submodule that is placed on another device is needed on the current device,
an execution request is sent to the execution server of the remote device along with the input tensors to
the submodule. The server then executes this module with the given inputs and returns the response to
the current device.

Since the current device is idle during the remote submodule execution, the local execution for the
current microbatch pauses, and the library runtime switches execution to another microbatch which
the current device can actively work on. The prioritization of microbatches is determined by the chosen

1889
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

pipeline schedule. For an interleaved pipeline schedule, microbatches that are in the backward stage of
the computation are prioritized whenever possible.

Tensor Parallelism
Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and
optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual
weights intact but partitions the set of weights, tensor parallelism splits individual weights. This typically
involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory
(such as large embedding tables with a large vocabulary size or a large softmax layer with a large
number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and
impedes balance of the memory load.

Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not
enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure
microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead
becomes prohibitively large.
Note
Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and
later.

Topics
• How Tensor Parallelism Works (p. 1890)
• Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism (p. 1892)
• Support for Hugging Face Transformer Models (p. 1897)
• Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor
Parallelism (p. 1899)

How Tensor Parallelism Works

Tensor parallelism takes place at the level of nn.Modules; it partitions specific modules in the model
across tensor parallel ranks. This is in addition to the existing partition of the set of modules used in
pipeline parallelism.

When a module is partitioned through tensor parallelism, its forward and backward propagation
are distributed. The library handles the necessary communication across devices to implement the
distributed execution of these modules. The modules are partitioned across multiple data parallel ranks.
Contrary to the traditional distribution of workloads, each data parallel rank does not have the complete
model replica when the library’s tensor parallelism is used. Instead, each data parallel rank may have
only a partition of the distributed modules, in addition to the entirety of the modules that are not
distributed.

Example: Consider tensor parallelism across data parallel ranks, where the degree of data parallelism is
4 and the degree of tensor parallelism is 2. Assume that you have a data parallel group that holds the
following module tree, after partitioning the set of modules.

A
### B
| ### E
| ### F
### C
### D
### G
### H

1890
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Assume that tensor parallelism is supported for the modules B, G, and H. One possible outcome of
tensor parallel partition of this model could be:

dp_rank 0 (tensor parallel rank 0): A, B:0, C, D, G:0, H


dp_rank 1 (tensor parallel rank 1): A, B:1, C, D, G:1, H
dp_rank 2 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 3 (tensor parallel rank 1): A, B:1, C, D, G:1, H

Each line represents the set of modules stored in that dp_rank, and the notation X:y represents the yth
fraction of the module X. Note the following:

1. Partitioning takes place across subsets of data parallel ranks, which we call TP_GROUP, not the entire
DP_GROUP, so that the exact model partition is replicated across dp_rank 0 and dp_rank 2, and
similarly across dp_rank 1 and dp_rank 3.
2. The modules E and F are no longer part of the model, since their parent module B is partitioned, and
any execution that is normally a part of E and F takes place within the (partitioned) B module.
3. Even though H is supported for tensor parallelism, in this example it is not partitioned, which
highlights that whether to partition a module depends on user input. The fact that a module is
supported for tensor parallelism does not necessarily mean it is partitioned.

How the library adapts tensor parallelism to PyTorch nn.Linear module

When tensor parallelism is performed over data parallel ranks, a subset of the parameters, gradients,
and optimizer states are partitioned across the tensor parallel devices for the modules that are
partitioned. For the rest of the modules, the tensor parallel devices operate in a regular data parallel
manner. To execute the partitioned module, a device first collects the necessary parts of all data samples
across peer devices in the same tensor parallelism group. The device then runs the local fraction of the
module on all these data samples, followed by another round of synchronization which both combines
the parts of the output for each data sample and returns the combined data samples to the GPUs from
which the data sample first originated. The following figure shows an example of this process over a
partitioned nn.Linear module.

1891
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

The first figure shows a small model with a large nn.Linear module with data parallelism over the two
tensor parallelism ranks. The nn.Linear module is replicated into the two parallel ranks.

The second figure shows tensor parallelism applied on a larger model while splitting the nn.Linear
module. Each tp_rank holds half the linear module, and the entirety of the rest of the operations. While
the linear module runs, each tp_rank collects the relevant half of all data samples and passes it through
their half of the nn.Linear module. The result needs to be reduce-scattered (with summation as the
reduction operation) so that each rank has the final linear output for their own data samples. The rest of
the model runs in the typical data parallel manner.

Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism

In this section, you learn:

• How to configure a SageMaker PyTorch estimator and the SageMaker model parallelism option to use
tensor parallelism.
• How to adapt your training script using the extended smdistributed.modelparallel modules for
tensor parallelism.

To learn more about the smdistributed.modelparallel modules, see the SageMaker model parallel
APIs in the SageMaker Python SDK documentation.

Topics
• Tensor parallelism alone (p. 1893)
• Tensor parallelism combined with pipeline parallelism (p. 1895)

1892
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Tensor parallelism alone

The following is an example of a distributed training option to activate tensor parallelism alone, without
pipeline parallelism. Configure the mpi_options and smp_options dictionaries to specify distributed
training options to the SageMaker PyTorch estimator.
Note
Extended memory-saving features are available through Deep Learning Containers for PyTorch,
which implements the SageMaker model parallelism library v1.6.0 or later.

Configure a SageMaker PyTorch estimator

mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
"enabled":True,
"parameters": {
"pipeline_parallel_degree": 1, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 4, # tp over 4 devices
"ddp": True
}
}

smp_estimator = PyTorch(
entry_point='your_training_script.py', # Specify
role=role,
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
framework_version='1.13.1',
py_version='py36',
instance_count=1,
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')

Tip
To find a complete list of parameters for distribution, see Configuration Parameters for
Model Parallelism in the SageMaker Python SDK documentation.

Adapt your PyTorch training script

The following example training script shows how to adapt the SageMaker model parallelism library to a
training script. In this example, it is assumed that the script is named your_training_script.py.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

1893
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):


x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
return F.log_softmax(x, 1)

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# smdistributed: Move input tensors to the GPU ID used by
# the current process, based on the set_device call.
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
loss.backward()
optimizer.step()

# smdistributed: Initialize the backend


smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.


# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks


if smp.dp_size() > 1:
partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
dataset = SplitDataset(dataset, partitions=partitions_dict)
dataset.select(f"{smp.dp_rank()}")

train_loader = torch.utils.data.DataLoader(dataset, batch_size=64)

# smdistributed: Enable tensor parallelism for all supported modules in the model
# i.e., nn.Linear in this case. Alternatively, we can use
# smp.set_tensor_parallelism(model.fc1, True)
# to enable it only for model.fc1
with smp.tensor_parallelism():
model = Net()

# smdistributed: Use the DistributedModel wrapper to distribute the


# modules for which tensor parallelism is enabled
model = smp.DistributedModel(model)

1894
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)


optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)

Tensor parallelism combined with pipeline parallelism

The following is an example of a distributed training option that enables tensor parallelism combined
with pipeline parallelism. Set up the mpi_options and smp_options parameters to specify model
parallel options with tensor parallelism when you configure a SageMaker PyTorch estimator.
Note
Extended memory-saving features are available through Deep Learning Containers for PyTorch,
which implements the SageMaker model parallelism library v1.6.0 or later.

Configure a SageMaker PyTorch estimator

mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True
}
}

smp_estimator = PyTorch(
entry_point='your_training_script.py', # Specify
role=role,
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
framework_version='1.13.1',
py_version='py36',
instance_count=1,
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')

Adapt your PyTorch training script

The following example training script shows how to adapt the SageMaker model parallelism library to a
training script. Note that the training script now includes the smp.step decorator:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset

1895
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):


x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
return F.log_softmax(x, 1)

# smdistributed: Define smp.step. Return any tensors needed outside.


@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# smdistributed: Move input tensors to the GPU ID used by
# the current process, based on the set_device call.
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
# Return value, loss_mb is a StepOutput object
_, loss_mb = train_step(model, data, target)

# smdistributed: Average the loss across microbatches.


loss = loss_mb.reduce_mean()

optimizer.step()

# smdistributed: Initialize the backend


smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.


# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks


if smp.dp_size() > 1:
partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}

1896
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

dataset = SplitDataset(dataset, partitions=partitions_dict)


dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible


# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = Net()

# smdistributed: enable tensor parallelism only for model.fc1


smp.set_tensor_parallelism(model.fc1, True)

# smdistributed: Use the DistributedModel container to provide the model


# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)


optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)

Support for Hugging Face Transformer Models

The SageMaker model parallelism library's tensor parallelism offers out-of-the-box support for the
following Hugging Face Transformer models:

• GPT-2, BERT, and RoBERTa (Available in the SageMaker model parallelism library v1.7.0 and later)
• GPT-J (Available in the SageMaker model parallelism library v1.8.0 and later)
• GPT-Neo (Available in the SageMaker model parallelism library v1.10.0 and later)

Note
For any other Transformers models, you need to use the
smdistributed.modelparallel.torch.tp_register_with_module() API to apply tensor parallelism.
Note
To use tensor parallelism for training Hugging Face Transformer models, make sure you use
Hugging Face Deep Learning Containers for PyTorch that has the SageMaker model parallelism
library v1.7.0 and later. For more information, see the SageMaker model parallelism library
release notes.

Supported Models Out of the Box

For the Hugging Face transformer models supported by the library out of the box, you
don't need to manually implement hooks to translate Transformer APIs to smdistributed
transformer layers. You can activate tensor parallelism by using the context manager
smdistributed.modelparallel.torch.tensor_parallelism() and wrapping the model by
smdistributed.modelparallel.torch.DistributedModel(). You don't need to manually register hooks for
tensor parallelism using the smp.tp_register API.

The state_dict translation functions between Hugging Face Transformers and


smdistributed.modelparallel can be accessed as follows.

• smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_state_dict_to_hf_gpt2(st
max_seq_len=None)
• smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_hf_state_dict_to_smdistr
• smdistributed.modelparallel.torch.nn.huggingface.bert.translate_state_dict_to_hf_bert(st
max_seq_len=None)

1897
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• smdistributed.modelparallel.torch.nn.huggingface.bert.translate_hf_state_dict_to_smdistr
• smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_state_dict_to_hf_robe
max_seq_len=None)
• smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_hf_state_dict_to_smdi
• smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_state_dict_to_hf_gptj(st
max_seq_len=None) (Available in the SageMaker model parallelism library v1.8.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_hf_gptj_state_dict_to_sm
(Available in the SageMaker model parallelism library v1.8.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_state_dict_to_hf_gptne
max_seq_len=None) (Available in the SageMaker model parallelism library v1.10.0 and later)
• smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_hf_state_dict_to_smdis
(Available in the SageMaker model parallelism library v1.10.0 and later)

Example usage of the GPT-2 translation function

Start with wrapping the model as shown in the following code.

from transformers import AutoModelForCausalLM

with smp.tensor_parallelism():
model = AutoModelForCausalLM.from_config(hf_gpt2_config)

model = smp.DistributedModel(model)

Given a state_dict from the DistributedModel object, you can load the weights into the original
Hugging Face GPT-2 model using the translate_state_dict_to_hf_gpt2 function as shown in the
following code.

from smdistributed.modelparallel.torch.nn.huggingface.gpt2 \
import translate_state_dict_to_hf_gpt2
max_seq_len = 1024

# [... code block for training ...]

if smp.rdp_rank() == 0:
state_dict = dist_model.state_dict()
hf_state_dict = translate_state_dict_to_hf_gpt2(state_dict, max_seq_len)

# can now call model.load_state_dict(hf_state_dict) to the original HF model

Example usage of the RoBERTa translation function

Similarly, given a supported HuggingFace model state_dict, you can use the
translate_hf_state_dict_to_smdistributed function to convert it to a format readable by
smp.DistributedModel. This can be useful in transfer learning use cases, where a pre-trained model is
loaded into a smp.DistributedModel for model-parallel fine-tuning:

from smdistributed.modelparallel.torch.nn.huggingface.roberta \
import translate_state_dict_to_smdistributed

model = AutoModelForMaskedLM.from_config(roberta_config)
model = smp.DistributedModel(model)

pretrained_model = AutoModelForMaskedLM.from_pretrained("roberta-large")
translated_state_dict =

1898
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

translate_state_dict_to_smdistributed(pretrained_model.state_dict())

# load the translated pretrained weights into the smp.DistributedModel


model.load_state_dict(translated_state_dict)

# start fine-tuning...

Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism

This section explains how the ranking mechanism of model parallelism works with tensor parallelism.
This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallelism
Library (p. 1875). With tensor parallelism, the library introduces three types of ranking and process
group APIs: smp.tp_rank() for tensor parallel rank, smp.pp_rank() for pipeline parallel rank, and
smp.rdp_rank() for reduced-data parallel rank. The corresponding communication process groups are
tensor parallel group (TP_GROUP), pipeline parallel group (PP_GROUP), and reduced-data parallel group
(RDP_GROUP). These groups are defined as follows:

• A tensor parallel group (TP_GROUP) is an evenly divisible subset of the data parallel group, over which
tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1,
TP_GROUP is the same as model parallel group (MP_GROUP).
• A pipeline parallel group (PP_GROUP) is the group of processes over which pipeline parallelism takes
place. When the tensor parallelism degree is 1, PP_GROUP is the same as MP_GROUP.
• A reduced-data parallel group (RDP_GROUP) is a set of processes that hold both the same pipeline
parallelism partitions and the same tensor parallel partitions, and perform data parallelism among
themselves. This is called the reduced data parallel group because it is a subset of the entire data
parallelism group, DP_GROUP. For the model parameters that are distributed within the TP_GROUP ,
the gradient allreduce operation is performed only for reduced-data parallel group, while for the
parameters that are not distributed, the gradient allreduce takes place over the entire DP_GROUP.
• A model parallel group (MP_GROUP) refers to a group of processes that collectively store the entire
model. It consists of the union of the PP_GROUPs of all the ranks that are in the TP_GROUP of the
current process. When the degree of tensor parallelism is 1, MP_GROUP is equivalent to PP_GROUP. It
is also consistent with the existing definition of MP_GROUP from previous smdistributed releases.
Note that the current TP_GROUP is a subset of both the current DP_GROUP and the current MP_GROUP.

To learn more about the communication process APIs in the SageMaker model parallelism library, see the
Common API and the PyTorch-specific APIs in the SageMaker Python SDK documentation.

1899
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

This figure shows ranking mechanism, parameter distribution, and associated AllReduce operations of
tensor parallelism.

For example, consider process groups for a single node with 8 GPUs, where the degree of tensor
parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper
center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower
right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism
and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower
figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated
for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of MP_GROUP,
PP_GROUP, and TP_GROUP. The lower right figure shows RDP_GROUP, DP_GROUP, and WORLD over the
same set of GPUs. The gradients for the layers and layer slices that have the same color are allreduced
together for data parallelism. For example, the first layer (light blue) gets the allreduce operations
across DP_GROUP, whereas the dark orange slice in the second layer only gets the allreduce operations
within the RDP_GROUP of its process. The bold dark red arrows represent tensors with the batch of its
entire TP_GROUP.

GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0


GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3

1900
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition,
data parallelism (allreduce) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7.
Tensor parallelism happens over subsets of DP_GROUPs, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).

Optimizer State Sharding


Optimizer state sharding is a useful memory-saving technique that shards the optimizer state (the set of
weights that describes the state of optimizer) across data parallel device groups. You can use optimizer
state sharding whenever you use a stateful optimizer (such as Adam) or an FP16 optimizer (which stores
both FP16 and FP32 copies of the parameters).
Note
Optimizer state sharding is available for PyTorch in the SageMaker model parallelism library
v1.6.0 and later.

How to Use Optimizer State Sharding


You can turn on optimizer state sharding by setting "shard_optimizer_state": True in the
modelparallel configuration.

When this feature is turned on, the library partitions the set of model parameters based on the data
parallelism degree. The gradients corresponding to the ith partition get reduced only at the ith data
parallel rank. At the end of the first call to an smp.step decorator function, the optimizer wrapped
by smp.DistributedOptimizer redefines its parameters to be only limited to those parameters
corresponding to the partition of the current data parallel rank. The redefined parameters are called
virtual parameters and share underlying storage with the original parameters. During the first call to
optimizer.step, the optimizer states are created based on these redefined parameters, which are
sharded because of the original partition. After the optimizer update, the AllGather operation (as part of
the optimizer.step call) runs across the data parallel ranks to achieve consistent parameter states.
Tip
Optimizer state sharding can be useful when the degree of data parallelism is greater than 1
and the model has more than a billion parameters.
The degree of data parallelism is calculated by (processes_per_host *
instance_count / pipeline_parallel_degree), and the smp.dp_size() function
handles the sizing in the background.

Configure a SageMaker PyTorch estimator

mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True,
"shard_optimizer_state": True
}
}

Adapt your PyTorch training script

See Adapt your PyTorch training script (p. 1895) in the Tensor parallelism combined with pipeline
parallelism section. There’s no additional modification required for the script.

1901
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Activation Checkpointing
Activation checkpointing (or gradient checkpointing) is a technique to reduce memory usage by clearing
activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra
computation time for reduced memory usage. If a module is checkpointed, at the end of a forward
pass, the inputs to and outputs from the module stay in memory. Any intermediate tensors that would
have been part of the computation inside that module are freed up during the forward pass. During
the backward pass of checkpointed modules, these tensors are recomputed. At this point, the layers
beyond this checkpointed module have finished their backward pass, so the peak memory usage with
checkpointing can be lower.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

How to Use Activation Checkpointing

With smdistributed.modelparallel, you can use activation checkpointing at the granularity


of a module. For all torch.nn modules except torch.nn.Sequential, you can only checkpoint a
module tree if it lies within one partition from the perspective of pipeline parallelism. In case of the
torch.nn.Sequential module, each module tree inside the sequential module must lie completely
within one partition for activation checkpointing to work. When you use manual partitioning, be aware
of these restrictions.

When you use automated model partitioning, you can find the partitioning assignment logs starting with
Partition assignments: in the training job logs. If a module is partitioned across multiple ranks
(for example, with one descendant on one rank and another descendant on a different rank), the library
ignores the attempt to checkpoint the module and raises a warning message that the module won't be
checkpointed.
Note
The SageMaker model parallelism library supports both overlapping and non-overlapping
allreduce operation in combination with checkpointing.
Note
PyTorch’s native checkpointing API is not compatible with smdistributed.modelparallel.

Example 1: The following sample code shows how to use activation checkpointing when you have a
model definition in your script.

import torch.nn as nn
import torch.nn.functional as F

from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):


x = self.conv1(x)
x = self.conv2(x)
x = F.max_pool2d(x, 2)
x = torch.flatten(x, 1)
# This call of fc1 will be checkpointed
x = checkpoint(self.fc1, x)
x = self.fc2(x)
return F.log_softmax(x, 1)

1902
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Example 2: The following sample code shows how to use activation checkpointing when you have a
sequential model in your script.

import torch.nn as nn
from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequential

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.seq = nn.Sequential(
nn.Conv2d(1,20,5),
nn.ReLU(),
nn.Conv2d(20,64,5),
nn.ReLU()
)

def forward(self, x):


# This call of self.seq will be checkpointed
x = checkpoint_sequential(self.seq, x)
return F.log_softmax(x, 1)

Example 3: The following sample code shows how to use activation checkpointing when you import a
prebuilt model from a library, such as PyTorch and Hugging Face Transformers. Whether you checkpoint
sequential modules or not, do the following:

1. Wrap the model by smp.DistributedModel().


2. Define an object for sequential layers.
3. Wrap the sequential layer object by smp.set_activation_checkpointig().

import smdistributed.modelparallel.torch as smp


from transformers import AutoModelForCausalLM

smp.init()
model = AutoModelForCausalLM(*args, **kwargs)
model = smp.DistributedModel(model)

# Call set_activation_checkpointing API


transformer_layers = model.module.module.module.transformer.seq_layers
smp.set_activation_checkpointing(
transformer_layers, pack_args_as_tuple=True, strategy='each')

Activation Offloading
When activation checkpointing and pipeline parallelism are turned on and the number of microbatches
is greater than one, activation offloading is an additional feature that can further reduce memory
usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their
microbatches that are not currently running in the CPU. Right before the GPU needs the activations for
the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the
CPU.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

How to Use Activation Offloading

Use activation offloading to reduce memory usage when the number of microbatches is greater than
1, and activation checkpointing is turned on (see Activation Checkpointing (p. 1902)). When the
activation checkpointing is not used, activation offloading has no effect. When it is used with only one
microbatch, it does not save memory.

1903
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

To use activation offloading, set "offload_activations": True in the modelparallel


configuration.

Activation offloading moves the checkpointed activations in nn.Sequential modules to CPU


asynchronously. The data transfer over the PCIe link overlaps with GPU computation. The offloading
happens immediately, as soon as the forward pass for a particular checkpointed layer is computed.
The activations are loaded back to the GPU shortly before they are needed for the backward pass of a
particular microbatch. The CPU-GPU transfer similarly overlaps with computation.

To adjust how early the activations are loaded back into the GPU, you can use the configuration
parameter "activation_loading_horizon" (default is set to 4, must be int larger than 0). A larger
activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the
horizon is too large, the memory-saving impact of activation offloading might be diminished. If the
horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and
degrading performance.
Tip
Activation offloading can be useful for large models with over a hundred billion parameters.

Configure a SageMaker PyTorch estimator

mpi_options = {
"enabled" : True,
"processes_per_host" : 8, # 8 processes
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2, # alias for "partitions"
"placement_strategy": "cluster",
"tensor_parallel_degree": 2, # tp over 2 devices
"ddp": True,
"offload_activations": True,
"activation_loading_horizon": 4 # optional. default is 4.
}
}

FP16 Training with Model Parallelism


For FP16 training, apply the following modifications to your training script and estimator.
Note
This feature is available for PyTorch in the SageMaker model parallelism library v1.10.0 and
later.

Adapt your PyTorch training script

1. Wrap your model using the smdistributed.modelparallel.torch.model_creation() context manager.

# fp16_training_script.py

import torch
import smdistributed.modelparallel.torch as smp

with smp.model_creation(
dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
):
model = ...

1904
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Tip
If you are using tensor parallelism, add tensor_parallelism=smp.tp_size() > 1 to the
smp.model_creation context manager. Adding this line also helps automatically detect
whether tensor parallelism is activated or not.

with smp.model_creation(
... ,
tensor_parallelism=smp.tp_size() > 1
):
model = ...

2. When you wrap the optimizer with


smdistributed.modelparallel.torch.DistributedOptimizer, set either
the static_loss_scaling or dynamic_loss_scaling argument. By default,
static_loss_scaling is set to 1.0, and dynamic_loss_scaling is set to False. If you set
dynamic_loss_scale=True, you can feed dynamic loss scaling options as a dictionary through the
dynamic_loss_args argument. In most cases, we recommend you use dynamic loss scaling with the
default options. For more information, options, and examples of the optimizer wrapper function, see
the smdistributed.modelparallel.torch.DistributedOptimizer API.

The following code is an example of wrapping an Adadelta optimizer object with dynamic loss
scaling for FP16 training.

optimizer = torch.optim.Adadelta(...)
optimizer = smp.DistributedOptimizer(
optimizer,
static_loss_scale=None,
dynamic_loss_scale=True,
dynamic_loss_args={
"scale_window": 1000,
"min_scale": 1,
"delayed_shift": 2
}
)

Configure a SageMaker PyTorch estimator

Add the FP16 parameter ("fp16") to the distribution configuration for model parallelism when creating
a SageMaker PyTorch estimator object. For a complete list of the configuration parameters for model
parallelism, see Parameters for smdistributed.

from sagemaker.pytorch import PyTorch

smp_options = {
"enabled": True,
"parameters": {
"microbatches": 4,
"pipeline_parallel_degree": 2,
"tensor_parallel_degree": 2,
...,

"fp16": True
}
}

fp16_estimator = PyTorch(
entry_point="fp16_training_script.py", # Specify your train script
...,

1905
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": {...}
}
)

fp16_estimator.fit(...)

When FP16 training starts, the model and the optimizer are wrapped by FP16_Module and
FP16_Optimizer respectively, which are modified smdistributed versions of the Apex utils.
FP16_Module converts the model to FP16 dtype and deals with the forward pass in FP16.
Tip
You can apply gradient clipping by calling clip_master_grads before optimizer.step.

optimizer.clip_master_grads(max_norm) # max_norm(float or int): max norm of the


gradients

Tip
When using torch.optim.lr_scheduler and FP16 training, you need to pass
optimizer.optimizer to the LR scheduler rather than the optimizer. See the following
example code.

from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(
optimizer.optimizer if smp.state.cfg.fp16 else optimizer,
step_size=1,
gamma=args.gamma
)

Support for FlashAttention


Support for FlashAttention is a feature of the library only applicable for the distributed transformer
model, which is a Transformer model wrapped by smp.DistributedModel() for model-parallel
training. This feature is also compatible with the section called “Tensor Parallelism” (p. 1890).

The FlashAttention library only supports models when attention_head_size is set to a value that's a
multiple of 8 and less than 128. Therefore, when you train a distributed transformer and make sure that
FlashAttention works properly, you should adjust parameters to make the attention head size comply
the requirements. For more information, see also Installation and features in the FlashAttention GitHub
repository.

For example, assume that you configure a Transformer model with hidden_width=864 and
num_heads=48. The head size of FlashAttention is calculated as attention_head_size =
hidden_width / num_heads = 864 / 48 = 18. To enable FlashAttention, you need to adjust the
num_heads parameter to 54, so that attention_head_size = hidden_width / num_heads =
864 / 54 = 16, which is a multiple of 8.

Run a SageMaker Distributed Training Job with Model


Parallelism
Learn how to run a model-parallel training job of your own training script using the SageMaker Python
SDK with the SageMaker model parallelism library.

There are three use-case scenarios for running a SageMaker training job.

1906
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

1. You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This
option is recommended if it is the first time for you to use the model parallel library. To find a tutorial
for how to run a SageMaker model parallel training job, see the example notebooks at PyTorch
training with Amazon SageMaker's model parallelism library.
2. You can extend the pre-built containers to handle any additional functional requirements for your
algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of
how you can extend a pre-built container, see Extend a Pre-built Container (p. 2675).
3. You can adapt your own Docker container to work with SageMaker using the SageMaker Training
toolkit. For an example, see Adapting Your Own Training Container.

For options 2 and 3 in the preceding list, refer to Extend a Pre-built Docker Container that Contains
SageMaker's Distributed Model Parallel Library (p. 1924) to learn how to install the model parallel
library in an extended or customized Docker container.

In all cases, you launch your training job configuring a SageMaker TensorFlow or PyTorch estimator to
activate the library. To learn more, see the following topics.

Topics
• Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel
Library (p. 1907)
• Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921)

Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model
Parallel Library
Use this section to learn how to customize your training script to use the core features of the Amazon
SageMaker model parallelism library. To use the library-specific API functions and parameters, we
recommend you use this documentation alongside the SageMaker model parallel library APIs in the
SageMaker Python SDK documentation.

The training script examples provided in these sections are simplified and designed to highlight the
required changes you must make to use the library. For end-to-end, runnable notebook examples that
demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker model parallelism
library, see Amazon SageMaker Distributed Training Notebook Examples (p. 1942).

Topics
• Split the model of your training script using the SageMaker model parallelism library (p. 1907)
• Modify a TensorFlow training script (p. 1909)
• Modify a PyTorch Training Script (p. 1915)

Split the model of your training script using the SageMaker model parallelism library
There are two ways to modify your training script to set up model splitting: automated splitting or
manual splitting.

Automated model splitting


When you use SageMaker's model parallelism library, you can take advantage of automated model
splitting, also referred to as automated model partitioning. The library uses a partitioning algorithm that
balances memory, minimizes communication between devices, and optimizes performance. You can
configure the automated partitioning algorithm to optimize for speed or memory.

Alternatively, you can use manual model splitting. We recommend automated model splitting, unless
you are very familiar with the model architecture and have a good idea of how to efficiently partition
your model.

1907
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

How it works

Auto-partitioning occurs during the first training step, when the smp.step-decorated function is first
called. During this call, the library first constructs a version of the model on the CPU RAM (to avoid GPU
memory limitations), and then analyzes the model graph and makes a partitioning decision. Based on
this decision, each model partition is loaded on a GPU, and only then the first step is executed. Because
of these analysis and partitioning steps, the first training step might take longer.

In either framework, the library manages the communication between devices through its own backend,
which is optimized for AWS infrastructure.

The auto-partition design adapts to the characteristics of the framework, and the library does the
partitioning at the granularity level that is more natural in each framework. For instance, in TensorFlow,
each specific operation can be assigned to a different device, whereas in PyTorch, the assignment is done
at the module level, where each module consists of multiple operations. The follow section reviews the
specifics of the design in each framework.

Automated model splitting with PyTorch

During the first training step, the model parallelism library internally runs a tracing step that is meant
to construct the model graph and determine the tensor and parameter shapes. After this tracing step,
the library constructs a tree, which consists of the nested nn.Module objects in the model, as well as
additional data gathered from tracing, such as the amount of stored nn.Parameters, and execution
time for each nn.Module.

Next, the library traverses this tree from the root and runs a partitioning algorithm that assigns each
nn.Module to a device, which balances computational load (measured by module execution time)
and memory use (measured by the total stored nn.Parameter size and activations). If multiple
nn.Modules share the same nn.Parameter, then these modules are placed on the same device to
avoid maintaining multiple versions of the same parameter. Once the partitioning decision is made, the
assigned modules and weights are loaded to their devices.

For instructions on how to register the smp.step decorator to your PyTorch training script, see the
section called “Automated splitting with PyTorch” (p. 1915).

Automated model splitting with TensorFlow

The model parallelism library analyzes the sizes of the trainable variables and the graph structure, and
internally uses a graph partitioning algorithm. This algorithm comes up with a device assignment for
each operation, with the objective of minimizing the amount of communication needed across devices,
subject to two constraints:

• Balancing the number of variables stored in each device


• Balancing the number of operations executed in each device

If you specify speed for optimize (in the model parallelism parameters in the Python SDK), the library
tries to balance the number of operations and tf.Variable objects in each device. Otherwise, it tries to
balance the total size of tf.Variables.

Once the partitioning decision is made, the library creates a serialized representation of the subgraph
that each device needs to execute and imports them onto each device. While partitioning, the library
places operations that consume the same tf.Variable and operations that are part of the same Keras
layer onto the same device. It also respects the colocation constraints imposed by TensorFlow. This
means that, for example, if there are two Keras layers that share a tf.Variable, then all operations
that are part of these layers are placed on a single device.

For instructions on how to register the smp.step decorator to your PyTorch training script, see the
section called “Automated splitting with TensorFlow” (p. 1910).

1908
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Comparison of automated model splitting between frameworks

In TensorFlow, the fundamental unit of computation is a tf.Operation, and TensorFlow represents


the model as a directed acyclic graph (DAG) of tf.Operations, and therefore the model parallelism
library partitions this DAG so that each node goes to one device. Crucially, tf.Operation objects are
sufficiently rich with customizable attributes, and they are universal in the sense that every model is
guaranteed to consist of a graph of such objects.

PyTorch on the other hand, does not have an equivalent notion of operation that is sufficiently rich and
universal. The closest unit of computation in PyTorch that has these characteristics is an nn.Module,
which is at a much higher granularity level, and this is why the library does partitioning at this level in
PyTorch.

Manual Model Splitting

If you want to manually specify how to partition your model across devices, use the smp.partition
context manager. For instructions on how to set the context manager for manual partitioning, see the
following pages.

• the section called “Manual splitting with TensorFlow” (p. 1913)


• the section called “Manual splitting with PyTorch” (p. 1917)

To use this option after making modifications, in Step 2, you'll need to set auto_partition to False,
and define a default_partition in the framework estimator class of the SageMaker Python SDK.
Any operation that is not explicitly placed on a partition through the smp.partition context manager
is executed on the default_partition. In this case, the automated splitting logic is bypassed, and
each operation is placed based on your specification. Based on the resulting graph structure, the model
parallelism library creates a pipelined execution schedule automatically.

Modify a TensorFlow training script

In this section, you learn how to modify TensorFlow training scripts to configure the SageMaker model
parallelism library for auto-partitioning and manual partitioning. This selection of examples also
includes an example integrated with Horovod for hybrid model and data parallelism.
Note
To find which TensorFlow versions are supported by the library, see the section called
“Supported Frameworks and AWS Regions” (p. 1872).

The required modifications you must make to your training script to use the library are listed in
Automated splitting with TensorFlow (p. 1910).

To learn how to modify your training script to use hybrid model and data parallelism with Horovod, see
Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism (p. 1911).

If you want to use manual partitioning, also review Manual splitting with TensorFlow (p. 1913).
Tip
For end-to-end notebook examples that demonstrate how to use a TensorFlow training script
with the SageMaker model parallelism library, see TensorFlow Examples (p. 1943).

The following topics show examples of training scripts that you can use to configure SageMaker's model
parallelism library for auto-partitioning and manual partitioning TensorFlow models.
Note
Auto-partitioning is enabled by default. Unless otherwise specified, the example scripts use
auto-partitioning.

Topics

1909
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• Automated splitting with TensorFlow (p. 1910)


• Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism (p. 1911)
• Manual splitting with TensorFlow (p. 1913)
• Unsupported framework features (p. 1914)

Automated splitting with TensorFlow

The following training script changes are required to run a TensorFlow model with SageMaker's model
parallelism library:

1. Import and initialize the library with smp.init().


2. Define a Keras model by inheriting from smp.DistributedModel instead of the Keras Model class.
Return the model outputs from the call method of the smp.DistributedModel object. Be mindful
that any tensors returned from the call method will be broadcast across model-parallel devices,
incurring communication overhead, so any tensors that are not needed outside the call method (such
as intermediate activations) should not be returned.
3. Set drop_remainder=True in tf.Dataset.batch() method. This is to ensure that the batch size
is always divisible by the number of microbatches.
4. Seed the random operations in the data pipeline using smp.dp_rank(), e.g., shuffle(ds,
seed=smp.dp_rank()) to ensure consistency of data samples across GPUs that hold different model
partitions.
5. Put the forward and backward logic in a step function and decorate it with smp.step.
6. Perform post-processing on the outputs across microbatches using StepOutput methods such as
reduce_mean. The smp.step function must have a return value that depends on the output of
smp.DistributedModel.
7. If there is an evaluation step, similarly place the forward logic inside an smp.step-decorated function
and post-process the outputs using StepOutput API.

To learn more about the SageMaker's model parallelism library API, refer to the API documentation.

The following Python script is an example of a training script after the changes are made.

import tensorflow as tf

# smdistributed: Import TF2.x API


import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
"MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension


x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder


# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(10000, seed=smp.dp_rank())

1910
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

.batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API


class MyModel(smp.DistributedModel):
def __init__(self):
super(MyModel, self).__init__()
# define layers

def call(self, x, training=None):


# define forward pass and return the model output

model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside


@smp.step
def get_grads(images, labels):
predictions = model(images, training=True)
loss = loss_object(labels, predictions)

grads = optimizer.get_gradients(loss, model.trainable_variables)


return grads, loss, predictions

@tf.function
def train_step(images, labels):
gradients, loss, predictions = get_grads(images, labels)

# smdistributed: Accumulate the gradients across microbatches


gradients = [g.accumulate() for g in gradients]
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# smdistributed: Merge predictions and average losses across microbatches


train_accuracy(labels, predictions.merge())
return loss.reduce_mean()

for epoch in range(5):


# Reset the metrics at the start of the next epoch
train_accuracy.reset_states()
for images, labels in train_ds:
loss = train_step(images, labels)
accuracy = train_accuracy.result()

If you are done preparing your training script, proceed to Step 2: Launch a Training Job Using the
SageMaker Python SDK (p. 1921). If you want to run a hybrid model and data parallel training job,
continue to the next section.

Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism

You can use the SageMaker model parallelism library with Horovod for hybrid model and data
parallelism. To read more about how the library splits a model for hybrid parallelism, see Pipeline
parallelism (available for PyTorch and TensorFlow) (p. 1866).

In this step, we focus on how to modify your training script to adapt the SageMaker model parallelism
library.

To properly set up your training script to pick up the hybrid parallelism configuration that you'll set
in Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921), use the library's helper

1911
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

functions, smp.dp_rank() and smp.mp_rank(), which automatically detect the data parallel rank and
model parallel rank respectively.

To find all MPI primitives the library supports, see MPI Basics in the SageMaker Python SDK
documentation.

The required changes needed in the script are:

• Adding hvd.allreduce
• Broadcasting variables after the first batch, as required by Horovod
• Seeding shuffling and/or sharding operations in the data pipeline with smp.dp_rank().

Note
When you use Horovod, you must not directly call hvd.init in your training script. Instead,
you'll have to set "horovod" to True in the SageMaker Python SDK modelparallel
parameters in Step 2: Launch a Training Job Using the SageMaker Python SDK (p. 1921). This
allows the library to internally initialize Horovod based on the device assignments of model
partitions. Calling hvd.init() directly in your training script can cause problems.
Note
Using the hvd.DistributedOptimizer API directly in your training script might result in
a poor training performance and speed, because the API implicitly places the AllReduce
operation inside smp.step. We recommend you to use the model parallelism library with
Horovod by directly calling hvd.allreduce after calling accumulate() or reduce_mean()
on the gradients returned from smp.step, as will be shown in the following example.

To learn more about the SageMaker's model parallelism library API, refer to the API documentation.

import tensorflow as tf
import horovod.tensorflow as hvd

# smdistributed: Import TF2.x API


import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
"MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension


x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: Seed the shuffle with smp.dp_rank(), and drop_remainder


# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(10000, seed=smp.dp_rank())
.batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API


class MyModel(smp.DistributedModel):
def __init__(self):
super(MyModel, self).__init__()
# define layers

1912
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

def call(self, x, training=None):


# define forward pass and return model outputs

model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside


@smp.step
def get_grads(images, labels):
predictions = model(images, training=True)
loss = loss_object(labels, predictions)

grads = optimizer.get_gradients(loss, model.trainable_variables)


return grads, loss, predictions

@tf.function
def train_step(images, labels, first_batch):
gradients, loss, predictions = get_grads(images, labels)

# smdistributed: Accumulate the gradients across microbatches


# Horovod: AllReduce the accumulated gradients
gradients = [hvd.allreduce(g.accumulate()) for g in gradients]
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# Horovod: Broadcast the variables after first batch


if first_batch:
hvd.broadcast_variables(model.variables, root_rank=0)
hvd.broadcast_variables(optimizer.variables(), root_rank=0)

# smdistributed: Merge predictions across microbatches


train_accuracy(labels, predictions.merge())
return loss.reduce_mean()

for epoch in range(5):


# Reset the metrics at the start of the next epoch
train_accuracy.reset_states()

for batch, (images, labels) in enumerate(train_ds):


loss = train_step(images, labels, tf.constant(batch == 0))

Manual splitting with TensorFlow

Use smp.partition context managers to place operations in specific partition. Any operation not
placed in any smp.partition contexts is placed in the default_partition. To learn more about the
SageMaker's model parallelism library API, refer to the API documentation.

import tensorflow as tf

# smdistributed: Import TF2.x API.


import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
"MNIST-data-%d" % smp.rank()
)

1913
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension


x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder


# in batching to make sure batch size is always divisible by number of microbatches.
train_ds = (
tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(10000, seed=smp.dp_rank())
.batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API.


class MyModel(smp.DistributedModel):
def __init__(self):
# define layers

def call(self, x):


with smp.partition(0):
x = self.layer0(x)
with smp.partition(1):
return self.layer1(x)

model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside


@smp.step
def get_grads(images, labels):
predictions = model(images, training=True)
loss = loss_object(labels, predictions)

grads = optimizer.get_gradients(loss, model.trainable_variables)


return grads, loss, predictions

@tf.function
def train_step(images, labels):
gradients, loss, predictions = get_grads(images, labels)

# smdistributed: Accumulate the gradients across microbatches


gradients = [g.accumulate() for g in gradients]
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# smdistributed: Merge predictions and average losses across microbatches


train_accuracy(labels, predictions.merge())
return loss.reduce_mean()

for epoch in range(5):


# Reset the metrics at the start of the next epoch
train_accuracy.reset_states()
for images, labels in train_ds:
loss = train_step(images, labels)
accuracy = train_accuracy.result()

Unsupported framework features

The following TensorFlow features are not supported by the library:

1914
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• tf.GradientTape() is currently not supported. You can use Optimizer.get_gradients() or


Optimizer.compute_gradients() instead to compute gradients.
• The tf.train.Checkpoint.restore() API is currently not supported. For checkpointing, use
smp.CheckpointManager instead, which provides the same API and functionality. Note that
checkpoint restores with smp.CheckpointManager should take place after the first step.

Modify a PyTorch Training Script

In this section, you learn how to modify PyTorch training scripts to configure the SageMaker model
parallelism library for auto-partitioning and manual partitioning.
Note
To find which PyTorch versions are supported by the library, see the section called “Supported
Frameworks and AWS Regions” (p. 1872).
Tip
For end-to-end notebook examples that demonstrate how to use a PyTorch training script with
the SageMaker model parallelism library, see PyTorch Examples (p. 1943).

Note that auto-partitioning is enabled by default. Unless otherwise specified, the following scripts use
auto-partitioning.

Topics
• Automated splitting with PyTorch (p. 1915)
• Manual splitting with PyTorch (p. 1917)
• Considerations (p. 1918)
• Unsupported framework features (p. 1920)

Automated splitting with PyTorch

The following training script changes are required to run a PyTorch training script with SageMaker's
model parallelism library:

1. Import and initialize the library with smdistributed.modelparallel.torch.init().


2. Wrap the model with smdistributed.modelparallel.torch.DistributedModel. Be mindful
that any tensors returned from the forward method of the underlying nn.Module object will be
broadcast across model-parallel devices, incurring communication overhead, so any tensors that are
not needed outside the call method (such as intermediate activations) should not be returned.
Note
For FP16 training, you need to use the smdistributed.modelparallel.torch.model_creation()
context manager to wrap the model. For more information, see FP16 Training with Model
Parallelism (p. 1904).
3. Wrap the optimizer with smdistributed.modelparallel.torch.DistributedOptimizer.
Note
For FP16 training, you need to set up static or dynamic loss scaling. For more information, see
FP16 Training with Model Parallelism (p. 1904).
4. Use the returned DistributedModel object instead of a user model.
5. Put the forward and backward logic in a step function and decorate it with
smdistributed.modelparallel.torch.step.
6. Restrict each process to its own device through torch.cuda.set_device(smp.local_rank()).
7. Move the input tensors to the GPU using the .to() API before the smp.step call (see example
below).

1915
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

8. Replace torch.Tensor.backward and torch.autograd.backward with


DistributedModel.backward.
9. Perform post-processing on the outputs across microbatches using StepOutput methods such as
reduce_mean.
10.If there is an evaluation step, similarly place the forward logic inside an smp.step-decorated function
and post-process the outputs using StepOutput API.
11.Set drop_last=True in DataLoader. Alternatively, manually skip a batch in the training loop if the
batch size is not divisible by the number of microbatches.

To learn more about the SageMaker's model parallelism library API, refer to the API documentation.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
def __init__(self):
super(GroupedNet, self).__init__()
# define layers

def forward(self, x):


# define forward pass and return model outputs

# smdistributed: Define smp.step. Return any tensors needed outside.


@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# smdistributed: Move input tensors to the GPU ID used by the current process,
# based on the set_device call.
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
# Return value, loss_mb is a StepOutput object
_, loss_mb = train_step(model, data, target)

# smdistributed: Average the loss across microbatches.


loss = loss_mb.reduce_mean()

optimizer.step()

# smdistributed: initialize the backend


smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.

1916
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks


if smp.dp_size() > 1:
partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
dataset = SplitDataset(dataset, partitions=partitions_dict)
dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible


# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model


# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)

Manual splitting with PyTorch

Use smp.partition context managers to place modules in specific devices. Any module not placed
in any smp.partition contexts is placed in the default_partition. The default_partition
needs to be provided if auto_partition is set to False. The modules that are created within a specific
smp.partition context are placed on the corresponding partition.

To learn more about the SageMaker's model parallelism library API, refer to the API documentation.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
def __init__(self):
super(GroupedNet, self).__init__()
with smp.partition(0):
# define child modules on device 0
with smp.partition(1):
# define child modules on device 1

def forward(self, x):


# define forward pass and return model outputs

# smdistributed: Define smp.step. Return any tensors needed outside.


@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss

1917
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# smdistributed: Move input tensors to the GPU ID used by the current process,
# based on the set_device call.
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
# Return value, loss_mb is a StepOutput object
_, loss_mb = train_step(model, data, target)

# smdistributed: Average the loss across microbatches.


loss = loss_mb.reduce_mean()

optimizer.step()

# smdistributed: initialize the backend


smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.


# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks


if smp.dp_size() > 1:
partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
dataset = SplitDataset(dataset, partitions=partitions_dict)
dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible


# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model


# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)

Considerations

When you configure a PyTorch training script using SageMaker's model parallelism library, you should be
aware of the following:

• If you are using an optimization technique that relies on global gradient norms, for example gradient
norm from the entire model, such as some variants of LAMB optimizer or global gradient clipping,
you need to gather all the norms across the model partitions for correctness. You can use the library’s
communication basic data types to do this.
• All torch.Tensor arguments to the forward methods of the nn.Modules in your model must be
used in the computation of the module output. In other words, the library does not support the case
where there is a torch.Tensor argument to a module on which the module output does not depend.

1918
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• The argument to the smp.DistributedModel.backward() call must depend on all model outputs.
In other words, there cannot be an output from the smp.DistributedModel.forward call that is
not used in the computation of the tensor that is fed into the smp.DistributedModel.backward
call.
• If there are torch.cuda.synchronize() calls in your code, you might need to call
torch.cuda.set_device(smp.local_rank()) immediately before the synchronize call.
Otherwise unnecessary CUDA contexts might be created in device 0, which will needlessly consume
memory.
• Since the library places nn.Modules on different devices, the modules in the model must not depend
on any global state that is modified inside smp.step. Any state that remains fixed throughout
training, or that is modified outside smp.step in a way that is visible to all processes, is allowed.
• You don’t need to move the model to GPU (for example, using model.to(device)) when using
the library. If you try to move the model to GPU before the model is partitioned (before the first
smp.step call), the move call is ignored. The library automatically moves the part of the model
assigned to a rank to its GPU. Once training with the library starts, don’t move the model to CPU
and use it, as it won’t have correct parameters for modules not assigned to the partition held by
the process. If you want to retrain a model or use it for inference without the library after it was
trained using the model parallelism library, the recommended way is to save the full model using our
checkpointing API and load it back to a regular PyTorch Module.
• If you have a list of modules such that output of one feeds into another, replacing that list with
nn.Sequential can significantly improve performance.
• The weight update (optimizer.step()) needs to happen outside of smp.step because that is when
the entire backward pass is done and gradients are ready. When using a hybrid model with model and
data parallelism, at this point, AllReduce of gradients is also guaranteed to finish.
• When using the library in combination with data parallelism, make sure that the number of batches
on all data parallel ranks is the same so that AllReduce does not hang waiting for a rank which is not
participating in the step.
• If you launch a training job using an ml.p4d instance type (such as ml.p4d.24xlarge), you must set the
data loader variable num_workers=0. For example, you may define your DataLoader as follows:

dataloader = torch.utils.data.DataLoader(
data,
batch_size=batch_size,
num_workers=0,
pin_memory=True,
drop_last=True,
shuffle=shuffle,
)

• The inputs to smp.step must be the model inputs generated by DataLoader. This is because
smp.step internally splits the input tensors along the batch dimension and pipelines them. This
means that passing DataLoader itself to the smp.step function to generate the model inputs inside
does not work.

For example, if you define a DataLoader as follows:

train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

You should access the model inputs generated by train_loader and pass those to an smp.step
decorated function. Do not pass train_loader directly to the smp.step function.

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
...

1919
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

_, loss_mb = train_step(model, data, target)


...

@smp.step
def train_step(model, data, target):
...
return output, loss

• The input tensors to smp.step must be moved to the current device using .to() API, which must
take place after the torch.cuda.set_device(local_rank()) call.

For example, you may define the train function as follows. This function adds data and target to
the current device using .to() API before using those input tensors to call train_step.

def train(model, device, train_loader, optimizer):


model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# smdistributed: Move input tensors to the GPU ID used by the current process,
# based on the set_device call.
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
# Return value, loss_mb is a StepOutput object
_, loss_mb = train_step(model, data, target)

# smdistributed: Average the loss across microbatches.


loss = loss_mb.reduce_mean()

optimizer.step()

The input tensors to this smp.set decorated function have been moved to the current device in
the train function above. The model does not need to be moved to the current device. The library
automatically moves the part of the model assigned to a rank to its GPU.

@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss

Unsupported framework features

The following PyTorch features are unsupported by SageMaker's model parallelism library:

• If you use data parallelism with the native PyTorch DDP, the
torch.nn.parallel.DistributedDataParallel wrapper module is not supported by the library.
The library internally manages integrating with PyTorch DDP, including parameter broadcast and
gradient AllReduce. When using the library, module buffers are only broadcast once at the start of
training. If your model has module buffers that need to be synchronized across data parallel groups at
each step, you can do so through the torch.distributed API, using the process group that can be
obtained via smp.get_dp_process_group().
• For mixed precision training, the apex.amp module is not supported. The recommended way to use
the library with automatic mixed-precision is to use torch.cuda.amp, with the exception of using
smp.amp.GradScaler instead of the implementation in torch.
• torch.jit.ScriptModules or ScriptFunctions are not supported by smp.DistributedModel.
• apex : FusedLayerNorm, FusedAdam, FusedLAMB, and FusedNovoGrad from apex are not
supported. You can use the library implementations of these through smp.optimizers and smp.nn
APIs instead.

1920
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Step 2: Launch a Training Job Using the SageMaker Python SDK


The SageMaker Python SDK supports managed training of models with ML frameworks such as
TensorFlow and PyTorch. To launch a training job using one of these frameworks, you define a
SageMaker TensorFlow estimator, a SageMaker PyTorch estimator, or a SageMaker generic Estimator to
use the modified training script and model parallelism configuration.

Topics
• Using the SageMaker TensorFlow and PyTorch Estimators (p. 1921)
• Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel
Library (p. 1924)
• Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library (p. 1925)

Using the SageMaker TensorFlow and PyTorch Estimators

The TensorFlow and PyTorch estimator classes contain the distribution parameter, which you can use
to specify configuration parameters for using distributed training frameworks. The SageMaker model
parallel library internally uses MPI for hybrid data and model parallelism, so you must use the MPI option
with the library.

The following template of a TensorFlow or PyTorch estimator shows how to configure the
distribution parameter for using the SageMaker model parallel library with MPI.

Using the SageMaker TensorFlow estimator

import sagemaker
from sagemaker.tensorflow import TensorFlow

smp_options = {
"enabled":True, # Required
"parameters": {
"partitions": 2, # Required
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"horovod": True, # Use this for hybrid model and data parallelism
}
}

mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8, # Required
# "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = TensorFlow(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='2.6.3',
py_version='py38',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)

1921
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

Using the SageMaker PyTorch estimator

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
"enabled":True,
"parameters": { # Required
"pipeline_parallel_degree": 2, # Required
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"ddp": True,
}
}

mpi_options = {
"enabled" : True, # Required
"processes_per_host" : 8, # Required
# "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = PyTorch(
entry_point="your_training_script.py", # Specify your train script
source_dir="location_to_your_script",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.16xlarge',
framework_version='1.13.1',
py_version='py38',
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

To enable the library, you need to pass configuration dictionaries to the "smdistributed" and "mpi"
keys through the distribution argument of the SageMaker estimator constructors.

Configuration parameters for SageMaker model parallelism

• For the "smdistributed" key, pass a dictionary with the "modelparallel" key and the following
inner dictionaries.
Note
Using "modelparallel" and "dataparallel" in one training job is not supported.
• "enabled" – Required. To enable model parallelism, set "enabled": True.
• "parameters" – Required. Specify a set of parameters for SageMaker model parallelism.
• For a complete list of common parameters, see Parameters for smdistributed in the SageMaker
Python SDK documentation.

For TensorFlow, see TensorFlow-specific Parameters.

For PyTorch, see PyTorch-specific Parameters.

1922
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• "pipeline_parallel_degree" (or "partitions" in smdistributed-


modelparallel<v1.6.0) – Required. Among the parameters for smdistributed, this
parameter is required to specify how many model partitions you want to split into.
Important
There is a breaking change in the parameter name. The
"pipeline_parallel_degree" parameter replaces the "partitions" since
smdistributed-modelparallel v1.6.0. For more information, see Common
Parameters for SageMaker model parallelism configuration and SageMaker Distributed
Model Parallel Release Notes in the SageMaker Python SDK documentation.
• For the "mpi" key, pass a dictionary that contains the following:
• "enabled" – Required. Set True to launch the distributed training job with MPI.
• "processes_per_host" – Required. Specify the number of processes MPI should launch on
each host. In SageMaker a host is a single Amazon EC2 ML instance. The SageMaker Python SDK
maintains a one-to-one mapping between processes and GPUs across model and data parallelism.
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains
more than one process. If you are using PyTorch, you must restrict each process to its own device
through torch.cuda.set_device(smp.local_rank()). To learn more, see Automated splitting
with PyTorch (p. 1915).
Important
process_per_host must not be greater than the number of GPUs per instance and
typically will be equal to the number of GPUs per instance.
• "custom_mpi_options" (optional) – Use this key to pass any custom MPI options you might
need. If you do not pass any MPI custom options to the key, the MPI option is set by default to the
following flag.

--mca btl_vader_single_copy_mechanism none

Note
You do not need to explicitly specify this default flag to the key. If you explicitly specify it,
your distributed model parallel training job might fail with the following error:

The following MCA parameter has been listed multiple times on the command
line:
MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed
once
on a command line to ensure there is no ambiguity as to its value.
Please correct the situation and try again.

Tip
If you launch a training job using an EFA-enabled instance type, such as ml.p4d.24xlarge
and ml.p3dn.24xlarge, use the following flag for best performance:

-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1

To launch the training job using the estimator and your SageMaker model parallel configured training
script, run the estimator.fit() function.

Use the following resources to learn more about using the model parallelism features in the SageMaker
Python SDK:

• Use TensorFlow with the SageMaker Python SDK


• Use PyTorch with the SageMaker Python SDK

1923
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• We recommend you use a SageMaker notebook instance if you are new users. To see an example of
how you can launch a training job using a SageMaker notebook instance, see Amazon SageMaker
Distributed Training Notebook Examples (p. 1942).
• You can also submit a distributed training job from your machine using AWS CLI. To set up AWS CLI on
your machine, see set up your AWS credentials and Region for development.

Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel
Library

To extend a pre-built container and use SageMaker's model parallelism library, you must use one of
the available AWS Deep Learning Containers (DLC) images for PyTorch or TensorFlow. The SageMaker
model parallelism library is included in the TensorFlow (2.3.0 and later) and PyTorch (1.6.0 and later) DLC
images with CUDA (cuxyz). For a complete list of DLC images, see Available Deep Learning Containers
Images in the AWS Deep Learning Containers GitHub repository.
Tip
We recommend that you use the image that contains the latest version of TensorFlow or
PyTorch to access the most up-to-date version of the SageMaker model parallelism library.

For example, your Dockerfile should contain a FROM statement similar to the following:

# Use the SageMaker DLC image URI for TensorFlow or PyTorch


FROM aws-dlc-account-id.dkr.ecr.aws-region.amazonaws.com/framework-training:{framework-
version-tag}

# Add your dependencies here


RUN ...

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker container to determine our user code
directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

Additionally, when you define a PyTorch or TensorFlow estimator, you must specify that the
entry_point for your training script. This should be the same path identified with ENV
SAGEMAKER_SUBMIT_DIRECTORY in your Dockerfile.
Tip
You must push this Docker container to Amazon Elastic Container Registry (Amazon ECR)
and use the image URI (image_uri) to define a SageMaker estimator for training. For more
information, see Extend a Pre-built Container (p. 2675).

After you finish hosting the Docker container and retrieving the image URI of the container, create a
SageMaker PyTorch estimator object as follows. This example assumes that you have already defined
smp_options and mpi_options.

smd_mp_estimator = Estimator(
entry_point="your_training_script.py",
role=sagemaker.get_execution_role(),
instance_type='ml.p3.16xlarge',
sagemaker_session=sagemaker_session,
image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag'
instance_count=1,
distribution={
"smdistributed": smp_options,
"mpi": mpi_options
},
base_job_name="SMD-MP-demo",

1924
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library

To build your own Docker container for training and use the SageMaker model parallel library, you must
include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in
your Dockerfile. This section provides the minimum set of code blocks you must include to properly
prepare a SageMaker training environment and the model parallel library in your own Docker container.
Note
This custom Docker option with the SageMaker model parallel library as a binary is available
only for PyTorch.

To create a Dockerfile with the SageMaker training toolkit and the model parallel library

1. Start with one of the NVIDIA CUDA base images.

FROM <cuda-cudnn-base-image>

Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA
base images. We recommend you look into the official Dockerfiles of AWS Deep Learning
Container for PyTorch to find which versions of the libraries you need to install and how to
configure them. The official Dockerfiles are complete, benchmark tested, and managed by
the SageMaker and Deep Learning Container service teams. In the provided link, choose the
PyTorch version you use, choose the CUDA (cuxyz) folder, and choose the Dockerfile ending
with .gpu or .sagemaker.gpu.
2. To set up a distributed training environment, you need to install software for communication and
network devices, such as Elastic Fabric Adapter (EFA), NVIDIA Collective Communications Library
(NCCL), and Open MPI. Depending on the PyTorch and CUDA versions you choose, you must install
compatible versions of the libraries.
Important
Because the SageMaker model parallel library requires the SageMaker data parallel library
in the subsequent steps, we highly recommend that you follow the instructions at Create
Your Own Docker Container with the SageMaker Distributed Data Parallel Library (p. 1850) to
properly set up a SageMaker training environment for distributed training.

For more information about setting up EFA with NCCL and Open MPI, see Get started with EFA and
MPI and Get started with EFA and NCCL.
3. Add the following arguments to specify the URLs of the SageMaker distributed training packages for
PyTorch. The SageMaker model parallel library requires the SageMaker data parallel library to use the
cross-node Remote Direct Memory Access (RDMA).

ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-
west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/
smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/
cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl

4. Install dependencies that the SageMaker model parallel library requires.


a. Install the METIS library.

ARG METIS=metis-5.1.0

1925
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

RUN rm /etc/apt/sources.list.d/* \
&& wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \
&& gunzip -f ${METIS}.tar.gz \
&& tar -xvf ${METIS}.tar \
&& cd ${METIS} \
&& apt-get update \
&& make config shared=1 \
&& make install \
&& cd .. \
&& rm -rf ${METIS}.tar* \
&& rm -rf ${METIS} \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

b. Install the RAPIDS Memory Manager library. This requires CMake 3.14 or later.

ARG RMM_VERSION=0.15.0

RUN wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \


&& tar -xvf v${RMM_VERSION}.tar.gz \
&& cd rmm-${RMM_VERSION} \
&& INSTALL_PREFIX=/usr/local ./build.sh librmm \
&& cd .. \
&& rm -rf v${RMM_VERSION}.tar* \
&& rm -rf rmm-${RMM_VERSION}

5. Install the SageMaker model parallel library.

RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL}

6. Install the SageMaker data parallel library.

RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}

7. Install the sagemaker-training toolkit. The toolkit contains the common functionality that's necessary
to create a container compatible with the SageMaker training platform and the SageMaker Python
SDK.

RUN pip install sagemaker-training

8. After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to
build the Docker container and host it in Amazon ECR.

Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see
Use Your Own Training Algorithms.

Checkpointing and Fine-Tuning a Model with Model Parallelism


The SageMaker model parallelism library provides checkpointing APIs to save the model state and the
optimizer state split by the various model parallelism strategies, and to load checkpoints for continuous
training from where you want to restart training and fine-tune. The APIs also support options to save the
model and optimizer states partially or fully.

Topics
• Checkpointing a distributed model (p. 1927)
• Fine-tuning a distributed model (p. 1931)

1926
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Checkpointing a distributed model


Choose one of the following topics depending on the framework between PyTorch and TensorFlow and
the version of the SageMaker model parallelism library you use.

Topics
• Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and
later) (p. 1927)
• Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between
v1.6.0 and v1.9.0) (p. 1929)
• Checkpointing a distributed TensorFlow model (p. 1930)

Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0
and later)

The SageMaker model parallelism library provides checkpoint APIs to save and load full or partial
checkpoints of the distributed model state and its optimizer state.
Note
This checkpointing method is recommended if you use PyTorch and the SageMaker model
parallelism library v1.10.0 or later.

Partial checkpointing

To save checkpoints of a model trained with model parallelism, use the


smdistributed.modelparallel.torch.save_checkpoint API with the partial checkpointing
option set to true (partial=True). This saves each model partition individually. In addition to
the model and the optimizer state, you can also save any additional custom data through the
user_content argument. The checkpointed model, optimizer, and user content are saved as separate
files. The save_checkpoint API call creates checkpoint folders in the following structure.

- path
- ${tag}_partial (folder for partial checkpoints)
- model_rankinfo.pt
- optimizer_rankinfo.pt
- fp16_states_rankinfo.pt
- user_content.pt
- $tag (checkpoint file for full checkpoints)
- user_content_$tag (user_content file for full checkpoints)
- newest (a file that indicates the newest checkpoint)

To resume training from partial checkpoints, use the


smdistributed.modelparallel.torch.resume_from_checkpoint API with partial=True,
and specify the checkpoint directory and the tag used while saving the partial checkpoints. Note that
the actual loading of model weights happens after model partitioning, during the first run of the
smdistributed.modelparallel.torch.step-decorated training step function.

When saving a partial checkpoint, the library also saves the model partition decision as files with .pt
file extension. Conversely, when resuming from the partial checkpoint, the library loads the partition
decision files together. Once the partition decision is loaded, you can't change the partition.

The following code snippet shows how to set the checkpoint APIs in a PyTorch training script.

import smdistributed.modelparallel.torch as smp

model = ...

1927
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ... # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
path=checkpoint_path,
tag=f"total_steps{total_steps}",
partial=True,
model=model,
optimizer=optimizer,
user_content=user_content
num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
path=checkpoint_path,
partial=True
)

Full checkpointing

To save the final model artifact for inference purposes, use the
smdistributed.modelparallel.torch.save_checkpoint API with partial=False, which
combines the model partitions to create a single model artifact. Note that this does not combine the
optimizer states.

To initialize training with particular weights, given a full model checkpoint, you can use the
smdistributed.modelparallel.torch.resume_from_checkpoint API with partial=False.
Note that this does not load optimizer states.
Note
With tensor parallelism, in general, the state_dict must be translated between
the original model implementation and the DistributedModel implementation.
Optionally, you can provide the state_dict translation function as an argument to the
smdistributed.modelparallel.torch.resume_from_checkpoint. However, for the
section called “Supported Models Out of the Box” (p. 1897), the library takes care of this
translation automatically.

The following code shows an example of how to use the checkpoint APIs for fully checkpointing a
PyTorch model trained with model parallelism.

import smdistributed.modelparallel.torch as smp

model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ... # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
path=checkpoint_path,
tag=f"total_steps{total_steps}",
partial=False,
model=model,
optimizer=optimizer,

1928
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

user_content=user_content
num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
path=checkpoint_path,
partial=False
)

Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library
between v1.6.0 and v1.9.0)

The SageMaker model parallelism library provides Python functions for saving partial or full checkpoints
for training jobs with tensor parallelism. The following procedure shows how to use smp.save() and
smp.load() to save and load a checkpoint when you use tensor parallelism.
Note
This checkpointing method is recommended if you use PyTorch, the section called “Tensor
Parallelism” (p. 1890), and the SageMaker model parallelism library between v1.6.0 and v1.9.0.

1. Prepare a model object and wrap it with the library's wrapper function smp.DistributedModel().

model = MyModel(...)
model = smp.DistributedModel(model)

2. Prepare an optimizer for the model. A set of model parameters is an iterable argument required by
optimizer functions. To prepare a set of model parameters, you must process model.parameters()
to assign unique IDs to individual model parameters.

If there are parameters with duplicated IDs in the model parameter iterable, loading the checkpointed
optimizer state fails. To create an iterable of model parameters with unique IDs for your optimizer, see
the following:

unique_params = []
unique_params_set = set()
for p in model.parameters():
if p not in unique_params_set:
unique_params.append(p)
unique_params_set.add(p)
del unique_params_set

optimizer = MyOpt(unique_params, ...)

3. Wrap the optimizer using the library's wrapper function smp.DistributedOptimizer().

optimizer = smp.DistributedOptimizer(optimizer)

4. Save the model and the optimizer state using smp.save(). Depending on how you want to save
checkpoints, choose one of the following two options:
• Option 1: Save a partial model on each mp_rank for a single MP_GROUP.

model_dict = model.local_state_dict() # save a partial model


opt_dict = optimizer.local_state_dict() # save a partial optimizer state
# Save the dictionaries at rdp_rank 0 as a checkpoint
if smp.rdp_rank() == 0:
smp.save(
{"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
f"/checkpoint.pt",

1929
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

partial=True,
)

With tensor parallelism, the library saves checkpointed files named in the following format:
checkpoint.pt_{pp_rank}_{tp_rank}.
Note
With tensor parallelism, make sure you set the if statement as if smp.rdp_rank()
== 0 instead of if smp.dp_rank() == 0. When the optimizer state is sharded with
tensor parallelism, all reduced-data parallel ranks must save their own partition of the
optimizer state. Using a wrong if statement for checkpointing might result in a stalling
training job. For more information about using if smp.dp_rank() == 0 without tensor
parallelism, see General Instruction for Saving and Loading in the SageMaker Python SDK
documentation.
• Option 2: Save the full model.

if smp.rdp_rank() == 0:
model_dict = model.state_dict(gather_to_rank0=True) # save the full model
if smp.rank() == 0:
smp.save(
{"model_state_dict": model_dict},
"/checkpoint.pt",
partial=False,
)

Note
Consider the following for full checkpointing:
• If you set gather_to_rank0=True, all ranks other than 0 return empty dictionaries.
• For full checkpointing, you can only checkpoint the model. Full checkpointing of
optimizer states is currently not supported.
• The full model only needs to be saved at smp.rank() == 0.
5. Load the checkpoints using smp.load(). Depending on how you checkpointed in the previous step,
choose one of the following two options:
• Option 1: Load the partial checkpoints.

checkpoint = smp.load("/checkpoint.pt", partial=True)


model.load_state_dict(checkpoint["model_state_dict"], same_partition_load=False)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

You can set same_partition_load=True in model.load_state_dict() for a faster load, if


you know that the partition will not change.
• Option 2: Load the full checkpoints.

if smp.rdp_rank() == 0:
checkpoint = smp.load("/checkpoint.pt", partial=False)
model.load_state_dict(checkpoint["model_state_dict"])

The if smp.rdp_rank() == 0 condition is not required, but it can help avoid redundant loading
among different MP_GROUPs. Full checkpointing optimizer state dict is currently not supported with
tensor parallelism.

Checkpointing a distributed TensorFlow model


To save a TensorFlow model while training with model parallelism, use the following functions provided
by the SageMaker model parallelism library.

1930
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• smdistributed.modelparallel.tensorflow.DistributedModel.save_model
• smdistributed.modelparallel.tensorflow.CheckpointManager

Fine-tuning a distributed model


The fine-tuning needs to be configured in your training script. The following code snippet shows
an example structure of a training script using the AutoModelForCausalLM class of Hugging Face
Transformers with modifications for registering the smdistributed.model.parallel.torch
modules and settings for fine-tuning.
Note
Fine-tuning a distributed transformer (a Transformer model wrapped by
smp.DistributedModel()) with the smp.delayed_param_initialization function activated
requires the fine-tuning job to be configured with an FSx for Lustre file system. In cases where
you want to fine-tune a large-scale model with the delayed parameter initialization option, you
should set up an FSx for Lustre file system.

import argparse
from transformers import AutoModelForCausalLM
import smdistributed.modelparallel
import smdistributed.modelparallel.torch as smp

def parse_args():

parser = argparse.ArgumentParser()

# set an arg group for model


model_grp = parser.add_argument_group(
title="model", description="arguments to describe model configuration"
)

... # set up numerous args to parse from the configuration dictionary to the script for
training

# add arg for activating fine-tuning


model_grp.add_argument(
"--fine_tune",
type=int,
default=0,
help="Fine-tune model from checkpoint or pretrained model",
)

def main():
"""Main function to train GPT."""
args = parse_args()

... # parse numerous args

if args.fine_tune > 0 and args.delayed_param > 0 and smp.rank() == 0:


pretrained_model = AutoModelForCausalLM.from_pretrained(
args.model_name or args.model_dir
)
model_state_dict = pretrained_model.state_dict()
path = os.path.join(args.model_dir, "fullmodel.pt")
torch.save(model_state_dict, path)

# create a Transformer model and wrap by smp.model_creation()


# with options to configure model parallelism parameters offered by SageMaker
with smp.model_creation(
tensor_parallelism=smp.tp_size() > 1 or args.use_distributed_transformer > 0,
zero_init=args.use_distributed_transformer == 0,
dtype=dtype,

1931
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

distribute_embedding=args.sharded_data_parallel_degree > 1 and smp.tp_size() > 1,


use_alibi=args.alibi > 0,
attention_in_fp32=args.attention_in_fp32 > 0,
fp32_residual_addition=args.residual_addition_in_fp32 > 0,
query_key_layer_scaling=args.query_key_layer_scaling > 0 and args.bf16 < 1,
fused_softmax=args.fused_softmax > 0,
fused_dropout=args.fused_dropout > 0,
fused_bias_gelu=args.fused_bias_gelu > 0,
flash_attention=args.flash_attention > 0,
):
if args.fine_tune > 0 and args.delayed_param == 0:
model = AutoModelForCausalLM.from_pretrained(
args.model_name or args.model_dir
)
else:
model = AutoModelForCausalLM.from_config(model_config)

# wrap the model by smp.DistributedModel() to apply SageMaker model parallelism


model = smp.DistributedModel(
model, trace_device="gpu", backward_passes_per_step=args.gradient_accumulation
)

# wrap the optimizer by smp.DistributedOptimizer() to apply SageMaker model parallelism


optimizer= ... # define an optimizer
optimizer = smp.DistributedOptimizer(
optimizer,
static_loss_scale=None,
dynamic_loss_scale=True,
dynamic_loss_args={"scale_window": 1000, "min_scale": 1, "delayed_shift": 2},
)

# for fine-tuning, use smp.resume_from_checkpoint() to load a pre-trained model


if args.fine_tune > 0 and args.delayed_param > 0:
smp.resume_from_checkpoint(args.model_dir, tag="fullmodel.pt", partial=False)

For a complete example of training scripts and Jupyter notebooks, see the GPT-2 examples for PyTorch
in the SageMaker Examples GitHub repository.

SageMaker Distributed Model Parallelism Best Practices


Use the following guidelines when you run a distributed training job with the SageMaker model parallel
library.

Setting Up the Right Configuration for a Given Model


When scaling up a model, we recommend you to go over the following list in order. Each list item
discusses the advantage of using the library's techniques along with the tradeoffs that might arise.
Tip
If a model can fit well using a subset of the library's features, adding more model parallelism or
memory saving features does not usually improve performance.

Using large GPU instance types

• In the realm of model parallelism, it is best to use powerful instances with large GPU memories to
handle overhead from model parallelism operations such as partitioning models across multiple GPUs.
We recommend using ml.p4d or ml.p3dn instances for training large DL models. These instances are
also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables
large-scale training with model parallelism.

1932
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Sharding optimizer state

• The impact of sharding optimizer state depends on the number of data parallel ranks. Typically,
a higher degree of data parallelism (proportional to the size of compute node) can improve the
efficiency of memory usage.

When you want to downsize a cluster, make sure you check the optimizer state sharding configuration.
For example, a large DL model with optimizer state sharding that fits on a node with 16 GPUs won't fit
on a node with 8 GPUs because there are simply not enough GPUs across which to shard the optimizer
state.

For more information, see Optimizer State Sharding (p. 1901).

Activation checkpointing

• Memory efficiency can be improved by using activation checkpointing for a group of modules. The
more you group the modules, the more efficient the memory usage. When checkpointing sequential
modules for layers, the strategy argument of the smp.set_activation_checkpointing
function groups the layers together for checkpointing. For example, grouping two or more layers
together for checkpointing is more memory efficient than checkpointing one layer at a time, and this
trades extra computation time for reduced memory usage.

For more information, see Activation Checkpointing (p. 1902).

Tensor parallelism
n
• The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2 ), where the maximum
degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs,
possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary
numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes,
misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the
nodes; this adds significant overhead from communication of activations across the nodes and can
become computationally expensive.

For more information, see Tensor Parallelism (p. 1890).

Pipeline parallelism across nodes

• You can run pipeline parallelism both within a single node and across multiple nodes. When you
use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline
parallelism across multiple nodes and keeping tensor parallelism within individual nodes.
• Pipeline parallelism comes with the following three knobs: microbatches, active_microbatches,
and prescaled_batch.
• When you use tensor parallelism with pipeline parallelism, we recommend activating
prescaled_batch so that the batch size per model parallel group can be increased for efficient
pipelining. With prescaled_batch activated, the batch size set in the training script becomes
tp_size times the batch size set for each rank without prescaled_batch.
• Increasing the number of microbatches helps achieve efficient pipelining and better performance.
Note that the effective microbatch size is the batch size divided by number of microbatches. If you
increase the number of microbatches while keeping batch size constant, each microbatch processes
fewer samples.
• The number of active_microbatches is the maximum number of microbatches that are
simultaneously in process during pipelining. For each active microbatch in process, its activations
and gradients take up GPU memory. Therefore, increasing active_microbatches takes up more
GPU memory.

1933
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

• If both GPU and GPU memory are underutilized, increase active_microbatches for better
parallelization during pipelining.
• For more information about how to use tensor parallelism with pipeline parallelism, see Tensor
parallelism combined with pipeline parallelism (p. 1895).
• To find descriptions of the aforementioned parameters, see Parameters for smdistributed in the
SageMaker Python SDK documentation.

Offloading activations to CPU

• Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To
ensure that the offloading and preloading happen in the background, specify a value greater than 1 to
the microbatches parameter.
• When offloading activations, you might be able to increase active_microbatches and sometimes
match with the total number of microbatches. This depends on which modules are checkpointed and
how the model is partitioned.

For more information, see Activation Offloading (p. 1903).

Reference configurations

The SageMaker model parallelism training team provides the following reference points based on
experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000.

The Instance Pipeline Tensor Optimizer Activation Prescaled Batch size


number type parallelism parallelism state checkpointing
batch
of model sharding
parameters

10 billion 16 1 4 True Each True batch_size=40


ml.p4d.24xlarge transformer
layer

30 billion 16 1 8 True Each True batch_size=32


ml.p4d.24xlarge transformer
layer

60 billion 32 2 8 True Each True batch_size=56,


ml.p4d.24xlarge transformer microbatches=4,
layer active_microbatches

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model
configuration. For example, if you increase the sequence length for a 10-billion-parameter model or
increase the size of the model to 20 billion, you might want to lower batch size first. If the model still
doesn’t fit, try increasing the degree of tensor parallelism.

Modifying Your Training Script


• Before you use the SageMaker model parallel library’s features in your training script, review The
SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls (p. 1935).
• To launch a training job faster, use the SageMaker local mode. This helps you quickly run a training job
locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your
SageMaker notebook instance is running, you might need to adjust the size of your model by changing
the model configurations, such as the hidden width, number of transformer layers, and attention

1934
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

heads. Validate if the reduced model runs well on the notebook instance before using a large cluster
for training the full model.

Monitoring and Logging a Training Job Using the SageMaker Console and
Amazon CloudWatch
To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU
utilization, use visualization provided through the SageMaker console.

1. In the left navigation pane, choose Training.


2. Choose Training jobs.
3. In the main pane, choose the training job name for which you want to see more details.
4. Browse the main pane and find the Monitor section to see the automated visualization.
5. To see training job logs, choose View logs in the Monitor section. You can access the distributed
training job logs of the training job in CloudWatch. If you launched multi-node distributed training,
you should see multiple log streams with tags in the format of algo-n-1234567890. The algo-1 log
stream tracks training logs from the main (0th) node.

For more information, see Monitor and Analyze Training Jobs Using Amazon CloudWatch
Metrics (p. 2127).

Permissions
To run a SageMaker training job with model parallelism or the SageMaker distributed training example
notebooks, make sure you have the right permissions in your IAM role, such as the following:

• To use FSx for Lustre, add AmazonFSxFullAccess.


• To use Amazon S3 as a data channel, add AmazonS3FullAccess.
• To use Docker, build your own container, and push it to Amazon ECR, add
AmazonEC2ContainerRegistryFullAccess.
• To have a full access to use the entire suite of SageMaker features, add
AmazonSageMakerFullAccess.

The SageMaker Distributed Model Parallelism Library


Configuration Tips and Pitfalls
Review the following tips and pitfalls before using Amazon SageMaker's model parallelism library.
This list includes tips that are applicable across frameworks. For TensorFlow and PyTorch specific tips,
see Modify a TensorFlow training script (p. 1909) and Modify a PyTorch Training Script (p. 1915),
respectively.

Batch Size and Number of Microbatches


• The library is most efficient when the batch size is increased. For use cases where the model fits within
a single device, but can only be trained with a small batch size, batch size can and should be increased
after the library is integrated. Model parallelism saves memory for large models, enabling you to train
using batch sizes that previously did not fit in memory.
• Choosing a number of microbatches that is too small or too large can lower performance. The library
executes each microbatch sequentially in each device, so microbatch size (batch size divided by
number of microbatches) must be large enough to fully utilize each GPU. At the same time, pipeline
efficiency increases with the number of microbatches, so striking the right balance is important.
Typically, a good starting point is to try 2 or 4 microbatches, increasing the batch size to the memory

1935
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

limit, and then experiment with larger batch sizes and numbers of microbatches. As the number of
microbatches is increased, larger batch sizes might become feasible if an interleaved pipeline is used.
• Your batch size must be always divisible by the number of microbatches. Note that depending on the
size of the dataset, sometimes the last batch of every epoch can be of a smaller size than the rest, and
this smaller batch needs to be divisible by the number of microbatches as well. If it is not, you can set
drop_remainder=True in the tf.Dataset.batch() call (in TensorFlow), or set drop_last=True
in DataLoader (in PyTorch), so that this last, small batch is not used. If you are using a different API
for the data pipeline, you might need to manually skip the last batch whenever it is not divisible by the
number of microbatches.

Manual Partitioning
• If you use manual partitioning, be mindful of the parameters that are consumed by multiple
operations and modules in your model, such as the embedding table in transformer architectures.
Modules that share the same parameter must be placed in the same device for correctness. When
auto-partitioning is used, the library automatically enforces this constraint.

Data Preparation
• If the model takes multiple inputs, make sure you seed the random operations in your data pipeline
(e.g., shuffling) with smp.dp_rank(). If the dataset is being deterministically sharded across data
parallel devices, make sure that the shard is indexed by smp.dp_rank(). This is to make sure that the
order of the data seen on all ranks that form a model partition is consistent.

Returning Tensors from smp.DistributedModel


• Any tensor that is returned from the smp.DistributedModel.call (for TensorFlow) or
smp.DistributedModel.forward (for PyTorch) function is broadcast to all other ranks, from the
rank that computed that particular tensor. As a result, any tensor that is not needed outside the call
and forward methods (intermediate activations, for example) should not be returned, as this causes
needless communication and memory overhead and hurts performance.

The @smp.step Decorator


• If an smp.step-decorated function has a tensor argument that does not have a batch dimension,
the argument name must be provided in the non_split_inputs list when calling smp.step. This
prevents the library from attempting to split the tensor into microbatches. For more information see
smp.step in the API documentation.

Delaying Parameter Initialization


For very large models over 100 billion parameters, weight initialization through the CPU
memory might result in an out-of-memory error. To get around this, the library offers
smp.delay_param_initialization context manager. This delays the physical allocation of
parameters until they move to GPU during the first execution of a smp.step-decorated function. This
avoids unnecessary memory usage of the CPU during the initialization of training. Use the context
manager when you create a model object as shown in the following code.

with smp.delay_param_initialization(enabled=True):
model = MyModel()

1936
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

Tensor Parallelism for PyTorch


• If you are using a seed for deterministic results, set the seed based on smp.dp_rank() (for example,
torch.manual_seed(42 + smp.dp_rank())). If you do not do this, different partitions of an
nn.Parameter are initialized in the same way, impacting convergence.
• SageMaker’s model parallelism library uses NCCL to implement collectives needed for the distribution
of the modules. Especially for smaller models, if too many NCCL calls are scheduled on the GPU
at the same time, memory usage might increase because of additional space used by NCCL. To
counteract this, smp throttles the NCCL calls so that the number of ongoing NCCL operations at
any given time is less than or equal to a given limit. The default limit is 8, but this can be adjusted
using the environment variable SMP_NCCL_THROTTLE_LIMIT. If you observe more memory usage
than you expect while using tensor parallelism, you can try reducing this limit. However, choosing
a limit that is too small might cause throughput loss. To disable throttling altogether, you can set
SMP_NCCL_THROTTLE_LIMIT=-1.
• The following identity, which holds when the degree of tensor parallelism is 1, does not hold
when the degree of tensor parallelism is greater than 1: smp.mp_size() * smp.dp_size() ==
smp.size(). This is because the tensor parallel group is part of both the model parallelism group and
the data parallelism group. If your code has existing references to mp_rank, mp_size, MP_GROUP, and
so on, and if you want to work with only the pipeline parallel group, you might need to replace the
references with smp.pp_size(). The following identities are always true:
• smp.mp_size() * smp.rdp_size() == smp.size()
• smp.pp_size() * smp.dp_size() == smp.size()
• smp.pp_size() * smp.tp_size() * smp.rdp_size() == smp.size()
• Since the smp.DistributedModel wrapper modifies the model parameters when tensor parallelism
is enabled, the optimizer should be created after calling smp.DistributedModel, with the
distributed parameters. For example, the following does not work:

## WRONG
model = MyModel()
optimizer = SomeOptimizer(model.parameters())
model = smp.DistributedModel(model) # optimizer now has outdated parameters!

Instead, the optimizer should be created with the parameters of the smp.DistributedModel as
follows:

## CORRECT
model = smp.DistributedModel(MyModel())
optimizer = SomeOptimizer(model.optimizers())

• When a module is replaced with its distributed counterpart through tensor parallelism, the distributed
module does not inherit its weights from the original module, and initializes new weights. This means
that, for instance, if the weights need to be initialized in a particular call (for example, through a
load_state_dict call), this needs to happen after the smp.DistributedModel call, once the
module distribution takes place.
• When accessing the parameters of distributed modules directly, note that the weight does not have
the same shape as the original module. For instance,

with smp.tensor_parallelism():
linear = nn.Linear(60, 60)

# will pass
assert tuple(linear.weight.shape) == (60, 60)

distributed_linear = smp.DistributedModel(linear)

# will fail. the number of input channels will have been divided by smp.tp_size()

1937
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

assert tuple(distributed_linear.module.weight.shape) == (60, 60)

• Using torch.utils.data.distributed.DistributedSampler is strongly recommended for


tensor parallelism. This ensures that every data parallel rank receives the same number of data
samples, which prevents hangs that might result from different dp_ranks taking a different number
of steps.
• If you use the join API of PyTorch's DistributedDataParallel class to handle cases in which
different data parallel ranks have different numbers of batches, you still need to make sure that ranks
that are in the same TP_GROUP have the same number of batches; otherwise the communication
collectives used in distributed execution of modules may hang. Ranks that are in different TP_GROUPs
can have different numbers of batches, as long as join API is used.
• If you want to checkpoint your model and use tensor parallelism, consider the following:
• To avoid stalling and race conditions while saving and loading models when you use tensor
parallelism, make sure you call appropriate functions from the following model and optimizer states
inside a reduced-data parallelism rank.
• If you are transitioning an existing pipeline parallel script and enabling tensor parallel for the script,
ensure that you modify any if smp.dp_rank() == 0 block used for saving and loading with if
smp.rdp_rank() == 0 blocks. Otherwise, it might cause your training job to stall.

For more information about checkpointing a model with tensor parallelism, see the section called
“Checkpointing a distributed model” (p. 1927).

Model Parallel Troubleshooting


If you run into an error, you can use the following list to try to troubleshoot your training job. If the
problem persists, contact AWS Support.

Topics
• Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism
Library (p. 1938)
• Saving Checkpoints (p. 1939)
• Convergence Using Model Parallel and TensorFlow (p. 1940)
• Stalling or Crashing Distributed Training Jobs (p. 1940)
• Receiving NCCL Error for a PyTorch Training Job (p. 1941)
• Receiving RecursionError for a PyTorch Training Job (p. 1942)

Considerations for Using SageMaker Debugger with the SageMaker Model


Parallelism Library
SageMaker Debugger is not available for the SageMaker model parallelism library. Debugger is enabled
by default for all SageMaker TensorFlow and PyTorch training jobs, and you might see an error that looks
like the following:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/


metadata.json.sagemaker-uploading

To fix this issue, disable Debugger by passing debugger_hook_config=False when creating a


framework estimator as shown in the following example.

bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"

1938
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints


checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

estimator = TensorFlow(
...

distribution={"smdistributed": {"modelparallel": { "enabled": True }}},


checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path="/opt/ml/checkpoints",
debugger_hook_config=False
)

Saving Checkpoints
You might run into the following error when saving checkpoints of a large model on SageMaker:

InternalServerError: We encountered an internal error. Please try again

This could be caused by a SageMaker limitation while uploading the local checkpoint to Amazon S3
during training. To disable checkpointing in SageMaker, use the following example to explicitly upload
the checkpoints.

If you run into the preceding error, do not use checkpoint_s3_uri with the SageMaker estimator
call. While saving checkpoints for larger models, we recommend saving checkpoints to a custom
directory and passing the same to the helper function (as a local_path argument).

import os

def aws_s3_sync(source, destination):


"""aws s3 sync in quiet mode and time profile"""
import time, subprocess
cmd = ["aws", "s3", "sync", "--quiet", source, destination]
print(f"Syncing files from {source} to {destination}")
start_time = time.time()
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()
end_time = time.time()
print("Time Taken to Sync: ", (end_time-start_time))
return

def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints",
s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
""" sample function to sync checkpoints from local path to s3 """

import boto3
#check if local path exists
if not os.path.exists(local_path):
raise RuntimeError("Provided local path {local_path} does not exist. Please check")

#check if s3 bucket exists


s3 = boto3.resource('s3')
if not s3_uri.startswith("s3://"):
raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

s3_bucket = s3_uri.replace('s3://','').split('/')[0]
print(f"S3 Bucket: {s3_bucket}")
try:
s3.meta.client.head_bucket(Bucket=s3_bucket)
except Exception as e:
raise e

1939
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

aws_s3_sync(local_path, s3_uri)
return

def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints",
s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
""" sample function to sync checkpoints from s3 to local path """

import boto3
#try to create local path if it does not exist
if not os.path.exists(local_path):
print(f"Provided local path {local_path} does not exist. Creating...")
try:
os.makedirs(local_path)
except Exception as e:
raise RuntimeError(f"Failed to create {local_path}")

#check if s3 bucket exists


s3 = boto3.resource('s3')
if not s3_uri.startswith("s3://"):
raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

s3_bucket = s3_uri.replace('s3://','').split('/')[0]
print(f"S3 Bucket: {s3_bucket}")
try:
s3.meta.client.head_bucket(Bucket=s3_bucket)
except Exception as e:
raise e
aws_s3_sync(s3_uri, local_path)
return

Usage of helper functions:

#base_s3_uri - user input s3 uri or save to model directory (default)


#curr_host - to save checkpoints of current host
#iteration - current step/epoch during which checkpoint is saved

# save checkpoints on every node using local_rank


if smp.local_rank() == 0:
base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))
curr_host = os.environ['SM_CURRENT_HOST']
full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}'
sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)

Convergence Using Model Parallel and TensorFlow


When you use SageMaker multi-node training with TensorFlow and the model parallelism library, the
loss may not converge as expected because the order of training input files may be different on each
node. This may cause different ranks in the same model parallel group to work on different input files,
causing inconsistencies. To prevent this, ensure the input files are ordered the same way in all the ranks
before they get converted to TensorFlow datasets. One way to achieve this is to sort the input file names
in the training script.

Stalling or Crashing Distributed Training Jobs


If your training job has stalling, crashing, or not responding issues, read the following troubleshooting
items to identify what's the cause of the issue. If you need any further support, reach out to the
SageMaker distributed training team through AWS Support.

• If you see a distributed training job stalling at the NCCL initialization step, consider the following:
• If you are using one of the EFA-enabled instances ( ml.p4d or ml.p3dn instances) with a custom
VPC and its subnet, ensure that the security group used has inbound and outbound connections

1940
Amazon SageMaker Developer Guide
SageMaker's Model Parallelism Library

for all ports to and from the same SG. You also generally need outbound connections to any IP
as a separate rule (for internet access). To find instructions on how to add inbound and outbound
rules for EFA communication, refer to SageMaker Distributed Training Job Stalling During
Initialization (p. 1863).
• If you see a distributed training job stalling when checkpointing the full model, this might
be because the state_dict() call on the model or optimizer was not made on all ranks with
rdp_rank()==0 (when using tensor parallelism) or dp_rank()==0 (when using only pipeline
parallelism). These ranks need to communicate to construct the checkpoint to be saved. Similar
stalling issues can also happen when checkpointing partial optimizer if shard_optimizer_state is
enabled.

For more information about checkpointing a model with model parallelism, see General Instruction
for Saving and Loading and Checkpointing a distributed PyTorch model (for the SageMaker model
parallelism library between v1.6.0 and v1.9.0) (p. 1929).
• If the training job crashes with a CUDA Out of Memory error, this means that the distributed training
configuration needs to be adjusted to fit the model on the GPU cluster. For more information and best
practices, see Setting Up the Right Configuration for a Given Model (p. 1932).
• If the training job crashes with an uncorrectable ECC error, this means that one of the GPUs in the
cluster has gone bad. If you need technical support, share the job ARN with the AWS team and restart
your training job from a checkpoint if possible.
• In rare cases, a job configuration that worked previously but is close to the limits of GPU memory
might fail later with a different cluster due to a CUDA Out of Memory error. This could be because
some GPU has lower available memory than usual due to ECC errors.
• Network timeout crash might happen when running a multinode job which doesn’t use
all GPUs in the node. To get around this, use all GPUs on the node by ensuring that the
processes_per_host parameter is set to the number of GPUs in each instance. For example, this
is processes_per_host=8 for ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge
instances.
• If you find that your training job takes a long time during the data downloading stage, make sure
the Amazon S3 path you provided to checkpoint_s3_uri for the SageMaker Estimator class
is unique for the current training job. If this path is reused across multiple training jobs running
simultaneously, all those checkpoints are uploaded and downloaded to the same Amazon S3 path and
might significantly increase checkpoint loading time.
• Use FSx for Lustre when you deal with large data and models.
• If your dataset is large and fetching it takes a long time, we recommend keeping your dataset in FSx
for Lustre.
• When training models are beyond 10 billion parameters, we recommend using FSx for Lustre for
checkpointing.
• After you create a file system, make sure to wait for the status to become available before starting a
training job using it.

Receiving NCCL Error for a PyTorch Training Job


If you encountered the following error, it might be due to a process running out of GPU memory.

NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL


version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

You can resolve this by reducing the batch size or active_microbatches. If auto partitioning is not
resulting in a well-balanced partitioning, you might have to consider manual partitioning. For more
information, see Pipeline parallelism across nodes (p. 1933).

1941
Amazon SageMaker Developer Guide
SageMaker Distributed Training Notebook Examples

Receiving RecursionError for a PyTorch Training Job


The library does not support calling super.forward() inside a module's forward call. If you use
super.forward(), you might receive the following error message.

RecursionError: maximum recursion depth exceeded

To fix the error, instead of calling super.forward(), you should call super()._orig_forward().

Amazon SageMaker Distributed Training Notebook


Examples
The following case studies and notebooks provide examples of implementing the SageMaker distributed
training libraries for the supported deep learning frameworks (PyTorch, TensorFlow, and HuggingFace)
and models, such as CNN and MaskRCNN for vision, and BERT for natural language processing.

These notebooks are provided in the SageMaker examples GitHub repository. You can also browse them
on the SageMaker examples website.

The examples are set up to use p3.16xlarge instances for the worker nodes, but you may choose
ml.p3dn.24xlarge or ml.p4d.24xlarge instance types for which the SageMaker distributed training
libraries are optimized. You can test the notebooks using a cluster of a single node; however, to see a
performance improvement as shown in the Training Benchmarks section, use a cluster of multiple nodes
(two or more). The examples call out the section in which you modify this configuration.

Blogs and Case Studies


The following blogs discuss case studies about using the SageMaker distributed training libraries.

The SageMaker data parallelism library

• How I trained 10TB for Stable Diffusion on SageMaker in Medium (November 29, 2022)
• Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon
Search , AWS Machine Learning Blog (August 18, 2022)
• Training YOLOv5 on AWS with PyTorch and the SageMaker distributed data parallel library, Medium
(May 6, 2022)
• Speed up EfficientNet model training on SageMaker with PyTorch and the SageMaker distributed data
parallel library, Medium (March 21, 2022)
• Speed up EfficientNet training on AWS with the SageMaker distributed data parallel library, Towards
Data Science (January 12, 2022)
• Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker,
AWS Machine Learning Blog (June 25, 2021)
• Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker,
the Hugging Face website (April 8, 2021)

The SageMaker model parallelism library

• New performance improvements in the Amazon SageMaker model parallelism library, AWS Machine
Learning Blog (December 16, 2022)
• Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker,
AWS Machine Learning Blog (October 31, 2022)

1942
Amazon SageMaker Developer Guide
SageMaker Distributed Training Notebook Examples

PyTorch Examples
The SageMaker data parallelism library

• CNN with PyTorch 1.6 and the SageMaker data parallelism library
• MaskRCNN with PyTorch 1.6 and the SageMaker data parallelism library
• BERT with PyTorch 1.6 and the SageMaker data parallelism library

The SageMaker model parallelism library

• Train GPT-2 with PyTorch 1.8.1 and Tensor Parallelism Using the SageMaker model parallelism library
• BERT with PyTorch 1.6 and the SageMaker model parallelism library

TensorFlow Examples
The SageMaker data parallelism library

• CNN with TensorFlow 2.3.1 and the SageMaker data parallelism library
• MaskRCNN with TensorFlow 2.3.1 and the SageMaker data parallelism library
• BERT with TensorFlow 2.3.1 and the SageMaker data parallelism library

The SageMaker model parallelism library

• CNN with TensorFlow 2.3.1 and the SageMaker model parallelism library

HuggingFace Examples
The following HuggingFace on SageMaker examples are available in the HuggingFace notebooks
repository.

The SageMaker data parallelism library

• HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker - Distributed Question


Answering
• HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker - Distributed Text
Summarization
• HuggingFace Distributed Data Parallel Training in TensorFlow on SageMaker

The SageMaker model parallelism library

• HuggingFace with TensorFlow Distributed model parallelism library Training on SageMaker

How to Access or Download the SageMaker Distributed Training


Notebook Examples
Follow instructions to access or download the SageMaker distributed training example notebooks.

Option 1: Use a SageMaker notebook instance


To use the aforementioned examples, we recommend that you use an Amazon SageMaker notebook
instance. A notebook instance runs Jupyter Notebook and JupyterServer apps on Amazon EC2 instances,

1943
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices

which are optimized for machine learning. If you do not have an active notebook instance, follow the
instructions in Create a Notebook Instance (p. 209) in the SageMaker developer guide to create one.

After you have created an instance, in the Notebook instances page of the SageMaker console, do the
following:

1. Open JupyterLab.
2.
Select the examples icon ( ) in the left tray.
3. Browse the examples for Training and look for notebooks titled Distributed Data Parallel or
Distributed Model Parallel.

Option 2: Clone the SageMaker example repository to SageMaker Studio or


notebook instance
To download and use the aforementioned example notebooks, do the following to clone the example
GitHub repositories:

1. Open a terminal.
2. In the command line, navigate to the SageMaker folder.

cd SageMaker

3. Clone the SageMaker examples GitHub repository.

git clone https://github.com/aws/amazon-sagemaker-examples.git

Note
To download the HuggingFace example notebooks (p. 1943), clone the HuggingFace
notebooks GitHub repository:

git clone https://github.com/huggingface/notebooks huggingface-notebooks

4. In the JupyterLab interface, navigate into the amazon-sagemaker-examples folder.


5. In the training/distributed_training folder, there are folders for frameworks, and in each
of these, there are folders for data_parallel and model_parallel. Choose the example of your
choice and follow the instructions to launch distributed training with an SageMaker distributed
training library.

Distributed computing with SageMaker best practices


This best practices page presents various flavors of distributed computing for machine learning (ML) jobs
in general. The term distributed computing in this page encompasses distributed training for machine
learning tasks and parallel computing for data processing, data generation, feature engineering, and
reinforcement learning. In this page, we discuss about common challenges in distributed computing,
and available options in SageMaker Training and SageMaker Processing. For additional reading materials
about distributed computing, see What Is Distributed Computing?.

You can configure ML tasks to run in a distributed manner across multiple nodes (instances), accelerators
(NVIDIA GPUs, AWS Trainium chips), and vCPU cores. By running distributed computation, you can
achieve a variety of goals such as computing operations faster, handling large datasets, or training large
ML models.

1944
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices

The following list covers common challenges that you might face when you run an ML training job at
scale.

• You need to make decisions on how to distribute computation depending on ML tasks, software
libraries you want to use, and compute resources.
• Not all ML tasks are straightforward to distribute. Also, not all ML libraries support distributed
computation.
• Distributed computation might not always result in a linear increase in compute efficiency. In
particular, you need to identify if data I/O and inter-GPU communication have bottlenecks or cause
overhead.
• Distributed computation might disturb numerical processes and change model accuracy. Specifically
to data-parallel neural network training, when you change the global batch size while scaling up to a
larger compute cluster, you also need to adjust the learning rate accordingly.

SageMaker provides distributed training solutions to ease such challenges for various use cases. Choose
one of the following options that best fits your use case.

Topics
• Option 1: Use a SageMaker built-in algorithm that supports distributed training (p. 1945)
• Option 2: Run a custom ML code in the SageMaker managed training or processing
environment (p. 1945)
• Option 3: Write your own custom distributed training code (p. 1947)
• Option 4: Launch multiple jobs in parallel or sequentially (p. 1947)

Option 1: Use a SageMaker built-in algorithm that supports


distributed training
SageMaker provides built-in algorithms that you can use out of the box through the SageMaker console
or the SageMaker Python SDK. Using the built-in algorithms, you don’t need to spend time for code
customization, understanding science behind the models, or running Docker on provisioned Amazon EC2
instances.

A subset of the SageMaker built-in algorithms support distributed training. To check if the algorithm
of your choice supports distributed training, see the Parallelizable column in the Common Information
About Built-in Algorithms table. Some of the algorithms support multi-instance distributed training,
while the rest of the parallelizable algorithms support parallelization across multiple GPUs in a single
instance, as indicated in the Parallelizable column.

Option 2: Run a custom ML code in the SageMaker managed


training or processing environment
SageMaker jobs can instantiate distributed training environment for specific use cases and frameworks.
This environment acts as a ready-to-use whiteboard, where you can bring and run your own ML code.

If your ML code uses a deep learning framework


You can launch distributed training jobs using the Deep Learning Containers (DLC) for SageMaker
Training, which you can orchestrate either through the dedicated Python modules in the SageMaker
Python SDK, or through the SageMaker APIs with AWS CLI, AWS SDK for Python (Boto3). SageMaker
provides training containers for machine learning frameworks, including PyTorch, TensorFlow, Hugging
Face Transformers, and Apache MXNet. You have two options to write deep learning code for distributed
training.

1945
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices

• The SageMaker distributed training libraries

The SageMaker distributed training libraries propose AWS-managed code for neural network data
parallelism and model parallelism. SageMaker distributed training also comes with launcher clients
built into the SageMaker Python SDK, and you don’t need to author parallel launch code. To learn
more, see SageMaker's data parallelism library and SageMaker's model parallelism library.
• Open-source distributed training libraries

Open source frameworks have their own distribution mechanisms such as DistributedDataParallelism
(DDP) in PyTorch or tf.distribute modules in TensorFlow. You can choose to run these distributed
training frameworks in the SageMaker-managed framework containers. For example, the sample code
for training MaskRCNN in SageMaker shows how to use both PyTorch DDP in the SageMaker PyTorch
framework container and Horovod in the SageMaker TensorFlow framework container.

SageMaker ML containers also come with MPI preinstalled, so you can parallelize your entry point script
using mpi4py. Using the MPI integrated training containers is a great option when you launch a third-
party distributed training launcher or write ad-hoc parallel code in the SageMaker managed training
environment.

Notes for data-parallel neural network training on GPUs

• Scale to multi-GPU and multi-machine parallelism when appropriate

We often run neural network training jobs on multiple-CPU or multiple-GPU instances. Each GPU-
based instance usually contains multiple GPU devices. Consequently, distributed GPU computing can
happen either within a single GPU instance with multiple GPUs (single-node multi-GPU training),
or across multiple GPU instances with multiple GPU cores in each (multi-node multi-GPU training).
Single-instance training is easier to write code and debug, and the intra-node GPU-to-GPU throughput
is usually faster than the inter-node GPU-to-GPU throughput. Therefore, it is a good idea to scale
data parallelism vertically first (use one GPU instance with multiple GPUs) and expand to multiple
GPU instances if needed. This might not apply to cases where the CPU budget is high (for example, a
massive workload for data pre-processing) and when the CPU-to-GPU ratio of a multi-GPU instance is
too low. In all cases, you need to experiment with different combinations of instance types based on
your own ML training needs and workload.
• Monitor the quality of convergence

When training a neural network with data parallelism, increasing the number of GPUs while keeping
the mini-batch size per GPU constant leads to increasing the size of global mini-batch for the mini-
batch stochastic gradient descent (MSGD) process. The size of mini-batches for MSGD is known to
impact the descent noise and convergence. For properly scaling while preserving accuracy, you need to
adjust other hyperparameters such as the learning rate [Goyal et al. (2017)].
• Monitor I/O bottlenecks

As you increase the number of GPUs, the throughput for reading and writing storage should also
increase. Make sure that your data source and pipeline don’t become bottlenecks.
• Modify your training script as needed

Training scripts written for single-GPU training must be modified for multi-node multi-GPU training. In
most data parallelism libraries, script modification is required to do the following.
• Assign batches of training data to each GPU.
• Use an optimizer that can deal with gradient computation and parameter updates across multiple
GPUs.
• Assign responsibility of checkpointing to a specific host and GPU.

1946
Amazon SageMaker Developer Guide
Distributed computing with SageMaker best practices

If your ML code involves tabular data processing


PySpark is a Python frontend of Apache Spark, which is an open-source distributed computing
framework. PySpark has been widely adopted for distributed tabular data processing for large-scale
production workloads. If you want to run tabular data processing code, consider using the SageMaker
Processing PySpark containers and running parallel jobs. You can also run data processing jobs in
parallel using SageMaker Training and SageMaker Processing APIs in Amazon SageMaker Studio, which is
integrated with Amazon EMR and AWS Glue.

Option 3: Write your own custom distributed training code


When you submit a training or processing job to SageMaker, SageMaker Training and SageMaker
Processing APIs launch Amazon EC2 compute instances. You can customize training and processing
environment in the instances by running your own Docker container or installing additional libraries
in the AWS managed containers. For more information about Docker with SageMaker Training, see
Adapting your own Docker container to work with SageMaker and Create a container with your own
algorithms and models. For more information about Docker with SageMaker Processing, see Use Your
Own Processing Code.

Every SageMaker training job environment contains a configuration file at /opt/ml/input/config/


resourceconfig.json, and every SageMaker processing job environment contains a similar
configuration file at /opt/ml/config/resourceconfig.json. Your code can read this file to find
hostnames and establish inter-node communications. To learn more, including the schema of the JSON
file, see Distributed Training Configuration and How Amazon SageMaker Processing Configures Your
Processing Container. You can also install and use third-party distributed computing libraries such as Ray
or DeepSpeed in SageMaker.

You can also use SageMaker Training and SageMaker Processing to run custom distributed computations
that do not require inter-worker communication. In the computing literature, those tasks are often
described as embarrassingly parallel or share-nothing. Examples include parallel processing of data
files, training models in parallel on different configurations, or running batch inference on a collection
of records. You can trivially parallelize such share-nothing use cases with Amazon SageMaker. When
you launch a SageMaker Training or SageMaker Processing job on a cluster with multiple nodes,
SageMaker by default replicates and launches your training code (in Python or Docker) on all the nodes.
Tasks requiring random spread of input data across such multiple nodes can be facilitated by setting
S3DataDistributionType=ShardedByS3Key in the data input configuration of the SageMaker
TrainingInput API.

Option 4: Launch multiple jobs in parallel or sequentially


You can also distribute an ML compute workflow into smaller parallel or sequential compute tasks, each
represented by its own SageMaker Training or SageMaker Processing job. Splitting a task into multiple
jobs can be beneficial for the following situations or tasks:

• When you have specific data channels and metadata entries (such as hyperparameters, model
configuration, or instance types) for each sub-tasks.
• When you implement retry steps at a sub-task level.
• When you vary the configuration of the sub-tasks over the course of the workload, such as when
training on increasing batch sizes.
• When you need to run an ML task that takes longer than the maximum training time allowed for a
single training job (28 days maximum).
• When different steps of a compute workflow require different instance types.

For the specific case of hyperparameter search, use SageMaker Automated Model Tuning. SageMaker
Automated Model Tuning is a serverless parameter search orchestrator that launches multiple training
jobs on your behalf, according to a search logic that can be random, Bayesian, or HyperBand.

1947
Amazon SageMaker Developer Guide
Training Compiler

Additionally, to orchestrate multiple training jobs, you can also consider workflow orchestration tools,
such as SageMaker Pipelines, AWS Step Functions, and Apache Airflow supported by Amazon Managed
Workflows for Apache Airflow (MWAA) and SageMaker Workflows.

Amazon SageMaker Training Compiler


Use Amazon SageMaker Training Compiler to train deep learning (DL) models faster on scalable GPU
instances managed by SageMaker.

What Is SageMaker Training Compiler?


State-of-the-art deep learning (DL) models consist of complex multi-layered neural networks with
billions of parameters that can take thousands of GPU hours to train. Optimizing such models on
training infrastructure requires extensive knowledge of DL and systems engineering; this is challenging
even for narrow use cases. Although there are open-source implementations of compilers that optimize
the DL training process, they can lack the flexibility to integrate DL frameworks with some hardware such
as GPU instances.

SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement


optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate
training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training
Compiler is available at no additional charge within SageMaker and can help reduce total billable time as
it accelerates training.

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the
SageMaker Training Compiler–enabled AWS DLCs, you can compile and optimize training jobs on GPU
instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable
SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for
accelerated computing.

How It Works
SageMaker Training Compiler converts DL models from their high-level language representation
to hardware-optimized instructions. Specifically, SageMaker Training Compiler applies graph-level
optimizations, dataflow-level optimizations, and backend optimizations to produce an optimized model
that efficiently uses hardware resources. As a result, you can train your models faster than when you train
them without compilation.

It is a two-step process to activate SageMaker Training Compiler for your training job:

1948
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

1. Bring your own DL script and, if needed, adapt to compile and train with SageMaker Training Compiler.
To learn more, see Bring Your Own Deep Learning Model (p. 1967).
2. Create a SageMaker estimator object with the compiler configuration parameter using the SageMaker
Python SDK.
a. Turn on SageMaker Training Compiler by adding
compiler_config=TrainingCompilerConfig() to the SageMaker estimator class.
b. Adjust hyperparameters (batch_size and learning_rate) to maximize the benefit that
SageMaker Training Compiler provides.

Compilation through SageMaker Training Compiler changes the memory footprint of the model.
Most commonly, this manifests as a reduction in memory utilization and a consequent increase in
the largest batch size that can fit on the GPU. In some cases, the compiler intelligently promotes
caching which leads to a decrease in the largest batch size that can fit on the GPU. Note that if you
want to change the batch size, you must adjust the learning rate appropriately.

For a reference for batch_size tested for popular models, see Tested Models (p. 1952).

When you adjust the batch size, you also have to adjust the learning_rate appropriately. For
best practices for adjusting the learning rate along with the change in batch size, see the section
called “Best Practices and Considerations” (p. 1989).
c. By running the estimator.fit() class method, SageMaker compiles your model and starts the
training job.

For instructions on how to launch a training job, see Enable SageMaker Training Compiler (p. 1975).

SageMaker Training Compiler does not alter the final trained model, while allowing you to accelerate the
training job by more efficiently using the GPU memory and fitting a larger batch size per iteration. The
final trained model from the compiler-accelerated training job is identical to the one from the ordinary
training job.
Tip
SageMaker Training Compiler only compiles DL models for training on supported GPU instances
managed by SageMaker. To compile your model for inference and deploy it to run anywhere in
the cloud and at the edge, use SageMaker Neo compiler.

Topics
• Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949)
• Bring Your Own Deep Learning Model (p. 1967)
• Enable SageMaker Training Compiler (p. 1975)
• SageMaker Training Compiler Example Notebooks and Blogs (p. 1989)
• SageMaker Training Compiler Best Practices and Considerations (p. 1989)
• SageMaker Training Compiler FAQ (p. 1992)
• SageMaker Training Compiler Troubleshooting (p. 1993)
• Amazon SageMaker Training Compiler Release Notes (p. 1999)

Supported Frameworks, AWS Regions, Instance


Types, and Tested Models
Before using SageMaker Training Compiler, check if your framework of choice is supported, the instance
types are available in your AWS account, and your AWS account is in one of the supported AWS Regions.
Note
SageMaker Training Compiler is available in the SageMaker Python SDK v2.70.0 or later.

1949
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Supported Frameworks
SageMaker Training Compiler supports the following deep learning frameworks and is available through
AWS Deep Learning Containers.

Topics
• PyTorch (p. 1950)
• TensorFlow (p. 1951)

PyTorch

Framework Framework version Deep Learning Extendable for Docker


Container URI customization

PyTorch PyTorch v1.13.1 No


763104351884.dkr.ecr.<region>.amazonaws.com/
pytorch-trcomp-
training:1.12.0-
gpu-py38-cu113-
ubuntu20.04-
sagemaker

PyTorch v1.12.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
pytorch-trcomp-
training:1.13.1-
gpu-py39-cu117-
ubuntu20.04-
sagemaker

PyTorch with Hugging Transformers v4.21.1 No


763104351884.dkr.ecr.<region>.amazonaws.com/
Face Transformers huggingface-pytorch-
PyTorch v1.11.0 trcomp-training:1.11.0-
transformers4.21.1-
gpu-py38-cu113-
ubuntu20.04

Transformers v4.17.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-pytorch-
PyTorch v1.10.2 trcomp-training:1.10.2-
transformers4.17.0-
gpu-py38-cu113-
ubuntu20.04

Transformers v4.11.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-pytorch-
PyTorch v1.9.0 training-comp:1.9.0-
transformers4.11.0-
gpu-py38-cu111-
ubuntu20.04

1950
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

TensorFlow

Framework Framework version Deep Learning Extendable for Docker


Container URI customization

TensorFlow TensorFlow v2.11.0 Yes


763104351884.dkr.ecr.<region>.amazonaws.com/
tensorflow-
training:2.11.0-
gpu-py39-cu112-
ubuntu20.04-
sagemaker

TensorFlow v2.10.0 Yes


763104351884.dkr.ecr.<region>.amazonaws.com/
tensorflow-
training:2.10.0-
gpu-py39-cu112-
ubuntu20.04-
sagemaker

TensorFlow v2.9.1 Yes


763104351884.dkr.ecr.<region>.amazonaws.com/
tensorflow-
training:2.9.1-
gpu-py39-cu112-
ubuntu20.04-
sagemaker

TensorFlow with Transformers v4.17.0 No


763104351884.dkr.ecr.<region>.amazonaws.com/
Hugging Face huggingface-
Transformers TensorFlow v2.6.3 tensorflow-trcomp-
training:2.6.3-
transformers4.17.0-
gpu-py38-cu112-
ubuntu20.04

Transformers v4.11.0 No
763104351884.dkr.ecr.<region>.amazonaws.com/
huggingface-
TensorFlow v2.5.1 tensorflow-training-
comp:2.5.1-
transformers4.11.0-
gpu-py37-cu112-
ubuntu18.04

For more information, see Available Images in the AWS Deep Learning Containers GitHub repository.

AWS Regions
The SageMaker Training Compiler Containers are available in the AWS Regions where AWS Deep
Learning Containers are in service except the China regions.

Supported Instance Types


SageMaker Training Compiler is tested on and supports the following ML instance types.

• P4 instances
• P3 instances
• G4dn instances

1951
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

• G5 instances

For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance
Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions at Request a service
quota increase for SageMaker resources.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling


the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.

Tested Models
The following table includes a list of the models that have been tested with SageMaker Training
Compiler. For reference, the largest batch size that is able to fit into memory is also included alongside
other training parameters. SageMaker Training Compiler can change the memory footprint of the model
training process; as a result, a larger batch size can often be used during the training process, further
decreasing total training time. In some cases, SageMaker Training Compiler intelligently promotes
caching which leads to a decrease in the largest batch size that can fit on the GPU. You must retune your
model hyperparameters and find an optimal batch size for your case. To save time, use the following
reference tables to look up a batch size that can be a good starting point for your use case.
Note
The batch sizes are local batch size that fit into each individual GPU in the respective instance
type. You should also adjust the learning rate when changing the batch size.

PyTorch 1.13.1
Natural language processing (NLP) models

The following models are tested for training jobs for all combinations of single-node and multi-node
with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.

Single-node/multi-node single-GPU/multi-GPU

albert-base- wikitext-2- g4dn.16xlarge float16 128 80 192


v2 raw-v1

albert-base- wikitext-2- g5.4xlarge float16 128 128 332


v2 raw-v1

albert-base- wikitext-2- p3.2xlarge float16 128 80 224


v2 raw-v1

bert-base- wikitext-2- g5.4xlarge float16 128 160 288


uncased raw-v1

camembert- wikitext-2- g5.4xlarge float16 128 160 280


base raw-v1

distilbert- wikitext-2- g5.4xlarge float16 128 240 472


base- raw-v1
uncased

distilgpt2 wikitext-2- g4dn.16xlarge float16 128 77 128


raw-v1

1952
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node/multi-node single-GPU/multi-GPU

distilgpt2 wikitext-2- g5.4xlarge float16 128 138 390


raw-v1

distilgpt2 wikitext-2- p3.2xlarge float16 128 96 256


raw-v1

distilroberta- wikitext-2- g4dn.16xlarge float16 128 96 192


base raw-v1

distilroberta- wikitext-2- g5.4xlarge float16 128 171 380


base raw-v1

distilroberta- wikitext-2- p3.2xlarge float16 128 112 256


base raw-v1

gpt2 wikitext-2- g4dn.16xlarge float16 128 52 152


raw-v1

gpt2 wikitext-2- g5.4xlarge float16 128 84 240


raw-v1

gpt2 wikitext-2- p3.2xlarge float16 128 58 164


raw-v1

microsoft/ wikitext-2- g4dn.16xlarge float16 128 48 128


deberta- raw-v1
base

microsoft/ wikitext-2- g5.4xlarge float16 128 84 207


deberta- raw-v1
base

microsoft/ wikitext-2- p3.2xlarge float16 128 53 133


deberta- raw-v1
base

roberta-base wikitext-2- g5.4xlarge float16 128 125 224


raw-v1

facebook/ xsum g4dn.16xlarge float16 128 10 16


bart-base

facebook/ xsum g5.4xlarge float16 128 16 32


bart-base

facebook/ xsum g5.4xlarge float16 128 5 8


bart-large

facebook/ xsum p3.2xlarge float16 128 2 4


bart-large

xlm-roberta- wikitext-2- g4dn.16xlarge float16 128 16 31


base raw-v1

xlm-roberta- wikitext-2- p3.2xlarge float16 128 18 50


base raw-v1

1953
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node/multi-node single-GPU/multi-GPU

xlnet-base- wikitext-2- g5.4xlarge float16 128 128 240


cased raw-v1

bert-base- wikitext-103- g5.48xlarge float16 512 29 50


uncased v1

distilbert- wikitext-103- g5.48xlarge float16 512 45 64


base- v1
uncased

gpt2 wikitext-103- g5.48xlarge float16 512 18 45


v1

roberta-base wikitext-103- g5.48xlarge float16 512 23 44


v1

gpt2 wikitext-103- p4d.24xlarge float16 512 36 64


v1

Computer Vision (CV) models

Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.

Single/multi-node single/multi-GPU

ResNet152 food101 g4dn.16xlarge float16 128 144

ResNet152 food101 g5.4xlarge float16 128 192

ResNet152 food101 p3.2xlarge float16 152 156

ViT food101 g4dn.16xlarge float16 512 512

ViT food101 g5.4xlarge float16 992 768

ViT food101 p3.2xlarge float16 848 768

PyTorch 1.12.0

Natural language processing (NLP) models

The following models are tested for training jobs for all combinations of single-node and multi-node
with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.

Single-node/multi-node single-GPU/multi-GPU

albert-base- wikitext-2- ml.g5.2xlarge float16 128 128 248


v2 raw-v1

bert-base- wikitext-2- ml.g5.2xlarge float16 128 160 288


uncased raw-v1

camembert- wikitext-2- ml.g5.2xlarge float16 128 160 279


base raw-v1

1954
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node/multi-node single-GPU/multi-GPU

camembert- wikitext-2- ml.p3.2xlarge float16 128 105 164


base raw-v1

distilgpt2 wikitext-2- ml.g5.2xlarge float16 128 136 256


raw-v1

distilgpt2 wikitext-2- ml.p3.2xlarge float16 128 80 118


raw-v1

gpt2 wikitext-2- ml.g5.2xlarge float16 128 84 240


raw-v1

gpt2 wikitext-2- ml.p3.2xlarge float16 128 80 119


raw-v1

microsoft/ wikitext-2- ml.g5.2xlarge float16 128 93 197


deberta- raw-v1
base

microsoft/ wikitext-2- ml.p3.2xlarge float16 128 113 130


deberta- raw-v1
base

roberta-base wikitext-2- ml.g5.2xlarge float16 128 125 224


raw-v1

roberta-base wikitext-2- ml.p3.2xlarge float16 128 78 112


raw-v1

xlnet-base- wikitext-2- ml.g5.2xlarge float16 128 138 240


cased raw-v1

bert-base- wikitext-103- ml.p4d.24xlarge


float16 512 52
uncased v1

distilbert- wikitext-103- ml.p4d.24xlarge


float16 512 160
base- v1
uncased

gpt2 wikitext-103- ml.p4d.24xlarge


float16 512 25
v1

roberta-base wikitext-103- ml.p4d.24xlarge


float16 512 64
v1

TensorFlow 2.11.0

Computer Vision (CV) models

Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.

Single/multi-node single/multi-GPU

MaskRCNN- COCO-2017 ml.g5.2xlarge float16 6 8


ResNet50-FPN

1955
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single/multi-node single/multi-GPU

MaskRCNN- COCO-2017 ml.p3.2xlarge float16 4 6


ResNet50-FPN

ResNet50 ImageNet ml.g5.2xlarge float16 192 256

ResNet50 ImageNet ml.p3.2xlarge float16 256 256

ResNet101 ImageNet ml.g5.2xlarge float16 128 256

ResNet101 ImageNet ml.p3.2xlarge float16 128 128

ResNet152 ImageNet ml.g5.2xlarge float16 128 224

ResNet152 ImageNet ml.p3.2xlarge float16 128 128

VisionTransformerImageNet ml.g5.2xlarge float16 112 144

VisionTransformerImageNet ml.p3.2xlarge float16 96 128

Natural Language Processing (NLP) models

Tested using Transformer models with Sequence_Len=128 and Automatic Mixed Precision (AMP) as
indicated.

Single/multi-node single/multi-GPU

albert-base-v2 wikitext-2-raw- ml.g5.2xlarge float16 160 197


v1

albert-base-v2 wikitext-2-raw- ml.p3.2xlarge float16 95 127


v1

bert-base- wikitext-2-raw- ml.g5.2xlarge float16 160 128


uncased v1

bert-base- wikitext-2-raw- ml.p3.2xlarge float16 104 111


uncased v1

bert-large- wikitext-2-raw- ml.g5.2xlarge float16 65 48


uncased v1

bert-large- wikitext-2-raw- ml.p3.2xlarge float16 40 35


uncased v1

camembert- wikitext-2-raw- ml.g5.2xlarge float16 128 162


base v1

camembert- wikitext-2-raw- ml.p3.2xlarge float16 105 111


base v1

distilbert-base- wikitext-2-raw- ml.g5.2xlarge float16 256 264


uncased v1

distilbert-base- wikitext-2-raw- ml.p3.2xlarge float16 128 169


uncased v1

gpt2 wikitext-2-raw- ml.g5.2xlarge float16 128 120


v1

1956
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single/multi-node single/multi-GPU

gpt2 wikitext-2-raw- ml.p3.2xlarge float16 80 83


v1

jplu/tf-xlm- wikitext-2-raw- ml.g5.2xlarge float16 32 32


roberta-base v1

jplu/tf-xlm- wikitext-2-raw- ml.p3.2xlarge float16 32 36


roberta-base v1

microsoft/ wikitext-2-raw- ml.g5.2xlarge float16 144 160


mpnet-base v1

microsoft/ wikitext-2-raw- ml.p3.2xlarge float16 106 110


mpnet-base v1

roberta-base wikitext-2-raw- ml.g5.2xlarge float16 128 128


v1

roberta-base wikitext-2-raw- ml.p3.2xlarge float16 72 98


v1

albert-base-v2 wikitext-2-raw- ml.g5.48xlarge float16 128 192


v1

albert-base-v2 wikitext-2-raw- ml.p3.16xlarge float16 95 96


v1

distilbert-base- wikitext-2-raw- ml.g5.48xlarge float16 256 256


uncased v1

distilbert-base- wikitext-2-raw- ml.p3.16xlarge float16 140 184


uncased v1

google/ wikitext-2-raw- ml.g5.48xlarge float16 256 384


electra-small- v1
discriminator

google/ wikitext-2-raw- ml.p3.16xlarge float16 256 268


electra-small- v1
discriminator

gpt2 wikitext-2-raw- ml.g5.48xlarge float16 116 116


v1

gpt2 wikitext-2-raw- ml.p3.16xlarge float16 85 83


v1

gpt2 wikitext-2-raw- ml.p4d.24xlarge float16 94 110


v1

microsoft/ wikitext-2-raw- ml.g5.48xlarge float16 187 164


mpnet-base v1

microsoft/ wikitext-2-raw- ml.p3.16xlarge float16 106 111


mpnet-base v1

1957
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

TensorFlow 2.10.0

Computer Vision (CV) models

Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP) as indicated.

Single-node single-GPU/multi-GPU

DetectionTransformer-
COCO-2017 ml.g4dn.2xlarge float32 2 4
ResNet50

DetectionTransformer-
COCO-2017 ml.g5.2xlarge float32 3 6
ResNet50

DetectionTransformer-
COCO-2017 ml.p3.2xlarge float32 2 4
ResNet50

MaskRCNN- COCO-2017 ml.g4dn.2xlarge float16 4 6


ResNet50-FPN

MaskRCNN- COCO-2017 ml.g5.2xlarge float16 6 8


ResNet50-FPN

MaskRCNN- COCO-2017 ml.g5.48xlarge float16 48 64


ResNet50-FPN

MaskRCNN- COCO-2017 ml.p3.2xlarge float16 4 6


ResNet50-FPN

ResNet50 ImageNet ml.g4dn.2xlarge float16 224 256

ResNet50 ImageNet ml.g5.2xlarge float16 192 160

ResNet50 ImageNet ml.g5.48xlarge float16 2048 2048

ResNet50 ImageNet ml.p3.2xlarge float16 224 160

ResNet101 ImageNet ml.g4dn.2xlarge float16 160 128

ResNet101 ImageNet ml.g5.2xlarge float16 192 256

ResNet101 ImageNet ml.g5.48xlarge float16 2048 2048

ResNet101 ImageNet ml.p3.2xlarge float16 160 224

ResNet152 ImageNet ml.g4dn.2xlarge float16 128 128

ResNet152 ImageNet ml.g5.2xlarge float16 192 224

ResNet152 ImageNet ml.g5.48xlarge float16 1536 1792

ResNet152 ImageNet ml.p3.2xlarge float16 128 160

VisionTransformerImageNet ml.g4dn.2xlarge float16 80 128

VisionTransformerImageNet ml.g5.2xlarge float16 112 144

VisionTransformerImageNet ml.g5.48xlarge float16 896 1152

VisionTransformerImageNet ml.p3.2xlarge float16 80 128

1958
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Natural Language Processing (NLP) models

Tested using Transformer models with Sequence_Len=128 and Automatic Mixed Precision (AMP) as
indicated.

Single-node single-GPU/multi-GPU

albert-base-v2 wikitext-2-raw- g4dn.16xlarge float16 128 112


v1

albert-base-v2 wikitext-2-raw- p3.2xlarge float16 128 128


v1

albert-base-v2 wikitext-2-raw- p3.8xlarge float16 128 135


v1

albert-base-v2 wikitext-2-raw- g5.4xlarge float16 128 191


v1

bert-base- wikitext-2-raw- g4dn.16xlarge float16 64 94


uncased v1

bert-base- wikitext-2-raw- p3.2xlarge float16 96 101


uncased v1

bert-base- wikitext-2-raw- p3.8xlarge float16 96 96


uncased v1

bert-base- wikitext-2-raw- g5.4xlarge float16 128 128


uncased v1

bert-large- wikitext-2-raw- g4dn.16xlarge float16 35 21


uncased v1

bert-large- wikitext-2-raw- p3.2xlarge float16 39 26


uncased v1

bert-large- wikitext-2-raw- g5.4xlarge float16 60 50


uncased v1

camembert- wikitext-2-raw- g4dn.16xlarge float16 96 90


base v1

camembert- wikitext-2-raw- p3.2xlarge float16 96 98


base v1

camembert- wikitext-2-raw- p3.8xlarge float16 96 96


base v1

camembert- wikitext-2-raw- g5.4xlarge float16 128 128


base v1

distilbert-base- wikitext-2-raw- g4dn.16xlarge float16 256 160


uncased v1

distilbert-base- wikitext-2-raw- p3.2xlarge float16 128 176


uncased v1

distilbert-base- wikitext-2-raw- p3.8xlarge float16 128 160


uncased v1

1959
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU/multi-GPU

distilbert-base- wikitext-2-raw- g5.4xlarge float16 256 258


uncased v1

google_electra- wikitext-2-raw- g4dn.16xlarge float16 256 216


small- v1
discriminator

google_electra- wikitext-2-raw- p3.2xlarge float16 256 230


small- v1
discriminator

google_electra- wikitext-2-raw- p3.8xlarge float16 256 224


small- v1
discriminator

google_electra- wikitext-2-raw- g5.4xlarge float16 256 320


small- v1
discriminator

gpt2 wikitext-2-raw- g4dn.16xlarge float16 80 64


v1

gpt2 wikitext-2-raw- p3.2xlarge float16 80 77


v1

gpt2 wikitext-2-raw- p3.8xlarge float16 80 72


v1

gpt2 wikitext-2-raw- g5.4xlarge float16 128 120


v1

jplu_tf-xlm- wikitext-2-raw- g4dn.16xlarge float16 28 24


roberta-base v1

jplu_tf-xlm- wikitext-2-raw- p3.2xlarge float16 32 24


roberta-base v1

jplu_tf-xlm- wikitext-2-raw- p3.8xlarge float16 32 26


roberta-base v1

jplu_tf-xlm- wikitext-2-raw- g5.4xlarge float16 66 52


roberta-base v1

microsoft_mpnet-wikitext-2-raw- g4dn.16xlarge float16 96 92


base v1

microsoft_mpnet-wikitext-2-raw- p3.2xlarge float16 96 101


base v1

microsoft_mpnet-wikitext-2-raw- p3.8xlarge float16 96 101


base v1

microsoft_mpnet-wikitext-2-raw- g5.4xlarge float16 128 152


base v1

roberta-base wikitext-2-raw- g4dn.16xlarge float16 64 72


v1

1960
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU/multi-GPU

roberta-base wikitext-2-raw- p3.2xlarge float16 64 84


v1

roberta-base wikitext-2-raw- p3.8xlarge float16 64 86


v1

roberta-base wikitext-2-raw- g5.4xlarge float16 128 128


v1

TensorFlow 2.9.1
Tested using TensorFlow Model Garden with Automatic Mixed Precision (AMP).

Single-node single-GPU/multi-GPU

ResNet50 ImageNet ml.g4dn.2xlarge 192 256*

ResNet101 ImageNet ml.g4dn.2xlarge 128 160

ml.g5.2xlarge 224 256*

ml.p3.16xlarge 1536 1792

ResNet152 ImageNet ml.g5.2xlarge 192 224

ml.p3.2xlarge 160 160

ml.p3.16xlarge 1024 1280

VisionTransformer ImageNet ml.g4dn.2xlarge 80 128*

ml.g5.2xlarge 112 128*

ml.p3.2xlarge 56 128*

ml.p3.16xlarge 640 1024*

DetectionTransformer-
COCO-2017 ml.g4dn.2xlarge 2 2
ResNet50
ml.g5.2xlarge 3 6

ml.p3.2xlarge 2 4

ml.p3.16xlarge 8 32

MaskRCNN- COCO-2017 ml.g4dn.2xlarge 4 4


ResNet50-FPN
ml.g5.2xlarge 6 8

ml.p3.2xlarge 4 6

* The batch sizes marked with the asterisk symbol (*) indicate the largest batch size tested by the
SageMaker Training Compiler developer team. For the marked cells, the instance may be able to fit a
larger batch size than what is indicated.

Transformers 4.21.1 with PyTorch 1.11.0


Tested with Sequence_Len=512 and Automatic Mixed Precision (AMP).

1961
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU

albert-base-v2 wikitext-2 ml.g4dn.2xlarge 1 14 28

ml.g5.2xlarge 1 18 40

ml.p3.2xlarge 1 14 32

bert-base- wikitext-2 ml.g4dn.2xlarge 1 12 24


cased
ml.g5.2xlarge 1 28 44

ml.p3.2xlarge 1 16 20

camembert- wikitext-2 ml.g4dn.2xlarge 1 16 28


base
ml.g5.2xlarge 1 24 40

ml.p3.2xlarge 1 16 24

distilbert-base- wikitext-2 ml.g4dn.2xlarge 1 28 52


uncased
ml.g5.2xlarge 1 40 76

ml.p3.2xlarge 1 32 48

wikitext-103- ml.p4d.24xlarge 4 82 160


v1

distilgpt2 wikitext-2 ml.g4dn.2xlarge 1 6 18

ml.g5.2xlarge 1 12 28

ml.p3.2xlarge 1 6 16

distilroberta- wikitext-2 ml.g4dn.2xlarge 1 20 40


base
ml.g5.2xlarge 1 28 56

ml.p3.2xlarge 1 24 40

EleutherAI/ wikitext-2 ml.g4dn.2xlarge 1 4 8


gpt-neo-125M
ml.g5.2xlarge 1 6 14

ml.p3.2xlarge 1 4 10

gpt2 wikitext-2 ml.g4dn.2xlarge 1 4 8

ml.g5.2xlarge 1 6 16

ml.p3.2xlarge 1 4 10

wikitext-103- ml.p4d.24xlarge 4 13 25
v1

roberta-base wikitext-2 ml.g4dn.2xlarge 1 12 20

ml.g5.2xlarge 1 24 36

ml.p3.2xlarge 1 12 20

1962
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU

wikitext-103- ml.p4d.24xlarge 4 36 64
v1

xlnet-base- wikitext-2 ml.g4dn.2xlarge 1 2 6


cased
ml.g5.2xlarge 1 2 10

ml.p3.2xlarge 1 2 8

bert-base- wikitext-103- ml.p4d.24xlarge 2 32 64


uncased v1
4 32 64

8 32 64

16 32 64

roberta-large wikitext-103- ml.p4d.24xlarge 4 16 24


v1

microsoft/ wikitext-103- ml.p4d.24xlarge 16 9 23


deberta-v3- v1
base

Transformers 4.17.0 with PyTorch 1.10.2

Tested with Sequence_Len=512 and Automatic Mixed Precision (AMP).

Single-node single-GPU

albert-base-v2 ml.p3.2xlarge 14 28

ml.g4dn.2xlarge 14 24

bert-base-cased ml.p3.2xlarge 16 24

ml.g4dn.2xlarge 12 24

bert-base-uncased ml.p3.2xlarge 16 24

ml.g4dn.2xlarge 12 28

camembert-base ml.p3.2xlarge 12 24

ml.g4dn.2xlarge 12 28

distilbert-base-uncased ml.p3.2xlarge 28 48

ml.g4dn.2xlarge 24 52

distilgpt2 ml.p3.2xlarge 6 12

ml.g4dn.2xlarge 6 14

distilroberta-base ml.p3.2xlarge 20 40

ml.g4dn.2xlarge 12 40

1963
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU

EleutherAI/gpt- ml.p3.2xlarge 2 10
neo-125M
ml.g4dn.2xlarge 2 8

facebook/bart-base ml.p3.2xlarge 2 6

ml.g4dn.2xlarge 2 6

gpt2 ml.p3.2xlarge 4 8

ml.g4dn.2xlarge 2 8

roberta-base ml.p3.2xlarge 12 20

ml.g4dn.2xlarge 12 20

xlnet-base-cased ml.p3.2xlarge 2 8

ml.g4dn.2xlarge 4 6

Transformers 4.11.0 with PyTorch 1.9.0

Tested with Sequence_Len=512 and Automatic Mixed Precision (AMP).

Single-node single-GPU

albert-base-v2 ml.p3.2xlarge 12 32

bert-base-cased ml.p3.2xlarge 14 24

bert-base-chinese ml.p3.2xlarge 16 24

bert-base-multilingual- ml.p3.2xlarge 4 16
cased

bert-base-multilingual- ml.p3.2xlarge 8 16
uncased

bert-base-uncased ml.p3.2xlarge 12 24

cl-tohoku/bert-base- ml.p3.2xlarge 12 24
japanese-whole-word-
masking

cl-tohoku/bert-base- ml.p3.2xlarge 12 24
japanese

distilbert-base-uncased ml.p3.2xlarge 28 32

distilbert-base- ml.p3.2xlarge 28 32
uncased-finetuned-
sst-2-english

distilgpt2 ml.p3.2xlarge 16 32

facebook/bart-base ml.p3.2xlarge 4 8

gpt2 ml.p3.2xlarge 6 20

1964
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Single-node single-GPU

nreimers/MiniLMv2-L6- ml.p3.2xlarge 20 32
H384-distilled-from-
RoBERTa-Large

roberta-base ml.p3.2xlarge 12 20

Single-node multi-GPU

bert-base-chinese ml.p3.8xlarge 16 26

bert-base-multilingual- ml.p3.8xlarge 6 16
cased

bert-base-multilingual- ml.p3.8xlarge 6 16
uncased

bert-base-uncased ml.p3.8xlarge 14 24

distilbert-base-uncased ml.p3.8xlarge 14 32

distilgpt2 ml.p3.8xlarge 6 32

facebook/bart-base ml.p3.8xlarge 8 16

gpt2 ml.p3.8xlarge 8 20

roberta-base ml.p3.8xlarge 12 20

Transformers 4.17.0 with TensorFlow 2.6.3

Tested with Sequence_Len=128 and Automatic Mixed Precision (AMP).

Model Instance type Batch size for native Batch size for Training
frameworks Compiler

albert-base-v2 ml.g4dn.16xlarge 136 208

albert-base-v2 ml.g5.4xlarge 219 312

albert-base-v2 ml.p3.2xlarge 152 208

albert-base-v2 ml.p3.8xlarge 152 192

bert-base-uncased ml.g4dn.16xlarge 120 101

bert-base-uncased ml.g5.4xlarge 184 160

bert-base-uncased ml.p3.2xlarge 128 108

bert-large-uncased ml.g4dn.16xlarge 37 28

bert-large-uncased ml.g5.4xlarge 64 55

bert-large-uncased ml.p3.2xlarge 40 32

camembert-base ml.g4dn.16xlarge 96 100

1965
Amazon SageMaker Developer Guide
Supported Frameworks, AWS Regions,
Instance Types, and Tested Models

Model Instance type Batch size for native Batch size for Training
frameworks Compiler

camembert-base ml.g5.4xlarge 190 160

camembert-base ml.p3.2xlarge 129 108

camembert-base ml.p3.8xlarge 128 104

distilbert-base-uncased ml.g4dn.16xlarge 210 160

distilbert-base-uncased ml.g5.4xlarge 327 288

distilbert-base-uncased ml.p3.2xlarge 224 196

distilbert-base-uncased ml.p3.8xlarge 192 182

google_electra-small- ml.g4dn.16xlarge 336 288


discriminator

google_electra-small- ml.g5.4xlarge 504 384


discriminator

google_electra-small- ml.p3.2xlarge 352 323


discriminator

gpt2 ml.g4dn.16xlarge 89 64

gpt2 ml.g5.4xlarge 140 146

gpt2 ml.p3.2xlarge 94 96

gpt2 ml.p3.8xlarge 96 88

jplu_tf-xlm-roberta- ml.g4dn.16xlarge 52 16
base

jplu_tf-xlm-roberta- ml.g5.4xlarge 64 44
base

microsoft_mpnet-base ml.g4dn.16xlarge 120 100

microsoft_mpnet-base ml.g5.4xlarge 192 160

microsoft_mpnet-base ml.p3.2xlarge 128 104

microsoft_mpnet-base ml.p3.8xlarge 130 92

roberta-base ml.g4dn.16xlarge 108 64

roberta-base ml.g5.4xlarge 176 142

roberta-base ml.p3.2xlarge 118 100

roberta-base ml.p3.8xlarge 112 88

Transformers 4.11.0 with TensorFlow 2.5.1

Tested with Sequence_Len=128 and Automatic Mixed Precision (AMP).

1966
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

Single-node single-GPU

albert-base-v2 ml.p3.2xlarge 128 128

bart-base ml.p3.2xlarge 12 64

bart-large ml.p3.2xlarge 4 28

bert-base-cased ml.p3.2xlarge 16 128

bert-base-chinese ml.p3.2xlarge 16 128

bert-base-multilingual- ml.p3.2xlarge 12 64
cased

bert-base-multilingual- ml.p3.2xlarge 16 96
uncased

bert-base-uncased ml.p3.2xlarge 16 96

bert-large-uncased ml.p3.2xlarge 4 24

cl-tohoku/bert-base- ml.p3.2xlarge 16 128


japanese

cl-tohoku/bert-base- ml.p3.2xlarge 16 128


japanese-whole-word-
masking

distilbert-base-sst2 ml.p3.2xlarge 32 128

distilbert-base-uncased ml.p3.2xlarge 32 128

distilgpt2 ml.p3.2xlarge 32 128

gpt2 ml.p3.2xlarge 12 64

gpt2-large ml.p3.2xlarge 2 24

jplu/tf-xlm-roberta- ml.p3.2xlarge 12 32
base

roberta-base ml.p3.2xlarge 4 64

roberta-large ml.p3.2xlarge 4 64

t5-base ml.p3.2xlarge 64 64

t5-small ml.p3.2xlarge 128 128

Bring Your Own Deep Learning Model


This guide walks you through how to adapt your training script for a compiler-accelerated training job.
The preparation of your training script depends on the following:

• Training settings such as single-core or distributed training.


• Frameworks and libraries that you use to create the training script.

Choose one of the following topics depending on the framework you use.

1967
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

Topics
• PyTorch (p. 1968)
• TensorFlow (p. 1973)

Note
After you finish preparing your training script, you can run a SageMaker training job using the
SageMaker framework estimator classes. For more information, see the previous topic at Enable
SageMaker Training Compiler (p. 1975).

PyTorch
Bring your own PyTorch model to SageMaker, and run the training job with SageMaker Training Compiler.

Topics
• PyTorch Models with Hugging Face Transformers (p. 1968)

PyTorch Models with Hugging Face Transformers


PyTorch models with Hugging Face Transformers are based on PyTorch's torch.nn.Module API. Hugging
Face Transformers also provides Trainer and pretrained model classes for PyTorch to help reduce the
effort for configuring natural language processing (NLP) models. After preparing your training script, you
can launch a training job using the SageMaker PyTorch or HuggingFace estimator with the SageMaker
Training Compiler configuration when you'll proceed to the next topic at Enable SageMaker Training
Compiler (p. 1975).
Tip
When you create a tokenizer for an NLP model using Transformers in your training script, make
sure that you use a static input tensor shape by specifying padding='max_length'. Do not
use padding='longest' because padding to the longest sequence in the batch can change
the tensor shape for each training batch. The dynamic input shape can trigger recompilation of
the model and might increase total training time. For more information about padding options
of the Transformers tokenizers, see Padding and truncation in the Hugging Face Transformers
documentation.

Topics
• Large Language Models Using the Hugging Face Transformers Trainer Class (p. 1968)
• Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer
API) (p. 1969)

Large Language Models Using the Hugging Face Transformers Trainer Class

If you use the transformers library’s Trainer class, you don’t need to make any additional changes to your
training script. SageMaker Training Compiler automatically compiles your Trainer model if you enable it
through the estimator class. The following code shows the basic form of a PyTorch training script with
Hugging Face Trainer API.

from transformers import Trainer, TrainingArguments

training_args=TrainingArguments(**kwargs)
trainer=Trainer(args=training_args, **kwargs)

Topics
• For single GPU training (p. 1969)
• For distributed training (p. 1969)

1968
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

• Best Practices to Use SageMaker Training Compiler with Trainer (p. 1969)

For single GPU training


You don't need to change your code when you use the transformers.Trainer class.

For distributed training


PyTorch v1.11.0 and later

To run distributed training with SageMaker Training Compiler, you must add the following _mp_fn()
function in your training script and wrap the main() function. It redirects the _mp_fn(index) function
calls from the SageMaker distributed runtime for PyTorch (pytorchxla) to the main() function of your
training script.

def _mp_fn(index):
main()

This function accepts the index argument to indicate the rank of the current GPU in the cluster
for distributed training. To find more example scripts, see the Hugging Face Transformers language
modeling example scripts.

For Transformers v4.17 and before with PyTorch v1.10.2 and before

SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job, and
you don't need to make any modification in your training script. Instead, SageMaker Training Compiler
requires you to pass a SageMaker distributed training launcher script to the entry_point argument and
pass your training script to the hyperparameters argument in the SageMaker Hugging Face estimator.

Best Practices to Use SageMaker Training Compiler with Trainer

• Make sure that you use SyncFree optimizers by setting the optim argument to adamw_torch_xla
while setting up transformers.TrainingArgument. See also Optimizer in the Hugging Face Transformers
documentation.
• Ensure that the throughput of the data processing pipeline is higher than the training throughput. You
can tweak the dataloader_num_workers and preprocessing_num_workers arguments of the
transformers.TrainingArgument class to achieve this. Typically, these need to be greater than or equal
to the number of GPUs but less than the number of CPUs.

After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).

Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer
API)
If you have a training script that uses PyTorch directly, you need to make additional changes to your
PyTorch training script to implement PyTorch/XLA. Follow the instructions to modify your script to
properly set up the PyTorch/XLA primatives.

Topics
• For single GPU training (p. 1969)
• For distributed training (p. 1970)
• Best Practices to Use SageMaker Training Compiler with PyTorch/XLA (p. 1972)

For single GPU training

1. Import the optimization libraries.

1969
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

import torch_xla
import torch_xla.core.xla_model as xm

2. Change the target device to be XLA instead of torch.device("cuda")

device=xm.xla_device()

3. If you're using PyTorch's Automatic Mixed Precision (AMP), do the following:


a. Replace torch.cuda.amp with the following:

import torch_xla.amp

b. Replace torch.optim.SGD and torch.optim.Adam with the following:

import torch_xla.amp.syncfree.Adam as adam


import torch_xla.amp.syncfree.SGD as SGD

c. Replace torch.cuda.amp.GradScaler with the following:

import torch_xla.amp.GradScaler as grad_scaler

4. If you're not using AMP, replace optimizer.step() with the following:

xm.optimizer_step(optimizer)

5. If you're using a distributed dataloader, wrap your dataloader in the PyTorch/XLA's ParallelLoader
class:

import torch_xla.distributed.parallel_loader as pl
parallel_loader=pl.ParallelLoader(dataloader, [device]).per_device_loader(device)

6. Add mark_step at the end of the training loop when you're not using parallel_loader:

xm.mark_step()

7. To checkpoint your training, use the PyTorch/XLA's model checkpoint method:

xm.save(model.state_dict(), path_to_save)

After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).

For distributed training

In addition to the changes listed in the previous For single GPU training (p. 1969) section, add the
following changes to properly distribute workload across GPUs.

1. If you're using AMP, add all_reduce after scaler.scale(loss).backward():

gradients=xm._fetch_gradients(optimizer)
xm.all_reduce('sum', gradients, scale=1.0/xm.xrt_world_size())

2. If you need to set variables for local_ranks and world_size, use similar code to the following:

local_rank=xm.get_local_ordinal()

1970
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

world_size=xm.xrt_world_size()

3. For any world_size (num_gpus_per_node*num_nodes) greater than 1, you must define a train
sampler which should look similar to the following:

import torch_xla.core.xla_model as xm

if xm.xrt_world_size() > 1:
train_sampler=torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=xm.xrt_world_size(),
rank=xm.get_ordinal(),
shuffle=True
)

train_loader=torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=train_sampler,
drop_last=args.drop_last,
shuffle=False if train_sampler else True,
num_workers=args.num_workers
)

4. Make the following changes to make sure you use the parallel_loader provided by the
torch_xla distributed module.

import torch_xla.distributed.parallel_loader as pl
train_device_loader=pl.MpDeviceLoader(train_loader, device)

The train_device_loader functions like a regular PyTorch loader as follows:

for step, (data, target) in enumerate(train_device_loader):


optimizer.zero_grad()
output=model(data)
loss=torch.nn.NLLLoss(output, target)
loss.backward()

With all of these changes, you should be able to launch distributed training with any PyTorch model
without the Transformer Trainer API. Note that these instructions can be used for both single-node
multi-GPU and multi-node multi-GPU.
5. For PyTorch v1.11.0 and later

To run distributed training with SageMaker Training Compiler, you must add the following _mp_fn()
function in your training script and wrap the main() function. It redirects the _mp_fn(index)
function calls from the SageMaker distributed runtime for PyTorch (pytorchxla) to the main()
function of your training script.

def _mp_fn(index):
main()

This function accepts the index argument to indicate the rank of the current GPU in the cluster
for distributed training. To find more example scripts, see the Hugging Face Transformers language
modeling example scripts.

For Transformers v4.17 and before with PyTorch v1.10.2 and before

1971
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job
and requires you to pass a SageMaker distributed training launcher script to the entry_point
argument and pass your training script to the hyperparameters argument in the SageMaker
Hugging Face estimator.

After you have completed adapting your training script, proceed to the section called “Run PyTorch
Training Jobs with Training Compiler” (p. 1976).

Best Practices to Use SageMaker Training Compiler with PyTorch/XLA

If you want to leverage the SageMaker Training Compiler on your native PyTorch training script, you may
want to first get familiar with PyTorch on XLA devices. The following sections list some best practices to
enable XLA for PyTorch.
Note
This section for best practices assumes that you use the following PyTorch/XLA modules:

import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

Understand the lazy mode in PyTorch/XLA

One significant difference between PyTorch/XLA and native PyTorch is that the PyTorch/XLA system
runs in lazy mode while the native PyTorch runs in eager mode. Tensors in lazy mode are placeholders
for building the computational graph until they are materialized after the compilation and evaluation
are complete. The PyTorch/XLA system builds the computational graph on the fly when you call PyTorch
APIs to build the computation using tensors and operators. The computational graph gets compiled
and executed when xm.mark_step() is called explicitly or implicitly by pl.MpDeviceLoader/
pl.ParallelLoader, or when you explicitly request the value of a tensor such as by calling
loss.item() or print(loss).

Minimize the number of compilation-and-executions using pl.MpDeviceLoader/


pl.ParallelLoader and xm.step_closure

For best performance, you should keep in mind the possible ways to initiate compilation-and-executions
as described in Understand the lazy mode in PyTorch/XLA (p. 1972) and should try to minimize the
number of compilation-and-executions. Ideally, only one compilation-and-execution is necessary per
training iteration and is initiated automatically by pl.MpDeviceLoader/pl.ParallelLoader. The
MpDeviceLoader is optimized for XLA and should always be used if possible for best performance.
During training, you might want to examine some intermediate results such as loss values. In such case,
the printing of lazy tensors should be wrapped using xm.add_step_closure() to avoid unnecessary
compilation-and-executions.

Use AMP and syncfree optimizers

Training in Automatic Mixed Precision (AMP) mode significantly accelerates your training speed
by leveraging the Tensor cores of NVIDIA GPUs. SageMaker Training Compiler provides syncfree
optimizers that are optimized for XLA to improve AMP performance. Currently, the following three
syncfree optimizers are available and should be used if possible for best performance.

torch_xla.amp.syncfree.SGD
torch_xla.amp.syncfree.Adam
torch_xla.amp.syncfree.AdamW

These syncfree optimizers should be paired with torch_xla.amp.GradScaler for gradient scaling/
unscaling.

1972
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

Tip
Starting PyTorch 1.13.1, SageMaker Training Compiler improves performance by letting
PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW)
in torch.optim or transformers.optimization with the syncfree versions of
them in torch_xla.amp.syncfree (such as torch_xla.amp.syncfree.SGD,
torch_xla.amp.syncfree.Adam, torch_xla.amp.syncfree.AdamW). You don't need to
change those code lines where you define optimizers in your training script.

TensorFlow
Bring your own TensorFlow model to SageMaker, and run the training job with SageMaker Training
Compiler.

TensorFlow Models
SageMaker Training Compiler automatically optimizes model training workloads that are built on top of
the native TensorFlow API or the high-level Keras API.
Tip
For preprocessing your input dataset, ensure that you use a static input shape. Dynamic input
shape can initiate recompilation of the model and might increase total training time.

Using Keras (Recommended)

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras
(tf.keras.Model).

For single GPU training

There's no additional change you need to make in the training script.

Without Keras

SageMaker Training Compiler does not support eager execution in TensorFlow. Accordingly, you should
wrap your model and training loops with the TensorFlow function decorator (@tf.function) to
leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure
your TensorFlow functions are set to run in graph mode.

For single GPU training

TensorFlow 2.0 or later has the eager execution on by default, so you should add the @tf.function
decorator in front of every function that you use for constructing a TensorFlow model.

TensorFlow Models with Hugging Face Transformers


TensorFlow models with Hugging Face Transformers are based on TensorFlow's tf.keras.Model API.
Hugging Face Transformers also provides pretrained model classes for TensorFlow to help reduce the
effort for configuring natural language processing (NLP) models. After creating your own training script
using the Transformers library, you can run the training script using the SageMaker HuggingFace
estimator with the SageMaker Training Compiler configuration class as shown in the previous topic at
Run TensorFlow Training Jobs with SageMaker Training Compiler (p. 1983).

SageMaker Training Compiler automatically optimizes model training workloads that are built on top of
the native TensorFlow API or the high-level Keras API, such as the TensorFlow transformer models.
Tip
When you create a tokenizer for an NLP model using Transformers in your training script, make
sure that you use a static input tensor shape by specifying padding='max_length'. Do not

1973
Amazon SageMaker Developer Guide
Bring Your Own Deep Learning Model

use padding='longest' because padding to the longest sequence in the batch can change
the tensor shape for each training batch. The dynamic input shape can initiate recompilation of
the model and might increase total training time. For more information about padding options
of the Transformers tokenizers, see Padding and truncation in the Hugging Face Transformers
documentation.

Topics
• Using Keras (p. 1974)
• Without Keras (p. 1975)

Using Keras

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras
(tf.keras.Model). As noted in the Quick tour page in the Hugging Face Transformers documentation, you
can use the models as regular TensorFlow Keras models.

For single GPU training

There's no additional change you need to make in the training script.

For distributed training

SageMaker Training Compiler acceleration works transparently for multi-GPU workloads when the model
is constructed and trained using Keras APIs within the scope of tf.distribute.Strategy.scope()
call.

1. Choose the right distributed training strategy.


a. For single-node multi-GPU, use tf.distribute.MirroredStrategy to set the strategy.

strategy = tf.distribute.MirroredStrategy()

b. For multi-node multi-GPU, add the following code to properly set the TensorFlow distributed
training configuration before creating the strategy.

def set_sm_dist_config():
DEFAULT_PORT = '8890'
DEFAULT_CONFIG_FILE = '/opt/ml/input/config/resourceconfig.json'
with open(DEFAULT_CONFIG_FILE) as f:
config = json.loads(f.read())
current_host = config['current_host']
tf_config = {
'cluster': {
'worker': []
},
'task': {'type': 'worker', 'index': -1}
}
for i, host in enumerate(config['hosts']):
tf_config['cluster']['worker'].append("%s:%s" % (host, DEFAULT_PORT))
if current_host == host:
tf_config['task']['index'] = i
os.environ['TF_CONFIG'] = json.dumps(tf_config)

set_sm_dist_config()

Use tf.distribute.MultiWorkerMirroredStrategy to set the strategy.

strategy = tf.distribute.MultiWorkerMirroredStrategy()

1974
Amazon SageMaker Developer Guide
Enable Training Compiler

2. Using the strategy of your choice, wrap the model.

with strategy.scope():
# create a model and do fit

Without Keras

If you want to bring custom models with custom training loops using TensorFlow without Keras, you
should wrap the model and the training loop with the TensorFlow function decorator (@tf.function)
to leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure
your TensorFlow functions are set to run in graph mode.

For single GPU training

TensorFlow 2.0 or later has the eager execution on by default, so you should add the @tf.function
decorator in front of every function that you use for constructing a TensorFlow model.

For distributed training

In addition to the changes needed for Using Keras for distributed training, you need to ensure that
functions to be run on each GPU are annotated with @tf.function, while cross-GPU communication
functions are not annotated. An example training code should look like the following:

@tf.function()
def compiled_step(inputs, outputs):
with tf.GradientTape() as tape:
pred=model(inputs, training=True)
total_loss=loss_object(outputs, pred)/args.batch_size
gradients=tape.gradient(total_loss, model.trainable_variables)
return total_loss, pred, gradients

def train_step(inputs, outputs):


total_loss, pred, gradients=compiled_step(inputs, outputs)
if args.weight_decay > 0.:
gradients=[g+v*args.weight_decay for g,v in zip(gradients,
model.trainable_variables)]

optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train_loss.update_state(total_loss)
train_accuracy.update_state(outputs, pred)

@tf.function()
def train_step_dist(inputs, outputs):
strategy.run(train_step, args= (inputs, outputs))

Note that this instruction can be used for both single-node multi-GPU and multi-node multi-GPU.

Enable SageMaker Training Compiler


SageMaker Training Compiler is built into the SageMaker Python SDK and AWS Deep Learning
Containers so that you don’t need to change your workflows to enable Training Compiler. Choose one of
the following topics that matches with your use case.

Topics

1975
Amazon SageMaker Developer Guide
Enable Training Compiler

• Run PyTorch Training Jobs with SageMaker Training Compiler (p. 1976)
• Run TensorFlow Training Jobs with SageMaker Training Compiler (p. 1983)

Run PyTorch Training Jobs with SageMaker Training Compiler


You can use any of the SageMaker interfaces to run a training job with SageMaker Training Compiler:
Amazon SageMaker Studio, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and
AWS Command Line Interface.

Topics
• Using the SageMaker Python SDK (p. 1976)
• Using the SageMaker CreateTrainingJob API Operation (p. 1983)

Using the SageMaker Python SDK


SageMaker Training Compiler for PyTorch is available through the SageMaker PyTorch and
HuggingFace framework estimator classes. To turn on SageMaker Training Compiler, add the
compiler_config parameter to the SageMaker estimators. Import the TrainingCompilerConfig
class and pass an instance of it to the compiler_config parameter. The following code examples show
the structure of SageMaker estimator classes with SageMaker Training Compiler turned on.
Tip
To get started with prebuilt models provided by PyTorch or Transformers, try using the batch
sizes provided in the reference table at Tested Models (p. 1952).
Note
The native PyTorch support is available in the SageMaker Python SDK v2.121.0 and later. Make
sure that you update the SageMaker Python SDK accordingly.
Note
Starting PyTorch v1.12.0, SageMaker Training Compiler containers for PyTorch are available.
Note that the SageMaker Training Compiler containers for PyTorch are not prepackaged with
Hugging Face Transformers. If you need to install the library in the container, make sure that
you add the requirements.txt file under the source directory when submitting a training job.
For PyTorch v1.11.0 and before, use the previous versions of the SageMaker Training Compiler
containers for Hugging Face and PyTorch.
For a complete list of framework versions and corresponding container information, see the
section called “Supported Frameworks” (p. 1950).

For information that fits your use case, see one of the following options.

For single GPU training

PyTorch v1.12.0 and later

To compile and train a PyTorch model, configure a SageMaker PyTorch estimator with SageMaker
Training Compiler as shown in the following code example.
Note
This native PyTorch support is available in the SageMaker Python SDK v2.120.0 and later.
Make sure that you update the SageMaker Python SDK.

from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12

1976
Amazon SageMaker Developer Guide
Enable Training Compiler

learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
entry_point='train.py',
source_dir='path-to-requirements-file', # Optional. Add this if need to install
additional packages.
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='1.13.1',
py_version='py3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

pytorch_estimator.fit()

Hugging Face Transformers with PyTorch v1.11.0 and before

To compile and train a transformer model with PyTorch, configure a SageMaker Hugging Face
estimator with SageMaker Training Compiler as shown in the following code example.

from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
transformers_version='4.21.1',
pytorch_version='1.11.0',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

1977
Amazon SageMaker Developer Guide
Enable Training Compiler

pytorch_huggingface_estimator.fit()

To prepare your training script, see the following pages.

• For single GPU training (p. 1969) of a PyTorch model using Hugging Face Transformers' Trainer
API
• For single GPU training (p. 1969) of a PyTorch model without Hugging Face Transformers' Trainer
API

To find end-to-end examples, see the following notebooks:

• Compile and Train a Hugging Face Transformers Trainer Model for Question and Answering with
the SQuAD dataset
• Compile and Train a Hugging Face Transformer BERT Model with the SST Dataset using SageMaker
Training Compiler
• Compile and Train a Binary Classification Trainer Model with the SST2 Dataset for Single-Node
Single-GPU Training

For distributed training

PyTorch v1.12

For PyTorch v1.12, you can run distributed training with SageMaker Training Compiler by adding
the pytorch_xla option specified to the distribution parameter of the SageMaker PyTorch
estimator class.
Note
This native PyTorch support is available in the SageMaker Python SDK v2.121.0 and later.
Make sure that you update the SageMaker Python SDK.

from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
entry_point='your_training_script.py',
source_dir='path-to-requirements-file', # Optional. Add this if need to install
additional packages.
instance_count=instance_count,

1978
Amazon SageMaker Developer Guide
Enable Training Compiler

instance_type=instance_type,
framework_version='1.13.1',
py_version='py3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
distribution ={'pytorchxla' : { 'enabled': True }},
disable_profiler=True,
debugger_hook_config=False
)

pytorch_estimator.fit()

Tip
To prepare your training script, see PyTorch (p. 1968)
Transformers v4.21 with PyTorch v1.11

For PyTorch v1.11 and later, SageMaker Training Compiler is available for distributed training with
the pytorch_xla option specified to the distribution parameter.

from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
entry_point='your_training_script.py',
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.21.1',
pytorch_version='1.11.0',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
distribution ={'pytorchxla' : { 'enabled': True }},
disable_profiler=True,
debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()

Tip
To prepare your training script, see the following pages.

• For distributed training (p. 1969) of a PyTorch model using Hugging Face Transformers'
Trainer API

1979
Amazon SageMaker Developer Guide
Enable Training Compiler

• For distributed training (p. 1970) of a PyTorch model without Hugging Face Transformers'
Trainer API

Transformers v4.17 with PyTorch v1.10.2 and before

For the supported version of PyTorch v1.10.2 and before, SageMaker Training Compiler requires an
alternate mechanism for launching a distributed training job. To run distributed training, SageMaker
Training Compiler requires you to pass a SageMaker distributed training launcher script to the
entry_point argument, and pass your training script to the hyperparameters argument. The
following code example shows how to configure a SageMaker Hugging Face estimator applying the
required changes.

from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

training_script="your_training_script.py"

hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate,
"training_script": training_script # Specify the file name of your training
script.
}

pytorch_huggingface_estimator=HuggingFace(
entry_point='distributed_training_launcher.py', # Specify the distributed
training launcher script.
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.17.0',
pytorch_version='1.10.2',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()

The launcher script should look like the following. It wraps your training script and configures the
distributed training environment depending on the size of the training instance of your choice.

# distributed_training_launcher.py

#!/bin/python

1980
Amazon SageMaker Developer Guide
Enable Training Compiler

import subprocess
import sys

if __name__ == "__main__":
arguments_command = " ".join([arg for arg in sys.argv[1:]])
"""
The following line takes care of setting up an inter-node communication
as well as managing intra-node workers for each GPU.
"""
subprocess.check_call("python -m torch_xla.distributed.sm_dist " +
arguments_command, shell=True)

Tip
To prepare your training script, see the following pages.

• For distributed training (p. 1969) of a PyTorch model using Hugging Face Transformers'
Trainer API
• For distributed training (p. 1970) of a PyTorch model without Hugging Face Transformers'
Trainer API

Tip
To find end-to-end examples, see the following notebooks:

• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Single-Node Multi-GPU Training
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Multi-Node Multi-GPU Training

The following list is the minimal set of parameters required to run a SageMaker training job with the
compiler.
Note
When using the SageMaker Hugging Face estimator, you must specify the
transformers_version, pytorch_version, hyperparameters, and compiler_config
parameters to enable SageMaker Training Compiler. You cannot use image_uri to manually
specify the Training Compiler integrated Deep Learning Containers that are listed at Supported
Frameworks (p. 1950).

• entry_point (str) – Required. Specify the file name of your training script.
Note
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and
before, specify the file name of a launcher script to this parameter. The launcher script should
be prepared to wrap your training script and configure the distributed training environment.
For more information, see the following example notebooks:
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Single-Node Multi-GPU Training
• Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2
Dataset for Multi-Node Multi-GPU Training
• source_dir (str) – Optional. Add this if need to install additional packages. To install packages, you
need to prapare a requirements.txt file under this directory.
• instance_count (int) – Required. Specify the number of instances.
• instance_type (str) – Required. Specify the instance type.
• transformers_version (str) – Required only when using the SageMaker Hugging Face estimator.
Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To
find available versions, see Supported Frameworks (p. 1950).

1981
Amazon SageMaker Developer Guide
Enable Training Compiler

• framework_version or pytorch_version (str) – Required. Specify the PyTorch version supported


by SageMaker Training Compiler. To find available versions, see Supported Frameworks (p. 1950).
Note
When using the SageMaker Hugging Face estimator, you must specify both
transformers_version and pytorch_version.
• hyperparameters (dict) – Optional. Specify hyperparameters for the training job, such as n_gpus,
batch_size, and learning_rate. When you enable SageMaker Training Compiler, try larger batch
sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted
batch sizes to improve training speed, see the section called “Tested Models” (p. 1952) and SageMaker
Training Compiler Example Notebooks and Blogs (p. 1989).
Note
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and
before, you need to add an additional parameter, "training_script", to specify your
training script, as shown in the preceding code example.
• compiler_config (TrainingCompilerConfig object) – Required to activate SageMaker Training
Compiler. Include this parameter to turn on SageMaker Training Compiler. The following are
parameters for the TrainingCompilerConfig class.
• enabled (bool) – Optional. Specify True or False to turn on or turn off SageMaker Training
Compiler. The default value is True.
• debug (bool) – Optional. To receive more detailed training logs from your compiler-accelerated
training jobs, change it to True. However, the additional logging might add overhead and slow
down the compiled training job. The default value is False.
• distribution (dict) – Optional. To run a distributed training job with SageMaker Training Compiler,
add distribution = { 'pytorchxla' : { 'enabled': True }}.

Warning
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training
Compiler. We recommend that you turn off Debugger when running SageMaker Training
Compiler to make sure there's no impact on performance. For more information, see the section
called “Considerations” (p. 1990). To turn the Debugger functionalities off, add the following
two arguments to the estimator:

disable_profiler=True,
debugger_hook_config=False

If the training job with the compiler is launched successfully, you receive the following logs during the
job initialization phase:

• With TrainingCompilerConfig(debug=False)

Found configuration for Training Compiler


Configuring SM Training Compiler...

• With TrainingCompilerConfig(debug=True)

Found configuration for Training Compiler


Configuring SM Training Compiler...
Training Compiler set to debug mode

1982
Amazon SageMaker Developer Guide
Enable Training Compiler

Using the SageMaker CreateTrainingJob API Operation


SageMaker Training Compiler configuration options must be specified through the
AlgorithmSpecification and HyperParameters field in the request syntax for the
CreateTrainingJob API operation.

"AlgorithmSpecification": {
"TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
"sagemaker_training_compiler_enabled": "true",
"sagemaker_training_compiler_debug_mode": "false",
"sagemaker_pytorch_xla_multi_worker_enabled": "false" // set to "true" for
distributed training
}

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler
implemented, see Supported Frameworks (p. 1950).

Run TensorFlow Training Jobs with SageMaker Training


Compiler
You can use any of the SageMaker interfaces to run a training job with SageMaker Training Compiler:
Amazon SageMaker Studio, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and
AWS Command Line Interface.

Topics
• Using the SageMaker Python SDK (p. 1983)
• Using the SageMaker Python SDK and Extending SageMaker Framework Deep Learning
Containers (p. 1987)
• Enable SageMaker Training Compiler Using the SageMaker CreateTrainingJob API
Operation (p. 1989)

Using the SageMaker Python SDK


To turn on SageMaker Training Compiler, add the compiler_config parameter to the SageMaker
TensorFlow or Hugging Face estimator. Import the TrainingCompilerConfig class and pass an
instance of it to the compiler_config parameter. The following code examples show the structure of
the SageMaker estimator classes with SageMaker Training Compiler turned on.
Tip
To get started with prebuilt models provided by the TensorFlow and Transformers libraries, try
using the batch sizes provided in the reference table at Tested Models (p. 1952).
Note
SageMaker Training Compiler for TensorFlow is available through the SageMaker TensorFlow
and Hugging Face framework estimators.

For information that fits your use case, see one of the following options.

For single GPU training

TensorFlow

from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

1983
Amazon SageMaker Developer Guide
Enable Training Compiler

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}

tensorflow_estimator=TensorFlow(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='2.9.1',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

tensorflow_estimator.fit()

To prepare your training script, see the following pages.

• For single GPU training (p. 1973) of a model constructed using TensorFlow Keras (tf.keras.*).
• For single GPU training (p. 1973) of a model constructed using TensorFlow modules (tf.*
excluding the TensorFlow Keras modules).

Hugging Face Estimator with TensorFlow

from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
"n_gpus": 1,
"batch_size": batch_size,
"learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=1,
instance_type='ml.p3.2xlarge',
transformers_version='4.21.1',
tensorflow_version='2.6.3',
hyperparameters=hyperparameters,

1984
Amazon SageMaker Developer Guide
Enable Training Compiler

compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()

To prepare your training script, see the following pages.

• For single GPU training (p. 1974) of a TensorFlow Keras model with Hugging Face Transformers
• For single GPU training (p. 1975) of a TensorFlow model with Hugging Face Transformers

For distributed training

Hugging Face Estimator with TensorFlow

from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate


learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
"n_gpus": num_gpus,
"batch_size": batch_size,
"learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
entry_point='train.py',
instance_count=instance_count,
instance_type=instance_type,
transformers_version='4.21.1',
tensorflow_version='2.6.3',
hyperparameters=hyperparameters,
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()

Tip
To prepare your training script, see the following pages.

• For distributed training (p. 1974) of a TensorFlow Keras model with Hugging Face
Transformers
• For distributed training (p. 1975) of a TensorFlow model with Hugging Face Transformers

1985
Amazon SageMaker Developer Guide
Enable Training Compiler

The following list is the minimal set of parameters required to run a SageMaker training job with the
compiler.
Note
When using the SageMaker Hugging Face estimator, you must specify the
transformers_version, tensorflow_version, hyperparameters, and
compiler_config parameters to enable SageMaker Training Compiler. You cannot use
image_uri to manually specify the Training Compiler integrated Deep Learning Containers that
are listed at Supported Frameworks (p. 1950).

• entry_point (str) – Required. Specify the file name of your training script.
• instance_count (int) – Required. Specify the number of instances.
• instance_type (str) – Required. Specify the instance type.
• transformers_version (str) – Required only when using the SageMaker Hugging Face estimator.
Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To
find available versions, see Supported Frameworks (p. 1950).
• framework_version or tensorflow_version (str) – Required. Specify the TensorFlow
version supported by SageMaker Training Compiler. To find available versions, see Supported
Frameworks (p. 1950).
Note
When using the SageMaker TensorFlow estimator, you must specify framework_version.
When using the SageMaker Hugging Face estimator, you must specify both
transformers_version and tensorflow_version.
• hyperparameters (dict) – Optional. Specify hyperparameters for the training job, such as n_gpus,
batch_size, and learning_rate. When you enable SageMaker Training Compiler, try larger batch
sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted
batch sizes to improve training speed, see the section called “Tested Models” (p. 1952) and SageMaker
Training Compiler Example Notebooks and Blogs (p. 1989).
• compiler_config (TrainingCompilerConfig object) – Required. Include this parameter to turn on
SageMaker Training Compiler. The following are parameters for the TrainingCompilerConfig class.
• enabled (bool) – Optional. Specify True or False to turn on or turn off SageMaker Training
Compiler. The default value is True.
• debug (bool) – Optional. To receive more detailed training logs from your compiler-accelerated
training jobs, change it to True. However, the additional logging might add overhead and slow
down the compiled training job. The default value is False.

Warning
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training
Compiler. We recommend that you turn off Debugger when running SageMaker Training
Compiler to make sure there's no impact on performance. For more information, see the section
called “Considerations” (p. 1990). To turn the Debugger functionalities off, add the following
two arguments to the estimator:

disable_profiler=True,
debugger_hook_config=False

If the training job with the compiler is launched successfully, you receive the following logs during the
job initialization phase:

• With TrainingCompilerConfig(debug=False)

Found configuration for Training Compiler


Configuring SM Training Compiler...

1986
Amazon SageMaker Developer Guide
Enable Training Compiler

• With TrainingCompilerConfig(debug=True)

Found configuration for Training Compiler


Configuring SM Training Compiler...
Training Compiler set to debug mode

Using the SageMaker Python SDK and Extending SageMaker Framework Deep
Learning Containers
AWS Deep Learning Containers (DLC) for TensorFlow use adapted versions of TensorFlow that include
changes on top of the open-source TensorFlow framework. The SageMaker Framework Deep Learning
Containers are optimized for the underlying AWS infrastructure and Amazon SageMaker. With the
advantage of using the DLCs, SageMaker Training Compiler integration adds more performance
improvements over the native TensorFlow. Furthermore, you can create a custom training container by
extending the DLC image.
Note
This Docker customization feature is currently available only for TensorFlow.

To extend and customize the SageMaker TensorFlow DLCs for your use-case, use the following
instructions.

Create a Dockerfile

Use the following Dockerfile template to extend the SageMaker TensorFlow DLC. You must use the
SageMaker TensorFlow DLC image as the base image of your Docker container. To find the SageMaker
TensorFlow DLC image URIs, see Supported Frameworks.

# SageMaker TensorFlow Deep Learning Container image


FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/tensorflow-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# This environment variable is used by the SageMaker container


# to determine user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# Add more code lines to customize for your use-case


...

For more information, see Step 2: Create and upload the Dockerfile and Python training scripts.

Consider the following pitfalls when extending SageMaker Framework DLCs:

• Do not explicitly uninstall or change the version of TensorFlow packages in SageMaker containers.
Doing so causes the AWS optimized TensorFlow packages to be overwritten by open-source
TensorFlow packages, which might result in performance degradation.
• Watch out for packages that have a particular TensorFlow version or flavor as a dependency. These
packages might implicitly uninstall the AWS optimized TensorFlow and install open-source TensorFlow
packages.

For example, there’s a known issue that the tensorflow/models and tensorflow/text libraries always
attempt to reinstall open source TensorFlow. If you need to install these libraries to choose a specific
version for your use case, we recommend that you look into the SageMaker TensorFlow DLC Dockerfiles
for v2.9 or later. The paths to the Dockerfiles are typically in the following format: tensorflow/
training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu. In the

1987
Amazon SageMaker Developer Guide
Enable Training Compiler

Dockerfiles, you should find the code lines to reinstall AWS managed TensorFlow binary (specified to the
TF_URL environment variable) and other dependencies in order. The reinstallation section should look
like the following example:

# tf-models does not respect existing installations of TensorFlow


# and always installs open source TensorFlow

RUN pip3 install --no-cache-dir -U \


tf-models-official==x.y.z

RUN pip3 uninstall -y tensorflow tensorflow-gpu \


; pip3 install --no-cache-dir -U \
${TF_URL} \
tensorflow-io==x.y.z \
tensorflow-datasets==x.y.z

Build and push to ECR

To build and push your Docker container to Amazon ECR, follow the instructions in the following links:

• Step 3: Build the container


• Step 4: Test the container
• Step 5: Push the container to Amazon ECR

Run using the SageMaker Python SDK Estimator

Use the SageMaker TensorFlow framework estimator as usual. You must specify image_uri to use the
new container you hosted in Amazon ECR.

import sagemaker, boto3


from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
account_id, region, uri_suffix, ecr_repository + tag
)

byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest

estimator = TensorFlow(
image_uri=image_uri,
role=get_execution_role(),
base_job_name='tf-custom-container-test-job',
instance_count=1,
instance_type='ml.p3.8xlarge'
compiler_config=TrainingCompilerConfig(),
disable_profiler=True,
debugger_hook_config=False
)

# Start training

1988
Amazon SageMaker Developer Guide
Example Notebooks and Blogs

estimator.fit()

Enable SageMaker Training Compiler Using the SageMaker


CreateTrainingJob API Operation
SageMaker Training Compiler configuration options must be specified through the
AlgorithmSpecification and HyperParameters field in the request syntax for the
CreateTrainingJob API operation.

"AlgorithmSpecification": {
"TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
"sagemaker_training_compiler_enabled": "true",
"sagemaker_training_compiler_debug_mode": "false"
}

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler
implemented, see Supported Frameworks (p. 1950).

SageMaker Training Compiler Example Notebooks


and Blogs
The following blogs, case studies, and notebooks provide examples of how to implement SageMaker
Training Compiler.

Example notebooks are provided in the SageMaker examples GitHub repository, and you can also browse
them on the SageMaker examples website.

Blogs and Case Studies


The following blogs discuss case studies about using SageMaker Training Compiler.

• New – Introducing SageMaker Training Compiler


• Hugging Face Transformers BERT fine-tuning using Amazon SageMaker Training Compiler
• Speed up Hugging Face Training Jobs on AWS by Up to 50% with SageMaker Training Compiler

Examples Notebooks
To find examples of using SageMaker Training Compiler, see the Training Compiler page in the Amazon
SageMaker Example Read the Docs website.

SageMaker Training Compiler Best Practices and


Considerations
Review the following best practices and considerations when using SageMaker Training Compiler.

Best Practices
Use the following guidelines to achieve the best results when you run training jobs with SageMaker
Training Compiler.

1989
Amazon SageMaker Developer Guide
Best Practices and Considerations

General Best Practices

• Make sure that you use one of the Supported Instance Types (p. 1951) and Tested Models (p. 1952).
• When you create a tokenizer for an NLP model using the Hugging Face Transformers library
in your training script, make sure that you use a static input tensor shape by specifying
padding='max_length'. Do not use padding='longest' because padding to the longest
sequence in the batch can change the tensor shape for each training batch. The dynamic input shape
can initiate recompilation of the model and might increase total training time. For more information
about padding options of the Transformers tokenizers, see Padding and truncation in the Hugging Face
Transformers documentation.
• Measure GPU memory utilization to make sure that you use the maximum batch size that can fit into
the GPU memory. Amazon SageMaker Training Compiler reduces the memory footprint of your model
during training, which typically allows you to fit a larger batch_size in the GPU memory. Using a
larger batch_size results in a better GPU utilization and reduces the total training time.

When you adjust the batch size, you also have to adjust the learning_rate appropriately. For
example, if you increased the batch size by a factor of k, you need to adjust learning_rate linearly
(simple multiplication by k) or multiply by the square root of k. This is to achieve the same or similar
convergence behavior in the reduced training time. For reference of batch_size tested for popular
models, see Tested Models (p. 1952).
• To debug the compiler-accelerated training job, enable the debug flag in the compiler_config
parameter. This enables SageMaker to put the debugging logs into SageMaker training job logs.

huggingface_estimator=HuggingFace(
...
compiler_config=TrainingCompilerConfig(debug=True)
)

Note that if you enable full debugging of the training job with the compiler, this might add some
overhead.

Best Practices for PyTorch

• If you bring a PyTorch model and want to checkpoint it, make sure you use PyTorch/XLA's model
save function to properly checkpoint your model. For more information about the function, see
torch_xla.core.xla_model.save in the PyTorch on XLA Devices documentation.

To learn how to add the modifications to your PyTorch script, see Large Language Models Using
PyTorch Directly (without the Hugging Face Transformers Trainer API) (p. 1969).

For more information about the actual application of using the model save function, see Checkpoint
Writing and Loading in the Hugging Face on PyTorch/XLA TPUs: Faster and cheaper training blog.
• To achieve the most optimal training time for distributed training, consider the following.
• Use instances with multiple GPUs instead of using single-gpu instances. For example, a single
ml.p3dn.24xlarge instance has faster training time compared to 8 x ml.p3.2xlarge instances.
• Use instances with EFA support such as ml.p3dn.24xlarge and ml.p4d.24xlarge. These
instance types have accelerated networking speed and reduce training time.
• Tune the preprocessing_num_workers parameter for datasets, so that model training is not
delayed by slow preprocessing.

Considerations
Consider the following when using SageMaker Training Compiler.

1990
Amazon SageMaker Developer Guide
Best Practices and Considerations

Performance degradation due to logging, checkpointing, and profiling


• Avoid logging, checkpointing, and profiling model tensors that lead to explicit evaluations. To
understand what an explicit evaluation is, consider the following code compiling example.

a = b+c
e = a+d

A compiler interprets the code as follows and reduces the memory footprint for the variable a:

e = b+c+d

Now consider the following case in which the code is changed to add a print function for the variable
a.

a = b+c
e = a+d
print(a)

The compiler makes an explicit evaluation of the variable a as follows.

e = b+c+d
a = b+c # Explicit evaluation
print(a)

In PyTorch, for example, avoid using torch.tensor.items(), which might introduce explicit evaluations. In
deep learning, such explicit evaluations can cause overhead because they break fused operations in a
compilation graph of a model and lead to recomputation of the tensors.

If you still want to periodically evaluate the model during training while using SageMaker Training
Compiler, we recommend logging and checkpointing at a lower frequency to reduce overhead due to
explicit evaluations. For example, log every 10 epochs instead of every epoch.
• Graph compilation runs during the first few steps of training. As a result, the first few steps are
expected to be exceptionally slow. However, this is a one-time compilation cost and can be amortized
by training for a longer duration because compilation makes future steps much faster. The initial
compilation overhead depends on the size of the model, the size of the input tensors, and the
distribution of input tensor shapes.

Incorrect use of the PyTorch/XLA APIs when using PyTorch directly


PyTorch/XLA defines a set of APIs to replace some of the existing PyTorch training APIs. Failing to use
them properly leads PyTorch training to fail.

• One of the most typical errors when compiling a PyTorch model is due to a wrong device type
for operators and tensors. To properly compile a PyTorch model, make sure you use XLA devices
(xm.xla_device()) instead of using CUDA or mixing CUDA devices and XLA devices.
• mark_step() is a barrier just for XLA. Failing to set it correctly causes a training job to stall.
• PyTorch/XLA provides additional distributed training APIs. Failing to program the APIs properly causes
gradients to be collected incorrectly, which causes a training convergence failure.

To properly set up your PyTorch script and avoid the aforementioned incorrect API uses, see Large
Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API) (p. 1969).

1991
Amazon SageMaker Developer Guide
Training Compiler FAQ

SageMaker Training Compiler FAQ


Use the following FAQ items to find answers to commonly asked questions about SageMaker Training
Compiler.

Q. How do I know SageMaker Training Compiler is working?

If you successfully launched your training job with SageMaker Training Compiler, you receive the
following log messages:

• With TrainingCompilerConfig(debug=False)

Found configuration for Training Compiler


Configuring SM Training Compiler...

• With TrainingCompilerConfig(debug=True)

Found configuration for Training Compiler


Configuring SM Training Compiler...
Training Compiler set to debug mode

Q. Which models does SageMaker Training Compiler accelerate?

SageMaker Training Compiler supports the most popular deep learning models from the Hugging
Face transformers library. With most of the operators that the compiler supports, these models can
be trained faster with SageMaker Training Compiler. Compilable models include but are not limited
to the following: bert-base-cased, bert-base-chinese, bert-base-uncased, distilbert-
base-uncased, distilbert-base-uncased-finetuned-sst-2-english, gpt2, roberta-base,
roberta-large, t5-base, and xlm-roberta-base. The compiler works with most DL operators and
data structures and can accelerate many other DL models beyond those that have been tested.

Q. What happens if I enable SageMaker Training Compiler with a model that isn't tested?

For an untested model, you might need to first modify the training script to be compatible with
SageMaker Training Compiler. For more information, see Bring Your Own Deep Learning Model (p. 1967)
and follow the instructions on how to prepare your training script.

Once you have updated your training script, you can start the training job. The compiler proceeds to
compile the model. However, training speed may not increase and might even decrease relative to the
baseline with an untested model. You might need to retune training parameters such as batch_size
and learning_rate to achieve any speedup benefits.

If compilation of the untested model fails, the compiler returns an error. See SageMaker Training
Compiler Troubleshooting (p. 1993) for detailed information about the failure types and error messages.

Q. Will I always get a faster training job with SageMaker Training Compiler?

No, not necessarily. First, SageMaker Training Compiler adds some compilation overhead before the
ongoing training process can be accelerated. The optimized training job must run sufficiently long to
amortize and make up for this incremental compilation overhead at the beginning of the training job.

Additionally, as with any model training process, training with suboptimal parameters can increase
training time. SageMaker Training Compiler can change the characteristics of the training job by, for
example, changing the memory footprint of the job. Because of these differences, you might need
to retune your training job parameters to speed up training. A reference table specifying the best
performing parameters for training jobs with different instance types and models can be found at Tested
Models (p. 1952).

1992
Amazon SageMaker Developer Guide
Troubleshooting

Finally, some code in a training script might add additional overhead or disrupt the compiled
computation graph and slow training. If working with a customized or untested model, see the
instructions at Best Practices to Use SageMaker Training Compiler with PyTorch/XLA (p. 1972).

Q. Can I always use a larger batch size with SageMaker Training Compiler?

Batch size increases in most, but not all, cases. The optimizations made by SageMaker Training Compiler
can change the characteristics of your training job, such as the memory footprint. Typically, a Training
Compiler job occupies less memory than an uncompiled training job with the native framework, which
allows for a larger batch size during training. A larger batch size, and a corresponding adjustment to the
learning rate, increases training throughput and can decrease total training time.

However, there could be cases where SageMaker Training Compiler might actually increase memory
footprint based on its optimization scheme. The compiler uses an analytical cost model to predict the
execution schedule with the lowest cost of execution for any compute-intensive operator. This model
could find an optimal schedule that increases memory use. In this case, you won’t be able to increase
batch sizes, but your sample throughput is still higher.

Q. Does SageMaker Training Compiler work with other SageMaker training features, such as the
SageMaker distributed training libraries and SageMaker Debugger?

SageMaker Training Compiler is currently not compatible with SageMaker’s distributed training libraries.

SageMaker Training Compiler is compatible with SageMaker Debugger, but Debugger might degrade
computational performance by adding overhead.

Q. Does SageMaker Training Compiler support custom containers (bring your own container)?

SageMaker Training Compiler is provided through AWS Deep Learning Containers, and you can
extend a subset of the containers to customize for your use-case. Containers that are extended from
AWS DLCs are supported by SageMaker Training Compiler. For more information, see Supported
Frameworks and Using the SageMaker Python SDK and Extending SageMaker Framework Deep Learning
Containers (p. 1987). If you need further support, reach out to the SageMaker team through AWS
Support or AWS Developer Forums for Amazon SageMaker.

SageMaker Training Compiler Troubleshooting


If you run into an error, you can use the following list to try to troubleshoot your training job. If you need
further support, reach out to the SageMaker team through AWS Support or AWS Developer Forums for
Amazon SageMaker.

Training job is not converging as expected when compared to


the native framework training job
Convergence issues range from “the model is not learning when SageMaker Training Compiler is turned
on” to “the model is learning but slower than the native framework”. In this troubleshooting guide, we
assume your convergence is fine without SageMaker Training Compiler (in the native framework) and
consider this the baseline.

When faced with such convergence issues, the first step is to identify if the issue is limited to distributed
training or stems from single-GPU training. Distributed training with SageMaker Training Compiler is an
extension of single-GPU training with additional steps.

1. Set up a cluster with multiple instances or GPUs.


2. Distribute input data to all workers.
3. Synchronize the model updates from all workers.

1993
Amazon SageMaker Developer Guide
Troubleshooting

Therefore, any convergence issue in single-GPU training propagates to distributed training with multiple
workers.

1994
Amazon SageMaker Developer Guide
Troubleshooting

1995
Amazon SageMaker Developer Guide
Troubleshooting

A flow chart to troubleshoot convergence issues in training jobs when using SageMaker Training
Compiler. Descriptions are in the following sections.

Convergence issues occurring in single-GPU training


If your convergence issue stems from single-GPU training, this is likely due to improper settings for
hyperparameters or the torch_xla APIs.

Check the hyperparameters

Training with SageMaker Training Compiler leads to change in the memory footprint of a model. The
compiler intelligently arbitrates between re-use and re-compute leading to a corresponding increase or
decrease in memory consumption. To leverage this, it is essential to re-tune the batch size and associated
hyperparameters when migrating a training job to SageMaker Training Compiler. However, incorrect
hyperparameter settings often cause oscillation in training loss and possibly a slower convergence
as a result. In rare cases, aggressive hyperparameters might result in the model not learning (the
training loss metric doesn’t decrease or returns NaN). To identify if the convergence issue is due to the
hyperparameters, do a side-by-side test of two training jobs with and without SageMaker Training
Compiler while keeping all the hyperparameters the same.

Check if the torch_xla APIs are properly set up for single-GPU training

If the convergence issue persists with the baseline hyperparameters, you need to check if there’s any
improper usage of the torch_xla APIs, specifically the ones for updating the model. Fundamentally,
torch_xla continues to accumulate instructions (deferring execution) in the form of graph until it is
explicitly instructed to run the accumulated graph. The torch_xla.core.xla_model.mark_step()
function facilitates the execution of the accumulated graph. The graph execution should be synchronized
using this function after each model update and before printing and logging any variables. If it lacks
the synchronization step, the model might use stale values from memory during prints, logs, and the
subsequent forward passes, instead of using the most recent values that have to be synchronized after
every iteration and model update.

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly
from the use of AMP) or gradient clipping techniques. The appropriate order of gradient computation
with AMP is as follows.

1. Gradient computation with scaling


2. Gradient un-scaling, gradient clipping, and then scaling
3. Model update
4. Synchronizing the graph execution with mark_step()

To find the right APIs for the operations mentioned in the list, see the guide for migrating your training
script to SageMaker Training Compiler.

Consider using Automatic Model Tuning

If the convergence issue arises when re-tuning the batch size and associated hyperparameters such as
the learning rate while using SageMaker Training Compiler, consider using Automatic Model Tuning to
tune your hyperparameters. You can refer to the example notebook on tuning hyperparameters with
SageMaker Training Compiler.

Convergence issues occurring in distributed training


If your convergence issue persists in distributed training, this is likely due to improper settings for weight
initialization or the torch_xla APIs.

Check weight initialization across the workers

1996
Amazon SageMaker Developer Guide
Troubleshooting

If the convergence issue arises when running a distributed training job with multiple workers, ensure
there is a uniform deterministic behavior across all workers by setting a constant seed where applicable.
Beware of techniques such as weight initialization, which involves randomization. Each worker might end
up training a different model in the absence of a constant seed.

Check if the torch_xla APIs are properly set up for distributed training

If the issue still persists, this is likely due to improper use of the torch_xla APIs for distributed training.
Make sure that you add the following in your estimator to set up a cluster for distributed training with
SageMaker Training Compiler.

distribution={'torchxla': {'enabled': True}}

This should be accompanied by a function _mp_fn(index) in your training script, which is invoked once
per worker. Without the mp_fn(index) function, you might end up letting each of the workers train the
model independently without sharing model updates.

Next, make sure that you use the torch_xla.distributed.parallel_loader.MpDeviceLoader


API along with the distributed data sampler, as guided in the documentation about migrating your
training script to SageMaker Training Compiler, as in the following example.

torch.utils.data.distributed.DistributedSampler()

This ensures that the input data is properly distributed across all workers.

Finally, to synchronize model updates from all workers, use


torch_xla.core.xla_model._fetch_gradients to gather gradients from all workers and
torch_xla.core.xla_model.all_reduce to combine all the gathered gradients into a single
update.

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly
from use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with
AMP is as follows.

1. Gradient computation with scaling


2. Gradient synchronization across all workers
3. Gradient un-scaling, gradient clipping, and then gradient scaling
4. Model update
5. Synchronizing the graph execution with mark_step()

Note that this checklist has an additional item for synchronizing all workers, compared to the checklist
for single-GPU training.

Training job fails due to missing PyTorcl/XLA configuration


If a training job fails with the Missing XLA configuration error message, it might be due to a
misconfiguration in the number of GPUs per instance that you use.

XLA requires additional environment variables to compile the training job. The most common missing
environment variable is GPU_NUM_DEVICES. For the compiler to work properly, you must set this
environment variable equal to the number of GPUs per instance.

There are three approaches to set the GPU_NUM_DEVICES environment variable:

1997
Amazon SageMaker Developer Guide
Troubleshooting

• Approach 1 – Use the environment argument of the SageMaker estimator class. For example, if you
use an ml.p3.8xlarge instance that has four GPUs, do the following:

# Using the SageMaker Python SDK's HuggingFace estimator

hf_estimator=HuggingFace(
...
instance_type="ml.p3.8xlarge",
hyperparameters={...},
environment={
...
"GPU_NUM_DEVICES": "4" # corresponds to number of GPUs on the specified instance
},
)

• Approach 2 – Use the hyperparameters argument of the SageMaker estimator class and parse it in
your training script.
1. To specify the number of GPUs, add a key-value pair to the hyperparameters argument.

For example, if you use an ml.p3.8xlarge instance that has four GPUs, do the following:

# Using the SageMaker Python SDK's HuggingFace estimator

hf_estimator=HuggingFace(
...
entry_point = "train.py"
instance_type= "ml.p3.8xlarge",
hyperparameters = {
...
"n_gpus": 4 # corresponds to number of GPUs on specified instance
}
)
hf_estimator.fit()

2. In your training script, parse the n_gpus hyperparameter and specify it as an input for the
GPU_NUM_DEVICES environment variable.

# train.py
import os, argparse

if __name__ == "__main__":
parser = argparse.ArgumentParser()
...
# Data, model, and output directories
parser.add_argument("--output_data_dir", type=str,
default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--training_dir", type=str,
default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])

args, _ = parser.parse_known_args()

os.environ["GPU_NUM_DEVICES"] = args.n_gpus

• Approach 3 – Hard-code the GPU_NUM_DEVICES environment variable in your training script. For
example, add the following to your script if you use an instance that has four GPUs.

# train.py

import os

1998
Amazon SageMaker Developer Guide
Release Notes

os.environ["GPU_NUM_DEVICES"] = 4

Tip
To find the number of GPU devices on machine learning instances that you want to use, see
Accelerated Computing in the Amazon EC2 Instance Types page.

SageMaker Training Compiler doesn't reduce the total training


time
If the total training time does not decrease with SageMaker Training Compiler, we highly recommend you
to go over the SageMaker Training Compiler Best Practices and Considerations (p. 1989) page to check
your training configuration, padding strategy for the input tensor shape, and hyperparameters.

Amazon SageMaker Training Compiler Release Notes


See the following release notes to track the latest updates for Amazon SageMaker Training Compiler.

SageMaker Training Compiler Release Notes: February 13, 2023


Currency Updates

• Added support for PyTorch v1.13.1

Bug Fixes

• Fixed a race condition issue on GPU which was causing NAN loss in some models like vision
transformer (ViT) models.

Other Changes

• SageMaker Training Compiler improves performance by letting PyTorch/XLA to


automatically override the optimizers (such as SGD, Adam, AdamW) in torch.optim or
transformers.optimization with the syncfree versions of them in torch_xla.amp.syncfree
(such as torch_xla.amp.syncfree.SGD, torch_xla.amp.syncfree.Adam,
torch_xla.amp.syncfree.AdamW). You don't need to change those code lines where you define
optimizers in your training script.

Migration to AWS Deep Learning Containers


This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:

• PyTorch v1.13.1

763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-trcomp-training:1.13.1-gpu-py39-
cu117-ubuntu20.04-sagemaker

To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see
Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949).

SageMaker Training Compiler Release Notes: January 9, 2023


Breaking Changes

1999
Amazon SageMaker Developer Guide
Release Notes

• tf.keras.optimizers.Optimizer points to a new optimizer in TensorFlow 2.11.0 and later. The


old optimizers are moved to tf.keras.optimizers.legacy. You might encounter job failure due
to the breaking change when you do the following.
• Load checkpoints from an old optimizer. We recommend you to switch to use the legacy optimizers.
• Use TensorFlow v1. We recommend you to migrate to TensorFlow v2, or switch to the legacy
optimizers if you need to continue using TensorFlow v1.

For more detailed list of breaking changes from the optimizer changes, see the official TensorFlow
v2.11.0 release notes in the TensorFlow GitHub repository.

Migration to AWS Deep Learning Containers

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:

• TensorFlow v2.11.0

763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-
ubuntu20.04-sagemaker

To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see
Supported Frameworks, AWS Regions, Instance Types, and Tested Models (p. 1949).

SageMaker Training Compiler Release Notes: December 8, 2022


Bug Fixes

• Fixed the seed for PyTorch training jobs starting PyTorch v1.12 to ensure that there is no discrepancy
in model initialization across different processes. See also PyTorch Reproducibility.
• Fixed the issue causing PyTorch distributed training jobs on G4dn and G5 instances to not default to
communication through PCIe.

Known Issues

• Improper use of PyTorch/XLA APIs

You might also like